CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

July, 2019

2019-07-01

  • Create an “AfricaRice books and book chapters” collection on CGSpace for AfricaRice
  • Last month Sisay asked why the following “most popular” statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace:
  • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
  • If I change the parameters to 2019 I see stats, so I’m really thinking it has something to do with the sharded yearly Solr statistics cores
    • I checked the Solr admin UI and I see all Solr cores loaded, so I don’t know what it could be
    • When I check the Atmire content and usage module it seems obvious that there is a problem with the old cores because I dont have anything before 2019-01

Atmire CUA 2018 stats missing

  • I don’t see anyone logged in right now so I’m going to try to restart Tomcat and see if the stats are accessible after Solr comes back up
  • I decided to run all system updates on the server (linode18) and reboot it

    • After rebooting Tomcat came back up, but the the Solr statistics cores were not all loaded
    • The error is always (with a different core):

      org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
      
  • I restarted Tomcat ten times and it never worked…

  • I tried to stop Tomcat and delete the write locks:

    # systemctl stop tomcat7
    # find /dspace/solr/statistics* -iname "*.lock" -print -delete
    /dspace/solr/statistics/data/index/write.lock
    /dspace/solr/statistics-2010/data/index/write.lock
    /dspace/solr/statistics-2011/data/index/write.lock
    /dspace/solr/statistics-2012/data/index/write.lock
    /dspace/solr/statistics-2013/data/index/write.lock
    /dspace/solr/statistics-2014/data/index/write.lock
    /dspace/solr/statistics-2015/data/index/write.lock
    /dspace/solr/statistics-2016/data/index/write.lock
    /dspace/solr/statistics-2017/data/index/write.lock
    /dspace/solr/statistics-2018/data/index/write.lock
    # find /dspace/solr/statistics* -iname "*.lock" -print -delete
    # systemctl start tomcat7
    
  • But it still didn’t work!

  • I stopped Tomcat, deleted the old locks, and will try to use the “simple” lock file type in solr/statistics/conf/solrconfig.xml:

    <lockType>${solr.lock.type:simple}</lockType>
    
  • And after restarting Tomcat it still doesn’t work

  • Now I’ll try going back to “native” locking with unlockAtStartup:

    <unlockOnStartup>true</unlockOnStartup>
    
  • Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can’t access any stats before 2018

  • I filed an issue with Atmire, so let’s see if they can help

  • And since I’m annoyed and it’s been a few months, I’m going to move the JVM heap settings that I’ve been testing on DSpace Test to CGSpace

  • The old ones were:

    -Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
    
  • And the new ones come from Solr 4.10.x’s startup scripts:

    -Djava.awt.headless=true
    -Xms8192m -Xmx8192m
    -Dfile.encoding=UTF-8
    -XX:NewRatio=3
    -XX:SurvivorRatio=4
    -XX:TargetSurvivorRatio=90
    -XX:MaxTenuringThreshold=8
    -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC
    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
    -XX:+CMSScavengeBeforeRemark
    -XX:PretenureSizeThreshold=64m
    -XX:+UseCMSInitiatingOccupancyOnly
    -XX:CMSInitiatingOccupancyFraction=50
    -XX:CMSMaxAbortablePrecleanTime=6000
    -XX:+CMSParallelRemarkEnabled
    -XX:+ParallelRefProcEnabled
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.port=1337
    -Dcom.sun.management.jmxremote.ssl=false
    -Dcom.sun.management.jmxremote.authenticate=false
    

2019-07-02

  • Help upload twenty-seven posters from the 2019-05 Sharefair to CGSpace

    • Sisay had already done the SAFBundle so I did some minor corrections to and uploaded them to a temporary collection so I could check them in OpenRefine:

      $ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
      $ echo "10568/101992" >> item_*/collections
      $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped
      
  • I noticed that all twenty-seven items had double dates like “2019-05||2019-05” so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection

  • Finish looking at the fifty-six AfricaRice items and upload them to CGSpace:

    $ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat
    
  • Peter pointed out that the Sharefair dates I fixed were not actually fixed

    • It seems there is a bug that causes DSpace to not detect changes if the values are the same like “2019-05||2019-05” and you try to remove one
    • To get it to work I had to change some of them to 2019-01, then remove them

2019-07-03

  • Atmire responded about the Solr issue and said they would be willing to help

2019-07-04

  • Maria Garruccio sent me some new ORCID identifiers for Bioversity authors

    • I combined them with our existing list and then used my resolve-orcids.py script to update the names from ORCID.org:

      $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-07-04-orcid-ids.txt
      $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d
      
  • Send and merge a pull request for the new ORCID identifiers (#428)

  • I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:

    cg.creator.id,correct
    "Marius Ekué: 0000-0002-5829-6321","Marius R.M. Ekué: 0000-0002-5829-6321"
    "Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
    "Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"
    
  • But when I ran fix-metadata-values.py I didn’t see any changes:

    $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    

2019-07-06

2019-07-08

  • Communicate with Atmire about the Solr statistics cores issue
    • I suspect we might need to get more disk space on DSpace Test so we can try to replicate the production environment more closely
  • Meeting with AgroKnow and CTA about their new ICT Update story telling thing
    • AgroKnow has developed a React application to display tag clouds based on harvesting metadata and full text from CGSpace items
    • We discussed how to host it technically, perhaps we purchase a server to run it on and just give AgroKnow guys access
  • Playing with the idea of using xsv to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:

    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
    field,value,count
    cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
    $ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'         
    field,value,count
    dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2
    
  • Or perhaps if DOIs are valid or not (having doi.org in the URL):

    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
    field,value,count
    cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1
    
  • Or perhaps items with invalid ISSNs (according to the ISSN code format):

    $ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
    dc.identifier.issn
    978-3-319-71997-9
    978-3-319-71997-9
    978-3-319-71997-9
    978-3-319-58789-9
    2320-7035 
    2593-9173
    

2019-07-09