July, 2019

Mon Jul 01, 2019 by Alan Orth in Notes

2019-07-01

Create an “AfricaRice books and book chapters” collection on CGSpace for AfricaRice
Last month Sisay asked why the following “most popular” statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace:
- DSpace Test
- CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community

If I change the parameters to 2019 I see stats, so I’m really thinking it has something to do with the sharded yearly Solr statistics cores
- I checked the Solr admin UI and I see all Solr cores loaded, so I don’t know what it could be
- When I check the Atmire content and usage module it seems obvious that there is a problem with the old cores because I dont have anything before 2019-01

Atmire CUA 2018 stats missing

I don’t see anyone logged in right now so I’m going to try to restart Tomcat and see if the stats are accessible after Solr comes back up

I decided to run all system updates on the server (linode18) and reboot it

After rebooting Tomcat came back up, but the the Solr statistics cores were not all loaded

The error is always (with a different core):

org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock

I restarted Tomcat ten times and it never worked…

I tried to stop Tomcat and delete the write locks:

# systemctl stop tomcat7
# find /dspace/solr/statistics* -iname "*.lock" -print -delete
/dspace/solr/statistics/data/index/write.lock
/dspace/solr/statistics-2010/data/index/write.lock
/dspace/solr/statistics-2011/data/index/write.lock
/dspace/solr/statistics-2012/data/index/write.lock
/dspace/solr/statistics-2013/data/index/write.lock
/dspace/solr/statistics-2014/data/index/write.lock
/dspace/solr/statistics-2015/data/index/write.lock
/dspace/solr/statistics-2016/data/index/write.lock
/dspace/solr/statistics-2017/data/index/write.lock
/dspace/solr/statistics-2018/data/index/write.lock
# find /dspace/solr/statistics* -iname "*.lock" -print -delete
# systemctl start tomcat7

But it still didn’t work!
I stopped Tomcat, deleted the old locks, and will try to use the “simple” lock file type in solr/statistics/conf/solrconfig.xml:
```
<lockType>${solr.lock.type:simple}</lockType>
```
And after restarting Tomcat it still doesn’t work
Now I’ll try going back to “native” locking with unlockAtStartup:
```
<unlockOnStartup>true</unlockOnStartup>
```
Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can’t access any stats before 2018
I filed an issue with Atmire, so let’s see if they can help
And since I’m annoyed and it’s been a few months, I’m going to move the JVM heap settings that I’ve been testing on DSpace Test to CGSpace

The old ones were:

-Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false

And the new ones come from Solr 4.10.x’s startup scripts:

-Djava.awt.headless=true
-Xms8192m -Xmx8192m
-Dfile.encoding=UTF-8
-XX:NewRatio=3
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90
-XX:MaxTenuringThreshold=8
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
-XX:+CMSScavengeBeforeRemark
-XX:PretenureSizeThreshold=64m
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:+CMSParallelRemarkEnabled
-XX:+ParallelRefProcEnabled
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=1337
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false

2019-07-02

Help upload twenty-seven posters from the 2019-05 Sharefair to CGSpace
- Sisay had already done the SAFBundle so I did some minor corrections to and uploaded them to a temporary collection so I could check them in OpenRefine:
```
$ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
$ echo "10568/101992" >> item_*/collections
$ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped
```
I noticed that all twenty-seven items had double dates like “2019-05||2019-05” so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection

Finish looking at the fifty-six AfricaRice items and upload them to CGSpace:

$ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat

Peter pointed out that the Sharefair dates I fixed were not actually fixed
- It seems there is a bug that causes DSpace to not detect changes if the values are the same like “2019-05||2019-05” and you try to remove one
- To get it to work I had to change some of them to 2019-01, then remove them

2019-07-03

Atmire responded about the Solr issue and said they would be willing to help

2019-07-04

Maria Garruccio sent me some new ORCID identifiers for Bioversity authors

I combined them with our existing list and then used my resolve-orcids.py script to update the names from ORCID.org:

$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-07-04-orcid-ids.txt
$ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d

Send and merge a pull request for the new ORCID identifiers (#428)

I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:

cg.creator.id,correct
"Marius Ekué: 0000-0002-5829-6321","Marius R.M. Ekué: 0000-0002-5829-6321"
"Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
"Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"

But when I ran fix-metadata-values.py I didn’t see any changes:

$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d

2019-07-06

Send a reminder to Marie about my notes on the CG Core v2 issue I created two weeks ago

2019-07-08

Communicate with Atmire about the Solr statistics cores issue
- I suspect we might need to get more disk space on DSpace Test so we can try to replicate the production environment more closely
Meeting with AgroKnow and CTA about their new ICT Update story telling thing
- AgroKnow has developed a React application to display tag clouds based on harvesting metadata and full text from CGSpace items
- We discussed how to host it technically, perhaps we purchase a server to run it on and just give AgroKnow guys access

Playing with the idea of using xsv to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:

$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
field,value,count
cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'         
field,value,count
dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2

Or perhaps if DOIs are valid or not (having doi.org in the URL):

$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
field,value,count
cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1

Or perhaps items with invalid ISSNs (according to the ISSN code format):

$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
dc.identifier.issn
978-3-319-71997-9
978-3-319-71997-9
978-3-319-71997-9
978-3-319-58789-9
2320-7035 
2593-9173

2019-07-09

Thinking about data cleaning automation again and found some resources about Python and Pandas:
- https://realpython.com/python-data-cleaning-numpy-pandas/
- https://mode.com/blog/python-data-cleaning-libraries