July, 2019

Mon Jul 01, 2019 in Notes

2019-07-01

Create an “AfricaRice books and book chapters” collection on CGSpace for AfricaRice
Last month Sisay asked why the following “most popular” statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace:
- DSpace Test
- CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community

If I change the parameters to 2019 I see stats, so I’m really thinking it has something to do with the sharded yearly Solr statistics cores
- I checked the Solr admin UI and I see all Solr cores loaded, so I don’t know what it could be
- When I check the Atmire content and usage module it seems obvious that there is a problem with the old cores because I dont have anything before 2019-01

Atmire CUA 2018 stats missing

I don’t see anyone logged in right now so I’m going to try to restart Tomcat and see if the stats are accessible after Solr comes back up
I decided to run all system updates on the server (linode18) and reboot it
- After rebooting Tomcat came back up, but the the Solr statistics cores were not all loaded
- The error is always (with a different core):

org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock

I restarted Tomcat ten times and it never worked…
I tried to stop Tomcat and delete the write locks:

# systemctl stop tomcat7
# find /dspace/solr/statistics* -iname "*.lock" -print -delete
/dspace/solr/statistics/data/index/write.lock
/dspace/solr/statistics-2010/data/index/write.lock
/dspace/solr/statistics-2011/data/index/write.lock
/dspace/solr/statistics-2012/data/index/write.lock
/dspace/solr/statistics-2013/data/index/write.lock
/dspace/solr/statistics-2014/data/index/write.lock
/dspace/solr/statistics-2015/data/index/write.lock
/dspace/solr/statistics-2016/data/index/write.lock
/dspace/solr/statistics-2017/data/index/write.lock
/dspace/solr/statistics-2018/data/index/write.lock
# find /dspace/solr/statistics* -iname "*.lock" -print -delete
# systemctl start tomcat7

But it still didn’t work!
I stopped Tomcat, deleted the old locks, and will try to use the “simple” lock file type in solr/statistics/conf/solrconfig.xml:

<lockType>${solr.lock.type:simple}</lockType>

And after restarting Tomcat it still doesn’t work
Now I’ll try going back to “native” locking with unlockAtStartup:

<unlockOnStartup>true</unlockOnStartup>

Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can’t access any stats before 2018
I filed an issue with Atmire, so let’s see if they can help
And since I’m annoyed and it’s been a few months, I’m going to move the JVM heap settings that I’ve been testing on DSpace Test to CGSpace
The old ones were:

-Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false

And the new ones come from Solr 4.10.x’s startup scripts:

    -Djava.awt.headless=true
    -Xms8192m -Xmx8192m
    -Dfile.encoding=UTF-8
    -XX:NewRatio=3
    -XX:SurvivorRatio=4
    -XX:TargetSurvivorRatio=90
    -XX:MaxTenuringThreshold=8
    -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC
    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
    -XX:+CMSScavengeBeforeRemark
    -XX:PretenureSizeThreshold=64m
    -XX:+UseCMSInitiatingOccupancyOnly
    -XX:CMSInitiatingOccupancyFraction=50
    -XX:CMSMaxAbortablePrecleanTime=6000
    -XX:+CMSParallelRemarkEnabled
    -XX:+ParallelRefProcEnabled
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.port=1337
    -Dcom.sun.management.jmxremote.ssl=false
    -Dcom.sun.management.jmxremote.authenticate=false

2019-07-02

Help upload twenty-seven posters from the 2019-05 Sharefair to CGSpace
- Sisay had already done the SAFBundle so I did some minor corrections to and uploaded them to a temporary collection so I could check them in OpenRefine:

$ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
$ echo "10568/101992" >> item_*/collections
$ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped

I noticed that all twenty-seven items had double dates like “2019-05||2019-05” so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection
Finish looking at the fifty-six AfricaRice items and upload them to CGSpace:

$ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat

Peter pointed out that the Sharefair dates I fixed were not actually fixed
- It seems there is a bug that causes DSpace to not detect changes if the values are the same like “2019-05||2019-05” and you try to remove one
- To get it to work I had to change some of them to 2019-01, then remove them

2019-07-03

Atmire responded about the Solr issue and said they would be willing to help

2019-07-04

Maria Garruccio sent me some new ORCID identifiers for Bioversity authors
- I combined them with our existing list and then used my resolve-orcids.py script to update the names from ORCID.org:

$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-07-04-orcid-ids.txt
$ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d

Send and merge a pull request for the new ORCID identifiers (#428)
I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:

cg.creator.id,correct
"Marius Ekué: 0000-0002-5829-6321","Marius R.M. Ekué: 0000-0002-5829-6321"
"Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
"Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"

But when I ran fix-metadata-values.py I didn’t see any changes:

$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d

2019-07-06

Send a reminder to Marie about my notes on the CG Core v2 issue I created two weeks ago

2019-07-08

Communicate with Atmire about the Solr statistics cores issue
- I suspect we might need to get more disk space on DSpace Test so we can try to replicate the production environment more closely
Meeting with AgroKnow and CTA about their new ICT Update story telling thing
- AgroKnow has developed a React application to display tag clouds based on harvesting metadata and full text from CGSpace items
- We discussed how to host it technically, perhaps we purchase a server to run it on and just give AgroKnow guys access
Playing with the idea of using xsv to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:

$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
field,value,count
cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'         
field,value,count
dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2

Or perhaps if DOIs are valid or not (having doi.org in the URL):

$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
field,value,count
cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1

Or perhaps items with invalid ISSNs (according to the ISSN code format):

$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
dc.identifier.issn
978-3-319-71997-9
978-3-319-71997-9
978-3-319-71997-9
978-3-319-58789-9
2320-7035 
 2593-9173

2019-07-09

Thinking about data cleaning automation again and found some resources about Python and Pandas:
- https://realpython.com/python-data-cleaning-numpy-pandas/
- https://mode.com/blog/python-data-cleaning-libraries

2019-07-11

Skype call with Marie Angelique about CG Core v2
- We discussed my comments and suggestions from last week
- One comment she had was that we should try to move our center-specific subjects into DCTERMS.subject and normalize them against AGROVOC
- I updated my gist about CGSpace metadata changes
Skype call with Jane Poole to discuss OpenRXV/AReS Phase II TORs
- I need to follow up with Moayad about the reporting functionality
- Also, I need to email Harrison my notes on the CG Core v2 stuff
- Also, Jane asked me to check the Data Portal to see which email address requests for confidential data are going
Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: “Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.”
I looked in the DSpace logs and found this right around the time of the screenshot he sent me:

2019-07-10 11:50:27,433 INFO  org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658

I’m assuming something happened in his browser (like a refresh) after the item was submitted…

2019-07-12

Atmire responded with some initial feedback about our Tomcat configuration related to the Solr issue I raised recently
- Unfortunately there is no concrete feedback yet
- I think we need to upgrade our DSpace Test server so we can fit all the Solr cores…
- Actually, I looked and there were over 40 GB free on DSpace Test so I copied the Solr statistics cores for the years 2017 to 2010 from CGSpace to DSpace Test because they weren’t actually very large
- I re-deployed DSpace for good measure, and I think all Solr cores are loading… I will do more tests later
Run all system updates on DSpace Test (linode19) and reboot it
Try to run dspace cleanup -v on CGSpace and ran into an error:

Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(167394) is still referenced from table "bundle".

The solution is, as always:

# su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);'
UPDATE 1

2019-07-16

Completely reset the Podman configuration on my laptop because there were some layers that I couldn’t delete and it had been some time since I did a cleanup:

$ podman system prune -a -f --volumes
$ sudo rm -rf ~/.local/share/containers

Then pull a new PostgreSQL 9.6 image and load a CGSpace database dump into a new local test container:

$ podman pull postgres:9.6-alpine
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-07-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'                     
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest

Start working on implementing the CG Core v2 changes on my local DSpace test environment
Make a pull request to CG Core v2 with some fixes for typos in the specification (#5)

2019-07-18

Talk to Moayad about the remaining issues for OpenRXV / AReS
- He sent a pull request with some changes for the bar chart and documentation about configuration, and said he’d finish the export feature next week
Sisay said a user was having problems registering on CGSpace and it looks like the email account expired again:

$ dspace test-email

About to send test email:
 - To: blahh@cgiar.org
 - Subject: DSpace test email
 - Server: smtp.office365.com

Error sending email:
 - Error: javax.mail.AuthenticationFailedException

Please see the DSpace documentation for assistance.

I emailed ICT to ask them to reset it and make the expiration period longer if possible

2019-07-19

ICT reset the password for the CGSpace support account and apparently removed the expiry requirement
- I tested the account and it’s working

2019-07-20

Create an account for Lionelle Samnick on CGSpace because the registration isn’t working for some reason:

$ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'

I added her as a submitter to CTA ISF Pro-Agro series
Start looking at 1429 records for the Bioversity batch import
- Multiple authors should be specified with multi-value separatator (||) instead of ;
- We don’t use “(eds)” as an author
- Same issue with dc.publisher using “;” for multiple values
- Some invalid ISSNs in dc.identifier.issn (they look like ISBNs)
- I see some ISSNs in the dc.identifier.isbn field
- I see some invalid ISBNs that look like Excel errors (9,78E+12)
- For DOI we just use the URL, not “DOI: https://doi.org…”
- I see an invalid “LEAVE BLANK” in the cg.contributor.crp field
- Country field is using “,” for multiple values instead of “||”
- Region field is using “,” for multiple values instead of “||”
- Language field should be lowercase like “en”, and it is using the wrong multiple value separator, and has some invalid values
- What is the cg.identifier.url2 field? You should probably add those as cg.link.reference

2019-07-22

Raise an issue on CG Core v2 spec regarding country and region coverage
- The current standard has them implemented as a class like this:

    <dct:coverage>
        <dct:spatial>
            <type>Country</type>
            <dct:identifier>http://sws.geonames.org/192950</dct:identifier>
            <rdfs:label>Kenya</rdfs:label>
        </dct:spatial>
    </dct:coverage>

I left a note saying that DSpace is technically limited to a flat schema so we use cg.coverage.country: Kenya
Do a little more work on CG Core v2 in the input forms

2019-07-25

Generate a list of the ORCID identifiers that we added to CGSpace in 2019 for Sara Jani at ICARDA
Bioversity sent a new file for their migration to CGSpace
- There is always a blank row and blank column at the end
- One invalid type (Brie)
- 824 items with leading/trailing spaces in dc.identifier.citation
- 175 items with a trailing comma in dc.identifier.citation (using custom text facet with GREL value.endsWith(',').toString())
- Fix them with GREL transform: value.replace(/,$/, '')
- A few strange publishers after splitting multi-value cells, like “(Belgium)”
- Deleted four ISSNs that are actually ISBNs and are already present in the ISBN field
- Eight invalid ISBNs
- Convert all DOIs to “https://doi.org” format and fix one invalid DOI
- Fix a handful of incorrect CRPs that seem to have been split on comma “,”
- Lots of strange values in cg.link.reference, and I normalized all DOIs to https://doi.org format
  - There are lots of invalid links here, like “36” and “recordlink:publications:2606” and “t3://record?identifier=publications&uid=2606”
  - Also there are hundreds of items that use the same value for cg.link.reference AND cg.link.dataurl
- Use https:// for all Bioversity links (reference, data url, permalink)
I might be able to use isbnlib to validate ISBNs in Python:

if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
    print("Yes")
else:
    print("No")

Or with python-stdnum:

from stdnum import isbn
from stdnum import issn

isbn.validate('978-92-9043-389-7')
issn.validate('1020-3362')

2019-07-26

Bioversity sent me an updated CSV file that fixes some of the issues I pointed out yesterday
- There are still 1429 records
- There are still one extra row and one extra column
- There are still eight invalid ISBNs (according to my validate.py script)
I figured out a GREL to trim spaces in multi-value cells without splitting them:

value.replace(/\s+\|\|/,"||").replace(/\|\|\s+/,"||")

I whipped up a quick script using Python Pandas to do whitespace cleanup

2019-07-29

I turned the Pandas script into a proper Python package called csv-metadata-quality
- It supports CSV and Excel files
- It fixes whitespace errors and erroneous multi-value separators ("|") and validates ISSN, ISBNs, and dates
- Also I added a bunch of other checks/fixes for unnecessary and “suspicious” Unicode characters
- I added fixes to drop duplicate metadata values
- And lastly, I added validation of ISO 639-2 and ISO 639-3 languages
- And lastly lastly, I added AGROVOC validation of subject terms
Inform Bioversity that there is an error in their CSV, seemingly caused by quotes in the citation field

2019-07-30

Add support for removing newlines (line feeds) to csv-metadata-quality
On the subject of validating some of our fields like countries and regions, Abenet pointed out that these should all be valid AGROVOC terms, so we can actually try to validate against that!