August, 2019

Sat Aug 03, 2019 by Alan Orth in Notes

2019-08-03

Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…

2019-08-04

Deploy ORCID identifier updates requested by Bioversity to CGSpace
Run system updates on CGSpace (linode18) and reboot it
- Before updating it I checked Solr and verified that all statistics cores were loaded properly…
- After rebooting, all statistics cores were loaded… wow, that’s lucky.
Run system updates on DSpace Test (linode19) and reboot it

2019-08-05

Update Tomcat to 7.0.96 in the Ansible infrastructure playbooks
Update PostgreSQL JDBC driver to 42.2.6 in the Ansible infrastrucutre playbooks
Deploy both on DSpace Test (linode19)
Looking at the 1429 records for Bioversity migration again
- The following items use the same exact PDF and seem to be duplicates:
- https://www.bioversityinternational.org/index.php?id=244&tx_news_pi1[news]=10191
- https://www.bioversityinternational.org/index.php?id=244&tx_news_pi1[news]=342
- The following items use the same exact PDF, but one seems to be incorrect:
- https://www.bioversityinternational.org/index.php?id=244&tx_news_pi1[news]=5347
- https://www.bioversityinternational.org/index.php?id=244&tx_news_pi1[news]=5340
- The following PDFs are used by several items incorrectly:
- Report_of_a_Working_Group_on_Allium_7.pdf
- Report_of_a_Working_Group_on_Allium_Fourth_meeting_1696.pdf
- I checked the SHA1 hashes of each PDF and found that some appear more than once…
- The following items use the same PDF with a different name, but seem to be duplicates (pick one?):
- https://www.bioversityinternational.org/index.php?id=244&tx_news_pi1[news]=433
- https://www.bioversityinternational.org/index.php?id=244&tx_news_pi1[news]=10189
- The following items use the same PDF with a different name, but seem to be duplicates (pick one?):
- https://www.bioversityinternational.org/index.php?id=244&tx_news_pi1[news]=332
- https://www.bioversityinternational.org/index.php?id=244&tx_news_pi1[news]=10187
- There are about thirty PDFs that have French or Spanish filenames and there seems to be an encoding issue
- I asked Francesco if he can give me a PDF URL column instead of a “filename” column so I can download the files myself
- At least the ~50 filenames identified by the following GREL will have issues:
```
or(
isNotNull(value.match(/^.*’.*$/)),
isNotNull(value.match(/^.*é.*$/)),
isNotNull(value.match(/^.*á.*$/)),
isNotNull(value.match(/^.*è.*$/)),
isNotNull(value.match(/^.*í.*$/)),
isNotNull(value.match(/^.*ó.*$/)),
isNotNull(value.match(/^.*ú.*$/)),
isNotNull(value.match(/^.*à.*$/)),
isNotNull(value.match(/^.*û.*$/))
).toString()
```
I tried to extract the filenames and construct a URL to download the PDFs with my generate-thumbnails.py script, but there seem to be several paths for PDFs so I can’t guess it properly
I will have to wait for Francesco to respond about the PDFs, or perhaps proceed with a metadata-only upload so we can do other checks on DSpace Test

2019-08-06

Francesca responded to address my feedback yesterday
- I made some changes to the CSV based on her feedback (remove two duplicates, change one PDF file name, change two titles)
- Then I found some items that have PDFs in multiple languages that only list one language in dc.language.iso so I changed them
- Strangley, one item was referring to a 7zip file…
- After removing the two duplicates there are now 1427 records
- Fix one invalid ISSN: 1020-2002→1020-3362

2019-08-07

Daniel Haile-Michael asked about using a logical OR with the DSpace OpenSearch, but I looked in the DSpace manual and it does not seem to be possible

2019-08-08

Moayad noticed that the HTTPS certificate expired on the AReS dev server (linode20)
- The first problem was that there is a Docker container listening on port 80, so it conflicts with the ACME http-01 validation
- The second problem was that we only allow access to port 80 from localhost
- I adjusted the renew-letsencrypt systemd service so it stops/starts the Docker container and firewall:
```
# /opt/certbot-auto renew --standalone --pre-hook "/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld" --post-hook "/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx"
```
It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains
Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04’s OpenSSL 1.1.0g with nginx 1.16.0
Run all system updates on AReS dev server (linode20) and reboot it

Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:

$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
$ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload.csv
$ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
$ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs2.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload2.csv
$ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs3.txt

(the weird sed regex removes color codes, because my generate-thumbnails script prints pretty colors)
Some PDFs are uploaded in different paths so I have to try a few times to get them all:
- /fileadmin/_migrated/uploads/tx_news/
- /fileadmin/user_upload/online_library/publications/pdfs/
- /fileadmin/user_upload/
Even so, there are still 52 items with incorrect filenames, so I can’t derive their PDF URLs…
- For example, Wild_cherry_Prunus_avium_859.pdf is here (with double underscore): https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf
I will proceed with a metadata-only upload first and then let them know about the missing PDFs
Troubleshoot an issue we had with proxying to the new development version of AReS from DSpace Test (linode19)
- For some reason the host header in the proxy pass is not set so nginx on DSpace Test makes a request to the upstream nginx on an IP-based virtual host
- The upstream nginx returns HTTP 444 because we configured it to not answer when a request does not send a valid hostname
- The solution is to set the host header when proxy passing:
```
proxy_set_header Host dev.ares.codeobia.com;
```
Though I am really wondering why this happened now, because the configuration has been working for months…
Improve the output of the suspicious characters check in csv-metadata-quality script and tag version 0.2.0

2019-08-09

Looking at the 128 IITA records (20195TH.xls) that Sisay uploadd to DSpace Test last month: IITA_July_29
- The records are pretty clean because Sisay ran them through the csv-metadata-quality tool
- I fixed one incorrect country (MELBOURNE)
- I normalized all DOIs to be https://doi.org format
- This item is using the wrong Google Books link: https://dspacetest.cgiar.org/handle/10568/102593
- The French abstract here has copy/paste errors: https://dspacetest.cgiar.org/handle/10568/102491
- Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
- $ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id
- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: if(cell.recon.matched, cell.recon.match.name, value)
- I asked Bosede to check about twenty-five invalid AGROVOC subjects identified by csv-metadata-quality script
- I still need to check the sponsors and then check for duplicates

2019-08-10

Add checks for uncommon filename extensions and replacements for unneccesary Unicode to the csv-metadata-quality script

2019-08-12

Looking at the 128 IITA records again:
- Validate and normalize affiliations against our 2019-02 list using reconcile-csv and OpenRefine:
- $ lein run ~/src/git/DSpace/2019-02-22-sponsorships.csv name id
- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: if(cell.recon.matched, cell.recon.match.name, value)
- I checked the collection for duplicates and found a few:
- https://dspacetest.cgiar.org/handle/10568/102513 is a duplicate of CIAT item: https://cgspace.cgiar.org/handle/10568/44158
- https://dspacetest.cgiar.org/handle/10568/102512 is a duplicate of CIAT item: https://cgspace.cgiar.org/handle/10568/43557

2019-08-13

Create a test user on DSpace Test for Mohammad Salem to attempt depositing:
```
$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
```
Create and merge a pull request (#429) to add eleven new CCAFS Phase II Project Tags to CGSpace
Atmire responded to the Solr cores issue last week, but they could not reproduce the issue
- I told them not to continue, and that we would keep an eye on it and keep troubleshooting it (if neccessary) in the public eye on dspace-tech and Solr mailing lists
Testing an import of 1,429 Bioversity items (metadata only) on my local development machine and got an error with Java memory after about 1,000 items:
```
$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
...
java.lang.OutOfMemoryError: GC overhead limit exceeded
```

I increased the heap size to 1536m and tried again:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1536m"
$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com

This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM
(oops, I realize that actually I forgot to delete items I had flagged as duplicates, so the total should be 1,427 items)

2019-08-14

I imported the 1,427 Bioversity records into DSpace Test
- To make sure we didn’t have memory issues I reduced Tomcat’s JVM heap by 512m, increased the import processes’s heap to 512m, and split the input file into two parts with about 700 each
- Then I had to create a few new temporary collections on DSpace Test that had been created on CGSpace after our last sync
- After that the import succeeded:
```
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
$ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
$ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
```
The next step is to check these items for duplicates

2019-08-16

Email Bioversity to let them know that the 1,427 records are on DSpace Test and that Abenet should look over them

2019-08-18

Deploy latest 5_x-prod branch on CGSpace (linode18), including the new CCAFS project tags
Deploy Tomcat 7.0.96 and PostgreSQL JDBC 42.2.6 driver on CGSpace (linde18)

After restarting Tomcat one of the Solr statistics cores failed to start up:

statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher

I decided to run all system updates on the server and reboot it
After reboot the statistics-2018 core failed to load so I restarted tomcat7 again
After this last restart all Solr cores seem to be up and running

2019-08-20

Francesco sent me a new CSV with the raw filenames and paths for the Bioversity migration
- All file paths are relative to the Typo3 upload path of /fileadmin on the Bioversity website
- I create a new column with the derived URL that I can use to download the PDFs with my generate-thumbnails.py script
- Unfortunately now the filename column has paths too, so I have to use a simple Python/Jython script in OpenRefine to get the basename of the files in the filename column:
```
import os

return os.path.basename(value)
```
Then I can try to download all the files again with the script
I also asked Francesco about the strange filenames (.LCK, .zip, and .7z)

2019-08-21

Upload csv-metadata-quality repository to ILRI’s GitHub organization
Fix a few invalid countries in IITA’s July 29 records (aka “20195TH.xls”)
- These were not caught by my csv-metadata-quality check script because of a logic error
- Remove dc.identified.uri fields from test data, set id values to “-1”, add collection mappings according to dc.type, and Upload 126 IITA records to CGSpace

2019-08-22

Transfer original csv-metadata-quality repository to ILRI organization on GitHub

2019-08-23

Run system updates on AReS / OpenRXV dev server (linode20) and reboot it
Fix AReS exports on DSpace Test by adding a new nginx proxy pass

2019-08-26

Peter sent 2,943 corrections to the author dump I had originally sent him on 2019-05-27
- I noticed that one correction had a missing space after the comma, ie “Adamou,A.” so I corrected it
- Also, I should add that as a check to the csv-metadata-quality pipeline
- Apply the corrections to my local dev machine in preparation for the CGSpace:
```
$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
```

Apply the corrections on CGSpace and DSpace Test

After that I started a full Discovery re-indexing on both servers:

$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    81m47.057s 
user    8m5.265s 
sys     2m24.715s

Peter asked me to add related citation aka cg.link.citation to the item view
- I created a pull request with a draft implementation and asked for Peter’s feedback
Add the ability to skip certain fields from the csv-metadata-quality script using --exclude-fields
- For example, when I’m working on the author corrections I want to do the basic checks on the corrected fields, but on the original fields so I would use --exclude-fields dc.contributor.author for example

2019-08-27

File an issue on OpenRXV for the bug when selecting communities
Peter approved the related citation changes so I merged the pull request on GitHub and will deploy it to CGSpace this weekend
Add a safety feature to fix-metadata-values.py that skips correction values that contain the ‘|’ character
Help Francesco from Bioversity with the REST and OAI APIs on CGSpace
- He is contracted by Bioversity to work on the migration from Typo3
- I told him that the OAI interface only exposes Dublin Core fields in its default configuration and that he might want to use OAI to get the latest-changed items, then use REST API to get their metadata
Add a fix for missing space after commas to my csv-metadata-quality script and tag version 0.2.2