January, 2020

Mon Jan 06, 2020 by Alan Orth in Notes

2020-01-06

Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
Last week Altmetric responded about the item that had a lower score than than its DOI
- The score is now linked to the DOI
- Another item that had the same problem in 2019 has now also linked to the score for its DOI
- Another item that had the same problem in 2019 has also been fixed

2020-01-07

Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has
- The DOI has a score of 259, but the Handle has no score at all
- I tweeted the CGSpace repository link

2020-01-08

Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:

dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790

As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:

$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779

According to this trick the troublesome character is on line 5227:

$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
5227: "Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22  "
00000001: 4f  O
00000002: 75  u
00000003: 65  e
00000004: cc  .
00000005: 81  .
00000006: 64  d
00000007: 72  r

~~According to the blog post linked above the troublesome character is probably the “High Octect Preset” (81)~~, which vim identifies (using ga on the character) as:

<e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401

If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database…
Other encodings like windows-1251 and windows-1257 also fail on different characters like “ž” and “é” that are legitimate UTF-8 characters
Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the 5_x-prod (no CG Core v2) branch

2020-01-14

I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
- I manually ran it on the server as the DSpace user and it said “Moving: 51633080 into core statistics-2019”
- After a few hours it died with the same error that I had seen in the log from the first run:

Exception: Read timed out
java.net.SocketTimeoutException: Read timed out

I am not sure how I will fix that shard…
I discovered a very interesting tool called ftfy that attempts to fix errors in UTF-8
- I'm curious to start checking input files with this to see what it highlights
- I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
- <e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
- <é> 233, Hex 00e9, Oct 351, Digr e'
Ah hah! We need to be normalizing characters into their canonical forms!
- In Python 3.8 we can even check if the string is normalized using the unicodedata library:

In [7]: unicodedata.is_normalized('NFC', 'é')
Out[7]: False

In [8]: unicodedata.is_normalized('NFC', 'é')
Out[8]: True

2020-01-15

I added support for Unicode normalization to my csv-metadata-quality tool in v0.4.0
Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:

dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325

She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC
I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my fix-metadata.py script:

$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d

2020-01-16

Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:

dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35

Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
- We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months
- Sisay uploaded the records to DSpace Test as IITA_201907_Jan13
- I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data
- I corrected one invalid AGROVOC subject
- Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
  - $ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id
  - I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: if(cell.recon.matched, cell.recon.match.name, value)

2020-01-20

Last week Atmire sent a quotation for the DSpace 6 upgrade that I had requested a few weeks ago
- I forwarded it to Peter et al for their comment
- We decided that we should probably buy enough credits to cover the upgrade and have 100 remaining for future development
Visit CodeObia to discuss the next phase of AReS development

2020-01-21

Create two accounts on CGSpace for CTA users
Marie-Angelique finally responded to some of the pull requests I made on the CG Core v2 repository last month:
- Merged: HTML syntax fixes
- Merged: Add LICENSE file
- Merged: Build main.css using npm build
- Approved a wider scope for cg.peer-reviewed (renaming the field and using non-boolean values), but there is more discussion needed
I opened a new pull request on the cg-core repository validate and fix the formatting of the HTML files
Create more issues for OpenRXV:
- Based on Peter's feedback on the text for labels and tooltips
- Based on Peter's feedback for the export icon
- Based on Peter's feedback for the sort options
- Based on Abenet's feedback that PDF and Word exports are not working

2020-01-22

I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:

Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.

They started limiting public access to the database in December, 2019 due to GDPR and CCPA
- This will be a problem in the future (see DS-4409)
Peter sent me his corrections for the list of authors that I had sent him earlier in the month
- There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8
- I will apply them on CGSpace and DSpace Test using my fix-metadata-values.py script:

$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d

Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to csv-metadata-quality:

dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
COPY 67314
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct

Peter asked me to send him a list of affiliations to correct
- First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:

dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6170
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n

I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:

$ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

Then I generated a new list for Peter:

dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6162

Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author “Hung, Nguyen”
- I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&R:

$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
$ wc -l hung-nguyen-a*handles.txt
  46 hung-nguyen-ares-handles.txt
  56 hung-nguyen-atmire-handles.txt
 102 total

Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet
- I am curious to check tomorrow to see if they are there

2020-01-23

I checked AReS and I see that there are now 55 items for author “Hung Nguyen-Viet”
Linode sent an alert that the outbound traffic rate of CGSpace (linode18) was high for several hours this morning around 5AM UTC+1
- I checked the nginx logs this morning for the few hours before and after that using goaccess:

# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2020:0[12345678]" | goaccess --log-format=COMBINED -

The top two hosts according to the amount of data transferred are:
- 2a01:7e00::f03c:91ff:fe9a:3a37
- 2a01:7e00::f03c:91ff:fe18:7396
Both are on Linode, and appear to be the new and old ilri.org servers
I will ask the web team
Judging from the ILRI publications site it seems they are downloading the PDFs so they can generate higher-quality thumbnails:
They are apparently using this Drupal module to generate the thumbnails: sites/all/modules/contrib/pdf_to_imagefield
I see some excellent suggestions in this ImageMagick thread from 2012 that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as this blog post:

$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg

Here I'm also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using -flatten like DSpace already does
I wonder if I could hack this into DSpace code to get better thumbnails…