January, 2020

Mon Jan 06, 2020 by Alan Orth in Notes

2020-01-06

Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
Last week Altmetric responded about the item that had a lower score than than its DOI
- The score is now linked to the DOI
- Another item that had the same problem in 2019 has now also linked to the score for its DOI
- Another item that had the same problem in 2019 has also been fixed

2020-01-07

Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has
- The DOI has a score of 259, but the Handle has no score at all
- I tweeted the CGSpace repository link

2020-01-08

Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:

dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790

As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:

$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779

According to this trick the troublesome character is on line 5227:

$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
5227: "Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22  "
00000001: 4f  O
00000002: 75  u
00000003: 65  e
00000004: cc  .
00000005: 81  .
00000006: 64  d
00000007: 72  r

~~According to the blog post linked above the troublesome character is probably the “High Octect Preset” (81)~~, which vim identifies (using ga on the character) as:

<e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401

If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database…
Other encodings like windows-1251 and windows-1257 also fail on different characters like “ž” and “é” that are legitimate UTF-8 characters
Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the 5_x-prod (no CG Core v2) branch

2020-01-14

I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
- I manually ran it on the server as the DSpace user and it said “Moving: 51633080 into core statistics-2019”
- After a few hours it died with the same error that I had seen in the log from the first run:

Exception: Read timed out
java.net.SocketTimeoutException: Read timed out

I am not sure how I will fix that shard…
I discovered a very interesting tool called ftfy that attempts to fix errors in UTF-8
- I'm curious to start checking input files with this to see what it highlights
- I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
- <e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
- <é> 233, Hex 00e9, Oct 351, Digr e'
Ah hah! We need to be normalizing characters into their canonical forms!
- In Python 3.8 we can even check if the string is normalized using the unicodedata library:

In [7]: unicodedata.is_normalized('NFC', 'é')
Out[7]: False

In [8]: unicodedata.is_normalized('NFC', 'é')
Out[8]: True

2020-01-15

I added support for Unicode normalization to my csv-metadata-quality tool in v0.4.0
Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:

dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325

She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC
I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my fix-metadata.py script:

$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d

2020-01-16

Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:

dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35

Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
- We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months
- Sisay uploaded the records to DSpace Test as IITA_201907_Jan13
- I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data
- I corrected one invalid AGROVOC subject
- Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
  - $ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id
  - I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: if(cell.recon.matched, cell.recon.match.name, value)

2020-01-20

Last week Atmire sent a quotation for the DSpace 6 upgrade that I had requested a few weeks ago
- I forwarded it to Peter et al for their comment
- We decided that we should probably buy enough credits to cover the upgrade and have 100 remaining for future development
Visit CodeObia to discuss the next phase of AReS development

2020-01-21

Create two accounts on CGSpace for CTA users
Marie-Angelique finally responded to some of the pull requests I made on the CG Core v2 repository last month:
- Merged: HTML syntax fixes
- Merged: Add LICENSE file
- Merged: Build main.css using npm build
- Approved a wider scope for cg.peer-reviewed (renaming the field and using non-boolean values), but there is more discussion needed
I opened a new pull request on the cg-core repository validate and fix the formatting of the HTML files
Create more issues for OpenRXV:
- Based on Peter's feedback on the text for labels and tooltips
- Based on Peter's feedback for the export icon
- Based on Peter's feedback for the sort options
- Based on Abenet's feedback that PDF and Word exports are not working

2020-01-22

I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:

Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.

They started limiting public access to the database in December, 2019 due to GDPR and CCPA
- This will be a problem in the future (see DS-4409)
Peter sent me his corrections for the list of authors that I had sent him earlier in the month
- There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8
- I will apply them on CGSpace and DSpace Test using my fix-metadata-values.py script:

$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d

Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to csv-metadata-quality:

dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
COPY 67314
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct

Peter asked me to send him a list of affiliations to correct
- First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:

dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6170
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n

I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:

$ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

Then I generated a new list for Peter:

dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6162

Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author “Hung, Nguyen”
- I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&R:

$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
$ wc -l hung-nguyen-a*handles.txt
  46 hung-nguyen-ares-handles.txt
  56 hung-nguyen-atmire-handles.txt
 102 total

Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet
- I am curious to check tomorrow to see if they are there