CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

January, 2020

2020-01-06

  • Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
  • Last week Altmetric responded about the item that had a lower score than than its DOI
    • The score is now linked to the DOI
    • Another item that had the same problem in 2019 has now also linked to the score for its DOI
    • Another item that had the same problem in 2019 has also been fixed

2020-01-07

  • Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has
    • The DOI has a score of 259, but the Handle has no score at all
    • I tweeted the CGSpace repository link

2020-01-08

  • Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
  • As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:
$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779
  • According to this trick the troublesome character is on line 5227:
$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
5227: "Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22  "
00000001: 4f  O
00000002: 75  u
00000003: 65  e
00000004: cc  .
00000005: 81  .
00000006: 64  d
00000007: 72  r
  • According to the blog post linked above the troublesome character is probably the “High Octect Preset” (81), which vim identifies (using ga on the character) as:
<e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401
  • If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database…
  • Other encodings like windows-1251 and windows-1257 also fail on different characters like “ž” and “é” that are legitimate UTF-8 characters
  • Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
  • I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
  • Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the 5_x-prod (no CG Core v2) branch

2020-01-14

  • I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
    • I manually ran it on the server as the DSpace user and it said “Moving: 51633080 into core statistics-2019”
    • After a few hours it died with the same error that I had seen in the log from the first run:
Exception: Read timed out
java.net.SocketTimeoutException: Read timed out
  • I am not sure how I will fix that shard…
  • I discovered a very interesting tool called ftfy that attempts to fix errors in UTF-8
    • I'm curious to start checking input files with this to see what it highlights
    • I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
    • <e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
    • <é> 233, Hex 00e9, Oct 351, Digr e'
  • Ah hah! We need to be normalizing characters into their canonical forms!
In [7]: unicodedata.is_normalized('NFC', 'é')
Out[7]: False

In [8]: unicodedata.is_normalized('NFC', 'é')
Out[8]: True