--- title: "January, 2020" date: 2020-01-06T10:48:30+02:00 author: "Alan Orth" categories: ["Notes"] --- ## 2020-01-06 - Open [a ticket](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706) with Atmire to request a quote for the upgrade to DSpace 6 - Last week Altmetric responded about the [item](https://hdl.handle.net/10568/97087) that had a lower score than than its DOI - The score is now linked to the DOI - Another [item](https://handle.hdl.net/10568/91278) that had the same problem in 2019 has now also linked to the score for its DOI - Another [item](https://hdl.handle.net/10568/81236) that had the same problem in 2019 has also been fixed ## 2020-01-07 - Peter Ballantyne highlighted one more WLE [item](https://hdl.handle.net/10568/101286) that is missing the Altmetric score that its DOI has - The DOI has a score of 259, but the Handle has no score at all - I [tweeted](https://twitter.com/mralanorth/status/1214471427157626881) the CGSpace repository link ## 2020-01-08 - Export a list of authors from CGSpace for Peter Ballantyne to look through and correct: ``` dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER; COPY 68790 ``` - As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error: ``` $ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv iconv: illegal input sequence at position 104779 ``` - According to [this trick](https://www.datafix.com.au/BASHing/2018-09-13.html) the troublesome character is on line 5227: ``` $ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv 5227: "Oue $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1 00000000: 22 " 00000001: 4f O 00000002: 75 u 00000003: 65 e 00000004: cc . 00000005: 81 . 00000006: 64 d 00000007: 72 r ``` - ~~According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81)~~, which vim identifies (using `ga` on the character) as: ``` 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401 ``` - If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database... - Other encodings like `windows-1251` and `windows-1257` also fail on different characters like "ž" and "é" that _are_ legitimate UTF-8 characters - Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings - I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me - Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch ## 2020-01-14 - I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error - I manually ran it on the server as the DSpace user and it said "Moving: 51633080 into core statistics-2019" - After a few hours it died with the same error that I had seen in the log from the first run: ``` Exception: Read timed out java.net.SocketTimeoutException: Read timed out ``` - I am not sure how I will fix that shard... - I discovered a very interesting tool called [ftfy](https://github.com/LuminosoInsight/python-ftfy) that attempts to fix errors in UTF-8 - I'm curious to start checking input files with this to see what it highlights - I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as: - ` 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401` - `<é> 233, Hex 00e9, Oct 351, Digr e'` - Ah hah! We need to be [normalizing characters into their canonical forms](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html)! - In Python 3.8 we can even [check if the string is normalized using the `unicodedata` library](https://docs.python.org/3/library/unicodedata.html): ``` In [7]: unicodedata.is_normalized('NFC', 'é') Out[7]: False In [8]: unicodedata.is_normalized('NFC', 'é') Out[8]: True ``` ## 2020-01-15 - I added support for Unicode normalization to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) tool in [v0.4.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0) - Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity: ``` dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER; COPY 144 dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER; COPY 1325 ``` - She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC - I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my `fix-metadata.py` script: ``` $ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d ``` ## 2020-01-16 - Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity: ``` dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER; COPY 35 ``` - Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls) - We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months - Sisay uploaded the records to DSpace Test as [IITA_201907_Jan13](https://dspacetest.cgiar.org/handle/10568/106567) - I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data - I corrected one invalid AGROVOC subject - Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine: - `$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id` - I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: `if(cell.recon.matched, cell.recon.match.name, value)` ## 2020-01-20 - Last week Atmire sent a quotation for the DSpace 6 upgrade that I had requested a few weeks ago - I forwarded it to Peter et al for their comment - We decided that we should probably buy enough credits to cover the upgrade and have 100 remaining for future development - Visit CodeObia to discuss the next phase of AReS development ## 2020-01-21 - Create two accounts on CGSpace for CTA users - Marie-Angelique finally responded to some of the pull requests I made on the CG Core v2 repository last month: - Merged: [HTML syntax fixes](https://github.com/AgriculturalSemantics/cg-core/pull/16) - Merged: [Add LICENSE file](https://github.com/AgriculturalSemantics/cg-core/pull/17) - Merged: [Build main.css using npm build](https://github.com/AgriculturalSemantics/cg-core/pull/18) - Approved a [wider scope for `cg.peer-reviewed`](https://github.com/AgriculturalSemantics/cg-core/issues/14) (renaming the field and using non-boolean values), but there is more discussion needed - I opened a new [pull request](https://github.com/AgriculturalSemantics/cg-core/pull/24) on the cg-core repository validate and fix the formatting of the HTML files - Create more issues for OpenRXV: - Based on Peter's feedback on the [text for labels and tooltips](https://github.com/ilri/OpenRXV/issues/33) - Based on Peter's feedback for the [export icon](https://github.com/ilri/OpenRXV/issues/35) - Based on Peter's feedback for the [sort options](https://github.com/ilri/OpenRXV/issues/31) - Based on Abenet's feedback that [PDF and Word exports are not working](https://github.com/ilri/OpenRXV/issues/34) ## 2020-01-22 - I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me: ``` Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN. ``` - They started [limiting public access to the database in December, 2019 due to GDPR and CCPA](https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/) - This will be a problem in the future (see [DS-4409](https://jira.lyrasis.org/browse/DS-4409))