mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-10-01 14:44:17 +02:00
116 lines
5.6 KiB
Markdown
116 lines
5.6 KiB
Markdown
---
|
|
title: "January, 2020"
|
|
date: 2020-01-06T10:48:30+02:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2020-01-06
|
|
|
|
- Open [a ticket](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706) with Atmire to request a quote for the upgrade to DSpace 6
|
|
- Last week Altmetric responded about the [item](https://hdl.handle.net/10568/97087) that had a lower score than than its DOI
|
|
- The score is now linked to the DOI
|
|
- Another [item](https://handle.hdl.net/10568/91278) that had the same problem in 2019 has now also linked to the score for its DOI
|
|
- Another [item](https://hdl.handle.net/10568/81236) that had the same problem in 2019 has also been fixed
|
|
|
|
## 2020-01-07
|
|
|
|
- Peter Ballantyne highlighted one more WLE [item](https://hdl.handle.net/10568/101286) that is missing the Altmetric score that its DOI has
|
|
- The DOI has a score of 259, but the Handle has no score at all
|
|
- I [tweeted](https://twitter.com/mralanorth/status/1214471427157626881) the CGSpace repository link
|
|
|
|
<!--more-->
|
|
|
|
## 2020-01-08
|
|
|
|
- Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:
|
|
|
|
```
|
|
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
|
|
COPY 68790
|
|
```
|
|
|
|
- As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:
|
|
|
|
```
|
|
$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
|
|
iconv: illegal input sequence at position 104779
|
|
```
|
|
|
|
- According to [this trick](https://www.datafix.com.au/BASHing/2018-09-13.html) the troublesome character is on line 5227:
|
|
|
|
```
|
|
$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv
|
|
5227: "Oue
|
|
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
|
|
00000000: 22 "
|
|
00000001: 4f O
|
|
00000002: 75 u
|
|
00000003: 65 e
|
|
00000004: cc .
|
|
00000005: 81 .
|
|
00000006: 64 d
|
|
00000007: 72 r
|
|
```
|
|
|
|
- ~~According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81)~~, which vim identifies (using `ga` on the character) as:
|
|
|
|
```
|
|
<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
|
|
```
|
|
|
|
- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database...
|
|
- Other encodings like `windows-1251` and `windows-1257` also fail on different characters like "ž" and "é" that _are_ legitimate UTF-8 characters
|
|
- Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
|
|
- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
|
|
- Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch
|
|
|
|
## 2020-01-14
|
|
|
|
- I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
|
|
- I manually ran it on the server as the DSpace user and it said "Moving: 51633080 into core statistics-2019"
|
|
- After a few hours it died with the same error that I had seen in the log from the first run:
|
|
|
|
```
|
|
Exception: Read timed out
|
|
java.net.SocketTimeoutException: Read timed out
|
|
```
|
|
|
|
- I am not sure how I will fix that shard...
|
|
- I discovered a very interesting tool called [ftfy](https://github.com/LuminosoInsight/python-ftfy) that attempts to fix errors in UTF-8
|
|
- I'm curious to start checking input files with this to see what it highlights
|
|
- I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
|
|
- `<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401`
|
|
- `<é> 233, Hex 00e9, Oct 351, Digr e'`
|
|
- Ah hah! We need to be [normalizing characters into their canonical forms](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html)!
|
|
- In Python 3.8 we can even [check if the string is normalized using the `unicodedata` library](https://docs.python.org/3/library/unicodedata.html):
|
|
|
|
```
|
|
In [7]: unicodedata.is_normalized('NFC', 'é')
|
|
Out[7]: False
|
|
|
|
In [8]: unicodedata.is_normalized('NFC', 'é')
|
|
Out[8]: True
|
|
```
|
|
|
|
## 2020-01-15
|
|
|
|
- I added support for Unicode normalization to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) tool in [v0.4.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0)
|
|
- Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:
|
|
|
|
```
|
|
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
|
|
COPY 144
|
|
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
|
|
COPY 1325
|
|
```
|
|
|
|
- She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC
|
|
- I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my `fix-metadata.py` script:
|
|
|
|
```
|
|
$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
|
|
```
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|