Add notes for 2021-05-25

This commit is contained in:
2021-05-25 21:20:22 +03:00
parent fdf6da4c7f
commit a249e1af51
24 changed files with 272 additions and 33 deletions

View File

@ -378,4 +378,105 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa
- This included the IWMI changes, so I also migrated the `cg.subject.iwmi` metadata to `dcterms.subject` and deleted the subject term
- Then I started a full Discovery reindex
## 2021-05-19
- I realized that I need to lower case the IWMI subjects that I just moved to AGROVOC because they were probably mostly uppercase
- To my surprise I checked `dcterms.subject` has 47,000 metadata fields that are upper or mixed case!
```console
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 47405
```
- That's interesting because we lowercased them all a few months ago, so these must all be new... wow
- We have 405,000 total AGROVOC terms, with 20,600 of them being unique
- I will have to start another Discovery re-indexing to pick up these new changes
## 2021-05-20
- Export the top 5,000 AGROVOC terms to validate them:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
COPY 5000
$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
$ csvgrep -c "number of matches" -m 0 /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
```
- Meeting with Medha and Pythagoras about the FAIR Workflow tool
- Discussed the need for such a tool, other tools being developed, etc
- I stressed the important of controlled vocabularies
- No real outcome, except to keep us posted and let us know if they need help testing on DSpace
- Meeting with Hector Tobon to discuss issues with CLARISA
- They pushed back a bit, saying they were more focused on the needs of the CG
- They are not against the idea of aligning closer to ROR, but lack the man power
- They pointed out that their countries come directly from the [ISO 3166 online browsing platform on the ISO website](https://www.iso.org/iso-3166-country-codes.html)
- Indeed the text value for Russia is "Russian Federation (the)" there... I find that strange
- I filed [an issue](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/33) on the iso-codes GitLab repository
## 2021-05-24
- Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:
```console
$ cat 2021-05-24-add-orcids.csv
dc.contributor.author,cg.creator.identifier
"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
```
- A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:
```console
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
```
- The indexes look OK so I started a harvesting on AReS
## 2021-05-25
- The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:
```console
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb
```
- Update all docker images on the AReS server (linode20):
```console
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml down
$ docker-compose -f docker/docker-compose.yml build
```
- Then run all system updates on the server and reboot it
- Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317... so it was actually correct before!
- For reference, this is how I re-created everything:
```console
curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
```
- I will just start a new harvest... sigh
<!-- vim: set sw=2 ts=2: -->