Add notes for 2021-05-25

2025-01-27 05:49:12 +01:00 · 2021-05-25 21:20:22 +03:00
parent fdf6da4c7f
commit a249e1af51
24 changed files with 272 additions and 33 deletions
--- a/content/posts/2021-05.md
+++ b/content/posts/2021-05.md
@ -378,4 +378,105 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa
  - This included the IWMI changes, so I also migrated the `cg.subject.iwmi` metadata to `dcterms.subject` and deleted the subject term
  - Then I started a full Discovery reindex

+## 2021-05-19
+
+- I realized that I need to lower case the IWMI subjects that I just moved to AGROVOC because they were probably mostly uppercase
+  - To my surprise I checked `dcterms.subject` has 47,000 metadata fields that are upper or mixed case!
+
+```console
+dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
+UPDATE 47405
+```
+
+- That's interesting because we lowercased them all a few months ago, so these must all be new... wow
+  - We have 405,000 total AGROVOC terms, with 20,600 of them being unique
+  - I will have to start another Discovery re-indexing to pick up these new changes
+
+## 2021-05-20
+
+- Export the top 5,000 AGROVOC terms to validate them:
+
+```console
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
+COPY 5000
+$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
+$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
+$ csvgrep -c "number of matches" -m 0 /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
+```
+
+- Meeting with Medha and Pythagoras about the FAIR Workflow tool
+  - Discussed the need for such a tool, other tools being developed, etc
+  - I stressed the important of controlled vocabularies
+  - No real outcome, except to keep us posted and let us know if they need help testing on DSpace
+- Meeting with Hector Tobon to discuss issues with CLARISA
+  - They pushed back a bit, saying they were more focused on the needs of the CG
+  - They are not against the idea of aligning closer to ROR, but lack the man power
+  - They pointed out that their countries come directly from the [ISO 3166 online browsing platform on the ISO website](https://www.iso.org/iso-3166-country-codes.html)
+  - Indeed the text value for Russia is "Russian Federation (the)" there... I find that strange
+  - I filed [an issue](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/33) on the iso-codes GitLab repository
+
+## 2021-05-24
+
+- Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:
+
+```console
+$ cat 2021-05-24-add-orcids.csv 
+dc.contributor.author,cg.creator.identifier
+"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
+"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
+"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
+"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
+"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
+"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
+"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
+"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
+"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
+"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
+"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
+$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
+```
+
+- A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:
+
+```console
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
+```
+
+- The indexes look OK so I started a harvesting on AReS
+
+## 2021-05-25
+
+- The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:
+
+```console
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items                                                
+yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
+yellow open openrxv-items-final      soEzAnp3TDClIGZbmVyEIw 1 1    953      0   2.3mb   2.3mb
+```
+
+- Update all docker images on the AReS server (linode20):
+
+```console
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+$ docker-compose -f docker/docker-compose.yml down
+$ docker-compose -f docker/docker-compose.yml build
+```
+
+- Then run all system updates on the server and reboot it
+- Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317... so it was actually correct before!
+- For reference, this is how I re-created everything:
+
+```console
+curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+curl -XPUT 'http://localhost:9200/openrxv-items-final'
+curl -XPUT 'http://localhost:9200/openrxv-items-temp'
+curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
+elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
+```
+
+- I will just start a new harvest... sigh
+
 <!-- vim: set sw=2 ts=2: -->