diff --git a/content/posts/2021-05.md b/content/posts/2021-05.md index a7a78773c..1c7fe879e 100644 --- a/content/posts/2021-05.md +++ b/content/posts/2021-05.md @@ -378,4 +378,105 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa - This included the IWMI changes, so I also migrated the `cg.subject.iwmi` metadata to `dcterms.subject` and deleted the subject term - Then I started a full Discovery reindex +## 2021-05-19 + +- I realized that I need to lower case the IWMI subjects that I just moved to AGROVOC because they were probably mostly uppercase + - To my surprise I checked `dcterms.subject` has 47,000 metadata fields that are upper or mixed case! + +```console +dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]'; +UPDATE 47405 +``` + +- That's interesting because we lowercased them all a few months ago, so these must all be new... wow + - We have 405,000 total AGROVOC terms, with 20,600 of them being unique + - I will have to start another Discovery re-indexing to pick up these new changes + +## 2021-05-20 + +- Export the top 5,000 AGROVOC terms to validate them: + +```console +localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER; +COPY 5000 +$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt +$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv +$ csvgrep -c "number of matches" -m 0 /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv +``` + +- Meeting with Medha and Pythagoras about the FAIR Workflow tool + - Discussed the need for such a tool, other tools being developed, etc + - I stressed the important of controlled vocabularies + - No real outcome, except to keep us posted and let us know if they need help testing on DSpace +- Meeting with Hector Tobon to discuss issues with CLARISA + - They pushed back a bit, saying they were more focused on the needs of the CG + - They are not against the idea of aligning closer to ROR, but lack the man power + - They pointed out that their countries come directly from the [ISO 3166 online browsing platform on the ISO website](https://www.iso.org/iso-3166-country-codes.html) + - Indeed the text value for Russia is "Russian Federation (the)" there... I find that strange + - I filed [an issue](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/33) on the iso-codes GitLab repository + +## 2021-05-24 + +- Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them: + +```console +$ cat 2021-05-24-add-orcids.csv +dc.contributor.author,cg.creator.identifier +"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988" +"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417" +"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417" +"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776" +"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636" +"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915" +"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971" +"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186" +"Grace, Delia","Delia Grace: 0000-0002-0195-9489" +"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358" +"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655" +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu' +``` + +- A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump: + +```console +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000 +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping +``` + +- The indexes look OK so I started a harvesting on AReS + +## 2021-05-25 + +- The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting: + +```console +$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items +yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb +yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb +``` + +- Update all docker images on the AReS server (linode20): + +```console +$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull +$ docker-compose -f docker/docker-compose.yml down +$ docker-compose -f docker/docker-compose.yml build +``` + +- Then run all system updates on the server and reboot it +- Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317... so it was actually correct before! +- For reference, this is how I re-created everything: + +```console +curl -XDELETE 'http://localhost:9200/openrxv-items-final' +curl -XDELETE 'http://localhost:9200/openrxv-items-temp' +curl -XPUT 'http://localhost:9200/openrxv-items-final' +curl -XPUT 'http://localhost:9200/openrxv-items-temp' +curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}' +elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping +elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000 +``` + +- I will just start a new harvest... sigh + diff --git a/docs/2021-02/index.html b/docs/2021-02/index.html index 4938b6b09..a7b4d38b1 100644 --- a/docs/2021-02/index.html +++ b/docs/2021-02/index.html @@ -32,7 +32,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty - + @@ -70,9 +70,9 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty "@type": "BlogPosting", "headline": "February, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-02/", - "wordCount": "4170", + "wordCount": "4143", "datePublished": "2021-02-01T10:13:54+02:00", - "dateModified": "2021-03-04T22:46:05+02:00", + "dateModified": "2021-05-18T23:21:39+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -233,7 +233,6 @@ java.lang.NullPointerException
resolve-orcids.py
:resolve-orcids.py
:$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
diff --git a/docs/2021-05/index.html b/docs/2021-05/index.html
index 09a71eaed..d08da3132 100644
--- a/docs/2021-05/index.html
+++ b/docs/2021-05/index.html
@@ -20,7 +20,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
-
+
@@ -46,9 +46,9 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
"@type": "BlogPosting",
"headline": "May, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-05/",
- "wordCount": "2245",
+ "wordCount": "3089",
"datePublished": "2021-05-02T09:50:54+03:00",
- "dateModified": "2021-05-16T17:11:48+03:00",
+ "dateModified": "2021-05-18T23:21:39+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -456,6 +456,145 @@ line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_E
6_x-prod
(#468)resolve-orcids.py
:$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
+$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
+
$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
+
add-orcid-identifiers-csv.py
:$ cat 2021-05-18-add-orcids.csv
+dc.contributor.author,cg.creator.identifier
+"Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
+"Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
+"Villegas, Daniel",Daniel M. Villegas: 0000-0001-6801-3332
+"Villegas, Daniel M.",Daniel M. Villegas: 0000-0001-6801-3332
+"Giles, James",James Giles: 0000-0003-1899-9206
+"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
+"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
+"Simbare, A.",Alice Simbare: 0000-0003-2389-0969
+"Dita Rodriguez, Miguel",Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
+"Templer, Noel",Noel Templer: 0000-0002-3201-9043
+"Jalonen, R.",Riina Jalonen: 0000-0003-1669-9138
+"Jalonen, Riina",Riina Jalonen: 0000-0003-1669-9138
+"Izquierdo, Paulo",Paulo Izquierdo: 0000-0002-2153-0655
+"Reyes, Byron",Byron Reyes: 0000-0003-2672-9636
+"Reyes, Byron A.",Byron Reyes: 0000-0003-2672-9636
+$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
+
6_x-prod
branch on CGSpace, ran all system updates, and rebooted the server
+cg.subject.iwmi
metadata to dcterms.subject
and deleted the subject termdcterms.subject
has 47,000 metadata fields that are upper or mixed case!dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
+UPDATE 47405
+
localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
+COPY 5000
+$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
+$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
+$ csvgrep -c "number of matches" -m 0 /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
+
$ cat 2021-05-24-add-orcids.csv
+dc.contributor.author,cg.creator.identifier
+"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
+"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
+"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
+"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
+"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
+"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
+"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
+"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
+"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
+"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
+"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
+$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
+
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
+
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
+yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb
+
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+$ docker-compose -f docker/docker-compose.yml down
+$ docker-compose -f docker/docker-compose.yml build
+
curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+curl -XPUT 'http://localhost:9200/openrxv-items-final'
+curl -XPUT 'http://localhost:9200/openrxv-items-temp'
+curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
+elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
+