From a249e1af51d62a677f34ea73df7aa454f7444bf1 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Tue, 25 May 2021 21:20:22 +0300 Subject: [PATCH] Add notes for 2021-05-25 --- content/posts/2021-05.md | 101 +++++++++++++++++ docs/2021-02/index.html | 7 +- docs/2021-05/index.html | 145 +++++++++++++++++++++++- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/sitemap.xml | 12 +- 24 files changed, 272 insertions(+), 33 deletions(-) diff --git a/content/posts/2021-05.md b/content/posts/2021-05.md index a7a78773c..1c7fe879e 100644 --- a/content/posts/2021-05.md +++ b/content/posts/2021-05.md @@ -378,4 +378,105 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa - This included the IWMI changes, so I also migrated the `cg.subject.iwmi` metadata to `dcterms.subject` and deleted the subject term - Then I started a full Discovery reindex +## 2021-05-19 + +- I realized that I need to lower case the IWMI subjects that I just moved to AGROVOC because they were probably mostly uppercase + - To my surprise I checked `dcterms.subject` has 47,000 metadata fields that are upper or mixed case! + +```console +dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]'; +UPDATE 47405 +``` + +- That's interesting because we lowercased them all a few months ago, so these must all be new... wow + - We have 405,000 total AGROVOC terms, with 20,600 of them being unique + - I will have to start another Discovery re-indexing to pick up these new changes + +## 2021-05-20 + +- Export the top 5,000 AGROVOC terms to validate them: + +```console +localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER; +COPY 5000 +$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt +$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv +$ csvgrep -c "number of matches" -m 0 /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv +``` + +- Meeting with Medha and Pythagoras about the FAIR Workflow tool + - Discussed the need for such a tool, other tools being developed, etc + - I stressed the important of controlled vocabularies + - No real outcome, except to keep us posted and let us know if they need help testing on DSpace +- Meeting with Hector Tobon to discuss issues with CLARISA + - They pushed back a bit, saying they were more focused on the needs of the CG + - They are not against the idea of aligning closer to ROR, but lack the man power + - They pointed out that their countries come directly from the [ISO 3166 online browsing platform on the ISO website](https://www.iso.org/iso-3166-country-codes.html) + - Indeed the text value for Russia is "Russian Federation (the)" there... I find that strange + - I filed [an issue](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/33) on the iso-codes GitLab repository + +## 2021-05-24 + +- Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them: + +```console +$ cat 2021-05-24-add-orcids.csv +dc.contributor.author,cg.creator.identifier +"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988" +"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417" +"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417" +"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776" +"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636" +"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915" +"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971" +"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186" +"Grace, Delia","Delia Grace: 0000-0002-0195-9489" +"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358" +"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655" +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu' +``` + +- A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump: + +```console +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000 +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping +``` + +- The indexes look OK so I started a harvesting on AReS + +## 2021-05-25 + +- The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting: + +```console +$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items +yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb +yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb +``` + +- Update all docker images on the AReS server (linode20): + +```console +$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull +$ docker-compose -f docker/docker-compose.yml down +$ docker-compose -f docker/docker-compose.yml build +``` + +- Then run all system updates on the server and reboot it +- Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317... so it was actually correct before! +- For reference, this is how I re-created everything: + +```console +curl -XDELETE 'http://localhost:9200/openrxv-items-final' +curl -XDELETE 'http://localhost:9200/openrxv-items-temp' +curl -XPUT 'http://localhost:9200/openrxv-items-final' +curl -XPUT 'http://localhost:9200/openrxv-items-temp' +curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}' +elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping +elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000 +``` + +- I will just start a new harvest... sigh + diff --git a/docs/2021-02/index.html b/docs/2021-02/index.html index 4938b6b09..a7b4d38b1 100644 --- a/docs/2021-02/index.html +++ b/docs/2021-02/index.html @@ -32,7 +32,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty - + @@ -70,9 +70,9 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty "@type": "BlogPosting", "headline": "February, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-02/", - "wordCount": "4170", + "wordCount": "4143", "datePublished": "2021-02-01T10:13:54+02:00", - "dateModified": "2021-03-04T22:46:05+02:00", + "dateModified": "2021-05-18T23:21:39+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -233,7 +233,6 @@ java.lang.NullPointerException
  • Communicate more with CodeObia about some fixes for OpenRXV
  • Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart’s iD
  • I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:
  • -
  • Then for the rest, I saved them to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:
  • $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
     $ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
    diff --git a/docs/2021-05/index.html b/docs/2021-05/index.html
    index 09a71eaed..d08da3132 100644
    --- a/docs/2021-05/index.html
    +++ b/docs/2021-05/index.html
    @@ -20,7 +20,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
     
     
     
    -
    +
     
     
     
    @@ -46,9 +46,9 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
       "@type": "BlogPosting",
       "headline": "May, 2021",
       "url": "https://alanorth.github.io/cgspace-notes/2021-05/",
    -  "wordCount": "2245",
    +  "wordCount": "3089",
       "datePublished": "2021-05-02T09:50:54+03:00",
    -  "dateModified": "2021-05-16T17:11:48+03:00",
    +  "dateModified": "2021-05-18T23:21:39+03:00",
       "author": {
         "@type": "Person",
         "name": "Alan Orth"
    @@ -456,6 +456,145 @@ line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_E
     
    +

    2021-05-18

    + +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
    +$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
    +
    +
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
    +
    +
    $ cat 2021-05-18-add-orcids.csv 
    +dc.contributor.author,cg.creator.identifier
    +"Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
    +"Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
    +"Villegas, Daniel",Daniel M. Villegas: 0000-0001-6801-3332
    +"Villegas, Daniel M.",Daniel M. Villegas: 0000-0001-6801-3332
    +"Giles, James",James Giles: 0000-0003-1899-9206
    +"Simbare,  Alice",Alice Simbare: 0000-0003-2389-0969
    +"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
    +"Simbare, A.",Alice Simbare: 0000-0003-2389-0969
    +"Dita Rodriguez, Miguel",Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
    +"Templer, Noel",Noel Templer: 0000-0002-3201-9043
    +"Jalonen, R.",Riina Jalonen: 0000-0003-1669-9138
    +"Jalonen, Riina",Riina Jalonen: 0000-0003-1669-9138
    +"Izquierdo, Paulo",Paulo Izquierdo: 0000-0002-2153-0655
    +"Reyes, Byron",Byron Reyes: 0000-0003-2672-9636
    +"Reyes, Byron A.",Byron Reyes: 0000-0003-2672-9636
    +$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
    +
    +

    2021-05-19

    + +
    dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
    +UPDATE 47405
    +
    +

    2021-05-20

    + +
    localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
    +COPY 5000
    +$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
    +$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
    +$ csvgrep -c "number of matches" -m 0 /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
    +
    +

    2021-05-24

    + +
    $ cat 2021-05-24-add-orcids.csv 
    +dc.contributor.author,cg.creator.identifier
    +"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
    +"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
    +"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
    +"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
    +"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
    +"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
    +"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
    +"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
    +"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
    +"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
    +"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
    +$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
    +
    +
    $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
    +$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
    +
    +

    2021-05-25

    + +
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items                                                
    +yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
    +yellow open openrxv-items-final      soEzAnp3TDClIGZbmVyEIw 1 1    953      0   2.3mb   2.3mb
    +
    +
    $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
    +$ docker-compose -f docker/docker-compose.yml down
    +$ docker-compose -f docker/docker-compose.yml build
    +
    +
    curl -XDELETE 'http://localhost:9200/openrxv-items-final'
    +curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +curl -XPUT 'http://localhost:9200/openrxv-items-final'
    +curl -XPUT 'http://localhost:9200/openrxv-items-temp'
    +curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
    +elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
    +elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index bc7e899b7..92aa38b40 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 7b622af27..6bb61bd88 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index cd0c293bc..696e5b82d 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 46a3bfd28..c2231d9fc 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 0bd718ea7..9efd351f6 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 2d65cf602..3187c688c 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 1340ade55..9b72a6bee 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 569d7663d..790d53060 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 50b2a51cf..a6003639a 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 5172c4f9f..d215e3600 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index eac31d81c..dba4f2d63 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 3d74b4145..3fc8dff89 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index ce14bf91b..ab201657b 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index cc0d91e55..d1efef46d 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 1243093ca..5b95b072d 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 7a6b52229..12d95a4a1 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index d84c9a9cd..10f1f6a39 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 4d40d3b58..13606e57b 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 5a933e170..91f57ae8c 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 4ad720a4e..e539c9cfe 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index d0deafd79..da51040fb 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-05-16T17:11:48+03:00 + 2021-05-18T23:21:39+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-05-16T17:11:48+03:00 + 2021-05-18T23:21:39+03:00 https://alanorth.github.io/cgspace-notes/2021-05/ - 2021-05-16T17:11:48+03:00 + 2021-05-18T23:21:39+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-05-16T17:11:48+03:00 + 2021-05-18T23:21:39+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-05-16T17:11:48+03:00 + 2021-05-18T23:21:39+03:00 https://alanorth.github.io/cgspace-notes/2021-04/ 2021-04-28T18:57:48+03:00 @@ -33,7 +33,7 @@ 2021-03-30T09:56:38+03:00 https://alanorth.github.io/cgspace-notes/2021-02/ - 2021-03-04T22:46:05+02:00 + 2021-05-18T23:21:39+03:00 https://alanorth.github.io/cgspace-notes/2021-01/ 2021-01-31T16:32:16+02:00