mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-05-25
This commit is contained in:
@ -20,7 +20,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-05/" />
|
||||
<meta property="article:published_time" content="2021-05-02T09:50:54+03:00" />
|
||||
<meta property="article:modified_time" content="2021-05-16T17:11:48+03:00" />
|
||||
<meta property="article:modified_time" content="2021-05-18T23:21:39+03:00" />
|
||||
|
||||
|
||||
|
||||
@ -46,9 +46,9 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
|
||||
"@type": "BlogPosting",
|
||||
"headline": "May, 2021",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2021-05/",
|
||||
"wordCount": "2245",
|
||||
"wordCount": "3089",
|
||||
"datePublished": "2021-05-02T09:50:54+03:00",
|
||||
"dateModified": "2021-05-16T17:11:48+03:00",
|
||||
"dateModified": "2021-05-18T23:21:39+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -456,6 +456,145 @@ line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_E
|
||||
</code></pre><ul>
|
||||
<li>After testing whether this escaped value worked during submission, I created and merged a pull request to <code>6_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/468">#468</a>)</li>
|
||||
</ul>
|
||||
<h2 id="2021-05-18">2021-05-18</h2>
|
||||
<ul>
|
||||
<li>Paola from the Alliance emailed me some new ORCID identifiers to add to CGSpace</li>
|
||||
<li>I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
|
||||
$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
|
||||
</code></pre><ul>
|
||||
<li>I sorted the names and added the XML formatting in vim, then ran it through tidy:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
|
||||
</code></pre><ul>
|
||||
<li>Tag fifty-five items from the Alliance’s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code>:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ cat 2021-05-18-add-orcids.csv
|
||||
dc.contributor.author,cg.creator.identifier
|
||||
"Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
|
||||
"Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
|
||||
"Villegas, Daniel",Daniel M. Villegas: 0000-0001-6801-3332
|
||||
"Villegas, Daniel M.",Daniel M. Villegas: 0000-0001-6801-3332
|
||||
"Giles, James",James Giles: 0000-0003-1899-9206
|
||||
"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
|
||||
"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
|
||||
"Simbare, A.",Alice Simbare: 0000-0003-2389-0969
|
||||
"Dita Rodriguez, Miguel",Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
|
||||
"Templer, Noel",Noel Templer: 0000-0002-3201-9043
|
||||
"Jalonen, R.",Riina Jalonen: 0000-0003-1669-9138
|
||||
"Jalonen, Riina",Riina Jalonen: 0000-0003-1669-9138
|
||||
"Izquierdo, Paulo",Paulo Izquierdo: 0000-0002-2153-0655
|
||||
"Reyes, Byron",Byron Reyes: 0000-0003-2672-9636
|
||||
"Reyes, Byron A.",Byron Reyes: 0000-0003-2672-9636
|
||||
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
|
||||
</code></pre><ul>
|
||||
<li>I deployed the latest <code>6_x-prod</code> branch on CGSpace, ran all system updates, and rebooted the server
|
||||
<ul>
|
||||
<li>This included the IWMI changes, so I also migrated the <code>cg.subject.iwmi</code> metadata to <code>dcterms.subject</code> and deleted the subject term</li>
|
||||
<li>Then I started a full Discovery reindex</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2021-05-19">2021-05-19</h2>
|
||||
<ul>
|
||||
<li>I realized that I need to lower case the IWMI subjects that I just moved to AGROVOC because they were probably mostly uppercase
|
||||
<ul>
|
||||
<li>To my surprise I checked <code>dcterms.subject</code> has 47,000 metadata fields that are upper or mixed case!</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
|
||||
UPDATE 47405
|
||||
</code></pre><ul>
|
||||
<li>That’s interesting because we lowercased them all a few months ago, so these must all be new… wow
|
||||
<ul>
|
||||
<li>We have 405,000 total AGROVOC terms, with 20,600 of them being unique</li>
|
||||
<li>I will have to start another Discovery re-indexing to pick up these new changes</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2021-05-20">2021-05-20</h2>
|
||||
<ul>
|
||||
<li>Export the top 5,000 AGROVOC terms to validate them:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
|
||||
COPY 5000
|
||||
$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
|
||||
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
|
||||
$ csvgrep -c "number of matches" -m 0 /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
|
||||
</code></pre><ul>
|
||||
<li>Meeting with Medha and Pythagoras about the FAIR Workflow tool
|
||||
<ul>
|
||||
<li>Discussed the need for such a tool, other tools being developed, etc</li>
|
||||
<li>I stressed the important of controlled vocabularies</li>
|
||||
<li>No real outcome, except to keep us posted and let us know if they need help testing on DSpace</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Meeting with Hector Tobon to discuss issues with CLARISA
|
||||
<ul>
|
||||
<li>They pushed back a bit, saying they were more focused on the needs of the CG</li>
|
||||
<li>They are not against the idea of aligning closer to ROR, but lack the man power</li>
|
||||
<li>They pointed out that their countries come directly from the <a href="https://www.iso.org/iso-3166-country-codes.html">ISO 3166 online browsing platform on the ISO website</a></li>
|
||||
<li>Indeed the text value for Russia is “Russian Federation (the)” there… I find that strange</li>
|
||||
<li>I filed <a href="https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/33">an issue</a> on the iso-codes GitLab repository</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2021-05-24">2021-05-24</h2>
|
||||
<ul>
|
||||
<li>Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ cat 2021-05-24-add-orcids.csv
|
||||
dc.contributor.author,cg.creator.identifier
|
||||
"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
|
||||
"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
|
||||
"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
|
||||
"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
|
||||
"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
|
||||
"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
|
||||
"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
|
||||
"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
|
||||
"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
|
||||
"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
|
||||
"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
|
||||
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
|
||||
</code></pre><ul>
|
||||
<li>A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
|
||||
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
|
||||
</code></pre><ul>
|
||||
<li>The indexes look OK so I started a harvesting on AReS</li>
|
||||
</ul>
|
||||
<h2 id="2021-05-25">2021-05-25</h2>
|
||||
<ul>
|
||||
<li>The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
|
||||
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
|
||||
yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb
|
||||
</code></pre><ul>
|
||||
<li>Update all docker images on the AReS server (linode20):</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
|
||||
$ docker-compose -f docker/docker-compose.yml down
|
||||
$ docker-compose -f docker/docker-compose.yml build
|
||||
</code></pre><ul>
|
||||
<li>Then run all system updates on the server and reboot it</li>
|
||||
<li>Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317… so it was actually correct before!</li>
|
||||
<li>For reference, this is how I re-created everything:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
|
||||
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
|
||||
curl -XPUT 'http://localhost:9200/openrxv-items-final'
|
||||
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
|
||||
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
|
||||
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
|
||||
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
|
||||
</code></pre><ul>
|
||||
<li>I will just start a new harvest… sigh</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user