Add notes for 2020-10-28

This commit is contained in:
2020-10-28 17:53:19 +03:00
parent 5f76797488
commit a0368b4e52
22 changed files with 352 additions and 29 deletions

View File

@ -23,7 +23,7 @@ During the FlywayDB migration I got an error:
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-10/" />
<meta property="article:published_time" content="2020-10-06T16:55:54+03:00" />
<meta property="article:modified_time" content="2020-10-24T22:23:06+03:00" />
<meta property="article:modified_time" content="2020-10-26T16:34:45+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2020"/>
@ -51,9 +51,9 @@ During the FlywayDB migration I got an error:
"@type": "BlogPosting",
"headline": "October, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-10/",
"wordCount": "5014",
"wordCount": "6262",
"datePublished": "2020-10-06T16:55:54+03:00",
"dateModified": "2020-10-24T22:23:06+03:00",
"dateModified": "2020-10-26T16:34:45+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -975,11 +975,180 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Then I started processing the statistics-2017 core&hellip;</li>
<li>Then I started processing the statistics-2017 core&hellip;
<ul>
<li>The processing finished with no errors and afterwards I purged 800,000 unmigrated records (all with <code>type: 5</code>):</li>
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
<li>Add new ORCID identifier for <a href="https://orcid.org/0000-0003-3871-6277">Perle LATRE DE LATE</a> to controlled vocabulary</li>
<li>Use <code>move-collections.sh</code> to move a few AgriFood Tools collections on CGSpace into a new <a href="https://hdl.handle.net/10568/109982">sub community</a></li>
</ul>
<!-- raw HTML omitted -->
<h2 id="2020-10-27">2020-10-27</h2>
<ul>
<li>I purged 849,408 unmigrated records from the statistics-2016 core after it finished processing&hellip;</li>
<li>I purged 285,000 unmigrated records from the statistics-2015 core after it finished processing&hellip;</li>
<li>I purged 196,000 unmigrated records from the statistics-2014 core after it finished processing&hellip;</li>
<li>I finally finished processing all the statistics cores with the <code>solr-upgrade-statistics-6x</code> utility on DSpace Test
<ul>
<li>I started the Atmire stats processing:</li>
</ul>
</li>
</ul>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
</code></pre><ul>
<li>Peter asked me to add the new preferred AGROVOC subject &ldquo;covid-19&rdquo; to all items we had previously added &ldquo;coronavirus disease&rdquo;, and to make sure all items with ILRI subject &ldquo;ZOONOTIC DISEASES&rdquo; have the AGROVOC subject &ldquo;zoonoses&rdquo;
<ul>
<li>I exported all the records on CGSpace from the CLI and extracted the columns I needed to process them in OpenRefine:</li>
</ul>
</li>
</ul>
<pre><code>$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
<li>Skype with Peter and Abenet about CGSpace Explorer (AReS)
<ul>
<li>They want to do a big push in ILRI and our partners to use it in mid November (around 16th) so we need to clean up the metadata and try to fix the views/downloads issue by then</li>
<li>I filed <a href="https://github.com/ilri/OpenRXV/issues/45">an issue</a> on OpenRXV for the views/downloads</li>
<li>We also talked about harvesting CIMMYT&rsquo;s repository into AReS, perhaps with only a subset of their data, though they seem to have some issues with their data:
<ul>
<li>dc.contributor.author and dcterms.creator</li>
<li>dc.title and dcterms.title</li>
<li>dc.region.focus</li>
<li>dc.coverage.countryfocus</li>
<li>dc.rights.accesslevel (access status)</li>
<li>dc.source.journal (source)</li>
<li>dcterms.type and dc.type</li>
<li>dc.subject.agrovoc</li>
</ul>
</li>
</ul>
</li>
<li>I did some work on my previous <code>create-mappings.py</code> script to process journal titles and sponsors/investors as well as CRPs and affiliations
<ul>
<li>I converted it to use the Elasticsearch scroll API directly rather than consuming a JSON file</li>
<li>The result is about 1200 mappings, mostly to remove acronyms at the end of metadata values</li>
<li>I added a few custom mappings using <code>convert-mapping.py</code> and then uploaded them to AReS:</li>
</ul>
</li>
</ul>
<pre><code>$ ./create-mappings.py &gt; /tmp/elasticsearch-mappings.txt
$ ./convert-mapping.py &gt;&gt; /tmp/elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elasticsearch-mappings.txt
</code></pre><ul>
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
</ul>
<pre><code>$ docker-compose up --build -d angular_nginx
</code></pre><h2 id="2020-10-28">2020-10-28</h2>
<ul>
<li>Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:</li>
</ul>
<pre><code>$ docker-compose up --build -d --force-recreate angular_nginx
</code></pre><ul>
<li>Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like &ldquo;Burkina faso&rdquo; is due to the country formatter (see: <code>backend/src/harvester/consumers/fetch.consumer.ts</code>)
<ul>
<li>I don&rsquo;t understand Typescript syntax so for now I will just disable that formatter in each repository configuration and I&rsquo;m sure it will be better, as we&rsquo;re all using title case like &ldquo;Kenya&rdquo; and &ldquo;Burkina Faso&rdquo; now anyways</li>
</ul>
</li>
<li>Also, I fixed a few mappings with WorldFish data</li>
<li>Peter really wants us to move forward with the alignment of our regions to UN M.49, and the CKM web team hasn&rsquo;t responded to any of the mails we&rsquo;ve sent recently so I will just do it
<ul>
<li>These are the changes that will happen in the input forms:
<ul>
<li>East Africa → Eastern Africa</li>
<li>West Africa → Western Africa</li>
<li>Southeast Asia → South-eastern Asia</li>
<li>South Asia → Southern Asia</li>
<li>Africa South of Sahara → Sub-Saharan Africa</li>
<li>North Africa → Northern Africa</li>
<li>West Asia → Western Asia</li>
</ul>
</li>
<li>There are some regions we use that are not present, for example Sahel, ACP, Middle East, and West and Central Africa. I will advocate for closer alignment later</li>
<li>I ran my <code>fix-metadata-values.py</code> script to update the values in the database:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-10-28-update-regions.csv
cg.coverage.region,correct
East Africa,Eastern Africa
West Africa,Western Africa
Southeast Asia,South-eastern Asia
South Asia,Southern Asia
Africa South Of Sahara,Sub-Saharan Africa
North Africa,Northern Africa
West Asia,Western Asia
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 92m14.294s
user 7m59.840s
sys 2m22.327s
</code></pre><ul>
<li>I realized I had been using an incorrect Solr query to purge unmigrated items after processing with <code>solr-upgrade-statistics-6x</code>&hellip;
<ul>
<li>Instead of this: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>I should have used this: <code>id:/.+-unmigrated/</code></li>
<li>Or perhaps this (with a check first!): <code>*:* NOT id:/.{36}/</code></li>
<li>We need to make sure to explicitly purge the unmigrated records, then purge any that are not matching the UUID pattern (after inspecting manually!)</li>
<li>There are still 3.7 million records in our ten years of Solr statistics that are unmigrated (I only noticed because the DSpace Statistics API indexer kept failing)</li>
<li>I don&rsquo;t think this is serious enough to re-start the simulation of the DSpace 6 migration over again, but I definitely want to make sure I use the correct query when I do CGSpace</li>
</ul>
</li>
<li>The AReS indexing finished after I removed the country formatting from all the repository configurations and now I see values like &ldquo;SA&rdquo;, &ldquo;CA&rdquo;, etc&hellip;
<ul>
<li>So really we need this to fix MELSpace countries, so I will re-enable the country formatting for their repository</li>
</ul>
</li>
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
COPY 6357
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
COPY 730
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
COPY 71748
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.publisher&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
COPY 3882
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.source&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
COPY 3684
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.relation.ispartofseries&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
COPY 5598
</code></pre><ul>
<li>I noticed there are still some mapping for acronyms and other fixes that haven&rsquo;t been applied, so I ran my <code>create-mappings.py</code> script against Elasticsearch again
<ul>
<li>Now I&rsquo;m comparing yesterday&rsquo;s mappings with today&rsquo;s and I don&rsquo;t see any duplicates&hellip;</li>
</ul>
</li>
</ul>
<pre><code>$ grep -c '&quot;find&quot;' /tmp/elasticsearch-mappings*
/tmp/elasticsearch-mappings2.txt:350
/tmp/elasticsearch-mappings.txt:1228
$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | wc -l
1578
$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | uniq | wc -l
1578
</code></pre><ul>
<li>I have no idea why they wouldn&rsquo;t have been caught yesterday when I originally ran the script on a clean AReS with no mappings&hellip;
<ul>
<li>In any case, I combined the mappings and then uploaded them to AReS:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/new-elasticsearch-mappings.txt
</code></pre><!-- raw HTML omitted -->