Add notes for 2020-10-28

2025-01-27 05:49:12 +01:00 · 2020-10-28 17:53:19 +03:00
parent 5f76797488
commit a0368b4e52
22 changed files with 352 additions and 29 deletions
--- a/docs/2020-10/index.html
+++ b/docs/2020-10/index.html
@ -23,7 +23,7 @@ During the FlywayDB migration I got an error:
 <meta property="og:type" content="article" />
 <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-10/" />
 <meta property="article:published_time" content="2020-10-06T16:55:54+03:00" />
-<meta property="article:modified_time" content="2020-10-24T22:23:06+03:00" />
+<meta property="article:modified_time" content="2020-10-26T16:34:45+03:00" />

 <meta name="twitter:card" content="summary"/>
 <meta name="twitter:title" content="October, 2020"/>
@ -51,9 +51,9 @@ During the FlywayDB migration I got an error:
  "@type": "BlogPosting",
  "headline": "October, 2020",
  "url": "https://alanorth.github.io/cgspace-notes/2020-10/",
-  "wordCount": "5014",
+  "wordCount": "6262",
  "datePublished": "2020-10-06T16:55:54+03:00",
-  "dateModified": "2020-10-24T22:23:06+03:00",
+  "dateModified": "2020-10-26T16:34:45+03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
@ -975,11 +975,180 @@ java.lang.OutOfMemoryError: Java heap space
 </ul>
 <pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
 </code></pre><ul>
-<li>Then I started processing the statistics-2017 core&hellip;</li>
+<li>Then I started processing the statistics-2017 core&hellip;
+<ul>
+<li>The processing finished with no errors and afterwards I purged 800,000 unmigrated records (all with <code>type: 5</code>):</li>
+</ul>
+</li>
+</ul>
+<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
+</code></pre><ul>
+<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
 <li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
 <li>Add new ORCID identifier for <a href="https://orcid.org/0000-0003-3871-6277">Perle LATRE DE LATE</a> to controlled vocabulary</li>
+<li>Use <code>move-collections.sh</code> to move a few AgriFood Tools collections on CGSpace into a new <a href="https://hdl.handle.net/10568/109982">sub community</a></li>
 </ul>
-<!-- raw HTML omitted -->
+<h2 id="2020-10-27">2020-10-27</h2>
+<ul>
+<li>I purged 849,408 unmigrated records from the statistics-2016 core after it finished processing&hellip;</li>
+<li>I purged 285,000 unmigrated records from the statistics-2015 core after it finished processing&hellip;</li>
+<li>I purged 196,000 unmigrated records from the statistics-2014 core after it finished processing&hellip;</li>
+<li>I finally finished processing all the statistics cores with the <code>solr-upgrade-statistics-6x</code> utility on DSpace Test
+<ul>
+<li>I started the Atmire stats processing:</li>
+</ul>
+</li>
+</ul>
+<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
+</code></pre><ul>
+<li>Peter asked me to add the new preferred AGROVOC subject &ldquo;covid-19&rdquo; to all items we had previously added &ldquo;coronavirus disease&rdquo;, and to make sure all items with ILRI subject &ldquo;ZOONOTIC DISEASES&rdquo; have the AGROVOC subject &ldquo;zoonoses&rdquo;
+<ul>
+<li>I exported all the records on CGSpace from the CLI and extracted the columns I needed to process them in OpenRefine:</li>
+</ul>
+</li>
+</ul>
+<pre><code>$ dspace metadata-export -f /tmp/cgspace.csv
+$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
+</code></pre><ul>
+<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
+<li>Skype with Peter and Abenet about CGSpace Explorer (AReS)
+<ul>
+<li>They want to do a big push in ILRI and our partners to use it in mid November (around 16th) so we need to clean up the metadata and try to fix the views/downloads issue by then</li>
+<li>I filed <a href="https://github.com/ilri/OpenRXV/issues/45">an issue</a> on OpenRXV for the views/downloads</li>
+<li>We also talked about harvesting CIMMYT&rsquo;s repository into AReS, perhaps with only a subset of their data, though they seem to have some issues with their data:
+<ul>
+<li>dc.contributor.author and dcterms.creator</li>
+<li>dc.title and dcterms.title</li>
+<li>dc.region.focus</li>
+<li>dc.coverage.countryfocus</li>
+<li>dc.rights.accesslevel (access status)</li>
+<li>dc.source.journal (source)</li>
+<li>dcterms.type and dc.type</li>
+<li>dc.subject.agrovoc</li>
+</ul>
+</li>
+</ul>
+</li>
+<li>I did some work on my previous <code>create-mappings.py</code> script to process journal titles and sponsors/investors as well as CRPs and affiliations
+<ul>
+<li>I converted it to use the Elasticsearch scroll API directly rather than consuming a JSON file</li>
+<li>The result is about 1200 mappings, mostly to remove acronyms at the end of metadata values</li>
+<li>I added a few custom mappings using <code>convert-mapping.py</code> and then uploaded them to AReS:</li>
+</ul>
+</li>
+</ul>
+<pre><code>$ ./create-mappings.py &gt; /tmp/elasticsearch-mappings.txt
+$ ./convert-mapping.py &gt;&gt; /tmp/elasticsearch-mappings.txt
+$ curl -XDELETE http://localhost:9200/openrxv-values
+$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elasticsearch-mappings.txt
+</code></pre><ul>
+<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
+<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
+</ul>
+<pre><code>$ docker-compose up --build -d angular_nginx
+</code></pre><h2 id="2020-10-28">2020-10-28</h2>
+<ul>
+<li>Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:</li>
+</ul>
+<pre><code>$ docker-compose up --build -d --force-recreate angular_nginx
+</code></pre><ul>
+<li>Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like &ldquo;Burkina faso&rdquo; is due to the country formatter (see: <code>backend/src/harvester/consumers/fetch.consumer.ts</code>)
+<ul>
+<li>I don&rsquo;t understand Typescript syntax so for now I will just disable that formatter in each repository configuration and I&rsquo;m sure it will be better, as we&rsquo;re all using title case like &ldquo;Kenya&rdquo; and &ldquo;Burkina Faso&rdquo; now anyways</li>
+</ul>
+</li>
+<li>Also, I fixed a few mappings with WorldFish data</li>
+<li>Peter really wants us to move forward with the alignment of our regions to UN M.49, and the CKM web team hasn&rsquo;t responded to any of the mails we&rsquo;ve sent recently so I will just do it
+<ul>
+<li>These are the changes that will happen in the input forms:
+<ul>
+<li>East Africa → Eastern Africa</li>
+<li>West Africa → Western Africa</li>
+<li>Southeast Asia → South-eastern Asia</li>
+<li>South Asia → Southern Asia</li>
+<li>Africa South of Sahara → Sub-Saharan Africa</li>
+<li>North Africa → Northern Africa</li>
+<li>West Asia → Western Asia</li>
+</ul>
+</li>
+<li>There are some regions we use that are not present, for example Sahel, ACP, Middle East, and West and Central Africa. I will advocate for closer alignment later</li>
+<li>I ran my <code>fix-metadata-values.py</code> script to update the values in the database:</li>
+</ul>
+</li>
+</ul>
+<pre><code>$ cat 2020-10-28-update-regions.csv
+cg.coverage.region,correct
+East Africa,Eastern Africa
+West Africa,Western Africa
+Southeast Asia,South-eastern Asia
+South Asia,Southern Asia
+Africa South Of Sahara,Sub-Saharan Africa
+North Africa,Northern Africa
+West Asia,Western Asia
+$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
+</code></pre><ul>
+<li>Then I started a full Discovery re-indexing:</li>
+</ul>
+<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real    92m14.294s
+user    7m59.840s
+sys     2m22.327s
+</code></pre><ul>
+<li>I realized I had been using an incorrect Solr query to purge unmigrated items after processing with <code>solr-upgrade-statistics-6x</code>&hellip;
+<ul>
+<li>Instead of this: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
+<li>I should have used this: <code>id:/.+-unmigrated/</code></li>
+<li>Or perhaps this (with a check first!): <code>*:* NOT id:/.{36}/</code></li>
+<li>We need to make sure to explicitly purge the unmigrated records, then purge any that are not matching the UUID pattern (after inspecting manually!)</li>
+<li>There are still 3.7 million records in our ten years of Solr statistics that are unmigrated (I only noticed because the DSpace Statistics API indexer kept failing)</li>
+<li>I don&rsquo;t think this is serious enough to re-start the simulation of the DSpace 6 migration over again, but I definitely want to make sure I use the correct query when I do CGSpace</li>
+</ul>
+</li>
+<li>The AReS indexing finished after I removed the country formatting from all the repository configurations and now I see values like &ldquo;SA&rdquo;, &ldquo;CA&rdquo;, etc&hellip;
+<ul>
+<li>So really we need this to fix MELSpace countries, so I will re-enable the country formatting for their repository</li>
+</ul>
+</li>
+<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
+</ul>
+<pre><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
+COPY 6357
+dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
+COPY 730
+dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
+COPY 71748
+dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.publisher&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
+COPY 3882
+dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.source&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
+COPY 3684
+dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.relation.ispartofseries&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
+COPY 5598
+</code></pre><ul>
+<li>I noticed there are still some mapping for acronyms and other fixes that haven&rsquo;t been applied, so I ran my <code>create-mappings.py</code> script against Elasticsearch again
+<ul>
+<li>Now I&rsquo;m comparing yesterday&rsquo;s mappings with today&rsquo;s and I don&rsquo;t see any duplicates&hellip;</li>
+</ul>
+</li>
+</ul>
+<pre><code>$ grep -c '&quot;find&quot;' /tmp/elasticsearch-mappings*
+/tmp/elasticsearch-mappings2.txt:350
+/tmp/elasticsearch-mappings.txt:1228
+$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | wc -l
+1578
+$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | uniq | wc -l
+1578
+</code></pre><ul>
+<li>I have no idea why they wouldn&rsquo;t have been caught yesterday when I originally ran the script on a clean AReS with no mappings&hellip;
+<ul>
+<li>In any case, I combined the mappings and then uploaded them to AReS:</li>
+</ul>
+</li>
+</ul>
+<pre><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
+$ curl -XDELETE http://localhost:9200/openrxv-values
+$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/new-elasticsearch-mappings.txt
+</code></pre><!-- raw HTML omitted -->