Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -144,10 +144,10 @@ During the FlywayDB migration I got an error:
</ul>
</li>
</ul>
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description=&#39;Electronic publishing&#39;, internal=&#39;FALSE&#39;, mimetype=&#39;application/epub+zip&#39;, short_description=&#39;EPUB&#39;, support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &#34;bitstreamformatregistry_short_description_key&#34;
Detail: Key (short_description)=(EPUB) already exists. Call getNextException to see other errors in the batch.
2020-10-06 21:36:04,138 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint &#34;bitstreamformatregistry_short_description_key&#34;
Detail: Key (short_description)=(EPUB) already exists.
2020-10-06 21:36:04,142 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [could not execute batch]
2020-10-06 21:36:04,143 ERROR org.dspace.storage.rdbms.DatabaseRegistryUpdater @ Error attempting to update Bitstream Format and/or Metadata Registries
@ -233,7 +233,7 @@ New item: aff5e78d-87c9-438d-94f8-1050b649961c (10568/108548)
+ Added (dc.title): Testing CUA import NPE
Tue Oct 06 22:06:14 CEST 2020 | Query:containerItem:aff5e78d-87c9-438d-94f8-1050b649961c
Error while updating
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. &lt;!doctype html&gt;&lt;html lang=&quot;en&quot;&gt;&lt;head&gt;&lt;title&gt;HTTP Status 404 Not Found&lt;/title&gt;&lt;style type=&quot;text/css&quot;&gt;body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}&lt;/style&gt;&lt;/head&gt;&lt;body&gt;&lt;h1&gt;HTTP Status 404 Not Found&lt;/h1&gt;&lt;hr class=&quot;line&quot; /&gt;&lt;p&gt;&lt;b&gt;Type&lt;/b&gt; Status Report&lt;/p&gt;&lt;p&gt;&lt;b&gt;Message&lt;/b&gt; The requested resource [/solr/update] is not available&lt;/p&gt;&lt;p&gt;&lt;b&gt;Description&lt;/b&gt; The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.&lt;/p&gt;&lt;hr class=&quot;line&quot; /&gt;&lt;h3&gt;Apache Tomcat/7.0.104&lt;/h3&gt;&lt;/body&gt;&lt;/html&gt;
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. &lt;!doctype html&gt;&lt;html lang=&#34;en&#34;&gt;&lt;head&gt;&lt;title&gt;HTTP Status 404 Not Found&lt;/title&gt;&lt;style type=&#34;text/css&#34;&gt;body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}&lt;/style&gt;&lt;/head&gt;&lt;body&gt;&lt;h1&gt;HTTP Status 404 Not Found&lt;/h1&gt;&lt;hr class=&#34;line&#34; /&gt;&lt;p&gt;&lt;b&gt;Type&lt;/b&gt; Status Report&lt;/p&gt;&lt;p&gt;&lt;b&gt;Message&lt;/b&gt; The requested resource [/solr/update] is not available&lt;/p&gt;&lt;p&gt;&lt;b&gt;Description&lt;/b&gt; The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.&lt;/p&gt;&lt;hr class=&#34;line&#34; /&gt;&lt;h3&gt;Apache Tomcat/7.0.104&lt;/h3&gt;&lt;/body&gt;&lt;/html&gt;
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
@ -278,7 +278,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com &#39;password=fuuuu&#39;
$ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
</code></pre><ul>
<li>Then we post an item in JSON format to <code>/rest/collections/{uuid}/items</code>:</li>
@ -287,25 +287,25 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
</code></pre><ul>
<li>Format of JSON is:</li>
</ul>
<pre tabindex="0"><code>{ &quot;metadata&quot;: [
<pre tabindex="0"><code>{ &#34;metadata&#34;: [
{
&quot;key&quot;: &quot;dc.title&quot;,
&quot;value&quot;: &quot;Testing REST API post&quot;,
&quot;language&quot;: &quot;en_US&quot;
&#34;key&#34;: &#34;dc.title&#34;,
&#34;value&#34;: &#34;Testing REST API post&#34;,
&#34;language&#34;: &#34;en_US&#34;
},
{
&quot;key&quot;: &quot;dc.contributor.author&quot;,
&quot;value&quot;: &quot;Orth, Alan&quot;,
&quot;language&quot;: &quot;en_US&quot;
&#34;key&#34;: &#34;dc.contributor.author&#34;,
&#34;value&#34;: &#34;Orth, Alan&#34;,
&#34;language&#34;: &#34;en_US&#34;
},
{
&quot;key&quot;: &quot;dc.date.issued&quot;,
&quot;value&quot;: &quot;2020-09-01&quot;,
&quot;language&quot;: &quot;en_US&quot;
&#34;key&#34;: &#34;dc.date.issued&#34;,
&#34;value&#34;: &#34;2020-09-01&#34;,
&#34;language&#34;: &#34;en_US&#34;
}
],
&quot;archived&quot;:&quot;false&quot;,
&quot;withdrawn&quot;:&quot;false&quot;
&#34;archived&#34;:&#34;false&#34;,
&#34;withdrawn&#34;:&#34;false&#34;
}
</code></pre><ul>
<li>What is unclear to me is the <code>archived</code> parameter, it seems to do nothing&hellip; perhaps it is only used for the <code>/items</code> endpoint when printing information about an item
@ -362,7 +362,7 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com &#39;password=ddddd&#39;
$ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
$ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 &lt; item-object.json
</code></pre><ul>
@ -408,10 +408,10 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
</code></pre><ul>
<li>After a few minutes I saw these four hits in Solr&hellip; WTF
<ul>
@ -483,7 +483,7 @@ dspace=&gt; COMMIT;
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.country&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &#34;cg.coverage.country&#34; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
COPY 195
</code></pre><ul>
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
@ -493,7 +493,7 @@ COPY 195
</li>
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
</ul>
<pre tabindex="0"><code>:'&lt;,'&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
<pre tabindex="0"><code>:&#39;&lt;,&#39;&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
</code></pre><ul>
<li>It uses a <a href="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka &ldquo;lookaround&rdquo; in PCRE?) to match words that are <em>not</em> &ldquo;pair&rdquo;, &ldquo;displayed&rdquo;, etc because we don&rsquo;t want to edit the XML tags themselves&hellip;
<ul>
@ -509,14 +509,14 @@ COPY 195
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.region&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &#34;cg.coverage.region&#34; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
COPY 34
</code></pre><ul>
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
<li>After testing the replacements locally I ran them on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -t &#39;correct&#39; -m 228
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -t &#39;correct&#39; -m 227
</code></pre><ul>
<li>Then I started a full re-indexing:</li>
</ul>
@ -583,14 +583,14 @@ sys 2m22.713s
dspace=&gt; UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57;
UPDATE 335063
dspace=&gt; COMMIT;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.subject&quot;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY &quot;dc.subject&quot; ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.subject&#34;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY &#34;dc.subject&#34; ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER;
COPY 1500
</code></pre><ul>
<li>Use my <code>agrovoc-lookup.py</code> script to validate subject terms against the AGROVOC REST API, extract matches with <code>csvgrep</code>, and then update and format the controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 &gt; /tmp/subjects.txt
$ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' &gt; dspace/config/controlled-vocabularies/dc-subject.xml
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed &#39;1d&#39; &gt; dspace/config/controlled-vocabularies/dc-subject.xml
# apply formatting in XML file
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
</code></pre><ul>
@ -614,7 +614,7 @@ sys 2m22.713s
<li>They are using the user agent &ldquo;CCAFS Website Publications importer BOT&rdquo; so they are getting rate limited by nginx</li>
<li>Ideally they would use the REST <code>find-by-metadata-field</code> endpoint, but it is <em>really</em> slow for large result sets (like twenty minutes!):</li>
</ul>
<pre tabindex="0"><code>$ curl -f -H &quot;CCAFS Website Publications importer BOT&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100&quot; -d '{&quot;key&quot;:&quot;cg.contributor.crp&quot;, &quot;value&quot;:&quot;Climate Change, Agriculture and Food Security&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &#34;CCAFS Website Publications importer BOT&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100&#34; -d &#39;{&#34;key&#34;:&#34;cg.contributor.crp&#34;, &#34;value&#34;:&#34;Climate Change, Agriculture and Food Security&#34;,&#34;language&#34;: &#34;en_US&#34;}&#39;
</code></pre><ul>
<li>For now I will whitelist their user agent so that they can continue scraping /browse</li>
<li>I figured out that the mappings for AReS are stored in Elasticsearch
@ -624,23 +624,23 @@ sys 2m22.713s
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_delete_by_query&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &#34;localhost:9200/openrxv-values/_delete_by_query&#34; -H &#39;Content-Type: application/json&#39; -d&#39;
{
&quot;query&quot;: {
&quot;match&quot;: {
&quot;_id&quot;: &quot;64j_THMBiwiQ-PKfCSlI&quot;
&#34;query&#34;: {
&#34;match&#34;: {
&#34;_id&#34;: &#34;64j_THMBiwiQ-PKfCSlI&#34;
}
}
}
</code></pre><ul>
<li>I added a new find/replace:</li>
</ul>
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &#34;localhost:9200/openrxv-values/_doc?pretty&#34; -H &#39;Content-Type: application/json&#39; -d&#39;
{
&quot;find&quot;: &quot;ALAN1&quot;,
&quot;replace&quot;: &quot;ALAN2&quot;,
&#34;find&#34;: &#34;ALAN1&#34;,
&#34;replace&#34;: &#34;ALAN2&#34;,
}
'
&#39;
</code></pre><ul>
<li>I see it in Kibana, and I can search it in Elasticsearch, but I don&rsquo;t see it in OpenRXV&rsquo;s mapping values dashboard</li>
<li>Now I deleted everything in the <code>openrxv-values</code> index:</li>
@ -649,12 +649,12 @@ sys 2m22.713s
</code></pre><ul>
<li>Then I tried posting it again:</li>
</ul>
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &#34;localhost:9200/openrxv-values/_doc?pretty&#34; -H &#39;Content-Type: application/json&#39; -d&#39;
{
&quot;find&quot;: &quot;ALAN1&quot;,
&quot;replace&quot;: &quot;ALAN2&quot;,
&#34;find&#34;: &#34;ALAN1&#34;,
&#34;replace&#34;: &#34;ALAN2&#34;,
}
'
&#39;
</code></pre><ul>
<li>But I still don&rsquo;t see it in AReS</li>
<li>Interesting! I added a find/replace manually in AReS and now I see the one I POSTed&hellip;</li>
@ -683,63 +683,63 @@ sys 2m22.713s
<li>Last night I learned how to POST mappings to Elasticsearch for AReS:</li>
</ul>
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @./mapping.json
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &#34;Content-Type: application/json&#34; --data-binary @./mapping.json
</code></pre><ul>
<li>The JSON file looks like this, with one instruction on each line:</li>
</ul>
<pre tabindex="0"><code>{&quot;index&quot;:{}}
{ &quot;find&quot;: &quot;CRP on Dryland Systems - DS&quot;, &quot;replace&quot;: &quot;Dryland Systems&quot; }
{&quot;index&quot;:{}}
{ &quot;find&quot;: &quot;FISH&quot;, &quot;replace&quot;: &quot;Fish&quot; }
<pre tabindex="0"><code>{&#34;index&#34;:{}}
{ &#34;find&#34;: &#34;CRP on Dryland Systems - DS&#34;, &#34;replace&#34;: &#34;Dryland Systems&#34; }
{&#34;index&#34;:{}}
{ &#34;find&#34;: &#34;FISH&#34;, &#34;replace&#34;: &#34;Fish&#34; }
</code></pre><ul>
<li>Adjust the report templates on AReS based on some of Peter&rsquo;s feedback</li>
<li>I wrote a quick Python script to filter and convert the old AReS mappings to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html">Elasticsearch&rsquo;s Bulk API</a> format:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#75715e">#!/usr/bin/env python3</span>
<span style="color:#f92672">import</span> json
<span style="color:#f92672">import</span> re
f <span style="color:#f92672">=</span> open(<span style="color:#e6db74">&#39;/tmp/mapping.json&#39;</span>, <span style="color:#e6db74">&#39;r&#39;</span>)
data <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
<span style="color:#75715e"># Iterate over old mapping file, which is in format &#34;find&#34;: &#34;replace&#34;, ie:</span>
<span style="color:#75715e">#</span>
<span style="color:#75715e"># &#34;alan&#34;: &#34;ALAN&#34;</span>
<span style="color:#75715e">#</span>
<span style="color:#75715e"># And convert to proper dictionaries for import into Elasticsearch&#39;s Bulk API:</span>
<span style="color:#75715e">#</span>
<span style="color:#75715e"># { &#34;find&#34;: &#34;alan&#34;, &#34;replace&#34;: &#34;ALAN&#34; }</span>
<span style="color:#75715e">#</span>
<span style="color:#66d9ef">for</span> find, replace <span style="color:#f92672">in</span> data<span style="color:#f92672">.</span>items():
<span style="color:#75715e"># Skip all upper and all lower case strings because they are indicative of</span>
<span style="color:#75715e"># some AGROVOC or other mappings we no longer want to do</span>
<span style="color:#66d9ef">if</span> find<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> find<span style="color:#f92672">.</span>islower() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>islower():
<span style="color:#66d9ef">continue</span>
<span style="color:#75715e"># Skip replacements with acronyms like:</span>
<span style="color:#75715e">#</span>
<span style="color:#75715e"># International Livestock Research Institute - ILRI</span>
<span style="color:#75715e">#</span>
acronym_pattern <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>compile(<span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;[A-Z]+$&#34;</span>)
acronym_pattern_match <span style="color:#f92672">=</span> acronym_pattern<span style="color:#f92672">.</span>search(replace)
<span style="color:#66d9ef">if</span> acronym_pattern_match <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
<span style="color:#66d9ef">continue</span>
mapping <span style="color:#f92672">=</span> { <span style="color:#e6db74">&#34;find&#34;</span>: find, <span style="color:#e6db74">&#34;replace&#34;</span>: replace }
<span style="color:#75715e"># Print command for Elasticsearch</span>
print(<span style="color:#e6db74">&#39;{&#34;index&#34;:</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">}&#39;</span>)
print(json<span style="color:#f92672">.</span>dumps(mapping))
f<span style="color:#f92672">.</span>close()
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e">#!/usr/bin/env python3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> re
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>f <span style="color:#f92672">=</span> open(<span style="color:#e6db74">&#39;/tmp/mapping.json&#39;</span>, <span style="color:#e6db74">&#39;r&#39;</span>)
</span></span><span style="display:flex;"><span>data <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Iterate over old mapping file, which is in format &#34;find&#34;: &#34;replace&#34;, ie:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># &#34;alan&#34;: &#34;ALAN&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># And convert to proper dictionaries for import into Elasticsearch&#39;s Bulk API:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># { &#34;find&#34;: &#34;alan&#34;, &#34;replace&#34;: &#34;ALAN&#34; }</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> find, replace <span style="color:#f92672">in</span> data<span style="color:#f92672">.</span>items():
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Skip all upper and all lower case strings because they are indicative of</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># some AGROVOC or other mappings we no longer want to do</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> find<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> find<span style="color:#f92672">.</span>islower() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>islower():
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">continue</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Skip replacements with acronyms like:</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># International Livestock Research Institute - ILRI</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span> acronym_pattern <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>compile(<span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;[A-Z]+$&#34;</span>)
</span></span><span style="display:flex;"><span> acronym_pattern_match <span style="color:#f92672">=</span> acronym_pattern<span style="color:#f92672">.</span>search(replace)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> acronym_pattern_match <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">continue</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> mapping <span style="color:#f92672">=</span> { <span style="color:#e6db74">&#34;find&#34;</span>: find, <span style="color:#e6db74">&#34;replace&#34;</span>: replace }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Print command for Elasticsearch</span>
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">&#39;{&#34;index&#34;:</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">}&#39;</span>)
</span></span><span style="display:flex;"><span> print(json<span style="color:#f92672">.</span>dumps(mapping))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>f<span style="color:#f92672">.</span>close()
</span></span></code></pre></div><ul>
<li>It filters all upper and lower case strings as well as any replacements that end in an acronym like &ldquo;- ILRI&rdquo;, reducing the number of mappings from around 4,000 to about 900</li>
<li>I deleted the existing <code>openrxv-values</code> Elasticsearch core and then POSTed it:</li>
</ul>
<pre tabindex="0"><code>$ ./convert-mapping.py &gt; /tmp/elastic-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elastic-mappings.txt
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &#34;Content-Type: application/json&#34; --data-binary @/tmp/elastic-mappings.txt
</code></pre><ul>
<li>Then in AReS I didn&rsquo;t see the mappings in the dashboard until I added a new one manually, after which they all appeared
<ul>
@ -762,12 +762,12 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
</li>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(192921) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(192921) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);&#39;
UPDATE 1
</code></pre><ul>
<li>After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:</li>
@ -794,8 +794,8 @@ Total number of bot hits purged: 8174
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
$ curl -s &#39;http://localhost:8083/solr/statistics/update?softCommit=true&#39;
</code></pre><ul>
<li>And I saw three hits in Solr with <code>isBot: true</code>!!!
<ul>
@ -817,9 +817,9 @@ $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx2048m&#34;
$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
$ csvcut -c &#39;id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]&#39; /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
<li>Then I went through all center subjects looking for &ldquo;WOMEN&rdquo; or &ldquo;GENDER&rdquo; and checking if they were missing the associated AGROVOC subject
<ul>
@ -848,7 +848,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&amp;size=10000&amp;q=*:*' &gt; /tmp/affiliations.json
<pre tabindex="0"><code>$ http &#39;http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&amp;size=10000&amp;q=*:*&#39; &gt; /tmp/affiliations.json
</code></pre><ul>
<li>Then I decided to try a different approach and I adjusted my <code>convert-mapping.py</code> script to re-consider some replacement patterns with acronyms from the original AReS <code>mapping.json</code> file to hopefully address some MEL to CGSpace mappings
<ul>
@ -897,8 +897,8 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
$ git checkout origin/6_x-dev-atmire-modules
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
$ sudo su - postgres
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
$ psql dspacetest -c &quot;DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');&quot;
$ psql dspacetest -c &#39;CREATE EXTENSION pgcrypto;&#39;
$ psql dspacetest -c &#34;DELETE FROM schema_version WHERE version IN (&#39;5.8.2015.12.03.3&#39;);&#34;
$ exit
$ sudo systemctl stop tomcat7
$ cd dspace/target/dspace-installer
@ -911,7 +911,7 @@ $ sudo systemctl start tomcat7
</code></pre><ul>
<li>Then I started processing the Solr stats one core and 1 million records at a time:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
@ -920,8 +920,8 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
</code></pre><ul>
<li>After the fifth or so run I got this error:</li>
</ul>
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
@ -945,7 +945,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8083/solr/statistics/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Then I restarted the <code>solr-upgrade-statistics-6x</code> process, which apparently had no records left to process</li>
<li>I started processing the statistics-2019 core&hellip;
@ -967,8 +967,8 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8083/solr/statistics-2018/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8083/solr/statistics-2018/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><ul>
<li>I restarted the process and it crashed again a few minutes later
<ul>
<li>I increased the memory to 4096m and tried again</li>
@ -976,7 +976,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8083/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Then I started processing the statistics-2017 core&hellip;
<ul>
@ -984,7 +984,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8083/solr/statistics-2017/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
@ -1011,7 +1011,7 @@ java.lang.OutOfMemoryError: Java heap space
</li>
</ul>
<pre tabindex="0"><code>$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
$ csvcut -c &#39;id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]&#39; /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
<li>Skype with Peter and Abenet about CGSpace Explorer (AReS)
@ -1043,7 +1043,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
<pre tabindex="0"><code>$ ./create-mappings.py &gt; /tmp/elasticsearch-mappings.txt
$ ./convert-mapping.py &gt;&gt; /tmp/elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elasticsearch-mappings.txt
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &#34;Content-Type: application/json&#34; --data-binary @/tmp/elasticsearch-mappings.txt
</code></pre><ul>
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
@ -1088,16 +1088,16 @@ South Asia,Southern Asia
Africa South Of Sahara,Sub-Saharan Africa
North Africa,Northern Africa
West Asia,Western Asia
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -t &#39;correct&#39; -m 227 -d
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 92m14.294s
user 7m59.840s
sys 2m22.327s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 92m14.294s
</span></span><span style="display:flex;"><span>user 7m59.840s
</span></span><span style="display:flex;"><span>sys 2m22.327s
</span></span></code></pre></div><ul>
<li>I realized I had been using an incorrect Solr query to purge unmigrated items after processing with <code>solr-upgrade-statistics-6x</code>&hellip;
<ul>
<li>Instead of this: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
@ -1115,17 +1115,17 @@ sys 2m22.327s
</li>
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
</ul>
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
COPY 6357
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.description.sponsorship&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
COPY 730
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
COPY 71748
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.publisher&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.publisher&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
COPY 3882
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.source&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.source&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
COPY 3684
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.relation.ispartofseries&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.relation.ispartofseries&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
COPY 5598
</code></pre><ul>
<li>I noticed there are still some mapping for acronyms and other fixes that haven&rsquo;t been applied, so I ran my <code>create-mappings.py</code> script against Elasticsearch again
@ -1134,12 +1134,12 @@ COPY 5598
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ grep -c '&quot;find&quot;' /tmp/elasticsearch-mappings*
<pre tabindex="0"><code>$ grep -c &#39;&#34;find&#34;&#39; /tmp/elasticsearch-mappings*
/tmp/elasticsearch-mappings2.txt:350
/tmp/elasticsearch-mappings.txt:1228
$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | wc -l
$ cat /tmp/elasticsearch-mappings* | grep -v &#39;{&#34;index&#34;:{}}&#39; | wc -l
1578
$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | uniq | wc -l
$ cat /tmp/elasticsearch-mappings* | grep -v &#39;{&#34;index&#34;:{}}&#39; | sort | uniq | wc -l
1578
</code></pre><ul>
<li>I have no idea why they wouldn&rsquo;t have been caught yesterday when I originally ran the script on a clean AReS with no mappings&hellip;
@ -1148,10 +1148,10 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | u
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> --data-binary @/tmp/new-elasticsearch-mappings.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
</span></span><span style="display:flex;"><span>$ curl -XDELETE http://localhost:9200/openrxv-values
</span></span><span style="display:flex;"><span>$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> --data-binary @/tmp/new-elasticsearch-mappings.txt
</span></span></code></pre></div><ul>
<li>The latest indexing (second for today!) finally finshed on AReS and the countries and affiliations/crps/journals all look MUCH better
<ul>
<li>There are still a few acronyms present, some of which are in the value mappings and some which aren&rsquo;t</li>
@ -1160,7 +1160,7 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="co
<li>Lower case some straggling AGROVOC subjects on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 123
dspace=# COMMIT;
</code></pre><ul>
@ -1198,10 +1198,10 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
</code></pre><ul>
<li>Then I did a test to apply the corrections and deletions on my local DSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.source -t &#39;correct&#39; -m 55
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.source -m 55
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.publisher -t correct -m 39
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.publisher -m 39
</code></pre><ul>
<li>I will wait to apply them on CGSpace when I have all the other corrections from Peter processed</li>
</ul>
@ -1214,8 +1214,8 @@ $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace
</li>
<li>Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -t &#39;correct&#39; -m 29
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -m 29
</code></pre><ul>
<li>I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:</li>
</ul>