Add notes for 2021-09-13

This commit is contained in:
2021-09-13 16:21:16 +03:00
parent 8b487a4a77
commit c05c7213c2
109 changed files with 2627 additions and 2530 deletions

View File

@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -144,7 +144,7 @@ During the FlywayDB migration I got an error:
</ul>
</li>
</ul>
<pre><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
Detail: Key (short_description)=(EPUB) already exists. Call getNextException to see other errors in the batch.
2020-10-06 21:36:04,138 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
@ -212,7 +212,7 @@ org.hibernate.exception.ConstraintViolationException: could not execute batch
</ul>
</li>
</ul>
<pre><code>$ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
<pre tabindex="0"><code>$ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
Loading @mire database changes for module MQM
Changes have been processed
-----------------------------------------------------------
@ -259,7 +259,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
</code></pre><ul>
<li>Also, I tested Listings and Reports and there are still no hits for &ldquo;Orth, Alan&rdquo; as a contributor, despite there being dozens of items in the repository and the Solr query generated by Listings and Reports actually returning hits:</li>
</ul>
<pre><code>2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=search.resourcetype:2&amp;fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&amp;fq=dateIssued.year:[2013+TO+2021]&amp;rows=500&amp;wt=javabin&amp;version=2} hits=18 status=0 QTime=10
<pre tabindex="0"><code>2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=search.resourcetype:2&amp;fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&amp;fq=dateIssued.year:[2013+TO+2021]&amp;rows=500&amp;wt=javabin&amp;version=2} hits=18 status=0 QTime=10
</code></pre><ul>
<li>Solr returns <code>hits=18</code> for the L&amp;R query, but there are no result shown in the L&amp;R UI</li>
<li>I sent all this feedback to Atmire&hellip;</li>
@ -278,16 +278,16 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
</ul>
</li>
</ul>
<pre><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
$ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
</code></pre><ul>
<li>Then we post an item in JSON format to <code>/rest/collections/{uuid}/items</code>:</li>
</ul>
<pre><code>$ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE &lt; item-object.json
<pre tabindex="0"><code>$ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE &lt; item-object.json
</code></pre><ul>
<li>Format of JSON is:</li>
</ul>
<pre><code>{ &quot;metadata&quot;: [
<pre tabindex="0"><code>{ &quot;metadata&quot;: [
{
&quot;key&quot;: &quot;dc.title&quot;,
&quot;value&quot;: &quot;Testing REST API post&quot;,
@ -362,7 +362,7 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
</ul>
</li>
</ul>
<pre><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
$ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
$ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 &lt; item-object.json
</code></pre><ul>
@ -408,7 +408,7 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
</ul>
</li>
</ul>
<pre><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
@ -438,7 +438,7 @@ $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-438
<li>I added <code>[Ss]pider</code> to the Tomcat Crawler Session Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID</li>
<li>I added a few of the patterns from above to our local agents list and ran the <code>check-spider-hits.sh</code> on CGSpace:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
Purging 228916 hits from RTB website BOT in statistics
Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics
Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics
@ -472,7 +472,7 @@ Total number of bot hits purged: 3684
</li>
<li>I can update the country metadata in PostgreSQL like this:</li>
</ul>
<pre><code>dspace=&gt; BEGIN;
<pre tabindex="0"><code>dspace=&gt; BEGIN;
dspace=&gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE resource_type_id=2 AND metadata_field_id=228;
UPDATE 51756
dspace=&gt; COMMIT;
@ -483,7 +483,7 @@ dspace=&gt; COMMIT;
</ul>
</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.country&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.country&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
COPY 195
</code></pre><ul>
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
@ -493,7 +493,7 @@ COPY 195
</li>
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
</ul>
<pre><code>:'&lt;,'&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
<pre tabindex="0"><code>:'&lt;,'&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
</code></pre><ul>
<li>It uses a <a href="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka &ldquo;lookaround&rdquo; in PCRE?) to match words that are <em>not</em> &ldquo;pair&rdquo;, &ldquo;displayed&rdquo;, etc because we don&rsquo;t want to edit the XML tags themselves&hellip;
<ul>
@ -509,18 +509,18 @@ COPY 195
</ul>
</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.region&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.region&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
COPY 34
</code></pre><ul>
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
<li>After testing the replacements locally I ran them on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
</code></pre><ul>
<li>Then I started a full re-indexing:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 88m21.678s
user 7m59.182s
@ -579,7 +579,7 @@ sys 2m22.713s
<li>I posted a message on Yammer to inform all our users about the changes to countries, regions, and AGROVOC subjects</li>
<li>I modified all AGROVOC subjects to be lower case in PostgreSQL and then exported a list of the top 1500 to update the controlled vocabulary in our submission form:</li>
</ul>
<pre><code>dspace=&gt; BEGIN;
<pre tabindex="0"><code>dspace=&gt; BEGIN;
dspace=&gt; UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57;
UPDATE 335063
dspace=&gt; COMMIT;
@ -588,7 +588,7 @@ COPY 1500
</code></pre><ul>
<li>Use my <code>agrovoc-lookup.py</code> script to validate subject terms against the AGROVOC REST API, extract matches with <code>csvgrep</code>, and then update and format the controlled vocabulary:</li>
</ul>
<pre><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 &gt; /tmp/subjects.txt
<pre tabindex="0"><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 &gt; /tmp/subjects.txt
$ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' &gt; dspace/config/controlled-vocabularies/dc-subject.xml
# apply formatting in XML file
@ -596,7 +596,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
</code></pre><ul>
<li>Then I started a full re-indexing on CGSpace:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 88m21.678s
user 7m59.182s
@ -614,7 +614,7 @@ sys 2m22.713s
<li>They are using the user agent &ldquo;CCAFS Website Publications importer BOT&rdquo; so they are getting rate limited by nginx</li>
<li>Ideally they would use the REST <code>find-by-metadata-field</code> endpoint, but it is <em>really</em> slow for large result sets (like twenty minutes!):</li>
</ul>
<pre><code>$ curl -f -H &quot;CCAFS Website Publications importer BOT&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100&quot; -d '{&quot;key&quot;:&quot;cg.contributor.crp&quot;, &quot;value&quot;:&quot;Climate Change, Agriculture and Food Security&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &quot;CCAFS Website Publications importer BOT&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100&quot; -d '{&quot;key&quot;:&quot;cg.contributor.crp&quot;, &quot;value&quot;:&quot;Climate Change, Agriculture and Food Security&quot;,&quot;language&quot;: &quot;en_US&quot;}'
</code></pre><ul>
<li>For now I will whitelist their user agent so that they can continue scraping /browse</li>
<li>I figured out that the mappings for AReS are stored in Elasticsearch
@ -624,7 +624,7 @@ sys 2m22.713s
</ul>
</li>
</ul>
<pre><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_delete_by_query&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_delete_by_query&quot; -H 'Content-Type: application/json' -d'
{
&quot;query&quot;: {
&quot;match&quot;: {
@ -635,7 +635,7 @@ sys 2m22.713s
</code></pre><ul>
<li>I added a new find/replace:</li>
</ul>
<pre><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
{
&quot;find&quot;: &quot;ALAN1&quot;,
&quot;replace&quot;: &quot;ALAN2&quot;,
@ -645,11 +645,11 @@ sys 2m22.713s
<li>I see it in Kibana, and I can search it in Elasticsearch, but I don&rsquo;t see it in OpenRXV&rsquo;s mapping values dashboard</li>
<li>Now I deleted everything in the <code>openrxv-values</code> index:</li>
</ul>
<pre><code>$ curl -XDELETE http://localhost:9200/openrxv-values
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
</code></pre><ul>
<li>Then I tried posting it again:</li>
</ul>
<pre><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
{
&quot;find&quot;: &quot;ALAN1&quot;,
&quot;replace&quot;: &quot;ALAN2&quot;,
@ -682,12 +682,12 @@ sys 2m22.713s
<ul>
<li>Last night I learned how to POST mappings to Elasticsearch for AReS:</li>
</ul>
<pre><code>$ curl -XDELETE http://localhost:9200/openrxv-values
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @./mapping.json
</code></pre><ul>
<li>The JSON file looks like this, with one instruction on each line:</li>
</ul>
<pre><code>{&quot;index&quot;:{}}
<pre tabindex="0"><code>{&quot;index&quot;:{}}
{ &quot;find&quot;: &quot;CRP on Dryland Systems - DS&quot;, &quot;replace&quot;: &quot;Dryland Systems&quot; }
{&quot;index&quot;:{}}
{ &quot;find&quot;: &quot;FISH&quot;, &quot;replace&quot;: &quot;Fish&quot; }
@ -737,7 +737,7 @@ f<span style="color:#f92672">.</span>close()
<li>It filters all upper and lower case strings as well as any replacements that end in an acronym like &ldquo;- ILRI&rdquo;, reducing the number of mappings from around 4,000 to about 900</li>
<li>I deleted the existing <code>openrxv-values</code> Elasticsearch core and then POSTed it:</li>
</ul>
<pre><code>$ ./convert-mapping.py &gt; /tmp/elastic-mappings.txt
<pre tabindex="0"><code>$ ./convert-mapping.py &gt; /tmp/elastic-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elastic-mappings.txt
</code></pre><ul>
@ -762,17 +762,17 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
</li>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(192921) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
UPDATE 1
</code></pre><ul>
<li>After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
Purging 2474 hits from ShortLinkTranslate in statistics
Purging 2568 hits from RI\/1\.0 in statistics
@ -794,7 +794,7 @@ Total number of bot hits purged: 8174
</ul>
</li>
</ul>
<pre><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
</code></pre><ul>
<li>And I saw three hits in Solr with <code>isBot: true</code>!!!
@ -817,7 +817,7 @@ $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
@ -833,7 +833,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
<ul>
<li>Bosede was getting this error on CGSpace yesterday:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
</code></pre><ul>
<li>Collection 1072 appears to be <a href="https://cgspace.cgiar.org/handle/10568/69542">IITA Miscellaneous</a>
<ul>
@ -848,7 +848,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
</ul>
</li>
</ul>
<pre><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&amp;size=10000&amp;q=*:*' &gt; /tmp/affiliations.json
<pre tabindex="0"><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&amp;size=10000&amp;q=*:*' &gt; /tmp/affiliations.json
</code></pre><ul>
<li>Then I decided to try a different approach and I adjusted my <code>convert-mapping.py</code> script to re-consider some replacement patterns with acronyms from the original AReS <code>mapping.json</code> file to hopefully address some MEL to CGSpace mappings
<ul>
@ -893,7 +893,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
<ul>
<li>I re-installed DSpace Test with a fresh snapshot of CGSpace&rsquo;s to test the DSpace 6 upgrade (the last time was in 2020-05, and we&rsquo;ve fixed a lot of issues since then):</li>
</ul>
<pre><code>$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
<pre tabindex="0"><code>$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
$ git checkout origin/6_x-dev-atmire-modules
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
$ sudo su - postgres
@ -911,7 +911,7 @@ $ sudo systemctl start tomcat7
</code></pre><ul>
<li>Then I started processing the Solr stats one core and 1 million records at a time:</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
@ -920,7 +920,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
</code></pre><ul>
<li>After the fifth or so run I got this error:</li>
</ul>
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
@ -945,7 +945,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Then I restarted the <code>solr-upgrade-statistics-6x</code> process, which apparently had no records left to process</li>
<li>I started processing the statistics-2019 core&hellip;
@ -958,7 +958,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
<ul>
<li>The statistics processing on the statistics-2018 core errored after 1.8 million records:</li>
</ul>
<pre><code>Exception: Java heap space
<pre tabindex="0"><code>Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I had the same problem when I processed the statistics-2018 core in 2020-07 and 2020-08
@ -967,7 +967,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>I restarted the process and it crashed again a few minutes later
<ul>
@ -976,7 +976,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Then I started processing the statistics-2017 core&hellip;
<ul>
@ -984,7 +984,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
@ -1002,7 +1002,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
</code></pre><ul>
<li>Peter asked me to add the new preferred AGROVOC subject &ldquo;covid-19&rdquo; to all items we had previously added &ldquo;coronavirus disease&rdquo;, and to make sure all items with ILRI subject &ldquo;ZOONOTIC DISEASES&rdquo; have the AGROVOC subject &ldquo;zoonoses&rdquo;
<ul>
@ -1010,7 +1010,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>$ dspace metadata-export -f /tmp/cgspace.csv
<pre tabindex="0"><code>$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
@ -1040,7 +1040,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
</ul>
</li>
</ul>
<pre><code>$ ./create-mappings.py &gt; /tmp/elasticsearch-mappings.txt
<pre tabindex="0"><code>$ ./create-mappings.py &gt; /tmp/elasticsearch-mappings.txt
$ ./convert-mapping.py &gt;&gt; /tmp/elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elasticsearch-mappings.txt
@ -1048,12 +1048,12 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
</ul>
<pre><code>$ docker-compose up --build -d angular_nginx
<pre tabindex="0"><code>$ docker-compose up --build -d angular_nginx
</code></pre><h2 id="2020-10-28">2020-10-28</h2>
<ul>
<li>Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:</li>
</ul>
<pre><code>$ docker-compose up --build -d --force-recreate angular_nginx
<pre tabindex="0"><code>$ docker-compose up --build -d --force-recreate angular_nginx
</code></pre><ul>
<li>Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like &ldquo;Burkina faso&rdquo; is due to the country formatter (see: <code>backend/src/harvester/consumers/fetch.consumer.ts</code>)
<ul>
@ -1079,7 +1079,7 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
</ul>
</li>
</ul>
<pre><code>$ cat 2020-10-28-update-regions.csv
<pre tabindex="0"><code>$ cat 2020-10-28-update-regions.csv
cg.coverage.region,correct
East Africa,Eastern Africa
West Africa,Western Africa
@ -1092,7 +1092,7 @@ $ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 92m14.294s
user 7m59.840s
@ -1115,7 +1115,7 @@ sys 2m22.327s
</li>
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
COPY 6357
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
COPY 730
@ -1134,7 +1134,7 @@ COPY 5598
</ul>
</li>
</ul>
<pre><code>$ grep -c '&quot;find&quot;' /tmp/elasticsearch-mappings*
<pre tabindex="0"><code>$ grep -c '&quot;find&quot;' /tmp/elasticsearch-mappings*
/tmp/elasticsearch-mappings2.txt:350
/tmp/elasticsearch-mappings.txt:1228
$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | wc -l
@ -1148,7 +1148,7 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | u
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/new-elasticsearch-mappings.txt
</code></pre><ul>
@ -1159,14 +1159,14 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
</li>
<li>Lower case some straggling AGROVOC subjects on CGSpace:</li>
</ul>
<pre><code>dspace=# BEGIN;
<pre tabindex="0"><code>dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 123
dspace=# COMMIT;
</code></pre><ul>
<li>Move some top-level communities to the CGIAR System community for Peter:</li>
</ul>
<pre><code>$ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
<pre tabindex="0"><code>$ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
$ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
</code></pre><h2 id="2020-10-30">2020-10-30</h2>
<ul>
@ -1187,7 +1187,7 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
</ul>
</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -1198,7 +1198,7 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
</code></pre><ul>
<li>Then I did a test to apply the corrections and deletions on my local DSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
@ -1214,12 +1214,12 @@ $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace
</li>
<li>Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
</code></pre><ul>
<li>I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><!-- raw HTML omitted -->