mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-09-13
This commit is contained in:
@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.87.0" />
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
|
||||
|
||||
|
||||
@ -144,7 +144,7 @@ During the FlywayDB migration I got an error:
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||||
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||||
Detail: Key (short_description)=(EPUB) already exists. Call getNextException to see other errors in the batch.
|
||||
2020-10-06 21:36:04,138 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
|
||||
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||||
@ -212,7 +212,7 @@ org.hibernate.exception.ConstraintViolationException: could not execute batch
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
|
||||
<pre tabindex="0"><code>$ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
|
||||
Loading @mire database changes for module MQM
|
||||
Changes have been processed
|
||||
-----------------------------------------------------------
|
||||
@ -259,7 +259,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
|
||||
</code></pre><ul>
|
||||
<li>Also, I tested Listings and Reports and there are still no hits for “Orth, Alan” as a contributor, despite there being dozens of items in the repository and the Solr query generated by Listings and Reports actually returning hits:</li>
|
||||
</ul>
|
||||
<pre><code>2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=search.resourcetype:2&fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&fq=dateIssued.year:[2013+TO+2021]&rows=500&wt=javabin&version=2} hits=18 status=0 QTime=10
|
||||
<pre tabindex="0"><code>2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=search.resourcetype:2&fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&fq=dateIssued.year:[2013+TO+2021]&rows=500&wt=javabin&version=2} hits=18 status=0 QTime=10
|
||||
</code></pre><ul>
|
||||
<li>Solr returns <code>hits=18</code> for the L&R query, but there are no result shown in the L&R UI</li>
|
||||
<li>I sent all this feedback to Atmire…</li>
|
||||
@ -278,16 +278,16 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
|
||||
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
|
||||
$ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
|
||||
</code></pre><ul>
|
||||
<li>Then we post an item in JSON format to <code>/rest/collections/{uuid}/items</code>:</li>
|
||||
</ul>
|
||||
<pre><code>$ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE < item-object.json
|
||||
<pre tabindex="0"><code>$ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE < item-object.json
|
||||
</code></pre><ul>
|
||||
<li>Format of JSON is:</li>
|
||||
</ul>
|
||||
<pre><code>{ "metadata": [
|
||||
<pre tabindex="0"><code>{ "metadata": [
|
||||
{
|
||||
"key": "dc.title",
|
||||
"value": "Testing REST API post",
|
||||
@ -362,7 +362,7 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
|
||||
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
|
||||
$ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
|
||||
$ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 < item-object.json
|
||||
</code></pre><ul>
|
||||
@ -408,7 +408,7 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
@ -438,7 +438,7 @@ $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-438
|
||||
<li>I added <code>[Ss]pider</code> to the Tomcat Crawler Session Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID</li>
|
||||
<li>I added a few of the patterns from above to our local agents list and ran the <code>check-spider-hits.sh</code> on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
|
||||
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
|
||||
Purging 228916 hits from RTB website BOT in statistics
|
||||
Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics
|
||||
Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics
|
||||
@ -472,7 +472,7 @@ Total number of bot hits purged: 3684
|
||||
</li>
|
||||
<li>I can update the country metadata in PostgreSQL like this:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=> BEGIN;
|
||||
<pre tabindex="0"><code>dspace=> BEGIN;
|
||||
dspace=> UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE resource_type_id=2 AND metadata_field_id=228;
|
||||
UPDATE 51756
|
||||
dspace=> COMMIT;
|
||||
@ -483,7 +483,7 @@ dspace=> COMMIT;
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
|
||||
COPY 195
|
||||
</code></pre><ul>
|
||||
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
|
||||
@ -493,7 +493,7 @@ COPY 195
|
||||
</li>
|
||||
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
|
||||
</ul>
|
||||
<pre><code>:'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
|
||||
<pre tabindex="0"><code>:'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
|
||||
</code></pre><ul>
|
||||
<li>It uses a <a href="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka “lookaround” in PCRE?) to match words that are <em>not</em> “pair”, “displayed”, etc because we don’t want to edit the XML tags themselves…
|
||||
<ul>
|
||||
@ -509,18 +509,18 @@ COPY 195
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
|
||||
COPY 34
|
||||
</code></pre><ul>
|
||||
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
|
||||
<li>After testing the replacements locally I ran them on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
|
||||
</code></pre><ul>
|
||||
<li>Then I started a full re-indexing:</li>
|
||||
</ul>
|
||||
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 88m21.678s
|
||||
user 7m59.182s
|
||||
@ -579,7 +579,7 @@ sys 2m22.713s
|
||||
<li>I posted a message on Yammer to inform all our users about the changes to countries, regions, and AGROVOC subjects</li>
|
||||
<li>I modified all AGROVOC subjects to be lower case in PostgreSQL and then exported a list of the top 1500 to update the controlled vocabulary in our submission form:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=> BEGIN;
|
||||
<pre tabindex="0"><code>dspace=> BEGIN;
|
||||
dspace=> UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57;
|
||||
UPDATE 335063
|
||||
dspace=> COMMIT;
|
||||
@ -588,7 +588,7 @@ COPY 1500
|
||||
</code></pre><ul>
|
||||
<li>Use my <code>agrovoc-lookup.py</code> script to validate subject terms against the AGROVOC REST API, extract matches with <code>csvgrep</code>, and then update and format the controlled vocabulary:</li>
|
||||
</ul>
|
||||
<pre><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 > /tmp/subjects.txt
|
||||
<pre tabindex="0"><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 > /tmp/subjects.txt
|
||||
$ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
|
||||
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' > dspace/config/controlled-vocabularies/dc-subject.xml
|
||||
# apply formatting in XML file
|
||||
@ -596,7 +596,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
|
||||
</code></pre><ul>
|
||||
<li>Then I started a full re-indexing on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 88m21.678s
|
||||
user 7m59.182s
|
||||
@ -614,7 +614,7 @@ sys 2m22.713s
|
||||
<li>They are using the user agent “CCAFS Website Publications importer BOT” so they are getting rate limited by nginx</li>
|
||||
<li>Ideally they would use the REST <code>find-by-metadata-field</code> endpoint, but it is <em>really</em> slow for large result sets (like twenty minutes!):</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
|
||||
<pre tabindex="0"><code>$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
|
||||
</code></pre><ul>
|
||||
<li>For now I will whitelist their user agent so that they can continue scraping /browse</li>
|
||||
<li>I figured out that the mappings for AReS are stored in Elasticsearch
|
||||
@ -624,7 +624,7 @@ sys 2m22.713s
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"query": {
|
||||
"match": {
|
||||
@ -635,7 +635,7 @@ sys 2m22.713s
|
||||
</code></pre><ul>
|
||||
<li>I added a new find/replace:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"find": "ALAN1",
|
||||
"replace": "ALAN2",
|
||||
@ -645,11 +645,11 @@ sys 2m22.713s
|
||||
<li>I see it in Kibana, and I can search it in Elasticsearch, but I don’t see it in OpenRXV’s mapping values dashboard</li>
|
||||
<li>Now I deleted everything in the <code>openrxv-values</code> index:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
</code></pre><ul>
|
||||
<li>Then I tried posting it again:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"find": "ALAN1",
|
||||
"replace": "ALAN2",
|
||||
@ -682,12 +682,12 @@ sys 2m22.713s
|
||||
<ul>
|
||||
<li>Last night I learned how to POST mappings to Elasticsearch for AReS:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
|
||||
</code></pre><ul>
|
||||
<li>The JSON file looks like this, with one instruction on each line:</li>
|
||||
</ul>
|
||||
<pre><code>{"index":{}}
|
||||
<pre tabindex="0"><code>{"index":{}}
|
||||
{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
|
||||
{"index":{}}
|
||||
{ "find": "FISH", "replace": "Fish" }
|
||||
@ -737,7 +737,7 @@ f<span style="color:#f92672">.</span>close()
|
||||
<li>It filters all upper and lower case strings as well as any replacements that end in an acronym like “- ILRI”, reducing the number of mappings from around 4,000 to about 900</li>
|
||||
<li>I deleted the existing <code>openrxv-values</code> Elasticsearch core and then POSTed it:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./convert-mapping.py > /tmp/elastic-mappings.txt
|
||||
<pre tabindex="0"><code>$ ./convert-mapping.py > /tmp/elastic-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
|
||||
</code></pre><ul>
|
||||
@ -762,17 +762,17 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
|
||||
</li>
|
||||
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
|
||||
</ul>
|
||||
<pre><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
<pre tabindex="0"><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
|
||||
</code></pre><ul>
|
||||
<li>The solution is, as always:</li>
|
||||
</ul>
|
||||
<pre><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
|
||||
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
|
||||
UPDATE 1
|
||||
</code></pre><ul>
|
||||
<li>After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
|
||||
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
|
||||
|
||||
Purging 2474 hits from ShortLinkTranslate in statistics
|
||||
Purging 2568 hits from RI\/1\.0 in statistics
|
||||
@ -794,7 +794,7 @@ Total number of bot hits purged: 8174
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
|
||||
</code></pre><ul>
|
||||
<li>And I saw three hits in Solr with <code>isBot: true</code>!!!
|
||||
@ -817,7 +817,7 @@ $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
|
||||
$ dspace metadata-export -f /tmp/cgspace.csv
|
||||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||||
</code></pre><ul>
|
||||
@ -833,7 +833,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
|
||||
<ul>
|
||||
<li>Bosede was getting this error on CGSpace yesterday:</li>
|
||||
</ul>
|
||||
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
|
||||
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
|
||||
</code></pre><ul>
|
||||
<li>Collection 1072 appears to be <a href="https://cgspace.cgiar.org/handle/10568/69542">IITA Miscellaneous</a>
|
||||
<ul>
|
||||
@ -848,7 +848,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
|
||||
<pre tabindex="0"><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
|
||||
</code></pre><ul>
|
||||
<li>Then I decided to try a different approach and I adjusted my <code>convert-mapping.py</code> script to re-consider some replacement patterns with acronyms from the original AReS <code>mapping.json</code> file to hopefully address some MEL to CGSpace mappings
|
||||
<ul>
|
||||
@ -893,7 +893,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
|
||||
<ul>
|
||||
<li>I re-installed DSpace Test with a fresh snapshot of CGSpace’s to test the DSpace 6 upgrade (the last time was in 2020-05, and we’ve fixed a lot of issues since then):</li>
|
||||
</ul>
|
||||
<pre><code>$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
|
||||
<pre tabindex="0"><code>$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
|
||||
$ git checkout origin/6_x-dev-atmire-modules
|
||||
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
|
||||
$ sudo su - postgres
|
||||
@ -911,7 +911,7 @@ $ sudo systemctl start tomcat7
|
||||
</code></pre><ul>
|
||||
<li>Then I started processing the Solr stats one core and 1 million records at a time:</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
@ -920,7 +920,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
</code></pre><ul>
|
||||
<li>After the fifth or so run I got this error:</li>
|
||||
</ul>
|
||||
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
|
||||
@ -945,7 +945,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Then I restarted the <code>solr-upgrade-statistics-6x</code> process, which apparently had no records left to process</li>
|
||||
<li>I started processing the statistics-2019 core…
|
||||
@ -958,7 +958,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
|
||||
<ul>
|
||||
<li>The statistics processing on the statistics-2018 core errored after 1.8 million records:</li>
|
||||
</ul>
|
||||
<pre><code>Exception: Java heap space
|
||||
<pre tabindex="0"><code>Exception: Java heap space
|
||||
java.lang.OutOfMemoryError: Java heap space
|
||||
</code></pre><ul>
|
||||
<li>I had the same problem when I processed the statistics-2018 core in 2020-07 and 2020-08
|
||||
@ -967,7 +967,7 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>I restarted the process and it crashed again a few minutes later
|
||||
<ul>
|
||||
@ -976,7 +976,7 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Then I started processing the statistics-2017 core…
|
||||
<ul>
|
||||
@ -984,7 +984,7 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
|
||||
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
|
||||
@ -1002,7 +1002,7 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
|
||||
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
|
||||
</code></pre><ul>
|
||||
<li>Peter asked me to add the new preferred AGROVOC subject “covid-19” to all items we had previously added “coronavirus disease”, and to make sure all items with ILRI subject “ZOONOTIC DISEASES” have the AGROVOC subject “zoonoses”
|
||||
<ul>
|
||||
@ -1010,7 +1010,7 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ dspace metadata-export -f /tmp/cgspace.csv
|
||||
<pre tabindex="0"><code>$ dspace metadata-export -f /tmp/cgspace.csv
|
||||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||||
</code></pre><ul>
|
||||
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
|
||||
@ -1040,7 +1040,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
|
||||
<pre tabindex="0"><code>$ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
|
||||
$ ./convert-mapping.py >> /tmp/elasticsearch-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
|
||||
@ -1048,12 +1048,12 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
|
||||
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
|
||||
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
|
||||
</ul>
|
||||
<pre><code>$ docker-compose up --build -d angular_nginx
|
||||
<pre tabindex="0"><code>$ docker-compose up --build -d angular_nginx
|
||||
</code></pre><h2 id="2020-10-28">2020-10-28</h2>
|
||||
<ul>
|
||||
<li>Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:</li>
|
||||
</ul>
|
||||
<pre><code>$ docker-compose up --build -d --force-recreate angular_nginx
|
||||
<pre tabindex="0"><code>$ docker-compose up --build -d --force-recreate angular_nginx
|
||||
</code></pre><ul>
|
||||
<li>Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like “Burkina faso” is due to the country formatter (see: <code>backend/src/harvester/consumers/fetch.consumer.ts</code>)
|
||||
<ul>
|
||||
@ -1079,7 +1079,7 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ cat 2020-10-28-update-regions.csv
|
||||
<pre tabindex="0"><code>$ cat 2020-10-28-update-regions.csv
|
||||
cg.coverage.region,correct
|
||||
East Africa,Eastern Africa
|
||||
West Africa,Western Africa
|
||||
@ -1092,7 +1092,7 @@ $ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace
|
||||
</code></pre><ul>
|
||||
<li>Then I started a full Discovery re-indexing:</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 92m14.294s
|
||||
user 7m59.840s
|
||||
@ -1115,7 +1115,7 @@ sys 2m22.327s
|
||||
</li>
|
||||
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
|
||||
COPY 6357
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
|
||||
COPY 730
|
||||
@ -1134,7 +1134,7 @@ COPY 5598
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ grep -c '"find"' /tmp/elasticsearch-mappings*
|
||||
<pre tabindex="0"><code>$ grep -c '"find"' /tmp/elasticsearch-mappings*
|
||||
/tmp/elasticsearch-mappings2.txt:350
|
||||
/tmp/elasticsearch-mappings.txt:1228
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
|
||||
@ -1148,7 +1148,7 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | u
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
|
||||
</code></pre><ul>
|
||||
@ -1159,14 +1159,14 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
|
||||
</li>
|
||||
<li>Lower case some straggling AGROVOC subjects on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# BEGIN;
|
||||
<pre tabindex="0"><code>dspace=# BEGIN;
|
||||
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
|
||||
UPDATE 123
|
||||
dspace=# COMMIT;
|
||||
</code></pre><ul>
|
||||
<li>Move some top-level communities to the CGIAR System community for Peter:</li>
|
||||
</ul>
|
||||
<pre><code>$ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
|
||||
<pre tabindex="0"><code>$ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
|
||||
$ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
|
||||
</code></pre><h2 id="2020-10-30">2020-10-30</h2>
|
||||
<ul>
|
||||
@ -1187,7 +1187,7 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>or(
|
||||
<pre tabindex="0"><code>or(
|
||||
isNotNull(value.match(/.*\uFFFD.*/)),
|
||||
isNotNull(value.match(/.*\u00A0.*/)),
|
||||
isNotNull(value.match(/.*\u200A.*/)),
|
||||
@ -1198,7 +1198,7 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
|
||||
</code></pre><ul>
|
||||
<li>Then I did a test to apply the corrections and deletions on my local DSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
|
||||
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
|
||||
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
|
||||
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
|
||||
@ -1214,12 +1214,12 @@ $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace
|
||||
</li>
|
||||
<li>Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
|
||||
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
|
||||
</code></pre><ul>
|
||||
<li>I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:</li>
|
||||
</ul>
|
||||
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
</code></pre><!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user