mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-03-04
This commit is contained in:
@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.92.2" />
|
||||
<meta name="generator" content="Hugo 0.93.1" />
|
||||
|
||||
|
||||
|
||||
@ -144,10 +144,10 @@ During the FlywayDB migration I got an error:
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||||
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||||
Detail: Key (short_description)=(EPUB) already exists. Call getNextException to see other errors in the batch.
|
||||
2020-10-06 21:36:04,138 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
|
||||
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||||
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
|
||||
Detail: Key (short_description)=(EPUB) already exists.
|
||||
2020-10-06 21:36:04,142 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [could not execute batch]
|
||||
2020-10-06 21:36:04,143 ERROR org.dspace.storage.rdbms.DatabaseRegistryUpdater @ Error attempting to update Bitstream Format and/or Metadata Registries
|
||||
@ -233,7 +233,7 @@ New item: aff5e78d-87c9-438d-94f8-1050b649961c (10568/108548)
|
||||
+ Added (dc.title): Testing CUA import NPE
|
||||
Tue Oct 06 22:06:14 CEST 2020 | Query:containerItem:aff5e78d-87c9-438d-94f8-1050b649961c
|
||||
Error while updating
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> The requested resource [/solr/update] is not available</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/7.0.104</h3></body></html>
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> The requested resource [/solr/update] is not available</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/7.0.104</h3></body></html>
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
|
||||
@ -278,7 +278,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
|
||||
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
|
||||
$ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
|
||||
</code></pre><ul>
|
||||
<li>Then we post an item in JSON format to <code>/rest/collections/{uuid}/items</code>:</li>
|
||||
@ -287,25 +287,25 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
|
||||
</code></pre><ul>
|
||||
<li>Format of JSON is:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>{ "metadata": [
|
||||
<pre tabindex="0"><code>{ "metadata": [
|
||||
{
|
||||
"key": "dc.title",
|
||||
"value": "Testing REST API post",
|
||||
"language": "en_US"
|
||||
"key": "dc.title",
|
||||
"value": "Testing REST API post",
|
||||
"language": "en_US"
|
||||
},
|
||||
{
|
||||
"key": "dc.contributor.author",
|
||||
"value": "Orth, Alan",
|
||||
"language": "en_US"
|
||||
"key": "dc.contributor.author",
|
||||
"value": "Orth, Alan",
|
||||
"language": "en_US"
|
||||
},
|
||||
{
|
||||
"key": "dc.date.issued",
|
||||
"value": "2020-09-01",
|
||||
"language": "en_US"
|
||||
"key": "dc.date.issued",
|
||||
"value": "2020-09-01",
|
||||
"language": "en_US"
|
||||
}
|
||||
],
|
||||
"archived":"false",
|
||||
"withdrawn":"false"
|
||||
"archived":"false",
|
||||
"withdrawn":"false"
|
||||
}
|
||||
</code></pre><ul>
|
||||
<li>What is unclear to me is the <code>archived</code> parameter, it seems to do nothing… perhaps it is only used for the <code>/items</code> endpoint when printing information about an item
|
||||
@ -362,7 +362,7 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
|
||||
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
|
||||
$ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
|
||||
$ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 < item-object.json
|
||||
</code></pre><ul>
|
||||
@ -408,10 +408,10 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
</code></pre><ul>
|
||||
<li>After a few minutes I saw these four hits in Solr… WTF
|
||||
<ul>
|
||||
@ -483,7 +483,7 @@ dspace=> COMMIT;
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
|
||||
COPY 195
|
||||
</code></pre><ul>
|
||||
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
|
||||
@ -493,7 +493,7 @@ COPY 195
|
||||
</li>
|
||||
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>:'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
|
||||
<pre tabindex="0"><code>:'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
|
||||
</code></pre><ul>
|
||||
<li>It uses a <a href="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka “lookaround” in PCRE?) to match words that are <em>not</em> “pair”, “displayed”, etc because we don’t want to edit the XML tags themselves…
|
||||
<ul>
|
||||
@ -509,14 +509,14 @@ COPY 195
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
|
||||
COPY 34
|
||||
</code></pre><ul>
|
||||
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
|
||||
<li>After testing the replacements locally I ran them on CGSpace:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
|
||||
</code></pre><ul>
|
||||
<li>Then I started a full re-indexing:</li>
|
||||
</ul>
|
||||
@ -583,14 +583,14 @@ sys 2m22.713s
|
||||
dspace=> UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57;
|
||||
UPDATE 335063
|
||||
dspace=> COMMIT;
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.subject", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY "dc.subject" ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER;
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.subject", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY "dc.subject" ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER;
|
||||
COPY 1500
|
||||
</code></pre><ul>
|
||||
<li>Use my <code>agrovoc-lookup.py</code> script to validate subject terms against the AGROVOC REST API, extract matches with <code>csvgrep</code>, and then update and format the controlled vocabulary:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 > /tmp/subjects.txt
|
||||
$ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
|
||||
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' > dspace/config/controlled-vocabularies/dc-subject.xml
|
||||
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' > dspace/config/controlled-vocabularies/dc-subject.xml
|
||||
# apply formatting in XML file
|
||||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
|
||||
</code></pre><ul>
|
||||
@ -614,7 +614,7 @@ sys 2m22.713s
|
||||
<li>They are using the user agent “CCAFS Website Publications importer BOT” so they are getting rate limited by nginx</li>
|
||||
<li>Ideally they would use the REST <code>find-by-metadata-field</code> endpoint, but it is <em>really</em> slow for large result sets (like twenty minutes!):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
|
||||
<pre tabindex="0"><code>$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
|
||||
</code></pre><ul>
|
||||
<li>For now I will whitelist their user agent so that they can continue scraping /browse</li>
|
||||
<li>I figured out that the mappings for AReS are stored in Elasticsearch
|
||||
@ -624,23 +624,23 @@ sys 2m22.713s
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"query": {
|
||||
"match": {
|
||||
"_id": "64j_THMBiwiQ-PKfCSlI"
|
||||
"query": {
|
||||
"match": {
|
||||
"_id": "64j_THMBiwiQ-PKfCSlI"
|
||||
}
|
||||
}
|
||||
}
|
||||
</code></pre><ul>
|
||||
<li>I added a new find/replace:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"find": "ALAN1",
|
||||
"replace": "ALAN2",
|
||||
"find": "ALAN1",
|
||||
"replace": "ALAN2",
|
||||
}
|
||||
'
|
||||
'
|
||||
</code></pre><ul>
|
||||
<li>I see it in Kibana, and I can search it in Elasticsearch, but I don’t see it in OpenRXV’s mapping values dashboard</li>
|
||||
<li>Now I deleted everything in the <code>openrxv-values</code> index:</li>
|
||||
@ -649,12 +649,12 @@ sys 2m22.713s
|
||||
</code></pre><ul>
|
||||
<li>Then I tried posting it again:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
<pre tabindex="0"><code>$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
|
||||
{
|
||||
"find": "ALAN1",
|
||||
"replace": "ALAN2",
|
||||
"find": "ALAN1",
|
||||
"replace": "ALAN2",
|
||||
}
|
||||
'
|
||||
'
|
||||
</code></pre><ul>
|
||||
<li>But I still don’t see it in AReS</li>
|
||||
<li>Interesting! I added a find/replace manually in AReS and now I see the one I POSTed…</li>
|
||||
@ -683,63 +683,63 @@ sys 2m22.713s
|
||||
<li>Last night I learned how to POST mappings to Elasticsearch for AReS:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
|
||||
</code></pre><ul>
|
||||
<li>The JSON file looks like this, with one instruction on each line:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>{"index":{}}
|
||||
{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
|
||||
{"index":{}}
|
||||
{ "find": "FISH", "replace": "Fish" }
|
||||
<pre tabindex="0"><code>{"index":{}}
|
||||
{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
|
||||
{"index":{}}
|
||||
{ "find": "FISH", "replace": "Fish" }
|
||||
</code></pre><ul>
|
||||
<li>Adjust the report templates on AReS based on some of Peter’s feedback</li>
|
||||
<li>I wrote a quick Python script to filter and convert the old AReS mappings to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html">Elasticsearch’s Bulk API</a> format:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#75715e">#!/usr/bin/env python3</span>
|
||||
|
||||
<span style="color:#f92672">import</span> json
|
||||
<span style="color:#f92672">import</span> re
|
||||
|
||||
f <span style="color:#f92672">=</span> open(<span style="color:#e6db74">'/tmp/mapping.json'</span>, <span style="color:#e6db74">'r'</span>)
|
||||
data <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
|
||||
|
||||
<span style="color:#75715e"># Iterate over old mapping file, which is in format "find": "replace", ie:</span>
|
||||
<span style="color:#75715e">#</span>
|
||||
<span style="color:#75715e"># "alan": "ALAN"</span>
|
||||
<span style="color:#75715e">#</span>
|
||||
<span style="color:#75715e"># And convert to proper dictionaries for import into Elasticsearch's Bulk API:</span>
|
||||
<span style="color:#75715e">#</span>
|
||||
<span style="color:#75715e"># { "find": "alan", "replace": "ALAN" }</span>
|
||||
<span style="color:#75715e">#</span>
|
||||
<span style="color:#66d9ef">for</span> find, replace <span style="color:#f92672">in</span> data<span style="color:#f92672">.</span>items():
|
||||
<span style="color:#75715e"># Skip all upper and all lower case strings because they are indicative of</span>
|
||||
<span style="color:#75715e"># some AGROVOC or other mappings we no longer want to do</span>
|
||||
<span style="color:#66d9ef">if</span> find<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> find<span style="color:#f92672">.</span>islower() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>islower():
|
||||
<span style="color:#66d9ef">continue</span>
|
||||
|
||||
<span style="color:#75715e"># Skip replacements with acronyms like:</span>
|
||||
<span style="color:#75715e">#</span>
|
||||
<span style="color:#75715e"># International Livestock Research Institute - ILRI</span>
|
||||
<span style="color:#75715e">#</span>
|
||||
acronym_pattern <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>compile(<span style="color:#e6db74">r</span><span style="color:#e6db74">"[A-Z]+$"</span>)
|
||||
acronym_pattern_match <span style="color:#f92672">=</span> acronym_pattern<span style="color:#f92672">.</span>search(replace)
|
||||
<span style="color:#66d9ef">if</span> acronym_pattern_match <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
|
||||
<span style="color:#66d9ef">continue</span>
|
||||
|
||||
mapping <span style="color:#f92672">=</span> { <span style="color:#e6db74">"find"</span>: find, <span style="color:#e6db74">"replace"</span>: replace }
|
||||
|
||||
<span style="color:#75715e"># Print command for Elasticsearch</span>
|
||||
print(<span style="color:#e6db74">'{"index":</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">}'</span>)
|
||||
print(json<span style="color:#f92672">.</span>dumps(mapping))
|
||||
|
||||
f<span style="color:#f92672">.</span>close()
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e">#!/usr/bin/env python3</span>
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> json
|
||||
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> re
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span>f <span style="color:#f92672">=</span> open(<span style="color:#e6db74">'/tmp/mapping.json'</span>, <span style="color:#e6db74">'r'</span>)
|
||||
</span></span><span style="display:flex;"><span>data <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Iterate over old mapping file, which is in format "find": "replace", ie:</span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#75715e"># "alan": "ALAN"</span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#75715e"># And convert to proper dictionaries for import into Elasticsearch's Bulk API:</span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#75715e"># { "find": "alan", "replace": "ALAN" }</span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> find, replace <span style="color:#f92672">in</span> data<span style="color:#f92672">.</span>items():
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Skip all upper and all lower case strings because they are indicative of</span>
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># some AGROVOC or other mappings we no longer want to do</span>
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> find<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> find<span style="color:#f92672">.</span>islower() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>islower():
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">continue</span>
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Skip replacements with acronyms like:</span>
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#75715e">#</span>
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># International Livestock Research Institute - ILRI</span>
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#75715e">#</span>
|
||||
</span></span><span style="display:flex;"><span> acronym_pattern <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>compile(<span style="color:#e6db74">r</span><span style="color:#e6db74">"[A-Z]+$"</span>)
|
||||
</span></span><span style="display:flex;"><span> acronym_pattern_match <span style="color:#f92672">=</span> acronym_pattern<span style="color:#f92672">.</span>search(replace)
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> acronym_pattern_match <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">continue</span>
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span> mapping <span style="color:#f92672">=</span> { <span style="color:#e6db74">"find"</span>: find, <span style="color:#e6db74">"replace"</span>: replace }
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Print command for Elasticsearch</span>
|
||||
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">'{"index":</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">}'</span>)
|
||||
</span></span><span style="display:flex;"><span> print(json<span style="color:#f92672">.</span>dumps(mapping))
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span>f<span style="color:#f92672">.</span>close()
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>It filters all upper and lower case strings as well as any replacements that end in an acronym like “- ILRI”, reducing the number of mappings from around 4,000 to about 900</li>
|
||||
<li>I deleted the existing <code>openrxv-values</code> Elasticsearch core and then POSTed it:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ ./convert-mapping.py > /tmp/elastic-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
|
||||
</code></pre><ul>
|
||||
<li>Then in AReS I didn’t see the mappings in the dashboard until I added a new one manually, after which they all appeared
|
||||
<ul>
|
||||
@ -762,12 +762,12 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
|
||||
</li>
|
||||
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
|
||||
<pre tabindex="0"><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
|
||||
</code></pre><ul>
|
||||
<li>The solution is, as always:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
|
||||
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
|
||||
UPDATE 1
|
||||
</code></pre><ul>
|
||||
<li>After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:</li>
|
||||
@ -794,8 +794,8 @@ Total number of bot hits purged: 8174
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
|
||||
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
|
||||
$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
|
||||
</code></pre><ul>
|
||||
<li>And I saw three hits in Solr with <code>isBot: true</code>!!!
|
||||
<ul>
|
||||
@ -817,9 +817,9 @@ $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
|
||||
$ dspace metadata-export -f /tmp/cgspace.csv
|
||||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||||
</code></pre><ul>
|
||||
<li>Then I went through all center subjects looking for “WOMEN” or “GENDER” and checking if they were missing the associated AGROVOC subject
|
||||
<ul>
|
||||
@ -848,7 +848,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
|
||||
<pre tabindex="0"><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
|
||||
</code></pre><ul>
|
||||
<li>Then I decided to try a different approach and I adjusted my <code>convert-mapping.py</code> script to re-consider some replacement patterns with acronyms from the original AReS <code>mapping.json</code> file to hopefully address some MEL to CGSpace mappings
|
||||
<ul>
|
||||
@ -897,8 +897,8 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
|
||||
$ git checkout origin/6_x-dev-atmire-modules
|
||||
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
|
||||
$ sudo su - postgres
|
||||
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
|
||||
$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');"
|
||||
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
|
||||
$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');"
|
||||
$ exit
|
||||
$ sudo systemctl stop tomcat7
|
||||
$ cd dspace/target/dspace-installer
|
||||
@ -911,7 +911,7 @@ $ sudo systemctl start tomcat7
|
||||
</code></pre><ul>
|
||||
<li>Then I started processing the Solr stats one core and 1 million records at a time:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
@ -920,8 +920,8 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
</code></pre><ul>
|
||||
<li>After the fifth or so run I got this error:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
|
||||
@ -945,7 +945,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Then I restarted the <code>solr-upgrade-statistics-6x</code> process, which apparently had no records left to process</li>
|
||||
<li>I started processing the statistics-2019 core…
|
||||
@ -967,8 +967,8 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">"http://localhost:8083/solr/statistics-2018/update?softCommit=true"</span> -H <span style="color:#e6db74">"Content-Type: text/xml"</span> --data-binary <span style="color:#e6db74">"<delete><query>id:/.+-unmigrated/</query></delete>"</span>
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">"http://localhost:8083/solr/statistics-2018/update?softCommit=true"</span> -H <span style="color:#e6db74">"Content-Type: text/xml"</span> --data-binary <span style="color:#e6db74">"<delete><query>id:/.+-unmigrated/</query></delete>"</span>
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I restarted the process and it crashed again a few minutes later
|
||||
<ul>
|
||||
<li>I increased the memory to 4096m and tried again</li>
|
||||
@ -976,7 +976,7 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Then I started processing the statistics-2017 core…
|
||||
<ul>
|
||||
@ -984,7 +984,7 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
|
||||
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
|
||||
@ -1011,7 +1011,7 @@ java.lang.OutOfMemoryError: Java heap space
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ dspace metadata-export -f /tmp/cgspace.csv
|
||||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||||
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
|
||||
</code></pre><ul>
|
||||
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
|
||||
<li>Skype with Peter and Abenet about CGSpace Explorer (AReS)
|
||||
@ -1043,7 +1043,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
|
||||
<pre tabindex="0"><code>$ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
|
||||
$ ./convert-mapping.py >> /tmp/elasticsearch-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
|
||||
</code></pre><ul>
|
||||
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
|
||||
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
|
||||
@ -1088,16 +1088,16 @@ South Asia,Southern Asia
|
||||
Africa South Of Sahara,Sub-Saharan Africa
|
||||
North Africa,Northern Africa
|
||||
West Asia,Western Asia
|
||||
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
|
||||
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
|
||||
</code></pre><ul>
|
||||
<li>Then I started a full Discovery re-indexing:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>real 92m14.294s
|
||||
user 7m59.840s
|
||||
sys 2m22.327s
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 92m14.294s
|
||||
</span></span><span style="display:flex;"><span>user 7m59.840s
|
||||
</span></span><span style="display:flex;"><span>sys 2m22.327s
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I realized I had been using an incorrect Solr query to purge unmigrated items after processing with <code>solr-upgrade-statistics-6x</code>…
|
||||
<ul>
|
||||
<li>Instead of this: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
|
||||
@ -1115,17 +1115,17 @@ sys 2m22.327s
|
||||
</li>
|
||||
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
|
||||
COPY 6357
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
|
||||
COPY 730
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
|
||||
COPY 71748
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.publisher", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.publisher", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
|
||||
COPY 3882
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.source", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.source", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
|
||||
COPY 3684
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.relation.ispartofseries", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
|
||||
dspace=> \COPY (SELECT DISTINCT text_value as "dc.relation.ispartofseries", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
|
||||
COPY 5598
|
||||
</code></pre><ul>
|
||||
<li>I noticed there are still some mapping for acronyms and other fixes that haven’t been applied, so I ran my <code>create-mappings.py</code> script against Elasticsearch again
|
||||
@ -1134,12 +1134,12 @@ COPY 5598
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ grep -c '"find"' /tmp/elasticsearch-mappings*
|
||||
<pre tabindex="0"><code>$ grep -c '"find"' /tmp/elasticsearch-mappings*
|
||||
/tmp/elasticsearch-mappings2.txt:350
|
||||
/tmp/elasticsearch-mappings.txt:1228
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
|
||||
1578
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | uniq | wc -l
|
||||
$ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | uniq | wc -l
|
||||
1578
|
||||
</code></pre><ul>
|
||||
<li>I have no idea why they wouldn’t have been caught yesterday when I originally ran the script on a clean AReS with no mappings…
|
||||
@ -1148,10 +1148,10 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | u
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
|
||||
$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="color:#e6db74">"Content-Type: application/json"</span> --data-binary @/tmp/new-elasticsearch-mappings.txt
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
|
||||
</span></span><span style="display:flex;"><span>$ curl -XDELETE http://localhost:9200/openrxv-values
|
||||
</span></span><span style="display:flex;"><span>$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="color:#e6db74">"Content-Type: application/json"</span> --data-binary @/tmp/new-elasticsearch-mappings.txt
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>The latest indexing (second for today!) finally finshed on AReS and the countries and affiliations/crps/journals all look MUCH better
|
||||
<ul>
|
||||
<li>There are still a few acronyms present, some of which are in the value mappings and some which aren’t</li>
|
||||
@ -1160,7 +1160,7 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="co
|
||||
<li>Lower case some straggling AGROVOC subjects on CGSpace:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=# BEGIN;
|
||||
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
|
||||
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
|
||||
UPDATE 123
|
||||
dspace=# COMMIT;
|
||||
</code></pre><ul>
|
||||
@ -1198,10 +1198,10 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
|
||||
</code></pre><ul>
|
||||
<li>Then I did a test to apply the corrections and deletions on my local DSpace:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
|
||||
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
|
||||
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
|
||||
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
|
||||
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
|
||||
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
|
||||
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
|
||||
</code></pre><ul>
|
||||
<li>I will wait to apply them on CGSpace when I have all the other corrections from Peter processed</li>
|
||||
</ul>
|
||||
@ -1214,8 +1214,8 @@ $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace
|
||||
</li>
|
||||
<li>Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
|
||||
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
|
||||
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
|
||||
</code></pre><ul>
|
||||
<li>I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:</li>
|
||||
</ul>
|
||||
|
Reference in New Issue
Block a user