mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-09-13
This commit is contained in:
@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
|
||||
Then I reduced the JVM heap size from 6144 back to 5120m
|
||||
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.87.0" />
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
|
||||
|
||||
|
||||
@ -175,7 +175,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
|
||||
<li>There are lots of errors on language, CRP, and even some encoding errors on abstract fields</li>
|
||||
<li>I export them and include the hidden metadata fields like <code>dc.date.accessioned</code> so I can filter the ones from 2018-04 and correct them in Open Refine:</li>
|
||||
</ul>
|
||||
<pre><code>$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
|
||||
<pre tabindex="0"><code>$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
|
||||
</code></pre><ul>
|
||||
<li>Abenet sent a list of 46 ORCID identifiers for ILRI authors so I need to get their names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script and merge them into our controlled vocabulary</li>
|
||||
<li>On the messed up IITA records from 2018-04 I see sixty DOIs in incorrect format (cg.identifier.doi)</li>
|
||||
@ -185,7 +185,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
|
||||
<li>Fixing the IITA records from Sisay, sixty DOIs have completely invalid format like <code>http:dx.doi.org10.1016j.cropro.2008.07.003</code></li>
|
||||
<li>I corrected all the DOIs and then checked them for validity with a quick bash loop:</li>
|
||||
</ul>
|
||||
<pre><code>$ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done
|
||||
<pre tabindex="0"><code>$ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done
|
||||
</code></pre><ul>
|
||||
<li>Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher’s site so…</li>
|
||||
<li>Also, there are some duplicates:
|
||||
@ -205,7 +205,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
|
||||
<li>A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: <code>’</code> (0x2019), <code>·</code> (0x00b7), and <code>€</code> (0x20ac)</li>
|
||||
<li>A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:</li>
|
||||
</ul>
|
||||
<pre><code>or(
|
||||
<pre tabindex="0"><code>or(
|
||||
isNotNull(value.match(/.*[(|)].*/)),
|
||||
isNotNull(value.match(/.*\uFFFD.*/)),
|
||||
isNotNull(value.match(/.*\u00A0.*/)),
|
||||
@ -218,7 +218,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
|
||||
<li>I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!</li>
|
||||
<li>Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
|
||||
</ul>
|
||||
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
|
||||
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
|
||||
$ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
|
||||
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
|
||||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
|
||||
@ -242,12 +242,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
<li>I could use it with <a href="https://github.com/okfn/reconcile-csv">reconcile-csv</a> or to populate a Solr instance for reconciliation</li>
|
||||
<li>This XPath expression gets close, but outputs all items on one line:</li>
|
||||
</ul>
|
||||
<pre><code>$ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml
|
||||
<pre tabindex="0"><code>$ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml
|
||||
Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
|
||||
</code></pre><ul>
|
||||
<li>Maybe <code>xmlstarlet</code> is better:</li>
|
||||
</ul>
|
||||
<pre><code>$ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
|
||||
<pre tabindex="0"><code>$ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
|
||||
Agriculture for Nutrition and Health
|
||||
Big Data
|
||||
Climate Change, Agriculture and Food Security
|
||||
@ -275,7 +275,7 @@ Livestock and Fish
|
||||
<li>I told them to get all <a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_35697">CIAT records via OAI</a></li>
|
||||
<li>Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file:</li>
|
||||
</ul>
|
||||
<pre><code>$ lein run /tmp/crps.csv name id
|
||||
<pre tabindex="0"><code>$ lein run /tmp/crps.csv name id
|
||||
</code></pre><ul>
|
||||
<li>I tried to reconcile against a CSV of our countries but reconcile-csv crashes</li>
|
||||
</ul>
|
||||
@ -310,7 +310,7 @@ Livestock and Fish
|
||||
<li>Also, I learned how to do something cool with Jython expressions in OpenRefine</li>
|
||||
<li>This will fetch a URL and return its HTTP response code:</li>
|
||||
</ul>
|
||||
<pre><code>import urllib2
|
||||
<pre tabindex="0"><code>import urllib2
|
||||
import re
|
||||
|
||||
pattern = re.compile('.*10.1016.*')
|
||||
@ -329,24 +329,24 @@ return "blank"
|
||||
<li>I was checking the CIFOR data for duplicates using Atmire’s Metadata Quality Module (and found some duplicates actually), but then DSpace died…</li>
|
||||
<li>I didn’t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in <code>dmest -T</code>:</li>
|
||||
</ul>
|
||||
<pre><code>[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
|
||||
<pre tabindex="0"><code>[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
|
||||
[Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
|
||||
[Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
|
||||
</code></pre><ul>
|
||||
<li>So the Linux kernel killed Java…</li>
|
||||
<li>Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
|
||||
<pre tabindex="0"><code>Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
|
||||
</code></pre><ul>
|
||||
<li>Looking in the DSpace log I see something related:</li>
|
||||
</ul>
|
||||
<pre><code>2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
|
||||
<pre tabindex="0"><code>2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
|
||||
</code></pre><ul>
|
||||
<li>So I’m not sure…</li>
|
||||
<li>I finally figured out how to get OpenRefine to reconcile values from Solr via <a href="https://github.com/codeforkjeff/conciliator">conciliator</a>:</li>
|
||||
<li>The trick was to use a more appropriate Solr fieldType <code>text_en</code> instead of <code>text_general</code> so that more terms match, for example uppercase and lower case:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./bin/solr start
|
||||
<pre tabindex="0"><code>$ ./bin/solr start
|
||||
$ ./bin/solr create_core -c countries
|
||||
$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema
|
||||
$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
|
||||
@ -357,7 +357,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
|
||||
<ul>
|
||||
<li>I should probably make a general copy field and set it to be the default search field, like DSpace’s search core does (see schema.xml):</li>
|
||||
</ul>
|
||||
<pre><code><defaultSearchField>search_text</defaultSearchField>
|
||||
<pre tabindex="0"><code><defaultSearchField>search_text</defaultSearchField>
|
||||
...
|
||||
<copyField source="*" dest="search_text"/>
|
||||
</code></pre><ul>
|
||||
@ -381,7 +381,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
|
||||
<li>I created and merged a pull request to fix the sorting issue in Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/374">#374</a>)</li>
|
||||
<li>Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in <code>page-structure-alterations.xsl</code> to:</li>
|
||||
</ul>
|
||||
<pre><code>ga('send', 'pageview', {
|
||||
<pre tabindex="0"><code>ga('send', 'pageview', {
|
||||
'anonymizeIp': true
|
||||
});
|
||||
</code></pre><ul>
|
||||
@ -439,7 +439,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
|
||||
<ul>
|
||||
<li>I’m investigating how many non-CGIAR users we have registered on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
|
||||
<pre tabindex="0"><code>dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
|
||||
</code></pre><ul>
|
||||
<li>We might need to do something regarding these users for GDPR compliance because we have their names, emails, and potentially phone numbers</li>
|
||||
<li>I decided that I will just use the cookieconsent script as is, since it looks good and technically does set the cookie with “allow” or “dismiss”</li>
|
||||
@ -460,7 +460,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
|
||||
<li>DSpace Test crashed last night, seems to be related to system memory (not JVM heap)</li>
|
||||
<li>I see this in <code>dmesg</code>:</li>
|
||||
</ul>
|
||||
<pre><code>[Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
|
||||
<pre tabindex="0"><code>[Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
|
||||
[Wed May 30 00:00:39 2018] Killed process 6082 (java) total-vm:14876264kB, anon-rss:5683372kB, file-rss:0kB, shmem-rss:0kB
|
||||
[Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
|
||||
</code></pre><ul>
|
||||
@ -471,7 +471,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
|
||||
<li>I generated a list of CIFOR duplicates from the <code>CIFOR_May_9</code> collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika</li>
|
||||
<li>I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each “Item1” line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):</li>
|
||||
</ul>
|
||||
<pre><code>$ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
|
||||
<pre tabindex="0"><code>$ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
|
||||
$ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cleaned.txt
|
||||
</code></pre><ul>
|
||||
<li>I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR’s collection</li>
|
||||
@ -482,18 +482,18 @@ $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cle
|
||||
<li>Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a></li>
|
||||
<li>The output isn’t great, but all the handles and IDs are printed in debug mode:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
|
||||
<pre tabindex="0"><code>$ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
|
||||
</code></pre><ul>
|
||||
<li>Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
|
||||
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
|
||||
</code></pre><h2 id="2018-05-31">2018-05-31</h2>
|
||||
<ul>
|
||||
<li>Clarify CGSpace’s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance</li>
|
||||
<li>Testing running PostgreSQL in a Docker container on localhost because when I’m on Arch Linux there isn’t an easily installable package for particular PostgreSQL versions</li>
|
||||
<li>Now I can just use Docker:</li>
|
||||
</ul>
|
||||
<pre><code>$ docker pull postgres:9.5-alpine
|
||||
<pre tabindex="0"><code>$ docker pull postgres:9.5-alpine
|
||||
$ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine
|
||||
$ createuser -h localhost -U postgres --pwprompt dspacetest
|
||||
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
|
||||
|
Reference in New Issue
Block a user