578 lines
35 KiB
HTML
Raw Normal View History

2018-05-01 17:50:03 +03:00
<!DOCTYPE html>
<html lang="en" >
2018-05-01 17:50:03 +03:00
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
2020-12-06 16:53:29 +02:00
2018-05-01 17:50:03 +03:00
<meta property="og:title" content="May, 2018" />
<meta property="og:description" content="2018-05-01
I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface:
http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E
http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
2019-11-28 17:30:45 +02:00
2018-05-01 17:50:03 +03:00
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
" />
<meta property="og:type" content="article" />
2019-02-02 14:12:57 +02:00
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-05/" />
2019-08-08 18:10:44 +03:00
<meta property="article:published_time" content="2018-05-01T16:43:54+03:00" />
2020-04-13 17:24:05 +03:00
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
2018-09-30 08:23:48 +03:00
2020-12-06 16:53:29 +02:00
2018-05-01 17:50:03 +03:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2018"/>
<meta name="twitter:description" content="2018-05-01
I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface:
http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E
http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
2019-11-28 17:30:45 +02:00
2018-05-01 17:50:03 +03:00
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
2021-12-28 13:24:23 +02:00
<meta name="generator" content="Hugo 0.91.2" />
2018-05-01 17:50:03 +03:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2018",
2020-04-02 10:55:42 +03:00
"url": "https://alanorth.github.io/cgspace-notes/2018-05/",
2018-09-04 17:08:34 +03:00
"wordCount": "3503",
"datePublished": "2018-05-01T16:43:54+03:00",
2020-04-13 17:24:05 +03:00
"dateModified": "2020-04-13T15:30:24+03:00",
2018-05-01 17:50:03 +03:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-05/">
<title>May, 2018 | CGSpace Notes</title>
2018-05-01 17:50:03 +03:00
<!-- combined, minified CSS -->
2020-01-23 20:19:38 +02:00
2021-01-24 09:46:27 +02:00
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
2018-05-01 17:50:03 +03:00
2020-01-28 12:01:42 +02:00
<!-- minified Font Awesome for SVG icons -->
2021-09-28 10:32:32 +03:00
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
2020-01-28 12:01:42 +02:00
2019-04-14 16:59:47 +03:00
<!-- RSS 2.0 feed -->
2018-05-01 17:50:03 +03:00
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
2018-12-19 13:20:39 +02:00
2018-05-01 17:50:03 +03:00
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
2018-05-01 17:50:03 +03:00
</div>
</header>
2018-12-19 13:20:39 +02:00
2018-05-01 17:50:03 +03:00
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2018-05/">May, 2018</a></h2>
2020-11-16 10:54:00 +02:00
<p class="blog-post-meta">
<time datetime="2018-05-01T16:43:54+03:00">Tue May 01, 2018</time>
in
2020-01-28 12:01:42 +02:00
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
2018-05-01 17:50:03 +03:00
</p>
</header>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-01">2018-05-01</h2>
2018-05-01 17:50:03 +03:00
<ul>
<li>I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface:
<ul>
2019-11-28 17:30:45 +02:00
<li>http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E</li>
<li>http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E</li>
</ul>
</li>
2018-05-01 17:50:03 +03:00
<li>Then I reduced the JVM heap size from 6144 back to 5120m</li>
<li>Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to support hosts choosing which distribution they want to use</li>
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-02">2018-05-02</h2>
2018-05-02 17:04:48 +03:00
<ul>
<li>Advise Fabio Fidanza about integrating CGSpace content in the new CGIAR corporate website</li>
<li>I think they can mostly rely on using the <code>cg.contributor.crp</code> field</li>
<li>Looking over some IITA records for Sisay
<ul>
<li>Other than trimming and collapsing consecutive whitespace, I made some other corrections</li>
2020-01-27 16:20:44 +02:00
<li>I need to check the correct formatting of COTE D&rsquo;IVOIRE vs COTE DIVOIRE</li>
2018-05-02 17:04:48 +03:00
<li>I replaced all DOIs with HTTPS</li>
<li>I checked a few DOIs and found at least one that was missing, so I Googled the title of the paper and found the correct DOI</li>
<li>Also, I found an <a href="https://www.doi.org/factsheets/DOI_PURL.html">FAQ for DOI that says the <code>dx.doi.org</code> syntax is older</a>, so I will replace all the DOIs with <code>doi.org</code> instead</li>
<li>I found five records with &ldquo;ISI Jounal&rdquo; instead of &ldquo;ISI Journal&rdquo;</li>
<li>I found one item with IITA subject &ldquo;.&rdquo;</li>
2019-11-28 17:30:45 +02:00
<li>Need to remember to check the facets for things like this in sponsorship:
<ul>
2018-05-02 17:04:48 +03:00
<li>Deutsche Gesellschaft für Internationale Zusammenarbeit</li>
<li>Deutsche Gesellschaft fur Internationale Zusammenarbeit</li>
2019-11-28 17:30:45 +02:00
</ul>
</li>
2018-05-02 17:04:48 +03:00
<li>Eight records with language &ldquo;fn&rdquo; instead of &ldquo;fr&rdquo;</li>
<li>One incorrect type (lowercase &ldquo;proceedings&rdquo;): Conference proceedings</li>
<li>Found some capitalized CRPs in <code>cg.contributor.crp</code></li>
2019-11-28 17:30:45 +02:00
<li>Found some incorrect author affiliations, ie &ldquo;Institut de Recherche pour le Developpement Agricolc&rdquo; should be &ldquo;Institut de Recherche pour le Developpement <em>Agricole</em>&rdquo;</li>
<li>Wow, and for sponsors there are the following:
<ul>
2018-05-02 17:04:48 +03:00
<li>Incorrect: Flemish Agency for Development Cooperation and Technical Assistance</li>
<li>Incorrect: Flemish Organization for Development Cooperation and Technical Assistance</li>
<li>Correct: Flemish <em>Association</em> for Development Cooperation and Technical Assistance</li>
2019-11-28 17:30:45 +02:00
</ul>
</li>
2018-05-02 17:04:48 +03:00
<li>One item had region &ldquo;WEST&rdquo; (I corrected it to &ldquo;WEST AFRICA&rdquo;)</li>
</ul>
2019-11-28 17:30:45 +02:00
</li>
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-03">2018-05-03</h2>
2018-05-03 17:31:12 +03:00
<ul>
<li>It turns out that the IITA records that I was helping Sisay with in March were imported in 2018-04 without a final check by Abenet or I</li>
<li>There are lots of errors on language, CRP, and even some encoding errors on abstract fields</li>
2019-11-28 17:30:45 +02:00
<li>I export them and include the hidden metadata fields like <code>dc.date.accessioned</code> so I can filter the ones from 2018-04 and correct them in Open Refine:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>Abenet sent a list of 46 ORCID identifiers for ILRI authors so I need to get their names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script and merge them into our controlled vocabulary</li>
<li>On the messed up IITA records from 2018-04 I see sixty DOIs in incorrect format (cg.identifier.doi)</li>
2018-05-06 15:50:52 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-06">2018-05-06</h2>
2018-05-06 15:50:52 +03:00
<ul>
<li>Fixing the IITA records from Sisay, sixty DOIs have completely invalid format like <code>http:dx.doi.org10.1016j.cropro.2008.07.003</code></li>
2019-11-28 17:30:45 +02:00
<li>I corrected all the DOIs and then checked them for validity with a quick bash loop:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ for line in $(&lt; /tmp/links.txt); do echo $line; http --print h $line; done
2019-11-28 17:30:45 +02:00
</code></pre><ul>
2020-01-27 16:20:44 +02:00
<li>Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher&rsquo;s site so&hellip;</li>
2019-11-28 17:30:45 +02:00
<li>Also, there are some duplicates:
2018-05-06 15:50:52 +03:00
<ul>
<li><code>10568/92241</code> and <code>10568/92230</code> (same DOI)</li>
<li><code>10568/92151</code> and <code>10568/92150</code> (same ISBN)</li>
<li><code>10568/92291</code> and <code>10568/92286</code> (same citation, title, authors, year)</li>
2019-11-28 17:30:45 +02:00
</ul>
</li>
<li>Messed up abstracts:
2018-05-06 15:50:52 +03:00
<ul>
<li><code>10568/92309</code></li>
2019-11-28 17:30:45 +02:00
</ul>
</li>
<li>Fixed some issues in regions, countries, sponsors, ISSN, and cleaned whitespace errors from citation, abstract, author, and titles</li>
<li>Fixed all issues with CRPs</li>
<li>A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: <code></code> (0x2019), <code>·</code> (0x00b7), and <code></code> (0x20ac)</li>
<li>A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>or(
2019-11-28 17:30:45 +02:00
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
isNotNull(value.match(/.*\u2019.*/)),
isNotNull(value.match(/.*\u00b7.*/)),
isNotNull(value.match(/.*\u20ac.*/))
2018-05-06 15:50:52 +03:00
)
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!</li>
<li>Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-05-06-combined.txt
2018-05-06 15:50:52 +03:00
$ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
2019-11-28 17:30:45 +02:00
</code></pre><ul>
2020-01-27 16:20:44 +02:00
<li>I made a pull request (<a href="https://github.com/ilri/DSpace/pull/373">#373</a>) for this that I&rsquo;ll merge some time next week (I&rsquo;m expecting Atmire to get back to us about DSpace 5.8 soon)</li>
<li>After testing quickly I just decided to merge it, and I noticed that I don&rsquo;t even need to restart Tomcat for the changes to get loaded</li>
2018-05-06 15:52:39 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-07">2018-05-07</h2>
2018-05-07 17:14:49 +03:00
<ul>
<li>I spent a bit of time playing with <a href="https://github.com/codeforkjeff/conciliator">conciliator</a> and Solr, trying to figure out how to reconcile columns in OpenRefine with data in our existing Solr cores (like CRP subjects)</li>
<li>The documentation regarding the Solr stuff is limited, and I cannot figure out what all the fields in <code>conciliator.properties</code> are supposed to be</li>
<li>But then I found <a href="https://github.com/okfn/reconcile-csv">reconcile-csv</a>, which allows you to reconcile against values in a CSV file!</li>
<li>That, combined with splitting our multi-value fields on &ldquo;||&rdquo; in OpenRefine is amaaaaazing, because after reconciliation you can just join them again</li>
2020-01-27 16:20:44 +02:00
<li>Oh wow, you can also facet on the individual values once you&rsquo;ve split them! That&rsquo;s going to be amazing for proofing CRPs, subjects, etc.</li>
2018-05-07 17:14:49 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-09">2018-05-09</h2>
2018-05-09 18:32:14 +03:00
<ul>
<li>Udana asked about the Book Chapters we had been proofing on DSpace Test in 2018-04</li>
<li>I told him that there were still some TODO items for him on that data, for example to update the <code>dc.language.iso</code> field for the Spanish items</li>
2018-05-10 14:41:37 +03:00
<li>I was trying to remember how I parsed the <code>input-forms.xml</code> using <code>xmllint</code> to extract subjects neatly</li>
<li>I could use it with <a href="https://github.com/okfn/reconcile-csv">reconcile-csv</a> or to populate a Solr instance for reconciliation</li>
2019-11-28 17:30:45 +02:00
<li>This XPath expression gets close, but outputs all items on one line:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ xmllint --xpath '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/node()' dspace/config/input-forms.xml
2018-05-10 14:41:37 +03:00
Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>Maybe <code>xmlstarlet</code> is better:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/text()' dspace/config/input-forms.xml
2018-05-10 14:41:37 +03:00
Agriculture for Nutrition and Health
Big Data
Climate Change, Agriculture and Food Security
Excellence in Breeding
Fish
Forests, Trees and Agroforestry
Genebanks
Grain Legumes and Dryland Cereals
Livestock
Maize
Policies, Institutions and Markets
Rice
Roots, Tubers and Bananas
Water, Land and Ecosystems
Wheat
Aquatic Agricultural Systems
Dryland Cereals
Dryland Systems
Grain Legumes
Integrated Systems for the Humid Tropics
Livestock and Fish
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>Discuss Colombian BNARS harvesting the CIAT data from CGSpace</li>
<li>They are using a system called Primo and the only options for data harvesting in that system are via FTP and OAI</li>
<li>I told them to get all <a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=com_10568_35697">CIAT records via OAI</a></li>
<li>Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ lein run /tmp/crps.csv name id
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>I tried to reconcile against a CSV of our countries but reconcile-csv crashes</li>
2018-05-09 18:32:14 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-13">2018-05-13</h2>
2018-05-13 18:30:25 +03:00
<ul>
<li>It turns out there was a space in my &ldquo;country&rdquo; header that was causing reconcile-csv to crash</li>
<li>After removing that it works fine!</li>
2020-01-27 16:20:44 +02:00
<li>Looking at Sisay&rsquo;s 2,640 CIFOR records on DSpace Test (<a href="https://dspacetest.cgiar.org/handle/10568/92904">10568/92904</a>)
2018-05-13 18:30:25 +03:00
<ul>
<li>Trimmed all leading / trailing white space and condensed multiple spaces into one</li>
2019-11-28 17:30:45 +02:00
<li>Corrected DOIs to use HTTPS and &ldquo;doi.org&rdquo; instead of &ldquo;dx.doi.org&rdquo;
<ul>
2018-05-13 18:30:25 +03:00
<li>There are eight items in <code>cg.identifier.doi</code> that are not DOIs)</li>
2019-11-28 17:30:45 +02:00
</ul>
</li>
2018-05-13 18:30:25 +03:00
<li>Corrected <code>cg.identifier.url</code> links to cifor.org to use HTTPS</li>
<li>Corrected <code>dc.language.iso</code> from vt to vi (Vietnamese)</li>
<li>Corrected affiliations to not use acronyms</li>
<li>Reconcile countries against our countries list (removing terms like LATIN AMERICA, CENTRAL AFRICA, etc that are not countries)</li>
<li>Reconcile regions against our list of regions</li>
</ul>
2019-11-28 17:30:45 +02:00
</li>
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-14">2018-05-14</h2>
2018-05-15 13:25:03 +03:00
<ul>
<li>Send a message to the OpenRefine mailing list about the bug with reconciling multi-value cells</li>
2018-05-15 18:16:33 +03:00
<li>Help Silvia Alonso get a list of all her publications since 2013 from Listings and Reports</li>
2018-05-15 13:25:03 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-15">2018-05-15</h2>
2018-05-15 13:25:03 +03:00
<ul>
<li>Turns out I was doing the OpenRefine reconciliation wrong: I needed to copy the matched values to a new column!</li>
<li>Also, I learned how to do something cool with Jython expressions in OpenRefine</li>
2019-11-28 17:30:45 +02:00
<li>This will fetch a URL and return its HTTP response code:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>import urllib2
2018-05-15 13:25:03 +03:00
import re
pattern = re.compile('.*10.1016.*')
if pattern.match(value):
2019-11-28 17:30:45 +02:00
get = urllib2.urlopen(value)
return get.getcode()
2018-05-15 13:25:03 +03:00
return &quot;blank&quot;
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>I used a regex to limit it to just some of the DOIs in this case because there were thousands of URLs</li>
<li>Here the response code would be 200, 404, etc, or &ldquo;blank&rdquo; if there is no URL for that item</li>
<li>You could use this in a facet or in a new column</li>
<li>More information and good examples here: <a href="https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine">https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine</a></li>
<li>Finish looking at the 2,640 CIFOR records on DSpace Test (<a href="https://dspacetest.cgiar.org/handle/10568/92904">10568/92904</a>), cleaning up authors and adding collection mappings</li>
2020-01-27 16:20:44 +02:00
<li>They can now be moved to CGSpace as far as I&rsquo;m concerned, but I don&rsquo;t know if Sisay will do it or me</li>
<li>I was checking the CIFOR data for duplicates using Atmire&rsquo;s Metadata Quality Module (and found some duplicates actually), but then DSpace died&hellip;</li>
<li>I didn&rsquo;t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in <code>dmest -T</code>:</li>
2019-11-28 17:30:45 +02:00
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
2018-05-15 18:16:33 +03:00
[Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
[Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>So the Linux kernel killed Java&hellip;</li>
<li>Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>Looking in the DSpace log I see something related:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
2019-11-28 17:30:45 +02:00
</code></pre><ul>
2020-01-27 16:20:44 +02:00
<li>So I&rsquo;m not sure&hellip;</li>
2019-11-28 17:30:45 +02:00
<li>I finally figured out how to get OpenRefine to reconcile values from Solr via <a href="https://github.com/codeforkjeff/conciliator">conciliator</a>:</li>
<li>The trick was to use a more appropriate Solr fieldType <code>text_en</code> instead of <code>text_general</code> so that more terms match, for example uppercase and lower case:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ ./bin/solr start
2018-05-15 18:16:33 +03:00
$ ./bin/solr create_core -c countries
$ curl -X POST -H 'Content-type:application/json' --data-binary '{&quot;add-field&quot;: {&quot;name&quot;:&quot;country&quot;, &quot;type&quot;:&quot;text_en&quot;, &quot;multiValued&quot;:false, &quot;stored&quot;:true}}' http://localhost:8983/solr/countries/schema
2018-05-17 10:51:46 +03:00
$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
2019-11-28 17:30:45 +02:00
</code></pre><ul>
2020-01-27 16:20:44 +02:00
<li>It still doesn&rsquo;t catch simple mistakes like &ldquo;ALBANI&rdquo; or &ldquo;AL BANIA&rdquo; for &ldquo;ALBANIA&rdquo;, and it doesn&rsquo;t return scores, so I have to select matches manually:</li>
2018-05-15 18:16:33 +03:00
</ul>
2019-11-28 17:30:45 +02:00
<p><img src="/cgspace-notes/2018/05/openrefine-solr-conciliator.png" alt="OpenRefine reconciling countries from local Solr"></p>
2018-05-15 18:16:33 +03:00
<ul>
2020-01-27 16:20:44 +02:00
<li>I should probably make a general copy field and set it to be the default search field, like DSpace&rsquo;s search core does (see schema.xml):</li>
2019-11-28 17:30:45 +02:00
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>&lt;defaultSearchField&gt;search_text&lt;/defaultSearchField&gt;
2018-05-15 18:16:33 +03:00
...
&lt;copyField source=&quot;*&quot; dest=&quot;search_text&quot;/&gt;
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>Actually, I wonder how much of their schema I could just copy&hellip;</li>
<li>Apparently the default search field is the <code>df</code> parameter and you could technically just add it to the query string, so no need to bother with that in the schema now</li>
2020-01-27 16:20:44 +02:00
<li>I copied over the DSpace <code>search_text</code> field type from the DSpace Solr config (had to remove some properties so Solr would start) but it doesn&rsquo;t seem to be any better at matching than the <code>text_en</code> type</li>
2019-11-28 17:30:45 +02:00
<li>I think I need to focus on trying to return scores with conciliator</li>
2018-05-15 13:25:03 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-16">2018-05-16</h2>
2018-05-16 14:17:54 +03:00
<ul>
<li>Discuss GDPR with James Stapleton
<ul>
2020-09-16 13:47:13 +03:00
<li>As far as I see it, we are &ldquo;Data Controllers&rdquo; on CGSpace because we store peoples' names, emails, and phone numbers if they register</li>
2020-01-27 16:20:44 +02:00
<li>We set cookies on the user&rsquo;s computer, but these do not contain personally identifiable information (PII) and they are &ldquo;session&rdquo; cookies which are deleted when the user closes their browser</li>
2018-05-16 14:17:54 +03:00
<li>We use Google Analytics to track website usage, which makes Google the &ldquo;Data Processor&rdquo; and in this case we merely need to <em>limit</em> or <em>obfuscate</em> the information we send to them</li>
2020-01-27 16:20:44 +02:00
<li>As the only personally identifiable information we send is the user&rsquo;s IP address, I think we only need to enable <a href="https://support.google.com/analytics/answer/2763052">IP Address Anonymization</a> in our <code>analytics.js</code> code snippets</li>
2018-05-16 14:17:54 +03:00
<li>Then we can add a &ldquo;Privacy&rdquo; page to CGSpace that makes all of this clear</li>
2019-11-28 17:30:45 +02:00
</ul>
</li>
2018-05-16 18:05:46 +03:00
<li>Silvia asked if I could sort the records in her Listings and Report output and it turns out that the options are misconfigured in <code>dspace/config/modules/atmire-listings-and-reports.cfg</code></li>
<li>I created and merged a pull request to fix the sorting issue in Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/374">#374</a>)</li>
2019-11-28 17:30:45 +02:00
<li>Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in <code>page-structure-alterations.xsl</code> to:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>ga('send', 'pageview', {
2019-11-28 17:30:45 +02:00
'anonymizeIp': true
2018-05-16 18:05:46 +03:00
});
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>I tested loading a certain page before and after adding this and afterwards I saw that the parameter <code>aip=1</code> was being sent with the analytics response to Google</li>
<li>According to the <a href="https://developers.google.com/analytics/devguides/collection/analyticsjs/field-reference#anonymizeIp">analytics.js protocol parameter documentation</a> this means that IPs are being anonymized</li>
2020-01-27 16:20:44 +02:00
<li>After finding and fixing some duplicates in IITA&rsquo;s <code>IITA_April_27</code> test collection on DSpace Test (10568/92703) I told Sisay that he can move them to IITA&rsquo;s Journal Articles collection on CGSpace</li>
2018-05-16 14:17:54 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-17">2018-05-17</h2>
2018-05-17 10:51:46 +03:00
<ul>
2020-01-27 16:20:44 +02:00
<li>Testing reconciliation of countries against Solr via conciliator, I notice that <code>CÔTE D'IVOIRE</code> doesn&rsquo;t match <code>COTE D'IVOIRE</code>, whereas with reconcile-csv it does</li>
<li>Also, when reconciling regions against Solr via conciliator <code>EASTERN AFRICA</code> doesn&rsquo;t match <code>EAST AFRICA</code>, whereas with reconcile-csv it does</li>
2018-05-17 10:51:46 +03:00
<li>And <code>SOUTH AMERICA</code> matches both <code>SOUTH ASIA</code> and <code>SOUTH AMERICA</code> with the same match score of 2&hellip; WTF.</li>
2018-05-17 12:37:21 +03:00
<li>It could be that I just need to tune the query filter in Solr (currently using the example <code>text_en</code> field type)</li>
<li>Oh sweet, it turns out that the issue with searching for characters with accents is called &ldquo;code folding&rdquo; in Solr</li>
<li>You can use either a <a href="https://lucene.apache.org/solr/guide/7_3/language-analysis.html"><code>solr.ASCIIFoldingFilterFactory</code> filter</a> or a <a href="https://lucene.apache.org/solr/guide/7_3/charfilterfactories.html"><code>solr.MappingCharFilterFactory</code> charFilter</a> mapping against <code>mapping-FoldToASCII.txt</code></li>
<li>Also see: <a href="https://opensourceconnections.com/blog/2017/02/20/solr-utf8/">https://opensourceconnections.com/blog/2017/02/20/solr-utf8/</a></li>
<li>Now <code>CÔTE D'IVOIRE</code> matches <code>COTE D'IVOIRE</code>!</li>
2020-01-27 16:20:44 +02:00
<li>I&rsquo;m not sure which method is better, perhaps the <code>solr.ASCIIFoldingFilterFactory</code> filter because it doesn&rsquo;t require copying the <code>mapping-FoldToASCII.txt</code> file</li>
<li>And actually I&rsquo;m not entirely sure about the order of filtering before tokenizing, etc&hellip;</li>
2018-05-17 12:37:21 +03:00
<li>Ah, I see that <code>charFilter</code> must be before the tokenizer because it works on a stream, whereas <code>filter</code> operates on tokenized input so it must come after the tokenizer</li>
2020-01-27 16:20:44 +02:00
<li>Regarding the use of the <code>charFilter</code> vs the <code>filter</code> class before and after the tokenizer, respectively, I think it&rsquo;s better to use the <code>charFilter</code> to normalize the input stream before tokenizing it as I have no idea what kinda stuff might get removed by the tokenizer</li>
2018-05-17 18:11:18 +03:00
<li>Skype with Geoffrey from IITA in Nairobi who wants to deposit records to CGSpace via the REST API but I told him that this skips the submission workflows and because we cannot guarantee the data quality we would not allow anyone to use it this way</li>
<li>I finished making the XMLUI changes for anonymization of IP addresses in Google Analytics and merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/375">#375</a></li>
<li>Also, I think we might be able to implement <a href="https://developers.google.com/analytics/devguides/collection/analyticsjs/user-opt-out">opt-out functionality for Google Analytics using a window property</a> that could be managed by <a href="https://webgilde.com/en/analytics-opt-out/">storing its status in a cookie</a></li>
<li>This cookie could be set by a user clicking a link in a privacy policy, for example</li>
<li>The additional Javascript could be easily added to our existing <code>googleAnalytics</code> template in each XMLUI theme</li>
2018-05-17 10:51:46 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-18">2018-05-18</h2>
2018-05-18 17:27:00 +03:00
<ul>
<li>Do a final check on the thirty (30) IWMI Book Chapters for Udana and upload them to CGSpace</li>
<li>These were previously on <a href="https://dspacetest.cgiar.org/handle/10568/91679">DSpace Test as &ldquo;IWMI test collection&rdquo;</a> in 2018-04</li>
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-20">2018-05-20</h2>
2018-05-20 12:03:56 +03:00
<ul>
<li>Run all system updates on DSpace Test (linode19), re-deploy DSpace with latest <code>5_x-dev</code> branch (including GDPR IP anonymization), and reboot the server</li>
<li>Run all system updates on CGSpace (linode18), re-deploy DSpace with latest <code>5_x-dev</code> branch (including GDPR IP anonymization), and reboot the server</li>
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-21">2018-05-21</h2>
2018-05-21 13:00:04 +03:00
<ul>
<li>Geoffrey from IITA got back with more questions about depositing items programatically into the CGSpace workflow</li>
2020-04-13 17:24:05 +03:00
<li>I pointed out that <a href="http://swordapp.org/">SWORD</a> might be an option, as <a href="https://wiki.lyrasis.org/display/DSDOC5x/SWORDv2+Server">DSpace supports the SWORDv2 protocol</a> (although we have never tested it)</li>
2018-05-22 11:22:53 +03:00
<li>Work on implementing <a href="https://cookieconsent.insites.com">cookie consent</a> popup for all XMLUI themes (SASS theme with primary / secondary branding from Bootstrap)</li>
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-22">2018-05-22</h2>
2018-05-22 11:22:53 +03:00
<ul>
<li>Skype with James Stapleton about last minute GDPR wording</li>
2018-05-22 17:52:09 +03:00
<li>After spending yesterday working on integration and theming of the cookieconsent popup, today I cannot get the damn &ldquo;Agree&rdquo; button to dismiss the popup!</li>
<li>I tried calling it several ways, via jQuery, via a function in <code>page-structure-alterations.xsl</code>, via script tags in <code>&lt;head&gt;</code> in <code>page-structure.xsl</code>, and a few others</li>
<li>The only way it actually works is if I paste it into the community or collection HTML</li>
2018-05-23 12:34:01 +03:00
<li>Oh, actually in testing it appears this is not true</li>
2018-05-22 17:52:09 +03:00
<li>This is a waste of TWO full days of work</li>
<li>Marissa Van Epp asked if I could add <code>PII-FP1_PACCA2</code> to the CCAFS phase II project tags on CGSpace so I created a ticket to track it (<a href="https://github.com/ilri/DSpace/issues/376">#376</a>)</li>
2018-05-21 13:00:04 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-23">2018-05-23</h2>
2018-05-23 10:17:16 +03:00
<ul>
2020-01-27 16:20:44 +02:00
<li>I&rsquo;m investigating how many non-CGIAR users we have registered on CGSpace:</li>
2019-11-28 17:30:45 +02:00
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>We might need to do something regarding these users for GDPR compliance because we have their names, emails, and potentially phone numbers</li>
<li>I decided that I will just use the cookieconsent script as is, since it looks good and technically does set the cookie with &ldquo;allow&rdquo; or &ldquo;dismiss&rdquo;</li>
<li>I wrote a quick conditional to check if the user has agreed or not before enabling Google Analytics</li>
<li>I made a pull request for the GDPR compliance popup (<a href="https://github.com/ilri/DSpace/pull/377">#377</a>) and merged it to the <code>5_x-prod</code> branch</li>
<li>I will deploy it to CGSpace tonight</li>
2018-05-23 10:17:16 +03:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-28">2018-05-28</h2>
2018-05-28 07:11:52 -07:00
<ul>
<li>Daniel Haile-Michael sent a message that CGSpace was down (I am currently in Oregon so the time difference is ~10 hours)</li>
2020-01-27 16:20:44 +02:00
<li>I looked in the logs but didn&rsquo;t see anything that would be the cause of the crash</li>
2018-05-28 09:47:45 -07:00
<li>Atmire finalized the DSpace 5.8 testing and sent a pull request: <a href="https://github.com/ilri/DSpace/pull/378">https://github.com/ilri/DSpace/pull/378</a></li>
<li>They have asked if I can test this and get back to them by June 11th</li>
2018-05-28 07:11:52 -07:00
</ul>
2019-12-17 14:49:24 +02:00
<h2 id="2018-05-30">2018-05-30</h2>
2018-05-30 09:05:40 -07:00
<ul>
2020-01-27 16:20:44 +02:00
<li>Talk to Samantha from Bioversity about something related to Google Analytics, I&rsquo;m still not sure what they want</li>
2018-05-30 10:50:55 -07:00
<li>DSpace Test crashed last night, seems to be related to system memory (not JVM heap)</li>
2019-11-28 17:30:45 +02:00
<li>I see this in <code>dmesg</code>:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>[Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
2018-05-30 10:50:55 -07:00
[Wed May 30 00:00:39 2018] Killed process 6082 (java) total-vm:14876264kB, anon-rss:5683372kB, file-rss:0kB, shmem-rss:0kB
[Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>I need to check the Tomcat JVM heap size/usage, command line JVM heap size (for cron jobs), and PostgreSQL memory usage</li>
2020-01-27 16:20:44 +02:00
<li>It might be possible to adjust some things, but eventually we&rsquo;ll need a larger VPS instance</li>
2019-11-28 17:30:45 +02:00
<li>For some reason there are no JVM stats in Munin, ugh</li>
<li>Run all system updates on DSpace Test and reboot it</li>
<li>I generated a list of CIFOR duplicates from the <code>CIFOR_May_9</code> collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika</li>
<li>I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each &ldquo;Item1&rdquo; line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html &gt; ~/cifor-duplicates.txt
2018-05-30 14:48:10 -07:00
$ sed 's/.*Item1.*/\n&amp;/g' ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cleaned.txt
2019-11-28 17:30:45 +02:00
</code></pre><ul>
2020-01-27 16:20:44 +02:00
<li>I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR&rsquo;s collection</li>
2019-11-28 17:30:45 +02:00
<li>A few weeks ago Peter wanted a list of authors from the ILRI collections, so I need to find a way to get the handles of all those collections</li>
2020-01-27 16:20:44 +02:00
<li>I can use the <code>/communities/{id}/collections</code> endpoint of the REST API but it only takes IDs (not handles) and doesn&rsquo;t seem to descend into sub communities</li>
2019-11-28 17:30:45 +02:00
<li>Shit, so I need the IDs for the the top-level ILRI community and all its sub communities (and their sub communities)</li>
<li>There has got to be a better way to do this than going to each community and getting their handles and IDs manually</li>
<li>Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a></li>
2020-01-27 16:20:44 +02:00
<li>The output isn&rsquo;t great, but all the handles and IDs are printed in debug mode:</li>
2019-11-28 17:30:45 +02:00
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2&gt; /tmp/ilri-collections.txt
2019-11-28 17:30:45 +02:00
</code></pre><ul>
<li>Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):</li>
2019-05-05 16:45:12 +03:00
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
2019-12-17 14:49:24 +02:00
</code></pre><h2 id="2018-05-31">2018-05-31</h2>
2018-05-31 14:31:03 -07:00
<ul>
2020-01-27 16:20:44 +02:00
<li>Clarify CGSpace&rsquo;s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance</li>
<li>Testing running PostgreSQL in a Docker container on localhost because when I&rsquo;m on Arch Linux there isn&rsquo;t an easily installable package for particular PostgreSQL versions</li>
2019-11-28 17:30:45 +02:00
<li>Now I can just use Docker:</li>
</ul>
2021-09-13 16:21:16 +03:00
<pre tabindex="0"><code>$ docker pull postgres:9.5-alpine
2018-05-31 15:53:12 -07:00
$ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -O -U dspacetest -d dspacetest -W -h localhost ~/Downloads/cgspace_2018-05-30.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest
2019-11-28 17:30:45 +02:00
</code></pre>
2018-05-31 14:31:03 -07:00
2018-05-01 17:50:03 +03:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2022-01-01 15:21:47 +02:00
<li><a href="/cgspace-notes/2022-01/">January, 2022</a></li>
2021-12-03 12:58:43 +02:00
<li><a href="/cgspace-notes/2021-12/">December, 2021</a></li>
2021-11-01 10:49:21 +02:00
<li><a href="/cgspace-notes/2021-11/">November, 2021</a></li>
<li><a href="/cgspace-notes/2021-10/">October, 2021</a></li>
2021-09-02 17:21:48 +03:00
<li><a href="/cgspace-notes/2021-09/">September, 2021</a></li>
2018-05-01 17:50:03 +03:00
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
2018-05-01 17:50:03 +03:00
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>