<li>Discuss some OpenRXV issues with Abdullah from CodeObia
<ul>
<li>He’s trying to work on the DSpace 6+ metadata schema autoimport using the DSpace 6+ REST API</li>
<li>Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies</li>
</ul>
</li>
</ul>
<h2id="2021-03-02">2021-03-02</h2>
<ul>
<li>I fixed three build and runtime issues in OpenRXV:
<ul>
<li><ahref="https://github.com/ilri/OpenRXV/pull/80">fix highcharts-angular and ngx-tour-core build</a></li>
<li><ahref="https://github.com/ilri/OpenRXV/pull/82">frontend/package.json: Pin @types/ramda at 0.27.34</a></li>
</ul>
</li>
<li>Then I merged a few fixes that Abdullah had worked on last week</li>
</ul>
<h2id="2021-03-03">2021-03-03</h2>
<ul>
<li>I <ahref="https://github.com/ilri/OpenRXV/issues/83">fixed another frontend build warning on OpenRXV</a></li>
<li>Then I <ahref="https://github.com/ilri/OpenRXV/pull/84">updated the frontend container to use Node.js 12 and Ubuntu 20.04</a></li>
<li>Also, I <ahref="https://github.com/ilri/OpenRXV/pull/85">added a GitHub Actions workflow to build the frontend</a></li>
<li>I did some testing of Abdullah’s patch for the values mapping search on OpenRXV
<ul>
<li>It still doesn’t work with multi-word values, so I recorded a video with wf-recorder and uploaded it to <ahref="https://github.com/ilri/OpenRXV/issues/43">the issue</a> for him to investigate</li>
</ul>
</li>
</ul>
<h2id="2021-03-04">2021-03-04</h2>
<ul>
<li>Peter is having issues with the workflow since yesterday
<ul>
<li>I looked at the Munin stats and see a high number of database locks since yesterday</li>
<li>I looked at the number of connections in PostgreSQL and it’s definitely high again:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1020
</code></pre><ul>
<li>I reported it to Atmire to take a look, on the <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=851">same issue</a> we had been tracking this before</li>
<li>Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell</li>
<li>I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my <code>add-orcid-identifier.py</code> script:</li>
<li>I still need to do cleanup on the journal articles metadata
<ul>
<li>Peter sent me some cleanups but I can’t use them in the search/replace format he gave</li>
<li>I think it’s better to export the metadata values with IDs and import cleaned up ones as CSV</li>
</ul>
</li>
</ul>
<pre><codeclass="language-console"data-lang="console">localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087
</code></pre><ul>
<li>I used OpenRefine to remove all journal values that didn’t have one of these values: ; ( )
<ul>
<li>Then I cloned the <code>cg.journal</code> field to <code>cg.volume</code> and <code>cg.issue</code></li>
<li>I used some GREL expressions like these to extract the journal name, volume, and issue:</li>
</ul>
</li>
</ul>
<pre><codeclass="language-console"data-lang="console">value.partition(';')[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
</code></pre><ul>
<li>Then I uploaded the changes to CGSpace using <code>dspace metadata-import</code></li>
<li>Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
<ul>
<li>The error was “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e …”</li>
<li>I searched the DSpace issue tracker and found several issues reporting this:
<li><ahref="https://jira.lyrasis.org/browse/DS-4004">DS-4004 Authorization denied Exception when trying to delete permanently an item, collection or community as a non-Admin user</a></li>
<li><ahref="https://jira.lyrasis.org/browse/DS-4297">DS-4297 Authorization error when trying to delete item by submitter/administrator</a></li>
</ul>
</li>
<li>The issue is apparently with non-admin users who are in the admin and submit groups of the owning collection…</li>
<li>In this case the item was uploaded to the CCAFS Reports collection, and Margarita is a non-admin user who is a member of the collection’s admin and submit groups, exactly as the issue described</li>
<li>I added a comment about our issue to <ahref="https://jira.lyrasis.org/browse/DS-4297">DS-4297</a></li>
</ul>
</li>
<li>Yesterday Abenet added me to a WLE collection approver/editer steps so we can try to figure out why Niroshini is having issues adding metadata to Udana’s submissions
<ul>
<li>I edited Udana’s submission to CGSpace:
<ul>
<li>corrected the title</li>
<li>added language English</li>
<li>changed the link to the external item page instead of PDF</li>
<li>added SDGs from the external item page</li>
<li>added AGROVOC subjects from the external item page</li>
<li>added pagination (extent)</li>
<li>changed the license to “other” because CC-BY-NC-ND is not printed anywhere in the PDF or external item page</li>
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d
</code></pre><ul>
<li>The trick is that when you create a volume like “myvolume” from a <code>docker-compose.yml</code> file, Docker will create it with the name “docker_myvolume”
<ul>
<li>If you create it manually on the command line with <code>docker volume create myvolume</code> then the name is literally “myvolume”</li>
</ul>
</li>
<li>I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit</li>
<li>Delete the <code>openrxv-items-temp</code> index to test a fresh harvesting:</li>
<li>I realized there is something wrong with the Elasticsearch indexes on AReS
<ul>
<li>On a new test environment I see <code>openrxv-items</code> is correctly an alias of <code>openrxv-items-final</code>:</li>
</ul>
</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
"openrxv-items-final": {
"aliases": {
"openrxv-items": {}
}
},
</code></pre><ul>
<li>But on AReS production <code>openrxv-items</code> has somehow become an index:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
"openrxv-items": {
"aliases": {}
},
"openrxv-items-final": {
"aliases": {}
},
"openrxv-items-temp": {
"aliases": {}
},
</code></pre><ul>
<li>I fixed the issue on production by cloning the <code>openrxv-items</code> index to <code>openrxv-items-final</code>, deleting <code>openrxv-items</code>, and then re-creating it as an alias:</li>
<li>They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a “normal” user agent
<ul>
<li>Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020</li>
<li>I will add this IP to the list of bots in nginx and purge it from Solr with my <code>check-spider-ip-hits.sh</code> script</li>
</ul>
</li>
<li>I made a few changes to OpenRXV:
<ul>
<li><ahref="https://github.com/ilri/OpenRXV/issues/89">Migrated away from links to use networks</a></li>
<li><ahref="https://github.com/ilri/OpenRXV/issues/68">Converted the backend container to use a custom image that includes <code>unoconv</code></a> so we don’t have to manually install it anymore</li>
<li>I approved the WLE item that I edited last week, and all the metadata is there: <ahref="https://hdl.handle.net/10568/111810">https://hdl.handle.net/10568/111810</a>
<ul>
<li>So I’m not sure what Niroshini’s issue with metadata is…</li>
</ul>
</li>
<li>Peter sent a message yesterday saying that his item finally got committed
<ul>
<li>I looked at the Munin graphs and there was a MASSIVE spike in database activity two days ago, and now database locks are back down to normal levels (from 1000+):</li>
</ul>
</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
</code></pre><ul>
<li>On 2021-03-03 the PostgreSQL transactions started rising:</li>
<li>I sent another message to Atmire to ask if they have time to look into this</li>
<li>CIFOR is pressuring me to upload the batch items from last week
<ul>
<li>Vika sent me a final file with some duplicates that Peter identified removed</li>
<li>I extracted and re-applied my basic corrections from last week in OpenRefine, then ran the items through <code>csv-metadata-quality</code> checker and uploaded them to CGSpace</li>
<li>In total there are 1,088 items</li>
</ul>
</li>
<li>Udana from IWMI emailed to ask about CGSpace thumbnails</li>
<li>Udana from IWMI emailed to ask about an item uploaded recently that does not appear in AReS
<ul>
<li><ahref="https://hdl.handle.net/10568/111794">The item</a> was added to the archive on 2021-03-05, and I last harvested on 2021-03-06, so this might be an issue of a missing item</li>
</ul>
</li>
<li>Abenet got a quote from Atmire to buy 125 credits for 3750€</li>
<li>Maria at Bioversity sent some feedback about duplicate items on AReS</li>
<li>I’m wondering if the issue of the <code>openrxv-items-final</code> index not getting cleared after a successful harvest (which results in having 200,000, then 300,000, etc items) has to do with the alias issue I fixed yesterday
<ul>
<li>I will start a fresh harvest on AReS without now to check, but first back up the current index just in case:</li>
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS
</code></pre><ul>
<li>As I saw on my local test instance, even when you cancel a harvesting, it replaces the <code>openrxv-items-final</code> index with whatever is in <code>openrxv-items-temp</code> automatically, so I assume it will do the same now</li>
<li>The harvesting on AReS finished last night and everything worked as expected, with no manual intervention
<ul>
<li>This means that <ahref="https://github.com/ilri/OpenRXV/issues/64">the issue</a> we were facing for a few months was due to the <code>openrxv-items</code> index being deleted and re-created as a standalone index instead of an alias of <code>openrxv-items-final</code></li>
</ul>
</li>
<li>Talk to Moayad about OpenRXV development
<ul>
<li>We realized that the missing/duplicate items issue is probably due to the long harvesting time on the REST API, as the time between starting the harvesting on page 0 and finishing the harvesting on page 900 (in the CGSpace example), some items will have been added to the repository, which causes the pages to shift</li>
<li>I proposed a solution in the <ahref="https://github.com/ilri/OpenRXV/issues/67">GitHub issue</a>, where we consult the site’s XML sitemap after harvesting to see if we missed any items, and then we harvest them individually</li>
</ul>
</li>
<li>Peter sent me a list of 356 DOIs from Altmetric that don’t have our Handles, so we need to Tweet them
<ul>
<li>I used my <code>doi-to-handle.py</code> script to generate a list of handles and titles for him:</li>
<li>Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses <code>cg.isijournal</code> and MELSpace uses <code>mel.impact-factor</code>
<ul>
<li>I filed <ahref="https://github.com/AgriculturalSemantics/cg-core/issues/39">an issue</a> on the cg-core project to ask colleagues for ideas</li>
<li>Peter said he doesn’t see “Source Code” or “Software” in the <ahref="https://cgspace.cgiar.org/handle/10568/1/search-filter?field=type">output type facet on the ILRI community</a>, but I see it on the home page, so I will try to do a full Discovery re-index:</li>
<li>Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code</li>
<li>Added the ability to check <code>dcterms.license</code> values against the SPDX licenses in the csv-metadata-quality tool
<ul>
<li>Also, I made some other minor fixes and released <ahref="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.6">version 0.4.6</a> on GitHub</li>
</ul>
</li>
<li>Proof and upload twenty-seven items to CGSpace for Peter Ballantyne
<ul>
<li>Mostly Ugandan outputs for CRP Livestock and Livestock and Fish</li>
<li>After the harvesting finished it seems the indexes got messed up again, as <code>openrxv-items</code> is an alias of <code>openrxv-items-temp</code> instead of <code>openrxv-items-final</code>:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
"openrxv-items-final": {
"aliases": {}
},
"openrxv-items-temp": {
"aliases": {
"openrxv-items": {}
}
},
</code></pre><ul>
<li>Anyways, the number of items in <code>openrxv-items</code> seems OK and the AReS Explorer UI is working fine
<ul>
<li>I will have to manually fix the indexes before the next harvesting</li>
</ul>
</li>
<li>Publish the web version of the DSpace CSV Metadata Quality checker tool that I wrote this weekend on GitHub: <ahref="https://github.com/ilri/csv-metadata-quality-web">https://github.com/ilri/csv-metadata-quality-web</a>
<ul>
<li>Also, it is deployed on Heroku: <ahref="https://fierce-ocean-30836.herokuapp.com/">https://fierce-ocean-30836.herokuapp.com/</a></li>
<li>I was running it on Google App Engine originally, but they have <em>way</em> too aggressive caching of static assets</li>
<li>I added the ability to check for, and fix, “mojibake” characters in csv-metadata-quality</li>
</ul>
<h2id="2021-03-21">2021-03-21</h2>
<ul>
<li>Last week Atmire asked me which browser I was using to test the duplicate checker, which I had <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934">reported</a> as not loading
<ul>
<li>I tried to load it in Chrome and it works… hmmm</li>
</ul>
</li>
<li>Back up the current <code>openrxv-items-final</code> index to start a fresh AReS Harvest:</li>
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
"openrxv-items-temp": {
"aliases": {}
},
"openrxv-items-final": {
"aliases": {
"openrxv-items": {}
}
}
</code></pre><ul>
<li>Then I started a new harvesting</li>
<li>I switched the Node.js in the <ahref="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to v12 since v10 will cease to be supported soon
<ul>
<li>I re-deployed DSpace Test (linode26) with Node.js 12 and restarted the server</li>
</ul>
</li>
<li>The AReS harvest finally finished, with 1047 pages of items, but the <code>openrxv-items-final</code> index is empty and the <code>openrxv-items-temp</code> index has a 103,000 items:</li>
<li>I looked in the Docker logs for Elasticsearch and saw a few memory errors:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>According to <code>/usr/share/elasticsearch/config/jvm.options</code> in the Elasticsearch container the default JVM heap is 1g
<ul>
<li>I see the running Java process has <code>-Xms 1g -Xmx 1g</code> in its process invocation so I guess that it must be indeed using 1g</li>
<li>We can <ahref="https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html">change the heap size with the ES_JAVA_OPTS environment variable</a></li>
<li>Or perhaps better, we should <ahref="https://www.elastic.co/guide/en/elasticsearch/reference/master/jvm-options.html">use a jvm.options.d file</a> because if you use the environment variable it overrides all other JVM options from the default <code>jvm.options</code></li>
<li>I tried to set memory to 1536m by binding an options file and restarting the container, but it didn’t seem to work</li>
<li>Nevertheless, after restarting I see 103,000 items in the Explorer…</li>
<li>But the indexes are still kinda messed up… the <code>openrxv-items</code> index is an alias of the wrong index!</li>
<li>I re-deployed AReS with 1.5GB of heap using the <code>ES_JAVA_OPTS</code> environment variable
<ul>
<li>It turns out that this <em>is</em> the recommended way to set the heap: <ahref="https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html">https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html</a></li>
</ul>
</li>
<li>Then I fixed the aliases to make sure <code>openrxv-items</code> was an alias of <code>openrxv-items-final</code>, similar to how I did a few weeks ago</li>