I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their cg.coverage.country text values
It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter’s preferred “display” country names)
It implements a “force” mode too that will clear existing country codes and re-tag everything
It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa…
I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their cg.coverage.country text values
It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter’s preferred “display” country names)
It implements a “force” mode too that will clear existing country codes and re-tag everything
It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa…
<li>I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their <code>cg.coverage.country</code> text values
<ul>
<li>It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter’s preferred “display” country names)</li>
<li>It implements a “force” mode too that will clear existing country codes and re-tag everything</li>
<li>It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa…</li>
</ul>
</li>
</ul>
<ul>
<li>The code is currently on my personal GitHub: <ahref="https://github.com/alanorth/dspace-curation-tasks">https://github.com/alanorth/dspace-curation-tasks</a>
<ul>
<li>I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the <code>dspace/lib</code> directory (not to mention the config)</li>
</ul>
</li>
<li>I forked the <ahref="https://github.com/ilri/dspace-curation-tasks">dspace-curation-tasks to ILRI’s GitHub</a> and <ahref="https://issues.sonatype.org/browse/OSSRH-59650">submitted the project to Maven Central</a> so I can integrate it more easily with our DSpace build via dependencies</li>
</ul>
<h2id="2020-08-03">2020-08-03</h2>
<ul>
<li>Atmire responded to the ticket about the ongoing upgrade issues
<ul>
<li>They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!</li>
<li>They also said they have never experienced the <code>type: 5</code> site statistics issue, so I need to try to purge those and continue with the stats processing</li>
</ul>
</li>
<li>I purged all unmigrated stats in a few cores and then restarted processing:</li>
<li>Andrea from Macaroni Bros emailed me a few days ago to say he’s having issues with the CGSpace REST API
<ul>
<li>He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: <ahref="https://www.rtb.cgiar.org/publications/">https://www.rtb.cgiar.org/publications/</a></li>
</ul>
</li>
</ul>
<h2id="2020-08-04">2020-08-04</h2>
<ul>
<li>Look into the REST API issues that Macaroni Bros raised last week:
<ul>
<li>The first one was about the <code>collections</code> endpoint returning empty items:
<ul>
<li><ahref="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=2">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=2</a> (offset=2 is correct)</li>
<li><ahref="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=3">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=3</a> (offset=3 is empty)</li>
<li><ahref="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=4">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=4</a> (offset=4 is correct again)</li>
</ul>
</li>
<li>I confirm that the second link returns zero items on CGSpace…
<ul>
<li>I tested on my local development instance and it returns one item correctly…</li>
<li>I tested on DSpace Test (currently DSpace 6 with UUIDs) and it returns one item correctly…</li>
<li>Perhaps an indexing issue?</li>
</ul>
</li>
<li>The second issue is the <code>collections</code> endpoint returning the wrong number of items:
<li><ahref="https://cgspace.cgiar.org/rest/collections/1445/items">https://cgspace.cgiar.org/rest/collections/1445/items</a> (real number of items: 61)</li>
</ul>
</li>
<li>I confirm that it is indeed happening on CGSpace…
<ul>
<li>And actually I can replicate the same issue on my local CGSpace 5.8 instance:</li>
<li>I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
<ul>
<li>I started a full Discovery re-indexing</li>
</ul>
</li>
</ul>
<h2id="2020-08-05">2020-08-05</h2>
<ul>
<li>Port my <ahref="https://github.com/ilri/dspace-curation-tasks">dspace-curation-tasks</a> to DSpace 6 and tag version <code>6.0-SNAPSHOT</code></li>
<li>I downloaded the <ahref="https://unstats.un.org/unsd/methodology/m49/overview/">UN M.49</a> CSV file to start working on updating the CGSpace regions
<ul>
<li>First issue is they don’t version the file so you have no idea when it was released</li>
<li>Second issue is that three rows have errors due to not using quotes around “China, Macao Special Administrative Region”</li>
</ul>
</li>
<li>Bizu said she was having problems approving tasks on CGSpace
<ul>
<li>I looked at the PostgreSQL locks and they have skyrocketed since yesterday:</li>
<li>I see the Macaroni Bros are using their new user agent for harvesting: <code>RTB website BOT</code>
<ul>
<li>But that pattern doesn’t match in the nginx bot list or Tomcat’s crawler session manager valve because we’re only checking for <code>[Bb]ot</code>!</li>
<li>So they have created thousands of Tomcat sessions:</li>
<li>DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don’t misuse the resources
<li>I see another IP 104.198.96.245, which is also using the “RTB website BOT” but there are 70,000 hits in Solr from earlier this year before they started using the user agent
<ul>
<li>I purged all the hits from Solr, including a few thousand from 64.62.202.71</li>
</ul>
</li>
<li>A few more IPs causing lots of Tomcat sessions yesterday:</li>
<li>38.128.66.10 isn’t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
</code></pre><ul>
<li>64.62.202.71 is using a user agent I’ve never seen before:</li>
<li>So now our “bot” regex can’t even match that…
<ul>
<li>Unless we change it to <code>[Bb]\.?[Oo]\.?[Tt]\.?</code>… which seems to match all variations of “bot” I can think of right now, according to <ahref="https://regexr.com/59lpt">regexr.com</a>:</li>
<li>I decided to start the 2018 core over again, so I re-synced it from CGSpace and started again from the solr-upgrade-statistics-6x tool and now I’m having the same issues with Java heap space that I had last month
<ul>
<li>The process kept crashing due to memory, so I increased the memory to 3072m and finally 4096m…</li>
<li>Also, I decided to try to purge all the <code>-unmigrated</code> docs that it had found so far to see if that helps…</li>
<li>There were about 466,000 records unmigrated so far, most of which were <code>type: 5</code> (SITE statistics)</li>
<li>Now it is processing again…</li>
</ul>
</li>
<li>I developed a small Java class called <code>FixJpgJpgThumbnails</code> to remove “.jpg.jpg” thumbnails from the <code>THUMBNAIL</code> bundle and replace them with their originals from the <code>ORIGINAL</code> bundle
<ul>
<li>The code is based on <ahref="https://github.com/UoW-IRRs/DSpace-Scripts/blob/master/src/main/java/nz/ac/waikato/its/irr/scripts/RemovePNGThumbnailsForPDFs.java">RemovePNGThumbnailsForPDFs.java</a> by Andrea Schweer</li>
<li>I incorporated it into my dspace-curation-tasks repository, then renamed it to <ahref="https://github.com/ilri/cgspace-java-helpers">cgspace-java-helpers</a></li>