cgspace-notes/docs/2020-08/index.html
2020-12-06 16:53:29 +02:00

853 lines
41 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="August, 2020" />
<meta property="og:description" content="2020-08-02
I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their cg.coverage.country text values
It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter&rsquo;s preferred &ldquo;display&rdquo; country names)
It implements a &ldquo;force&rdquo; mode too that will clear existing country codes and re-tag everything
It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa&hellip;
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-08/" />
<meta property="article:published_time" content="2020-08-02T15:35:54+03:00" />
<meta property="article:modified_time" content="2020-09-02T13:39:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2020"/>
<meta name="twitter:description" content="2020-08-02
I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their cg.coverage.country text values
It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter&rsquo;s preferred &ldquo;display&rdquo; country names)
It implements a &ldquo;force&rdquo; mode too that will clear existing country codes and re-tag everything
It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa&hellip;
"/>
<meta name="generator" content="Hugo 0.79.0" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "August, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-08/",
"wordCount": "3672",
"datePublished": "2020-08-02T15:35:54+03:00",
"dateModified": "2020-09-02T13:39:11+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-08/">
<title>August, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.16633182cd803b52b9bf9e29ea1ef4b2e3d460deee0ded49466d7e16e449c158.css" rel="stylesheet" integrity="sha256-FmMxgs2AO1K5v54p6h70suPUYN7uDe1JRm1&#43;FuRJwVg=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.4ed405d7c7002b970d34cbe6026ff44a556b0808cb98a9db4008752110ed964b.js" integrity="sha256-TtQF18cAK5cNNMvmAm/0SlVrCAjLmKnbQAh1IRDtlks=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-08/">August, 2020</a></h2>
<p class="blog-post-meta">
<time datetime="2020-08-02T15:35:54+03:00">Sun Aug 02, 2020</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-08-02">2020-08-02</h2>
<ul>
<li>I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their <code>cg.coverage.country</code> text values
<ul>
<li>It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter&rsquo;s preferred &ldquo;display&rdquo; country names)</li>
<li>It implements a &ldquo;force&rdquo; mode too that will clear existing country codes and re-tag everything</li>
<li>It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa&hellip;</li>
</ul>
</li>
</ul>
<ul>
<li>The code is currently on my personal GitHub: <a href="https://github.com/alanorth/dspace-curation-tasks">https://github.com/alanorth/dspace-curation-tasks</a>
<ul>
<li>I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the <code>dspace/lib</code> directory (not to mention the config)</li>
</ul>
</li>
<li>I forked the <a href="https://github.com/ilri/dspace-curation-tasks">dspace-curation-tasks to ILRI&rsquo;s GitHub</a> and <a href="https://issues.sonatype.org/browse/OSSRH-59650">submitted the project to Maven Central</a> so I can integrate it more easily with our DSpace build via dependencies</li>
</ul>
<h2 id="2020-08-03">2020-08-03</h2>
<ul>
<li>Atmire responded to the ticket about the ongoing upgrade issues
<ul>
<li>They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!</li>
<li>They also said they have never experienced the <code>type: 5</code> site statistics issue, so I need to try to purge those and continue with the stats processing</li>
</ul>
</li>
<li>I purged all unmigrated stats in a few cores and then restarted processing:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
</code></pre><ul>
<li>Andrea from Macaroni Bros emailed me a few days ago to say he&rsquo;s having issues with the CGSpace REST API
<ul>
<li>He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: <a href="https://www.rtb.cgiar.org/publications/">https://www.rtb.cgiar.org/publications/</a></li>
</ul>
</li>
</ul>
<h2 id="2020-08-04">2020-08-04</h2>
<ul>
<li>Look into the REST API issues that Macaroni Bros raised last week:
<ul>
<li>The first one was about the <code>collections</code> endpoint returning empty items:
<ul>
<li><a href="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&amp;offset=2">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&amp;offset=2</a> (offset=2 is correct)</li>
<li><a href="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&amp;offset=3">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&amp;offset=3</a> (offset=3 is empty)</li>
<li><a href="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&amp;offset=4">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&amp;offset=4</a> (offset=4 is correct again)</li>
</ul>
</li>
<li>I confirm that the second link returns zero items on CGSpace&hellip;
<ul>
<li>I tested on my local development instance and it returns one item correctly&hellip;</li>
<li>I tested on DSpace Test (currently DSpace 6 with UUIDs) and it returns one item correctly&hellip;</li>
<li>Perhaps an indexing issue?</li>
</ul>
</li>
<li>The second issue is the <code>collections</code> endpoint returning the wrong number of items:
<ul>
<li><a href="https://cgspace.cgiar.org/rest/collections/1445">https://cgspace.cgiar.org/rest/collections/1445</a> (numberItems: 63)</li>
<li><a href="https://cgspace.cgiar.org/rest/collections/1445/items">https://cgspace.cgiar.org/rest/collections/1445/items</a> (real number of items: 61)</li>
</ul>
</li>
<li>I confirm that it is indeed happening on CGSpace&hellip;
<ul>
<li>And actually I can replicate the same issue on my local CGSpace 5.8 instance:</li>
</ul>
</li>
</ul>
</li>
</ul>
<pre><code>$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
&quot;numberItems&quot; : 63,
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
61
</code></pre><ul>
<li>Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:</li>
</ul>
<pre><code>$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
&quot;numberItems&quot; : 61,
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
59
</code></pre><ul>
<li>Ah! I exported that collection&rsquo;s metadata and checked it in OpenRefine, where I noticed that two items are mapped twice
<ul>
<li>I dealt with this problem in 2017-01 and the solution is to check the <code>collection2item</code> table:</li>
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
id | collection_id | item_id
--------+---------------+---------
133698 | 966 | 107687
134685 | 1445 | 107687
134686 | 1445 | 107687
(3 rows)
</code></pre><ul>
<li>So for each id you can delete one duplicate mapping:</li>
</ul>
<pre><code>dspace=# DELETE FROM collection2item WHERE id='134686';
dspace=# DELETE FROM collection2item WHERE id='128819';
</code></pre><ul>
<li>Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter&rsquo;s preferred display names</li>
</ul>
<pre><code>$ cat 2020-08-04-PB-new-countries.csv
cg.coverage.country,correct
CAPE VERDE,CABO VERDE
COCOS ISLANDS,COCOS (KEELING) ISLANDS
&quot;CONGO, DR&quot;,&quot;CONGO, DEMOCRATIC REPUBLIC OF&quot;
COTE D'IVOIRE,CÔTE D'IVOIRE
&quot;KOREA, REPUBLIC&quot;,&quot;KOREA, REPUBLIC OF&quot;
PALESTINE,&quot;PALESTINE, STATE OF&quot;
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
</code></pre><ul>
<li>I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
<ul>
<li>I started a full Discovery re-indexing</li>
</ul>
</li>
</ul>
<h2 id="2020-08-05">2020-08-05</h2>
<ul>
<li>Port my <a href="https://github.com/ilri/dspace-curation-tasks">dspace-curation-tasks</a> to DSpace 6 and tag version <code>6.0-SNAPSHOT</code></li>
<li>I downloaded the <a href="https://unstats.un.org/unsd/methodology/m49/overview/">UN M.49</a> CSV file to start working on updating the CGSpace regions
<ul>
<li>First issue is they don&rsquo;t version the file so you have no idea when it was released</li>
<li>Second issue is that three rows have errors due to not using quotes around &ldquo;China, Macao Special Administrative Region&rdquo;</li>
</ul>
</li>
<li>Bizu said she was having problems approving tasks on CGSpace
<ul>
<li>I looked at the PostgreSQL locks and they have skyrocketed since yesterday:</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2020/08/postgres_locks_ALL-day.png" alt="PostgreSQL locks day"></p>
<p><img src="/cgspace-notes/2020/08/postgres_querylength_ALL-day.png" alt="PostgreSQL query length day"></p>
<ul>
<li>Seems that something happened yesterday afternoon at around 5PM&hellip;
<ul>
<li>For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue</li>
<li>I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly</li>
</ul>
</li>
<li>I checked the nginx logs around 5PM yesterday to see who was accessing the server:</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I see the Macaroni Bros are using their new user agent for harvesting: <code>RTB website BOT</code>
<ul>
<li>But that pattern doesn&rsquo;t match in the nginx bot list or Tomcat&rsquo;s crawler session manager valve because we&rsquo;re only checking for <code>[Bb]ot</code>!</li>
<li>So they have created thousands of Tomcat sessions:</li>
</ul>
</li>
</ul>
<pre><code>$ cat dspace.log.2020-08-04 | grep -E &quot;(63.32.242.35|64.62.202.71)&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
5693
</code></pre><ul>
<li>DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don&rsquo;t misuse the resources
<ul>
<li>Perhaps <code>[Bb][Oo][Tt]</code>&hellip;</li>
</ul>
</li>
<li>I see another IP 104.198.96.245, which is also using the &ldquo;RTB website BOT&rdquo; but there are 70,000 hits in Solr from earlier this year before they started using the user agent
<ul>
<li>I purged all the hits from Solr, including a few thousand from 64.62.202.71</li>
</ul>
</li>
<li>A few more IPs causing lots of Tomcat sessions yesterday:</li>
</ul>
<pre><code>$ cat dspace.log.2020-08-04 | grep &quot;38.128.66.10&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
1585
$ cat dspace.log.2020-08-04 | grep &quot;64.62.202.71&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
5691
</code></pre><ul>
<li>38.128.66.10 isn&rsquo;t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
</code></pre><ul>
<li>64.62.202.71 is using a user agent I&rsquo;ve never seen before:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
</code></pre><ul>
<li>So now our &ldquo;bot&rdquo; regex can&rsquo;t even match that&hellip;
<ul>
<li>Unless we change it to <code>[Bb]\.?[Oo]\.?[Tt]\.?</code>&hellip; which seems to match all variations of &ldquo;bot&rdquo; I can think of right now, according to <a href="https://regexr.com/59lpt">regexr.com</a>:</li>
</ul>
</li>
</ul>
<pre><code>RTB website BOT
Altmetribot
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
</code></pre><ul>
<li>And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):</li>
</ul>
<pre><code>$ cat dspace.log.2020-08-04 | grep &quot;199.47.87.145&quot; | grep -E 'sessi
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2777
</code></pre><ul>
<li>I will add <code>Turnitin</code> to the Tomcat Crawler Session Manager Valve regex as well&hellip;</li>
</ul>
<h2 id="2020-08-06">2020-08-06</h2>
<ul>
<li>I have been working on processing the Solr statistics with the Atmire tool on DSpace Test the last few days:
<ul>
<li>statistics:
<ul>
<li>2,040,385 docs: 2h 28m 49s</li>
</ul>
</li>
<li>statistics-2019:
<ul>
<li>8,960,000 docs: 12h 7s</li>
<li>1,780,575 docs: 2h 7m 29s</li>
</ul>
</li>
<li>statistics-2018:
<ul>
<li>1,970,000 docs: 12h 1m 28s</li>
<li>360,000 docs: 2h 54m 56s (Linode rebooted)</li>
<li>1,110,000 docs: 7h 1m 44s (Restarted Tomcat, oops)</li>
</ul>
</li>
</ul>
</li>
<li>I decided to start the 2018 core over again, so I re-synced it from CGSpace and started again from the solr-upgrade-statistics-6x tool and now I&rsquo;m having the same issues with Java heap space that I had last month
<ul>
<li>The process kept crashing due to memory, so I increased the memory to 3072m and finally 4096m&hellip;</li>
<li>Also, I decided to try to purge all the <code>-unmigrated</code> docs that it had found so far to see if that helps&hellip;</li>
<li>There were about 466,000 records unmigrated so far, most of which were <code>type: 5</code> (SITE statistics)</li>
<li>Now it is processing again&hellip;</li>
</ul>
</li>
<li>I developed a small Java class called <code>FixJpgJpgThumbnails</code> to remove &ldquo;.jpg.jpg&rdquo; thumbnails from the <code>THUMBNAIL</code> bundle and replace them with their originals from the <code>ORIGINAL</code> bundle
<ul>
<li>The code is based on <a href="https://github.com/UoW-IRRs/DSpace-Scripts/blob/master/src/main/java/nz/ac/waikato/its/irr/scripts/RemovePNGThumbnailsForPDFs.java">RemovePNGThumbnailsForPDFs.java</a> by Andrea Schweer</li>
<li>I incorporated it into my dspace-curation-tasks repository, then renamed it to <a href="https://github.com/ilri/cgspace-java-helpers">cgspace-java-helpers</a></li>
<li>In testing I found that I can replace ~4,000 thumbnails on CGSpace!</li>
</ul>
</li>
</ul>
<h2 id="2020-08-07">2020-08-07</h2>
<ul>
<li>I improved the <code>RemovePNGThumbnailsForPDFs.java</code> a bit more to exclude infographics and original bitstreams larger than 100KiB
<ul>
<li>I ran it on CGSpace and it cleaned up 3,769 thumbnails!</li>
<li>Afterwards I ran <code>dspace cleanup -v</code> to remove the deleted thumbnails</li>
</ul>
</li>
</ul>
<h2 id="2020-08-08">2020-08-08</h2>
<ul>
<li>The Atmire stats processing for the statistics-2018 Solr core keeps stopping with this error:</li>
</ul>
<pre><code>Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.storeOnServer(SourceFile:317)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:177)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
<li>It lists a few of the records that it is having issues with and they all have integer IDs
<ul>
<li>When I checked Solr I see 8,000 of them, some of which have type 0 and some with no type&hellip;</li>
<li>I purged them and then the process continues:</li>
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/[0-9]+/&lt;/query&gt;&lt;/delete&gt;'
</code></pre><h2 id="2020-08-09">2020-08-09</h2>
<ul>
<li>The Atmire script did something to the server and created 132GB of log files so the root partition ran out of space&hellip;</li>
<li>I removed the log file and tried to re-run the process but it seems to be looping over 11,000 records and failing, creating millions of lines in the logs again:</li>
</ul>
<pre><code># grep -oE &quot;Record uid: ([a-f0-9\\-]*){1} couldn't be processed&quot; /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 &gt; /tmp/not-processed-errors.txt
# wc -l /tmp/not-processed-errors.txt
2202973 /tmp/not-processed-errors.txt
# sort /tmp/not-processed-errors.txt | uniq -c | tail -n 10
220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn't be processed
220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn't be processed
220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn't be processed
220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn't be processed
220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn't be processed
221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn't be processed
220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn't be processed
220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn't be processed
221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn't be processed
221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn't be processed
</code></pre><ul>
<li>I looked at some of those records and saw strange objects in their <code>containerCommunity</code>, <code>containerCollection</code>, etc&hellip;</li>
</ul>
<pre><code>{
&quot;responseHeader&quot;: {
&quot;status&quot;: 0,
&quot;QTime&quot;: 0,
&quot;params&quot;: {
&quot;q&quot;: &quot;uid:fff1349d-79d5-4ceb-89a1-ce78107d982d&quot;,
&quot;indent&quot;: &quot;true&quot;,
&quot;wt&quot;: &quot;json&quot;,
&quot;_&quot;: &quot;1596957629970&quot;
}
},
&quot;response&quot;: {
&quot;numFound&quot;: 1,
&quot;start&quot;: 0,
&quot;docs&quot;: [
{
&quot;containerCommunity&quot;: [
&quot;155&quot;,
&quot;155&quot;,
&quot;{set=null}&quot;
],
&quot;uid&quot;: &quot;fff1349d-79d5-4ceb-89a1-ce78107d982d&quot;,
&quot;containerCollection&quot;: [
&quot;1099&quot;,
&quot;830&quot;,
&quot;{set=830}&quot;
],
&quot;owningComm&quot;: [
&quot;155&quot;,
&quot;155&quot;,
&quot;{set=null}&quot;
],
&quot;isInternal&quot;: false,
&quot;isBot&quot;: false,
&quot;statistics_type&quot;: &quot;view&quot;,
&quot;time&quot;: &quot;2018-05-08T23:17:00.157Z&quot;,
&quot;owningColl&quot;: [
&quot;1099&quot;,
&quot;830&quot;,
&quot;{set=830}&quot;
],
&quot;_version_&quot;: 1621500445042147300
}
]
}
}
</code></pre><ul>
<li>I deleted those 11,724 records with the strange &ldquo;set&rdquo; object in the collections and communities, as well as 360,000 records with <code>id: -1</code></li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:\-1&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
<li>I was going to compare the CUA stats for 2018 and 2019 on CGSpace and DSpace Test, but after Linode rebooted CGSpace (linode18) for maintenance yesterday the solr cores didn&rsquo;t all come back up OK
<ul>
<li>I had to restart Tomcat five times before they all came up!</li>
<li>After that I generated a report for 2018 and 2019 on each server and found that the difference is about 10,00020,000 per month, which is much less than I was expecting</li>
</ul>
</li>
<li>I noticed some authors that should have ORCID identifiers, but didn&rsquo;t (perhaps older items before we were tagging ORCID metadata)
<ul>
<li>With the simple list below I added 1,341 identifiers!</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-08-09-add-ILRI-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Grace, Delia&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
&quot;Delia Grace&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
&quot;Baker, Derek&quot;,&quot;Derek Baker: 0000-0001-6020-6973&quot;
&quot;Ngan Tran Thi&quot;,&quot;Tran Thi Ngan: 0000-0002-7184-3086&quot;
&quot;Dang Xuan Sinh&quot;,&quot;Sinh Dang-Xuan: 0000-0002-0522-7808&quot;
&quot;Hung Nguyen-Viet&quot;,&quot;Hung Nguyen-Viet: 0000-0001-9877-0596&quot;
&quot;Pham Van Hung&quot;,&quot;Pham Anh Hung: 0000-0001-9366-0259&quot;
&quot;Lindahl, Johanna F.&quot;,&quot;Johanna Lindahl: 0000-0002-1175-0398&quot;
&quot;Teufel, Nils&quot;,&quot;Nils Teufel: 0000-0001-5305-6620&quot;
&quot;Duncan, Alan J.&quot;,Alan Duncan: 0000-0002-3954-3067&quot;
&quot;Moodley, Arshnee&quot;,&quot;Arshnee Moodley: 0000-0002-6469-3948&quot;
</code></pre><ul>
<li>That got me curious, so I generated a list of all the unique ORCID identifiers we have in the database:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
COPY 2095
dspace=# \q
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq &gt; /tmp/2020-08-09-orcid-identifiers-uniq.csv
$ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
1949 /tmp/2020-08-09-orcid-identifiers-uniq.csv
</code></pre><ul>
<li>I looked into the strange Solr record above that had &ldquo;{set=830}&rdquo; in the communities and collections
<ul>
<li>There are exactly 11724 records like this in the current CGSpace (DSpace 5.8) statistics-2018 Solr core</li>
<li>None of them have an <code>id</code> or <code>type</code> field!</li>
<li>I see 242,000 of them in the statistics-2017 core, 185,063 in the statistics-2016 core&hellip; all the way to 2010, but not in 2019 or the current statistics core</li>
<li>I decided to purge all of these records from CGSpace right now so they don&rsquo;t even have a chance at being an issue on the real migration:</li>
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
<li>I added <code>Googlebot</code> and <code>Twitterbot</code> to the list of explicitly allowed bots
<ul>
<li>In Google&rsquo;s case, they were getting lumped in with all the other bad bots and then important links like the sitemaps were returning HTTP 503, but they generally respect <code>robots.txt</code> so we should just allow them (perhaps we can control the crawl rate in the webmaster console)</li>
<li>In Twitter&rsquo;s case they were also getting lumped in with the bad bots too, but really they only make ~50 or so requests a day when someone posts a CGSpace link on Twitter</li>
</ul>
</li>
<li>I tagged the ISO 3166-1 Alpha2 country codes on all items on CGSpace using my <a href="https://github.com/ilri/cgspace-java-helpers">CountryCodeTagger</a> curation task
<ul>
<li>I still need to set up a cron job for it&hellip;</li>
<li>This tagged 50,000 countries!</li>
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
count
-------
50812
(1 row)
</code></pre><h2 id="2020-08-11">2020-08-11</h2>
<ul>
<li>I noticed some more hits from Macaroni&rsquo;s WordPress harvestor that I hadn&rsquo;t caught last week
<ul>
<li>104.198.13.34 made many requests without a user agent, with a &ldquo;WordPress&rdquo; user agent, and with their new &ldquo;RTB website BOT&rdquo; user agent, about 100,000 in total in 2020, and maybe another 70,000 in the other years</li>
<li>I will purge them an add them to the Tomcat Crawler Session Manager and the DSpace bots list so they don&rsquo;t get logged in Solr</li>
</ul>
</li>
<li>I noticed a bunch of user agents with &ldquo;Crawl&rdquo; in the Solr stats, which is strange because the DSpace spider agents file has had &ldquo;crawl&rdquo; for a long time (and it is case insensitive)
<ul>
<li>In any case I will purge them and add them to the Tomcat Crawler Session Manager Valve so that at least their sessions get re-used</li>
</ul>
</li>
</ul>
<h2 id="2020-08-13">2020-08-13</h2>
<ul>
<li>Linode keeps sending mails that the load and outgoing bandwidth is above the threshold
<ul>
<li>I took a look briefly and found two IPs with the &ldquo;Delphi 2009&rdquo; user agent</li>
<li>Then there is 88.99.115.53 which made 82,000 requests in 2020 so far with no user agent</li>
<li>64.62.202.73 has made 7,000 requests with this user agent <code>Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)</code></li>
<li>I had added it to the Tomcat Crawler Session Manager Valve last week but never purged the hits from Solr</li>
<li>195.54.160.163 is making thousands of requests with user agents liket this:</li>
</ul>
</li>
</ul>
<p><code>(CASE WHEN 2850=9474 THEN 2850 ELSE NULL END)</code></p>
<ul>
<li>I purged 150,000 hits from 2020 and 2020 from these user agents and hosts</li>
</ul>
<h2 id="2020-08-14">2020-08-14</h2>
<ul>
<li>Last night I started the processing of the statistics-2016 core with the Atmire stats util and I see some errors like this:</li>
</ul>
<pre><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: f6b288d7-d60d-4df9-b311-1696b88552a0, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: java.lang.NullPointerException
</code></pre><ul>
<li>I see it has <code>id: 980-unmigrated</code> and <code>type: 0</code>&hellip;</li>
<li>The 2016 core has 629,983 unmigrated docs, mostly:
<ul>
<li><code>type: 5</code>: 620311</li>
<li><code>type: 0</code>: 7255</li>
<li><code>type: 3</code>: 1333</li>
</ul>
</li>
<li>I purged the unmigrated docs and continued processing:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
</code></pre><ul>
<li>Altmetric asked for a dump of CGSpace&rsquo;s OAI &ldquo;sets&rdquo; so they can update their affiliation mappings
<ul>
<li>I did it in a kinda ghetto way:</li>
</ul>
</li>
</ul>
<pre><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' &gt; /tmp/0.xml
$ for num in {100..1300..100}; do http &quot;https://cgspace.cgiar.org/oai/request?verb=ListSets&amp;resumptionToken=////$num&quot; &gt; /tmp/$num.xml; sleep 2; done
$ for num in {0..1300..100}; do cat /tmp/$num.xml &gt;&gt; /tmp/cgspace-oai-sets.xml; done
</code></pre><ul>
<li>This produces one file that has all the sets, albeit with 14 pages of responses concatenated into one document, but that&rsquo;s how theirs was in the first place&hellip;</li>
<li>Help Bizu with a restricted item for CIAT</li>
</ul>
<h2 id="2020-08-16">2020-08-16</h2>
<ul>
<li>The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs&hellip;</li>
<li>I looked at a few of the UIDs that it was having problems with and they were unmigrated ones&hellip; so I purged them in 2015 and all the rest of the statistics cores</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
</code></pre><h2 id="2020-08-19">2020-08-19</h2>
<ul>
<li>I tested the DSpace 5 and DSpace 6 versions of the <a href="https://github.com/ilri/cgspace-java-helpers">country code tagger curation task</a> and noticed a few things
<ul>
<li>The DSpace 5.8 version finishes in 2 hours and 1 minute</li>
<li>The DSpace 6.3 version ran for over 12 hours and didn&rsquo;t even finish (I killed it)</li>
<li>Furthermore, it seems that each item is curated once for each collection it appears in, causing about 115,000 items to be processed, even though we only have about 87,000</li>
</ul>
</li>
<li>I had been running the tasks on the entire repository with <code>-i 10568/0</code>, but I think I might need to try again with the special <code>all</code> option before writing to the dspace-tech mailing list for help
<ul>
<li>Actually I just tested the <code>all</code> option on DSpace 5.8 and it still does many of the items multiple times, once for each of their mappings</li>
<li>I sent a message to the dspace-tech mailing list</li>
</ul>
</li>
<li>I finished the Atmire stats processing on all cores on DSpace Test:
<ul>
<li>statistics:
<ul>
<li>2,040,385 docs: 2h 28m 49s</li>
</ul>
</li>
<li>statistics-2019:
<ul>
<li>8,960,000 docs: 12h 7s</li>
<li>1,780,575 docs: 2h 7m 29s</li>
</ul>
</li>
<li>statistics-2018:
<ul>
<li>2,200,000 docs: 12h 1m 11s</li>
<li>2,100,000 docs: 12h 4m 19s</li>
<li>?</li>
</ul>
</li>
<li>statistics-2017:
<ul>
<li>1,970,000 docs: 12h 5m 45s</li>
<li>2,000,000 docs: 12h 5m 38s</li>
<li>1,312,674 docs: 4h 14m 23s</li>
</ul>
</li>
<li>statistics-2016:
<ul>
<li>1,669,020 docs: 12h 4m 3s</li>
<li>1,650,000 docs: 12h 7m 40s</li>
<li>850,611 docs: 44m 52s</li>
</ul>
</li>
<li>statistics-2014:
<ul>
<li>4,832,334 docs: 3h 53m 41s</li>
</ul>
</li>
<li>statistics-2013:
<ul>
<li>4,509,891 docs: 3h 18m 44s</li>
</ul>
</li>
<li>statistics-2012:
<ul>
<li>3,716,857 docs: 2h 36m 21s</li>
</ul>
</li>
<li>statistics-2011:
<ul>
<li>1,645,426 docs: 1h 11m 41s</li>
</ul>
</li>
</ul>
</li>
<li>As far as I can tell, the processing became much faster once I purged all the unmigrated records
<ul>
<li>It took about six days for the processing according to the times above, though 2015 is missing&hellip; hmm</li>
</ul>
</li>
<li>Now I am testing the Atmire Listings and Reports
<ul>
<li>On both my local test and DSpace Test I get no results when searching for &ldquo;Orth, A.&rdquo; and &ldquo;Orth, Alan&rdquo; or even Delia Grace, but the Discovery index is up to date and I have eighteen items&hellip;</li>
<li>I sent a message to Atmire&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2020-08-20">2020-08-20</h2>
<ul>
<li>Natalia from CIAT was asking how she can download all the PDFs for the items in a search result
<ul>
<li>The search result is for the keyword &ldquo;trade off&rdquo; in the WLE community</li>
<li>I converted the Discovery search to an open-search query to extract the XML, but we can&rsquo;t get all the results on one page so I had to change the <code>rpp</code> to 100 and request a few times to get them all:</li>
</ul>
</li>
</ul>
<pre><code>$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=0' User-Agent:'curl' &gt; /tmp/wle-trade-off-page1.xml
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=100' User-Agent:'curl' &gt; /tmp/wle-trade-off-page2.xml
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=200' User-Agent:'curl' &gt; /tmp/wle-trade-off-page3.xml
</code></pre><ul>
<li>Ugh, and to extract the <code>&lt;id&gt;</code> from each <code>&lt;entry&gt;</code> we have to use an XPath query, but use a <a href="http://blog.powered-up-games.com/wordpress/archives/70">hack to ignore the default namespace by setting each element&rsquo;s local name</a>:</li>
</ul>
<pre><code>$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page1.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page2.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page3.xml &gt;&gt; /tmp/ids.txt
$ sort -u /tmp/ids.txt &gt; /tmp/ids-sorted.txt
$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt &gt; /tmp/handles.txt
</code></pre><ul>
<li>Now I have all the handles for the matching items and I can use the REST API to get each item&rsquo;s PDFs&hellip;
<ul>
<li>I wrote <code>get-wle-pdfs.py</code> to read the handles from a text file and get all PDFs: <a href="https://github.com/ilri/DSpace/blob/5_x-prod/get-wle-pdfs.py">https://github.com/ilri/DSpace/blob/5_x-prod/get-wle-pdfs.py</a></li>
</ul>
</li>
<li>Add <code>Foreign, Commonwealth and Development Office, United Kingdom</code> to the controlled vocabulary for sponsors on CGSpace
<ul>
<li>This is the new name for DFID as of 2020-09-01</li>
<li>We will continue using DFID for older items</li>
</ul>
</li>
</ul>
<h2 id="2020-08-22">2020-08-22</h2>
<ul>
<li>Peter noticed that the AReS data was out dated, and I see in the admin dashboard that it hasn&rsquo;t been updated since 2020-07-21
<ul>
<li>I initiated a re-indexing and I see from the CGSpace logs that it is indeed running</li>
</ul>
</li>
<li>Margarita from CCAFS asked for help adding a new user to their submission and approvers groups
<ul>
<li>I told them to log in using the LDAP login first so that the e-person gets created</li>
</ul>
</li>
<li>I manually renamed a few dozen of the stupid &ldquo;a-ILRI submitters&rdquo; groups that had the &ldquo;a-&rdquo; prefix on CGSpace
<ul>
<li>For what it&rsquo;s worth, we had asked Sisay to do this over a year ago and he never did</li>
<li>Also, we have two CCAFS approvers groups: <code>CCAFS approvers</code> and <code>CCAFS approvers1</code>, with each added to about half of the CCAFS collections</li>
<li>The group members are the same so I went through and replaced the <code>CCAFS approvers1</code> group everywhere manually&hellip;</li>
<li>I also removed some old CCAFS users from the groups</li>
</ul>
</li>
</ul>
<h2 id="2020-08-27">2020-08-27</h2>
<ul>
<li>I ran the CountryCodeTagger on CGSpace and it was very fast:</li>
</ul>
<pre><code>$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-08-27-countrycodetagger.log
real 2m7.643s
user 1m48.740s
sys 0m14.518s
$ grep -c added /tmp/2020-08-27-countrycodetagger.log
46
</code></pre><ul>
<li>I still haven&rsquo;t created a cron job for it&hellip; but it&rsquo;s good to know that when it doesn&rsquo;t need to add very many country codes that it is very fast (original run a few weeks ago added 50,000 country codes)
<ul>
<li>I wonder how DSpace 6 will perform when it doesn&rsquo;t need to add all the codes, like after the initial run</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2020-12/">December, 2020</a></li>
<li><a href="/cgspace-notes/cgspace-dspace6-upgrade/">CGSpace DSpace 6 Upgrade</a></li>
<li><a href="/cgspace-notes/2020-11/">November, 2020</a></li>
<li><a href="/cgspace-notes/2020-10/">October, 2020</a></li>
<li><a href="/cgspace-notes/2020-09/">September, 2020</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>