Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -150,8 +150,8 @@ It is class based so I can easily add support for other vocabularies, and the te
</li>
<li>I purged all unmigrated stats in a few cores and then restarted processing:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;&#39;
$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
</code></pre><ul>
<li>Andrea from Macaroni Bros emailed me a few days ago to say he&rsquo;s having issues with the CGSpace REST API
@ -192,16 +192,16 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
&quot;numberItems&quot; : 63,
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
<pre tabindex="0"><code>$ http &#39;http://localhost:8080/rest/collections/1445&#39; | json_pp | grep numberItems
&#34;numberItems&#34; : 63,
$ http &#39;http://localhost:8080/rest/collections/1445/items&#39; jq &#39;. | length&#39;
61
</code></pre><ul>
<li>Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:</li>
</ul>
<pre tabindex="0"><code>$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
&quot;numberItems&quot; : 61,
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
<pre tabindex="0"><code>$ http &#39;https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708&#39; | json_pp | grep numberItems
&#34;numberItems&#34; : 61,
$ http &#39;https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items&#39; | jq &#39;. | length&#39;
59
</code></pre><ul>
<li>Ah! I exported that collection&rsquo;s metadata and checked it in OpenRefine, where I noticed that two items are mapped twice
@ -210,7 +210,7 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
<pre tabindex="0"><code>dspace=# SELECT * FROM collection2item WHERE item_id = &#39;107687&#39;;
id | collection_id | item_id
--------+---------------+---------
133698 | 966 | 107687
@ -220,8 +220,8 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
</code></pre><ul>
<li>So for each id you can delete one duplicate mapping:</li>
</ul>
<pre tabindex="0"><code>dspace=# DELETE FROM collection2item WHERE id='134686';
dspace=# DELETE FROM collection2item WHERE id='128819';
<pre tabindex="0"><code>dspace=# DELETE FROM collection2item WHERE id=&#39;134686&#39;;
dspace=# DELETE FROM collection2item WHERE id=&#39;128819&#39;;
</code></pre><ul>
<li>Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter&rsquo;s preferred display names</li>
</ul>
@ -229,11 +229,11 @@ dspace=# DELETE FROM collection2item WHERE id='128819';
cg.coverage.country,correct
CAPE VERDE,CABO VERDE
COCOS ISLANDS,COCOS (KEELING) ISLANDS
&quot;CONGO, DR&quot;,&quot;CONGO, DEMOCRATIC REPUBLIC OF&quot;
COTE D'IVOIRE,CÔTE D'IVOIRE
&quot;KOREA, REPUBLIC&quot;,&quot;KOREA, REPUBLIC OF&quot;
PALESTINE,&quot;PALESTINE, STATE OF&quot;
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
&#34;CONGO, DR&#34;,&#34;CONGO, DEMOCRATIC REPUBLIC OF&#34;
COTE D&#39;IVOIRE,CÔTE D&#39;IVOIRE
&#34;KOREA, REPUBLIC&#34;,&#34;KOREA, REPUBLIC OF&#34;
PALESTINE,&#34;PALESTINE, STATE OF&#34;
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -t &#39;correct&#39; -m 228
</code></pre><ul>
<li>I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
<ul>
@ -267,7 +267,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</li>
<li>I checked the nginx logs around 5PM yesterday to see who was accessing the server:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#39;04/Aug/2020:(17|18)&#39; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I see the Macaroni Bros are using their new user agent for harvesting: <code>RTB website BOT</code>
<ul>
@ -276,7 +276,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep -E &quot;(63.32.242.35|64.62.202.71)&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep -E &#34;(63.32.242.35|64.62.202.71)&#34; | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
5693
</code></pre><ul>
<li>DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don&rsquo;t misuse the resources
@ -291,9 +291,9 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</li>
<li>A few more IPs causing lots of Tomcat sessions yesterday:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &quot;38.128.66.10&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &#34;38.128.66.10&#34; | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
1585
$ cat dspace.log.2020-08-04 | grep &quot;64.62.202.71&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2020-08-04 | grep &#34;64.62.202.71&#34; | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
5691
</code></pre><ul>
<li>38.128.66.10 isn&rsquo;t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:</li>
@ -318,8 +318,8 @@ Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
</code></pre><ul>
<li>And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &quot;199.47.87.145&quot; | grep -E 'sessi
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &#34;199.47.87.145&#34; | grep -E &#39;sessi
on_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2777
</code></pre><ul>
<li>I will add <code>Turnitin</code> to the Tomcat Crawler Session Manager Valve regex as well&hellip;</li>
@ -377,8 +377,8 @@ on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<ul>
<li>The Atmire stats processing for the statistics-2018 Solr core keeps stopping with this error:</li>
</ul>
<pre tabindex="0"><code>Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
<pre tabindex="0"><code>Exception: 50 consecutive records couldn&#39;t be saved. There&#39;s most likely an issue with the connection to the solr server. Shutting down.
java.lang.RuntimeException: 50 consecutive records couldn&#39;t be saved. There&#39;s most likely an issue with the connection to the solr server. Shutting down.
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.storeOnServer(SourceFile:317)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:177)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
@ -398,71 +398,71 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/[0-9]+/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/[0-9]+/&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><h2 id="2020-08-09">2020-08-09</h2>
<ul>
<li>The Atmire script did something to the server and created 132GB of log files so the root partition ran out of space&hellip;</li>
<li>I removed the log file and tried to re-run the process but it seems to be looping over 11,000 records and failing, creating millions of lines in the logs again:</li>
</ul>
<pre tabindex="0"><code># grep -oE &quot;Record uid: ([a-f0-9\\-]*){1} couldn't be processed&quot; /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 &gt; /tmp/not-processed-errors.txt
<pre tabindex="0"><code># grep -oE &#34;Record uid: ([a-f0-9\\-]*){1} couldn&#39;t be processed&#34; /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 &gt; /tmp/not-processed-errors.txt
# wc -l /tmp/not-processed-errors.txt
2202973 /tmp/not-processed-errors.txt
# sort /tmp/not-processed-errors.txt | uniq -c | tail -n 10
220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn't be processed
220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn't be processed
220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn't be processed
220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn't be processed
220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn't be processed
221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn't be processed
220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn't be processed
220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn't be processed
221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn't be processed
221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn't be processed
220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn&#39;t be processed
220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn&#39;t be processed
220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn&#39;t be processed
220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn&#39;t be processed
220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn&#39;t be processed
221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn&#39;t be processed
220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn&#39;t be processed
220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn&#39;t be processed
221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn&#39;t be processed
221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn&#39;t be processed
</code></pre><ul>
<li>I looked at some of those records and saw strange objects in their <code>containerCommunity</code>, <code>containerCollection</code>, etc&hellip;</li>
</ul>
<pre tabindex="0"><code>{
&quot;responseHeader&quot;: {
&quot;status&quot;: 0,
&quot;QTime&quot;: 0,
&quot;params&quot;: {
&quot;q&quot;: &quot;uid:fff1349d-79d5-4ceb-89a1-ce78107d982d&quot;,
&quot;indent&quot;: &quot;true&quot;,
&quot;wt&quot;: &quot;json&quot;,
&quot;_&quot;: &quot;1596957629970&quot;
&#34;responseHeader&#34;: {
&#34;status&#34;: 0,
&#34;QTime&#34;: 0,
&#34;params&#34;: {
&#34;q&#34;: &#34;uid:fff1349d-79d5-4ceb-89a1-ce78107d982d&#34;,
&#34;indent&#34;: &#34;true&#34;,
&#34;wt&#34;: &#34;json&#34;,
&#34;_&#34;: &#34;1596957629970&#34;
}
},
&quot;response&quot;: {
&quot;numFound&quot;: 1,
&quot;start&quot;: 0,
&quot;docs&quot;: [
&#34;response&#34;: {
&#34;numFound&#34;: 1,
&#34;start&#34;: 0,
&#34;docs&#34;: [
{
&quot;containerCommunity&quot;: [
&quot;155&quot;,
&quot;155&quot;,
&quot;{set=null}&quot;
&#34;containerCommunity&#34;: [
&#34;155&#34;,
&#34;155&#34;,
&#34;{set=null}&#34;
],
&quot;uid&quot;: &quot;fff1349d-79d5-4ceb-89a1-ce78107d982d&quot;,
&quot;containerCollection&quot;: [
&quot;1099&quot;,
&quot;830&quot;,
&quot;{set=830}&quot;
&#34;uid&#34;: &#34;fff1349d-79d5-4ceb-89a1-ce78107d982d&#34;,
&#34;containerCollection&#34;: [
&#34;1099&#34;,
&#34;830&#34;,
&#34;{set=830}&#34;
],
&quot;owningComm&quot;: [
&quot;155&quot;,
&quot;155&quot;,
&quot;{set=null}&quot;
&#34;owningComm&#34;: [
&#34;155&#34;,
&#34;155&#34;,
&#34;{set=null}&#34;
],
&quot;isInternal&quot;: false,
&quot;isBot&quot;: false,
&quot;statistics_type&quot;: &quot;view&quot;,
&quot;time&quot;: &quot;2018-05-08T23:17:00.157Z&quot;,
&quot;owningColl&quot;: [
&quot;1099&quot;,
&quot;830&quot;,
&quot;{set=830}&quot;
&#34;isInternal&#34;: false,
&#34;isBot&#34;: false,
&#34;statistics_type&#34;: &#34;view&#34;,
&#34;time&#34;: &#34;2018-05-08T23:17:00.157Z&#34;,
&#34;owningColl&#34;: [
&#34;1099&#34;,
&#34;830&#34;,
&#34;{set=830}&#34;
],
&quot;_version_&quot;: 1621500445042147300
&#34;_version_&#34;: 1621500445042147300
}
]
}
@ -470,8 +470,8 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
</code></pre><ul>
<li>I deleted those 11,724 records with the strange &ldquo;set&rdquo; object in the collections and communities, as well as 360,000 records with <code>id: -1</code></li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:\-1&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;&#39;
$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:\-1&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><ul>
<li>I was going to compare the CUA stats for 2018 and 2019 on CGSpace and DSpace Test, but after Linode rebooted CGSpace (linode18) for maintenance yesterday the solr cores didn&rsquo;t all come back up OK
<ul>
@ -487,24 +487,24 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=tru
</ul>
<pre tabindex="0"><code>$ cat 2020-08-09-add-ILRI-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Grace, Delia&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
&quot;Delia Grace&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
&quot;Baker, Derek&quot;,&quot;Derek Baker: 0000-0001-6020-6973&quot;
&quot;Ngan Tran Thi&quot;,&quot;Tran Thi Ngan: 0000-0002-7184-3086&quot;
&quot;Dang Xuan Sinh&quot;,&quot;Sinh Dang-Xuan: 0000-0002-0522-7808&quot;
&quot;Hung Nguyen-Viet&quot;,&quot;Hung Nguyen-Viet: 0000-0001-9877-0596&quot;
&quot;Pham Van Hung&quot;,&quot;Pham Anh Hung: 0000-0001-9366-0259&quot;
&quot;Lindahl, Johanna F.&quot;,&quot;Johanna Lindahl: 0000-0002-1175-0398&quot;
&quot;Teufel, Nils&quot;,&quot;Nils Teufel: 0000-0001-5305-6620&quot;
&quot;Duncan, Alan J.&quot;,Alan Duncan: 0000-0002-3954-3067&quot;
&quot;Moodley, Arshnee&quot;,&quot;Arshnee Moodley: 0000-0002-6469-3948&quot;
&#34;Grace, Delia&#34;,&#34;Delia Grace: 0000-0002-0195-9489&#34;
&#34;Delia Grace&#34;,&#34;Delia Grace: 0000-0002-0195-9489&#34;
&#34;Baker, Derek&#34;,&#34;Derek Baker: 0000-0001-6020-6973&#34;
&#34;Ngan Tran Thi&#34;,&#34;Tran Thi Ngan: 0000-0002-7184-3086&#34;
&#34;Dang Xuan Sinh&#34;,&#34;Sinh Dang-Xuan: 0000-0002-0522-7808&#34;
&#34;Hung Nguyen-Viet&#34;,&#34;Hung Nguyen-Viet: 0000-0001-9877-0596&#34;
&#34;Pham Van Hung&#34;,&#34;Pham Anh Hung: 0000-0001-9366-0259&#34;
&#34;Lindahl, Johanna F.&#34;,&#34;Johanna Lindahl: 0000-0002-1175-0398&#34;
&#34;Teufel, Nils&#34;,&#34;Nils Teufel: 0000-0001-5305-6620&#34;
&#34;Duncan, Alan J.&#34;,Alan Duncan: 0000-0002-3954-3067&#34;
&#34;Moodley, Arshnee&#34;,&#34;Arshnee Moodley: 0000-0002-6469-3948&#34;
</code></pre><ul>
<li>That got me curious, so I generated a list of all the unique ORCID identifiers we have in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
COPY 2095
dspace=# \q
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq &gt; /tmp/2020-08-09-orcid-identifiers-uniq.csv
$ grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq &gt; /tmp/2020-08-09-orcid-identifiers-uniq.csv
$ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
1949 /tmp/2020-08-09-orcid-identifiers-uniq.csv
</code></pre><ul>
@ -517,9 +517,9 @@ $ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;&#39;
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
$ curl -s &#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><ul>
<li>I added <code>Googlebot</code> and <code>Twitterbot</code> to the list of explicitly allowed bots
<ul>
@ -573,7 +573,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<ul>
<li>Last night I started the processing of the statistics-2016 core with the Atmire stats util and I see some errors like this:</li>
</ul>
<pre tabindex="0"><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
<pre tabindex="0"><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn&#39;t be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: f6b288d7-d60d-4df9-b311-1696b88552a0, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
@ -598,8 +598,8 @@ Caused by: java.lang.NullPointerException
</li>
<li>I purged the unmigrated docs and continued processing:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2016/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;&#39;
$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
</code></pre><ul>
<li>Altmetric asked for a dump of CGSpace&rsquo;s OAI &ldquo;sets&rdquo; so they can update their affiliation mappings
@ -608,8 +608,8 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' &gt; /tmp/0.xml
$ for num in {100..1300..100}; do http &quot;https://cgspace.cgiar.org/oai/request?verb=ListSets&amp;resumptionToken=////$num&quot; &gt; /tmp/$num.xml; sleep 2; done
<pre tabindex="0"><code>$ http &#39;https://cgspace.cgiar.org/oai/request?verb=ListSets&#39; &gt; /tmp/0.xml
$ for num in {100..1300..100}; do http &#34;https://cgspace.cgiar.org/oai/request?verb=ListSets&amp;resumptionToken=////$num&#34; &gt; /tmp/$num.xml; sleep 2; done
$ for num in {0..1300..100}; do cat /tmp/$num.xml &gt;&gt; /tmp/cgspace-oai-sets.xml; done
</code></pre><ul>
<li>This produces one file that has all the sets, albeit with 14 pages of responses concatenated into one document, but that&rsquo;s how theirs was in the first place&hellip;</li>
@ -620,9 +620,9 @@ $ for num in {0..1300..100}; do cat /tmp/$num.xml &gt;&gt; /tmp/cgspace-oai-sets
<li>The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs&hellip;</li>
<li>I looked at a few of the UIDs that it was having problems with and they were unmigrated ones&hellip; so I purged them in 2015 and all the rest of the statistics cores</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2015/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;&#39;
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ curl -s &#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><h2 id="2020-08-19">2020-08-19</h2>
<ul>
<li>I tested the DSpace 5 and DSpace 6 versions of the <a href="https://github.com/ilri/cgspace-java-helpers">country code tagger curation task</a> and noticed a few things
@ -715,17 +715,17 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=0' User-Agent:'curl' &gt; /tmp/wle-trade-off-page1.xml
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=100' User-Agent:'curl' &gt; /tmp/wle-trade-off-page2.xml
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=200' User-Agent:'curl' &gt; /tmp/wle-trade-off-page3.xml
<pre tabindex="0"><code>$ http &#39;https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=0&#39; User-Agent:&#39;curl&#39; &gt; /tmp/wle-trade-off-page1.xml
$ http &#39;https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=100&#39; User-Agent:&#39;curl&#39; &gt; /tmp/wle-trade-off-page2.xml
$ http &#39;https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=200&#39; User-Agent:&#39;curl&#39; &gt; /tmp/wle-trade-off-page3.xml
</code></pre><ul>
<li>Ugh, and to extract the <code>&lt;id&gt;</code> from each <code>&lt;entry&gt;</code> we have to use an XPath query, but use a <a href="http://blog.powered-up-games.com/wordpress/archives/70">hack to ignore the default namespace by setting each element&rsquo;s local name</a>:</li>
</ul>
<pre tabindex="0"><code>$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page1.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page2.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page3.xml &gt;&gt; /tmp/ids.txt
<pre tabindex="0"><code>$ xmllint --xpath &#39;//*[local-name()=&#34;entry&#34;]/*[local-name()=&#34;id&#34;]/text()&#39; /tmp/wle-trade-off-page1.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath &#39;//*[local-name()=&#34;entry&#34;]/*[local-name()=&#34;id&#34;]/text()&#39; /tmp/wle-trade-off-page2.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath &#39;//*[local-name()=&#34;entry&#34;]/*[local-name()=&#34;id&#34;]/text()&#39; /tmp/wle-trade-off-page3.xml &gt;&gt; /tmp/ids.txt
$ sort -u /tmp/ids.txt &gt; /tmp/ids-sorted.txt
$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt &gt; /tmp/handles.txt
$ grep -oE &#39;[0-9]+/[0-9]+&#39; /tmp/ids.txt &gt; /tmp/handles.txt
</code></pre><ul>
<li>Now I have all the handles for the matching items and I can use the REST API to get each item&rsquo;s PDFs&hellip;
<ul>