Add notes for 2021-09-13

This commit is contained in:
2021-09-13 16:21:16 +03:00
parent 8b487a4a77
commit c05c7213c2
109 changed files with 2627 additions and 2530 deletions

View File

@ -29,4 +29,46 @@ $ docker-compose build
- Then run system updates and reboot the server
- After the system came back up I started a fresh re-harvesting
## 2021-09-07
- Checking last month's Solr statistics to see if there are any new bots that I need to purge and add to the list
- 78.203.225.68 made 50,000 requests on one day in August, and it is using this user agent: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36`
- It's a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser
- 130.255.162.154 is in Sweden and made 46,000 requests in August and it is using this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0`
- 35.174.144.154 is on Amazon and made 28,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36`
- 192.121.135.6 is in Sweden and made 9,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0`
- 185.38.40.66 is in Germany and made 6,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/1.10.2.4`
- 3.225.28.105 is in Amazon and made 3,000 requests with this user agent: `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36`
- I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko`
- I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com.
- I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again
- While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others
- They must be related, because I see them all using the exact same user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko`
- So this startdedicated.com DNS is some Bing bot also...
- I extracted all the IPs and purged them using my `check-spider-ip-hits.sh` script
- In total I purged 225,000 hits...
## 2021-09-12
- Start a harvest on AReS
## 2021-09-13
- Mishell Portilla asked me about thumbnails on CGSpace being small
- For example, [10568/114576](https://cgspace.cgiar.org/handle/10568/114576) has a lot of white space on the left side
- I created a new thumbnail with vipsthumbnail:
```console
$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
```
- Looking at the PDF's metadata I see:
- Producer: iLovePDF
- Creator: Adobe InDesign 15.0 (Windows)
- Format: PDF-1.7
- Eventually I should do more tests on this and perhaps file a bug with DSpace...
- Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool
- I told them I can give them access to DSpace Test and that we should have a meeting soon
- We need to figure out what controlled vocabularies they should use
<!-- vim: set sw=2 ts=2: -->

View File

@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now
$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -126,7 +126,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
78
</code></pre><ul>
<li>For now I have increased the limit from 60 to 90, run updates, and rebooted the server</li>
@ -137,7 +137,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<li>Getting emails from uptimeRobot and uptimeButler that it&rsquo;s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors</li>
<li>Looks like there are still a bunch of idle PostgreSQL connections:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
96
</code></pre><ul>
<li>For some reason the number of idle connections is very high since we upgraded to DSpace 5</li>
@ -147,7 +147,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<li>Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config</li>
<li>The OAI application requests stylesheets and javascript files with the path <code>/oai/static/css</code>, which gets matched here:</li>
</ul>
<pre><code># static assets we can load from the file system directly with nginx
<pre tabindex="0"><code># static assets we can load from the file system directly with nginx
location ~ /(themes|static|aspects/ReportingSuite) {
try_files $uri @tomcat;
...
@ -158,21 +158,21 @@ location ~ /(themes|static|aspects/ReportingSuite) {
<li>We simply need to add <code>include extra-security.conf;</code> to the above location block (but research and test first)</li>
<li>We should add WOFF assets to the list of things to set expires for:</li>
</ul>
<pre><code>location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
<pre tabindex="0"><code>location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
</code></pre><ul>
<li>We should also add <code>aspects/Statistics</code> to the location block for static assets (minus <code>static</code> from above):</li>
</ul>
<pre><code>location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
<pre tabindex="0"><code>location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
</code></pre><ul>
<li>Need to check <code>/about</code> on CGSpace, as it&rsquo;s blank on my local test server and we might need to add something there</li>
<li>CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
93
</code></pre><ul>
<li>I looked closer at the idle connections and saw that many have been idle for hours (current time on server is <code>2015-11-25T20:20:42+0000</code>):</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | less -S
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | less -S
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start |
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20
@ -191,13 +191,13 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
<li>CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item</li>
<li>Not as bad for me, but still unsustainable if you have to get many:</li>
</ul>
<pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
8.415
</code></pre><ul>
<li>Monitoring e-mailed in the evening to say CGSpace was down</li>
<li>Idle connections in PostgreSQL again:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
66
</code></pre><ul>
<li>At the time, the current DSpace pool size was 50&hellip;</li>
@ -208,14 +208,14 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
<li>Still more alerts that CGSpace has been up and down all day</li>
<li>Current database settings for DSpace:</li>
</ul>
<pre><code>db.maxconnections = 30
<pre tabindex="0"><code>db.maxconnections = 30
db.maxwait = 5000
db.maxidle = 8
db.statementpool = true
</code></pre><ul>
<li>And idle connections:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
49
</code></pre><ul>
<li>Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace&rsquo;s thirst can ever be quenched</li>

View File

@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -126,7 +126,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<ul>
<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li>
</ul>
<pre><code># cd /home/dspacetest.cgiar.org/log
<pre tabindex="0"><code># cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
@ -137,20 +137,20 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<li>CGSpace went down again (due to PostgreSQL idle connections of course)</li>
<li>Current database settings for DSpace are <code>db.maxconnections = 30</code> and <code>db.maxidle = 8</code>, yet idle connections are exceeding this:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
39
</code></pre><ul>
<li>I restarted PostgreSQL and Tomcat and it&rsquo;s back</li>
<li>On a related note of why CGSpace is so slow, I decided to finally try the <code>pgtune</code> script to tune the postgres settings:</li>
</ul>
<pre><code># apt-get install pgtune
<pre tabindex="0"><code># apt-get install pgtune
# pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune
# mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig
# mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf
</code></pre><ul>
<li>It introduced the following new settings:</li>
</ul>
<pre><code>default_statistics_target = 50
<pre tabindex="0"><code>default_statistics_target = 50
maintenance_work_mem = 480MB
constraint_exclusion = on
checkpoint_completion_target = 0.9
@ -164,7 +164,7 @@ max_connections = 80
<li>Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc</li>
<li>For what it&rsquo;s worth, now the REST API should be faster (because of these PostgreSQL tweaks):</li>
</ul>
<pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.474
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
2.141
@ -189,7 +189,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)</li>
<li>Idle postgres connections look like this (with no change in DSpace db settings lately):</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
29
</code></pre><ul>
<li>I restarted Tomcat and postgres&hellip;</li>
@ -197,7 +197,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>We weren&rsquo;t out of heap yet, but it&rsquo;s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it&rsquo;s ok</li>
<li>A possible side effect is that I see that the REST API is twice as fast for the request above now:</li>
</ul>
<pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.368
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.968
@ -214,7 +214,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>CGSpace has been up and down all day and REST API is completely unresponsive</li>
<li>PostgreSQL idle connections are currently:</li>
</ul>
<pre><code>postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
<pre tabindex="0"><code>postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
28
</code></pre><ul>
<li>I have reverted all the pgtune tweaks from the other day, as they didn&rsquo;t fix the stability issues, so I&rsquo;d rather not have them introducing more variables into the equation</li>
@ -229,7 +229,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
<li>Atmire sent <a href="https://github.com/ilri/DSpace/pull/161">some fixes</a> to DSpace&rsquo;s REST API code that was leaving contexts open (causing the slow performance and database issues)</li>
<li>After deploying the fix to CGSpace the REST API is consistently faster:</li>
</ul>
<pre><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.675
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.599

View File

@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />

View File

@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)&hellip;
Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -140,20 +140,20 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
<li>Found a way to get items with null/empty metadata values from SQL</li>
<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>
</ul>
<pre><code>dspacetest=# select * from metadatafieldregistry;
<pre tabindex="0"><code>dspacetest=# select * from metadatafieldregistry;
</code></pre><ul>
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>
<pre><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
<pre tabindex="0"><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
</code></pre><ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
</code></pre><ul>
<li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li>
</ul>
<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
DELETE 25
</code></pre><ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li>
@ -171,7 +171,7 @@ DELETE 25
<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>
<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>
</ul>
<pre><code>$ postgres -D /opt/brew/var/postgres
<pre tabindex="0"><code>$ postgres -D /opt/brew/var/postgres
$ createuser --superuser postgres
$ createuser --pwprompt dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
@ -187,7 +187,7 @@ $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sq
</code></pre><ul>
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li>
</ul>
<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
<pre tabindex="0"><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
@ -198,11 +198,11 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>For example:</li>
</ul>
<pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&quot;
</code></pre><ul>
<li>After verifying that the site is working, start a full index:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ ~/dspace/bin/dspace index-discovery -b
</code></pre><h2 id="2016-02-08">2016-02-08</h2>
<ul>
<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>
@ -216,7 +216,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
<li>Help Sisay with OpenRefine</li>
<li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li>
</ul>
<pre><code>$ cd ~/src/git
<pre tabindex="0"><code>$ cd ~/src/git
$ git clone https://github.com/letsencrypt/letsencrypt
$ cd letsencrypt
$ sudo service nginx stop
@ -231,15 +231,15 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
<li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li>
</ul>
<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>or</li>
</ul>
<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
<pre tabindex="0"><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre><ul>
<li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
</ul>
<pre><code># free -m
<pre tabindex="0"><code># free -m
total used free shared buffers cached
Mem: 3950 3902 48 9 37 1311
-/+ buffers/cache: 2552 1397
@ -253,11 +253,11 @@ Swap: 255 57 198
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>
<pre><code>value.split('/')[-1]
<pre tabindex="0"><code>value.split('/')[-1]
</code></pre><ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>
<pre><code>$ ./generate-thumbnails.py ciat-reports.csv
<pre tabindex="0"><code>$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&gt; Downloading 64661.pdf
&gt; Creating thumbnail for 64661.pdf
@ -278,13 +278,13 @@ Processing 64195.pdf
<li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>
<pre><code>$ ls | grep -c -E &quot;%&quot;
<pre tabindex="0"><code>$ ls | grep -c -E &quot;%&quot;
265
</code></pre><ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
<li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li>
</ul>
<pre><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
<pre tabindex="0"><code>$ python -c &quot;import urllib, sys; print urllib.unquote(sys.argv[1])&quot; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
</code></pre><ul>
<li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li>
@ -294,7 +294,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<ul>
<li>Turns out OpenRefine has an unescape function!</li>
</ul>
<pre><code>value.unescape(&quot;url&quot;)
<pre tabindex="0"><code>value.unescape(&quot;url&quot;)
</code></pre><ul>
<li>This turns the URLs into human-readable versions that we can use as proper filenames</li>
<li>Run web server and system updates on DSpace Test and reboot</li>
@ -316,7 +316,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<li>Turns out the &ldquo;bug&rdquo; in SAFBuilder isn&rsquo;t a bug, it&rsquo;s a feature that allows you to encode extra information like the destintion bundle in the filename</li>
<li>Also, it seems DSpace&rsquo;s SAF import tool doesn&rsquo;t like importing filenames that have accents in them:</li>
</ul>
<pre><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
<pre tabindex="0"><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
</code></pre><ul>
<li>Need to rename files to have no accents or umlauts, etc&hellip;</li>
<li>Useful custom text facet for URLs ending with &ldquo;.pdf&rdquo;: <code>value.endsWith(&quot;.pdf&quot;)</code></li>
@ -325,12 +325,12 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
<ul>
<li>To change Spanish accents to ASCII in OpenRefine:</li>
</ul>
<pre><code>value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
<pre tabindex="0"><code>value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
</code></pre><ul>
<li>But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac</li>
<li>On closer inspection, I can import files with the following names on Linux (DSpace Test):</li>
</ul>
<pre><code>Bitstream: tést.pdf
<pre tabindex="0"><code>Bitstream: tést.pdf
Bitstream: tést señora.pdf
Bitstream: tést señora alimentación.pdf
</code></pre><ul>
@ -353,7 +353,7 @@ Bitstream: tést señora alimentación.pdf
<li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li>
<li>It&rsquo;s tricky to parse those things in some programming languages so I&rsquo;d rather just get rid of the weird stuff now in OpenRefine:</li>
</ul>
<pre><code>value.replace(&quot;'&quot;,'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
<pre tabindex="0"><code>value.replace(&quot;'&quot;,'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
</code></pre><ul>
<li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li>
<li>Re-deploy CGSpace with the Google Scholar fix, but I&rsquo;m waiting on the Atmire fixes for now, as the branch history is ugly</li>

View File

@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -128,7 +128,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<li>I identified one commit that causes the issue and let them know</li>
<li>Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:</li>
</ul>
<pre><code>Exception in thread &quot;Lucene Merge Thread #19&quot; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
<pre tabindex="0"><code>Exception in thread &quot;Lucene Merge Thread #19&quot; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
</code></pre><h2 id="2016-03-08">2016-03-08</h2>
<ul>
<li>Add a few new filters to Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/issues/180">#180</a>)</li>
@ -175,7 +175,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<li>Help Sisay with some PostgreSQL queries to clean up the incorrect <code>dc.contributor.corporateauthor</code> field</li>
<li>I noticed that we have some weird values in <code>dc.language</code>:</li>
</ul>
<pre><code># select * from metadatavalue where metadata_field_id=37;
<pre tabindex="0"><code># select * from metadatavalue where metadata_field_id=37;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
1942571 | 35342 | 37 | hi | | 1 | | -1 | 2
@ -215,7 +215,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<ul>
<li>Command used:</li>
</ul>
<pre><code>$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
<pre tabindex="0"><code>$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
</code></pre><ul>
<li>Also, it looks like adding <code>-sharpen 0x1.0</code> really improves the quality of the image for only a few KB</li>
</ul>
@ -261,7 +261,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<ul>
<li>Abenet is having problems saving group memberships, and she gets this error: <a href="https://gist.github.com/alanorth/87281c061c2de57b773e">https://gist.github.com/alanorth/87281c061c2de57b773e</a></li>
</ul>
<pre><code>Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
<pre tabindex="0"><code>Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
</code></pre><ul>
<li>I can reproduce the same error on DSpace Test and on my Mac</li>
<li>Looks to be an issue with the Atmire modules, I&rsquo;ve submitted a ticket to their tracker.</li>

View File

@ -32,7 +32,7 @@ After running DSpace for over five years I&rsquo;ve never needed to look in any
This will save us a few gigs of backup space we&rsquo;re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -126,7 +126,7 @@ Also, I noticed the checker log has some errors we should pay attention to:
<li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li>
<li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li>
</ul>
<pre><code>Run start time: 03/06/2016 04:00:22
<pre tabindex="0"><code>Run start time: 03/06/2016 04:00:22
Error retrieving bitstream ID 71274 from asset store.
java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files)
at java.io.FileInputStream.open(Native Method)
@ -158,7 +158,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
<ul>
<li>Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don&rsquo;t need!</li>
</ul>
<pre><code># s3cmd ls s3://cgspace.cgiar.org/log/ &gt; /tmp/s3-logs.txt
<pre tabindex="0"><code># s3cmd ls s3://cgspace.cgiar.org/log/ &gt; /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
@ -171,7 +171,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
<ul>
<li>A better way to move metadata on this scale is via SQL, for example <code>dc.type.output</code> → <code>dc.type</code> (their IDs in the metadatafieldregistry are 66 and 109, respectively):</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
<pre tabindex="0"><code>dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
UPDATE 40852
</code></pre><ul>
<li>After that an <code>index-discovery -bf</code> is required</li>
@ -182,7 +182,7 @@ UPDATE 40852
<li>Write shell script to do the migration of fields: <a href="https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b">https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b</a></li>
<li>Testing with a few fields it seems to work well:</li>
</ul>
<pre><code>$ ./migrate-fields.sh
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40883
UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
@ -199,7 +199,7 @@ UPDATE 51258
<li>Looking at the DOI issue <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=678507860">reported by Leroy from CIAT a few weeks ago</a></li>
<li>It seems the <code>dx.doi.org</code> URLs are much more proper in our repository!</li>
</ul>
<pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
count
-------
5638
@ -221,7 +221,7 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
<ul>
<li>Looking at quality of WLE data (<code>cg.subject.iwmi</code>) in SQL:</li>
</ul>
<pre><code>dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
<pre tabindex="0"><code>dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
</code></pre><ul>
<li>Listings and Reports is still not returning reliable data for <code>dc.type</code></li>
<li>I think we need to ask Atmire, as their documentation isn&rsquo;t too clear on the format of the filter configs</li>
@ -231,11 +231,11 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
<li>I decided to keep the set of subjects that had <code>FMD</code> and <code>RANGELANDS</code> added, as it appears to have been requested to have been added, and might be the newer list</li>
<li>I found 226 blank metadatavalues:</li>
</ul>
<pre><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
</code></pre><ul>
<li>I think we should delete them and do a full re-index:</li>
</ul>
<pre><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 226
</code></pre><ul>
<li>I deleted them on CGSpace but I&rsquo;ll wait to do the re-index as we&rsquo;re going to be doing one in a few days for the metadata changes anyways</li>
@ -281,7 +281,7 @@ DELETE 226
</li>
<li>Test metadata migration on local instance again:</li>
</ul>
<pre><code>$ ./migrate-fields.sh
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40885
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
@ -298,7 +298,7 @@ $ JAVA_OPTS=&quot;-Xms512m -Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dsp
</code></pre><ul>
<li>CGSpace was down but I&rsquo;m not sure why, this was in <code>catalina.out</code>:</li>
</ul>
<pre><code>Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
<pre tabindex="0"><code>Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
at org.dspace.rest.Resource.processFinally(Resource.java:163)
@ -328,7 +328,7 @@ javax.ws.rs.WebApplicationException
<ul>
<li>Get handles for items that are using a given metadata field, ie <code>dc.Species.animal</code> (105):</li>
</ul>
<pre><code># select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
<pre tabindex="0"><code># select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
handle
-------------
10568/10298
@ -338,26 +338,26 @@ javax.ws.rs.WebApplicationException
</code></pre><ul>
<li>Delete metadata values for <code>dc.GRP</code> and <code>dc.icsubject.icrafsubject</code>:</li>
</ul>
<pre><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
<pre tabindex="0"><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=83;
</code></pre><ul>
<li>They are old ICRAF fields and we haven&rsquo;t used them since 2011 or so</li>
<li>Also delete them from the metadata registry</li>
<li>CGSpace went down again, <code>dspace.log</code> had this:</li>
</ul>
<pre><code>2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>I restarted Tomcat and PostgreSQL and now it&rsquo;s back up</li>
<li>I bet this is the same crash as yesterday, but I only saw the errors in <code>catalina.out</code></li>
<li>Looks to be related to this, from <code>dspace.log</code>:</li>
</ul>
<pre><code>2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
<pre tabindex="0"><code>2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
</code></pre><ul>
<li>We have 18,000 of these errors right now&hellip;</li>
<li>Delete a few more old metadata values: <code>dc.Species.animal</code>, <code>dc.type.journal</code>, and <code>dc.publicationcategory</code>:</li>
</ul>
<pre><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
<pre tabindex="0"><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=85;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=95;
</code></pre><ul>
@ -369,7 +369,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>Migrate fields and re-deploy CGSpace with the new subject and type fields, run all system updates, and reboot the server</li>
<li>Field migration went well:</li>
</ul>
<pre><code>$ ./migrate-fields.sh
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40909
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
@ -387,7 +387,7 @@ UPDATE 46075
<li>Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)</li>
<li>Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:</li>
</ul>
<pre><code>$ grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-20
<pre tabindex="0"><code>$ grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-20
21252
</code></pre><ul>
<li>I found a recent discussion on the DSpace mailing list and I&rsquo;ve asked for advice there</li>
@ -423,7 +423,7 @@ UPDATE 46075
<li>Looks like the last one was &ldquo;down&rdquo; from about four hours ago</li>
<li>I think there must be something with this REST stuff:</li>
</ul>
<pre><code># grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-*
<pre tabindex="0"><code># grep -c &quot;Aborting context in finally statement&quot; dspace.log.2016-04-*
dspace.log.2016-04-01:0
dspace.log.2016-04-02:0
dspace.log.2016-04-03:0
@ -468,7 +468,7 @@ dspace.log.2016-04-27:7271
<ul>
<li>Logs for today and yesterday have zero references to this REST error, so I&rsquo;m going to open back up the REST API but log all requests</li>
</ul>
<pre><code>location /rest {
<pre tabindex="0"><code>location /rest {
access_log /var/log/nginx/rest.log;
proxy_pass http://127.0.0.1:8443;
}

View File

@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -126,13 +126,13 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<li>I have blocked access to the API now</li>
<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
3168
</code></pre><ul>
<li>The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29</li>
<li>100% of the requests coming from Ethiopia are like this and result in an HTTP 500:</li>
</ul>
<pre><code>GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
<pre tabindex="0"><code>GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
</code></pre><ul>
<li>For now I&rsquo;ll block just the Ethiopian IP</li>
<li>The owner of that application has said that the <code>NaN</code> (not a number) is an error in his code and he&rsquo;ll fix it</li>
@ -152,7 +152,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<li>I will re-generate the Discovery indexes after re-deploying</li>
<li>Testing <code>renew-letsencrypt.sh</code> script for nginx</li>
</ul>
<pre><code>#!/usr/bin/env bash
<pre tabindex="0"><code>#!/usr/bin/env bash
readonly SERVICE_BIN=/usr/sbin/service
readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto
@ -214,7 +214,7 @@ fi
<p>After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:</p>
</li>
</ul>
<pre><code>[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -&gt; [Help 1]
<pre tabindex="0"><code>[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -&gt; [Help 1]
</code></pre><ul>
<li>I&rsquo;ve sent them a question about it</li>
<li>A user mentioned having problems with uploading a 33 MB PDF</li>
@ -240,7 +240,7 @@ fi
</li>
<li>Found ~200 messed up CIAT values in <code>dc.publisher</code>:</li>
</ul>
<pre><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &quot;% %&quot;;
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &quot;% %&quot;;
</code></pre><h2 id="2016-05-13">2016-05-13</h2>
<ul>
<li>More theorizing about CGcore</li>
@ -259,7 +259,7 @@ fi
<li>They have thumbnails on Flickr and elsewhere</li>
<li>In OpenRefine I created a new <code>filename</code> column based on the <code>thumbnail</code> column with the following GREL:</li>
</ul>
<pre><code>if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
<pre tabindex="0"><code>if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
</code></pre><ul>
<li>Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL</li>
<li>So for the <code>hqdefault.jpg</code> ones I just take the UUID (-2) and use it as the filename</li>
@ -269,7 +269,7 @@ fi
<ul>
<li>More quality control on <code>filename</code> field of CCAFS records to make processing in shell and SAFBuilder more reliable:</li>
</ul>
<pre><code>value.replace('_','').replace('-','')
<pre tabindex="0"><code>value.replace('_','').replace('-','')
</code></pre><ul>
<li>We need to hold off on moving <code>dc.Species</code> to <code>cg.species</code> because it is only used for plants, and might be better to move it to something like <code>cg.species.plant</code></li>
<li>And <code>dc.identifier.fund</code> is MOSTLY used for CPWF project identifier but has some other sponsorship things
@ -281,17 +281,17 @@ fi
</ul>
</li>
</ul>
<pre><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
</code></pre><h2 id="2016-05-20">2016-05-20</h2>
<ul>
<li>More work on CCAFS Video and Images records</li>
<li>For SAFBuilder we need to modify filename column to have the thumbnail bundle:</li>
</ul>
<pre><code>value + &quot;__bundle:THUMBNAIL&quot;
<pre tabindex="0"><code>value + &quot;__bundle:THUMBNAIL&quot;
</code></pre><ul>
<li>Also, I fixed some weird characters using OpenRefine&rsquo;s transform with the following GREL:</li>
</ul>
<pre><code>value.replace(/\u0081/,'')
<pre tabindex="0"><code>value.replace(/\u0081/,'')
</code></pre><ul>
<li>Write shell script to resize thumbnails with height larger than 400: <a href="https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256">https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256</a></li>
<li>Upload 707 CCAFS records to DSpace Test</li>
@ -309,12 +309,12 @@ fi
<ul>
<li>Export CCAFS video and image records from DSpace Test using the migrate option (<code>-m</code>):</li>
</ul>
<pre><code>$ mkdir ~/ccafs-images
<pre tabindex="0"><code>$ mkdir ~/ccafs-images
$ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
</code></pre><ul>
<li>And then import to CGSpace:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log
</code></pre><ul>
<li>But now we have double authors for &ldquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rdquo; in the authority</li>
<li>I&rsquo;m trying to do a Discovery index before messing with the authority index</li>
@ -322,19 +322,19 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
<li>Run system updates on DSpace Test, re-deploy code, and reboot the server</li>
<li>Clean up and import ~200 CTA records to CGSpace via CSV like:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot;
$ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &amp;&gt; ~/CTA-May30/CTA-42229.log
</code></pre><ul>
<li>Discovery indexing took a few hours for some reason, and after that I started the <code>index-authority</code> script</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace index-authority
</code></pre><h2 id="2016-05-31">2016-05-31</h2>
<ul>
<li>The <code>index-authority</code> script ran over night and was finished in the morning</li>
<li>Hopefully this was because we haven&rsquo;t been running it regularly and it will speed up next time</li>
<li>I am running it again with a timer to see:</li>
</ul>
<pre><code>$ time /home/cgspace.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ time /home/cgspace.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Cleaning the old index

View File

@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -129,7 +129,7 @@ Working on second phase of metadata migration, looks like this will work for mov
<li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li>
<li>Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in <code>dc.identifier.fund</code> to <code>cg.identifier.cpwfproject</code> and then the rest to <code>dc.description.sponsorship</code></li>
</ul>
<pre><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
<pre tabindex="0"><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
UPDATE 497
dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
UPDATE 14
@ -141,7 +141,7 @@ UPDATE 14
<li>Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with <code>cg.coverage.admin-unit</code></li>
<li>Seems that the Browse configuration in <code>dspace.cfg</code> can&rsquo;t handle the &lsquo;-&rsquo; in the field name:</li>
</ul>
<pre><code>webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
<pre tabindex="0"><code>webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
</code></pre><ul>
<li>But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error</li>
<li>I&rsquo;ve sent a message to the DSpace mailing list to ask about the Browse index definition</li>
@ -154,13 +154,13 @@ UPDATE 14
<li>Investigating the CCAFS authority issue, I exported the metadata for the Videos collection</li>
<li>The top two authors are:</li>
</ul>
<pre><code>CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
<pre tabindex="0"><code>CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600
</code></pre><ul>
<li>So the only difference is the &ldquo;confidence&rdquo;</li>
<li>Ok, well THAT is interesting:</li>
</ul>
<pre><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
@ -180,7 +180,7 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
</code></pre><ul>
<li>And now an actually relevent example:</li>
</ul>
<pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
count
-------
707
@ -194,14 +194,14 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and te
</code></pre><ul>
<li>Trying something experimental:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
UPDATE 960
</code></pre><ul>
<li>And then re-indexing authority and Discovery&hellip;?</li>
<li>After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet</li>
<li>The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:</li>
</ul>
<pre><code>webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
<pre tabindex="0"><code>webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
</code></pre><ul>
<li>That would only be for the &ldquo;Browse by&rdquo; function&hellip; so we&rsquo;ll have to see what effect that has later</li>
</ul>
@ -215,7 +215,7 @@ UPDATE 960
<ul>
<li>Figured out how to export a list of the unique values from a metadata field ordered by count:</li>
</ul>
<pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
</code></pre><ul>
<li>
<p>Identified the next round of fields to migrate:</p>
@ -244,7 +244,7 @@ UPDATE 960
<li>Looks like this is all we need: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies">https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies</a></li>
<li>I wrote an XPath expression to extract the ILRI subjects from <code>input-forms.xml</code> (from the xmlstarlet package):</li>
</ul>
<pre><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
</code></pre><ul>
<li>Write to Atmire about the use of <code>atmire.orcid.id</code> to see if we can change it</li>
<li>Seems to be a virtual field that is queried from the authority cache&hellip; hmm</li>
@ -263,7 +263,7 @@ UPDATE 960
<li>It looks like the values are documented in <code>Choices.java</code></li>
<li>Experiment with setting all 960 CCAFS author values to be 500:</li>
</ul>
<pre><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
<pre tabindex="0"><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
UPDATE 960
@ -320,7 +320,7 @@ UPDATE 960
<ul>
<li>CGSpace&rsquo;s HTTPS certificate expired last night and I didn&rsquo;t notice, had to renew:</li>
</ul>
<pre><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &quot;/usr/bin/service nginx stop&quot; --post-hook &quot;/usr/bin/service nginx start&quot;
<pre tabindex="0"><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &quot;/usr/bin/service nginx stop&quot; --post-hook &quot;/usr/bin/service nginx start&quot;
</code></pre><ul>
<li>I really need to fix that cron job&hellip;</li>
</ul>
@ -328,7 +328,7 @@ UPDATE 960
<ul>
<li>Run the replacements/deletes for <code>dc.description.sponsorship</code> (investors) on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
</code></pre><ul>
<li>The scripts for this are here:
@ -346,7 +346,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
<li>There are still ~97 fields that weren&rsquo;t indicated to do anything</li>
<li>After the above deletions and replacements I regenerated a CSV and sent it to Peter <em>et al</em> to have a look</li>
</ul>
<pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
</code></pre><ul>
<li>Re-evaluate <code>dc.contributor.corporate</code> and it seems we will move it to <code>dc.contributor.author</code> as this is more in line with how editors are actually using it</li>
</ul>
@ -354,7 +354,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
<ul>
<li>Test run of <code>migrate-fields.sh</code> with the following re-mappings:</li>
</ul>
<pre><code>72 55 #dc.source
<pre tabindex="0"><code>72 55 #dc.source
86 230 #cg.contributor.crp
91 211 #cg.contributor.affiliation
94 212 #cg.species
@ -367,7 +367,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
</code></pre><ul>
<li>Run all cleanups and deletions of <code>dc.contributor.corporate</code> on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
</code></pre><ul>
@ -383,11 +383,11 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
<ul>
<li>Wow, there are 95 authors in the database who have &lsquo;,&rsquo; at the end of their name:</li>
</ul>
<pre><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,';
<pre tabindex="0"><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like '%,';
</code></pre><ul>
<li>We need to use something like this to fix them, need to write a proper regex later:</li>
</ul>
<pre><code># update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
<pre tabindex="0"><code># update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
</code></pre>

View File

@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -135,7 +135,7 @@ In this case the select query was showing 95 results before the update
<li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
text_value
@ -158,7 +158,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<li>We <em>really</em> only need <code>statistics</code> and <code>authority</code> but meh</li>
<li>Fix metadata for species on DSpace Test:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
</code></pre><ul>
<li>Will run later on CGSpace</li>
<li>A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is &ldquo;ungraded&rdquo;</li>
@ -169,7 +169,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<ul>
<li>Delete 23 blank metadata values from CGSpace:</li>
</ul>
<pre><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 23
</code></pre><ul>
<li>Complete phase three of metadata migration, for the following fields:
@ -188,7 +188,7 @@ DELETE 23
</li>
<li>Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
</code></pre><ul>
@ -198,7 +198,7 @@ $ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Dele
<ul>
<li>Doing some author cleanups from Peter and Abenet:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
$ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UTF8.csv -m 3 -u dspacetest -d dspacetest -p fuuu
</code></pre><h2 id="2016-07-13">2016-07-13</h2>
<ul>
@ -215,20 +215,20 @@ $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UT
<li>Add species and breed to the XMLUI item display</li>
<li>CGSpace crashed late at night and the DSpace logs were showing:</li>
</ul>
<pre><code>2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
...
</code></pre><ul>
<li>I suspect it&rsquo;s someone hitting REST too much:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
710 66.249.78.38
1781 181.118.144.29
24904 70.32.99.142
</code></pre><ul>
<li>I just blocked access to <code>/rest</code> for that last IP for now:</li>
</ul>
<pre><code> # log rest requests
<pre tabindex="0"><code> # log rest requests
location /rest {
access_log /var/log/nginx/rest.log;
proxy_pass http://127.0.0.1:8443;
@ -248,23 +248,23 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>We might need to use <code>index.authority.ignore-prefered=true</code> to tell the Discovery index to prefer the variation that exists in the metadatavalue rather than what it finds in the authority cache.</li>
<li>Trying these on DSpace Test after a discussion by Daniel Scharon on the dspace-tech mailing list:</li>
</ul>
<pre><code>index.authority.ignore-prefered.dc.contributor.author=true
<pre tabindex="0"><code>index.authority.ignore-prefered.dc.contributor.author=true
index.authority.ignore-variants.dc.contributor.author=false
</code></pre><ul>
<li>After reindexing I don&rsquo;t see any change in Discovery&rsquo;s display of authors, and still have entries like:</li>
</ul>
<pre><code>Grace, D. (464)
<pre tabindex="0"><code>Grace, D. (464)
Grace, D. (62)
</code></pre><ul>
<li>I asked for clarification of the following options on the DSpace mailing list:</li>
</ul>
<pre><code>index.authority.ignore
<pre tabindex="0"><code>index.authority.ignore
index.authority.ignore-prefered
index.authority.ignore-variants
</code></pre><ul>
<li>In the mean time, I will try these on DSpace Test (plus a reindex):</li>
</ul>
<pre><code>index.authority.ignore=true
<pre tabindex="0"><code>index.authority.ignore=true
index.authority.ignore-prefered=true
index.authority.ignore-variants=true
</code></pre><ul>
@ -272,7 +272,7 @@ index.authority.ignore-variants=true
<li>It was misconfigured and disabled, but already working for some reason <em>sigh</em></li>
<li>&hellip; no luck. Trying with just:</li>
</ul>
<pre><code>index.authority.ignore=true
<pre tabindex="0"><code>index.authority.ignore=true
</code></pre><ul>
<li>After re-indexing and clearing the XMLUI cache nothing has changed</li>
</ul>
@ -280,7 +280,7 @@ index.authority.ignore-variants=true
<ul>
<li>Trying a few more settings (plus reindex) for Discovery on DSpace Test:</li>
</ul>
<pre><code>index.authority.ignore-prefered.dc.contributor.author=true
<pre tabindex="0"><code>index.authority.ignore-prefered.dc.contributor.author=true
index.authority.ignore-variants=true
</code></pre><ul>
<li>Run all OS updates and reboot DSpace Test server</li>
@ -291,7 +291,7 @@ index.authority.ignore-variants=true
<ul>
<li>The DSpace source code mentions the configuration key <code>discovery.index.authority.ignore-prefered.*</code> (with prefix of discovery, despite the docs saying otherwise), so I&rsquo;m trying the following on DSpace Test:</li>
</ul>
<pre><code>discovery.index.authority.ignore-prefered.dc.contributor.author=true
<pre tabindex="0"><code>discovery.index.authority.ignore-prefered.dc.contributor.author=true
discovery.index.authority.ignore-variants=true
</code></pre><ul>
<li>Still no change!</li>

View File

@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -137,7 +137,7 @@ $ git rebase -i dspace-5.5
<li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li>
<li>Start working on DSpace 5.15.5 port:</li>
</ul>
<pre><code>$ git checkout -b 55new 5_x-prod
<pre tabindex="0"><code>$ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
</code></pre><ul>
@ -166,7 +166,7 @@ $ git rebase -i dspace-5.5
<li>Fix item display incorrectly displaying Species when Breeds were present (<a href="https://github.com/ilri/DSpace/pull/260">#260</a>)</li>
<li>Experiment with fixing more authors, like Delia Grace:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
</code></pre><h2 id="2016-08-06">2016-08-06</h2>
<ul>
<li>Finally figured out how to remove &ldquo;View/Open&rdquo; and &ldquo;Bitstreams&rdquo; from the item view</li>
@ -184,7 +184,7 @@ $ git rebase -i dspace-5.5
<li>Install latest Oracle Java 8 JDK</li>
<li>Create <code>setenv.sh</code> in Tomcat 8 <code>libexec/bin</code> directory:</li>
</ul>
<pre><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>CATALINA_OPTS=&quot;-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8&quot;
CATALINA_OPTS=&quot;$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib&quot;
JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
@ -192,7 +192,7 @@ JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
<li>Edit Tomcat 8 <code>server.xml</code> to add regular HTTP listener for solr</li>
<li>Symlink webapps:</li>
</ul>
<pre><code>$ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
<pre tabindex="0"><code>$ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
$ ln -sv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
$ ln -sv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/oai
$ ln -sv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/jspui
@ -246,7 +246,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
<li>Fix &ldquo;CONGO,DR&rdquo; country name in <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/264">#264</a>)</li>
<li>Also need to fix existing records using the incorrect form in the database:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
</code></pre><ul>
<li>I asked a question on the DSpace mailing list about updating &ldquo;preferred&rdquo; forms of author names from ORCID</li>
</ul>
@ -262,7 +262,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
<ul>
<li>Database migrations are fine on DSpace 5.1:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace database info
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database info
Database URL: jdbc:postgresql://localhost:5432/dspacetest
Database Schema: public
@ -300,12 +300,12 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
<li>Talk to Atmire about the DSpace 5.5 issue, and it seems to be caused by a bug in FlywayDB</li>
<li>They said I should delete the Atmire migrations</li>
</ul>
<pre><code>dspacetest=# delete from schema_version where description = 'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
<pre tabindex="0"><code>dspacetest=# delete from schema_version where description = 'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
dspacetest=# delete from schema_version where description = 'Atmire MQM migration' and version='5.1.2015.12.03.3';
</code></pre><ul>
<li>After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!</li>
</ul>
<pre><code>org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
<pre tabindex="0"><code>org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
</code></pre><ul>
<li>Looks like we&rsquo;re missing some stuff in the XMLUI module&rsquo;s <code>sitemap.xmap</code>, as well as in each of our XMLUI themes</li>
@ -324,18 +324,18 @@ context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
<li>Clean up and import 48 CCAFS records into DSpace Test</li>
<li>SQL to get all journal titles from dc.source (55), since it&rsquo;s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
</code></pre><h2 id="2016-08-25">2016-08-25</h2>
<ul>
<li>Atmire suggested adding a missing bean to <code>dspace/config/spring/api/atmire-cua.xml</code> but it doesn&rsquo;t help:</li>
</ul>
<pre><code>...
<pre tabindex="0"><code>...
Error creating bean with name 'MetadataStorageInfoService'
...
</code></pre><ul>
<li>Atmire sent an updated version of <code>dspace/config/spring/api/atmire-cua.xml</code> and now XMLUI starts but gives a null pointer exception:</li>
</ul>
<pre><code>Java stacktrace: java.lang.NullPointerException
<pre tabindex="0"><code>Java stacktrace: java.lang.NullPointerException
at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
@ -350,7 +350,7 @@ Error creating bean with name 'MetadataStorageInfoService'
</code></pre><ul>
<li>Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:</li>
</ul>
<pre><code>$ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
<pre tabindex="0"><code>$ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
$ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
</code></pre><ul>
<li>Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs</li>
@ -360,7 +360,7 @@ $ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/b
<li>CGSpace had issues tonight, not entirely crashing, but becoming unresponsive</li>
<li>The dspace log had this:</li>
</ul>
<pre><code>2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
<pre tabindex="0"><code>2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Related to /rest no doubt</li>
</ul>

View File

@ -34,7 +34,7 @@ It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -127,11 +127,11 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=or
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
<pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
</code></pre><ul>
<li>User who has been migrated to the root vs user still in the hierarchical structure:</li>
</ul>
<pre><code>distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
<pre tabindex="0"><code>distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Ethiopia,DC=ILRI,DC=CGIARAD,DC=ORG
</code></pre><ul>
<li>Changing the DSpace LDAP config to use <code>OU=ILRIHUB</code> seems to work:</li>
@ -140,7 +140,7 @@ distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Eth
<ul>
<li>Notes for local PostgreSQL database recreation from production snapshot:</li>
</ul>
<pre><code>$ dropdb dspacetest
<pre tabindex="0"><code>$ dropdb dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql dspacetest -c 'alter user dspacetest createuser;'
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup
@ -150,7 +150,7 @@ $ vacuumdb dspacetest
</code></pre><ul>
<li>Some names that I thought I fixed in July seem not to be:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
text_value | authority | confidence
-----------------------+--------------------------------------+------------
Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb | 600
@ -163,12 +163,12 @@ $ vacuumdb dspacetest
</code></pre><ul>
<li>At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
UPDATE 69
</code></pre><ul>
<li>And for Peter Ballantyne:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
text_value | authority | confidence
-------------------+--------------------------------------+------------
Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 | 600
@ -180,12 +180,12 @@ UPDATE 69
</code></pre><ul>
<li>Again, a few have the correct ORCID, but there should only be one authority&hellip;</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
UPDATE 58
</code></pre><ul>
<li>And for me:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
@ -197,7 +197,7 @@ UPDATE 11
</code></pre><ul>
<li>And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
UPDATE 166
dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
text_value | authority | confidence
@ -215,7 +215,7 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
<ul>
<li>After one week of logging TLS connections on CGSpace:</li>
</ul>
<pre><code># zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
<pre tabindex="0"><code># zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
@ -226,7 +226,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
<li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li>
<li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
</code></pre><ul>
<li>This gives you, for example: <code>Mainstreaming gender in agricultural R&amp;D.pdf__description:Brief</code></li>
</ul>
@ -251,7 +251,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
<li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li>
<li>We should definitely clean filenames so they don&rsquo;t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>&quot;</code></li>
</ul>
<pre><code>value.replace(&quot;'&quot;,&quot;&quot;).replace(&quot;,&quot;,&quot;&quot;).replace('&quot;','')
<pre tabindex="0"><code>value.replace(&quot;'&quot;,&quot;&quot;).replace(&quot;,&quot;,&quot;&quot;).replace('&quot;','')
</code></pre><ul>
<li>I need to write a Python script to match that for renaming files in the file system</li>
<li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li>
@ -263,7 +263,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
</li>
<li>Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the <code>tomcat7</code> user, and deleting the bundle, for each collection&rsquo;s items:</li>
</ul>
<pre><code>$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
<pre tabindex="0"><code>$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
$ JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
$ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
</code></pre><h2 id="2016-09-07">2016-09-07</h2>
@ -274,7 +274,7 @@ $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
<li>See: <a href="https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html">https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html</a></li>
<li>CGSpace went down and the error seems to be the same as always (lately):</li>
</ul>
<pre><code>2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
...
</code></pre><ul>
@ -284,7 +284,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<ul>
<li>CGSpace crashed twice today, errors from <code>catalina.out</code>:</li>
</ul>
<pre><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
<pre tabindex="0"><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
</code></pre><ul>
<li>I enabled logging of requests to <code>/rest</code> again</li>
@ -293,29 +293,29 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<ul>
<li>CGSpace crashed again, errors from <code>catalina.out</code>:</li>
</ul>
<pre><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
<pre tabindex="0"><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
</code></pre><ul>
<li>I restarted Tomcat and it was ok again</li>
<li>CGSpace crashed a few hours later, errors from <code>catalina.out</code>:</li>
</ul>
<pre><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
</code></pre><ul>
<li>We haven&rsquo;t seen that in quite a while&hellip;</li>
<li>Indeed, in a month of logs it only occurs 15 times:</li>
</ul>
<pre><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l
<pre tabindex="0"><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l
15
</code></pre><ul>
<li>I also see a bunch of errors from dspace.log:</li>
</ul>
<pre><code>2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Looking at REST requests, it seems there is one IP hitting us nonstop:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
820 50.87.54.15
12872 70.32.99.142
25744 70.32.83.92
@ -328,19 +328,19 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>I think the stability issues are definitely from REST</li>
<li>Crashed AGAIN, errors from dspace.log:</li>
</ul>
<pre><code>2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>And more heap space errors:</li>
</ul>
<pre><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l
<pre tabindex="0"><code># grep -rsI &quot;OutOfMemoryError&quot; /var/log/tomcat7/catalina.* | wc -l
19
</code></pre><ul>
<li>There are no more rest requests since the last crash, so maybe there are other things causing this.</li>
<li>Hmm, I noticed a shitload of IPs from 180.76.0.0/16 are connecting to both CGSpace and DSpace Test (58 unique IPs concurrently!)</li>
<li>They seem to be coming from Baidu, and so far during today alone account for 1/6 of every connection:</li>
</ul>
<pre><code># grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
<pre tabindex="0"><code># grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
29084
# grep -c ip_addr=180.76.15 /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
5192
@ -349,16 +349,16 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
<li>From the activity control panel I can see 58 unique IPs hitting the site <em>concurrently</em>, which has GOT to hurt our stability</li>
<li>A list of all 2000 unique IPs from CGSpace logs today:</li>
</ul>
<pre><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
<pre tabindex="0"><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
</code></pre><ul>
<li>Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc&hellip; do we have any real users?</li>
<li>Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:</li>
</ul>
<pre><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
</code></pre><ul>
<li>Looking into the Catalina logs again around the time of the first crash, I see:</li>
</ul>
<pre><code>Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2
<pre tabindex="0"><code>Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2
Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs.
Commit
Commit done
@ -368,7 +368,7 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-193&quot; java.lang.OutOf
<li>And after that I see a bunch of &ldquo;pool error Timeout waiting for idle object&rdquo;</li>
<li>Later, near the time of the next crash I see:</li>
</ul>
<pre><code>dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
<pre tabindex="0"><code>dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Wed Sep 14 11:29:55 UTC 2016 | Query:id: 79078 AND type:2
Wed Sep 14 11:30:20 UTC 2016 | Updating : 6/6 docs.
Commit
@ -389,7 +389,7 @@ java.util.Map does not have a no-arg default constructor.
</code></pre><ul>
<li>Then 20 minutes later another outOfMemoryError:</li>
</ul>
<pre><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-25&quot; java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
</code></pre><ul>
<li>Perhaps these particular issues <em>are</em> memory issues, the munin graphs definitely show some weird purging/allocating behavior starting this week</li>
@ -402,7 +402,7 @@ java.util.Map does not have a no-arg default constructor.
<li>Oh great, the configuration on the actual server is different than in configuration management!</li>
<li>Seems we added a bunch of settings to the <code>/etc/default/tomcat7</code> in December, 2015 and never updated our ansible repository:</li>
</ul>
<pre><code>JAVA_OPTS=&quot;-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts&quot;
<pre tabindex="0"><code>JAVA_OPTS=&quot;-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts&quot;
</code></pre><ul>
<li>So I&rsquo;m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)</li>
<li>Increased JVM heap to 4096m on CGSpace (linode01)</li>
@ -416,7 +416,7 @@ java.util.Map does not have a no-arg default constructor.
<ul>
<li>CGSpace crashed again, and there are TONS of heap space errors but the datestamps aren&rsquo;t on those lines so I&rsquo;m not sure if they were yesterday:</li>
</ul>
<pre><code>dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
<pre tabindex="0"><code>dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Thu Sep 15 18:45:25 UTC 2016 | Query:id: 55785 AND type:2
Thu Sep 15 18:45:26 UTC 2016 | Updating : 100/218 docs.
Thu Sep 15 18:45:26 UTC 2016 | Updating : 200/218 docs.
@ -443,7 +443,7 @@ Exception in thread &quot;Thread-54216&quot; org.apache.solr.client.solrj.impl.H
<li>I bumped the heap space from 4096m to 5120m to see if this is <em>really</em> about heap speace or not.</li>
<li>Looking into some of these errors that I&rsquo;ve seen this week but haven&rsquo;t noticed before:</li>
</ul>
<pre><code># zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
<pre tabindex="0"><code># zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
113
</code></pre><ul>
<li>I&rsquo;ve sent a message to Atmire about the Solr error to see if it&rsquo;s related to their batch update module</li>
@ -452,7 +452,7 @@ Exception in thread &quot;Thread-54216&quot; org.apache.solr.client.solrj.impl.H
<ul>
<li>Work on cleanups for author affiliations after Peter sent me his list of corrections/deletions:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2-deletions.csv -m 211 -u dspace -d dspace -p fuuu
</code></pre><ul>
<li>After that we need to take the top ~300 and make a controlled vocabulary for it</li>
@ -474,7 +474,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
<li>Turns out the Solr search logic switched from OR to AND in DSpace 6.0 and the change is easy to backport: <a href="https://jira.duraspace.org/browse/DS-2809">https://jira.duraspace.org/browse/DS-2809</a></li>
<li>We just need to set this in <code>dspace/solr/search/conf/schema.xml</code>:</li>
</ul>
<pre><code>&lt;solrQueryParser defaultOperator=&quot;AND&quot;/&gt;
<pre tabindex="0"><code>&lt;solrQueryParser defaultOperator=&quot;AND&quot;/&gt;
</code></pre><ul>
<li>It actually works really well, and search results return much less hits now (before, after):</li>
</ul>
@ -483,7 +483,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
<ul>
<li>Found a way to improve the configuration of Atmire&rsquo;s Content and Usage Analysis (CUA) module for date fields</li>
</ul>
<pre><code>-content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
<pre tabindex="0"><code>-content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
+content.analysis.dataset.option.8=metadata:dc.date.accessioned:date(month)
</code></pre><ul>
<li>This allows the module to treat the field as a date rather than a text string, so we can interrogate it more intelligently</li>
@ -492,7 +492,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
<li>45 minutes of downtime!</li>
<li>Start processing the fixes to <code>dc.description.sponsorship</code> from Peter Ballantyne:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>I need to run these and the others from a few days ago on CGSpace the next time we run updates</li>
@ -511,14 +511,14 @@ $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsor
<li>Not sure if it&rsquo;s something like we already have too many filters there (30), or the filter name is reserved, etc&hellip;</li>
<li>Generate a list of ILRI subjects for Peter and Abenet to look through/fix:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
</code></pre><ul>
<li>Regenerate Discovery indexes a few times after playing with <code>discovery.xml</code> index definitions (syntax, parameters, etc).</li>
<li>Merge changes to boolean logic in Solr search (<a href="https://github.com/ilri/DSpace/pull/274">#274</a>)</li>
<li>Run all sponsorship and affiliation fixes on CGSpace, deploy latest <code>5_x-prod</code> branch, and re-index Discovery on CGSpace</li>
<li>Tested OCSP stapling on DSpace Test&rsquo;s nginx and it works:</li>
</ul>
<pre><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
<pre tabindex="0"><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP response:
======================================
@ -533,12 +533,12 @@ OCSP Response Data:
<li>Discuss fixing some ORCIDs for CCAFS author Sonja Vermeulen with Magdalena Haman</li>
<li>This author has a few variations:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
len, S%';
</code></pre><ul>
<li>And it looks like <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code> is the authority with the correct ORCID linked</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
UPDATE 101
</code></pre><ul>
<li>Hmm, now her name is missing from the authors facet and only shows the authority ID</li>
@ -547,7 +547,7 @@ UPDATE 101
<li>On a clean snapshot of the database I see the correct authority should be <code>f01f7b7b-be3f-4df7-a61d-b73c067de88d</code>, not <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code></li>
<li>Updating her authorities again and reindexing:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
UPDATE 101
</code></pre><ul>
<li>Use GitHub icon from Font Awesome instead of a PNG to save one extra network request</li>
@ -564,14 +564,14 @@ UPDATE 101
<li>Minor fix to a string in Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/pull/280">#280</a>)</li>
<li>This seems to be what I&rsquo;ll need to do for Sonja Vermeulen (but with <code>2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0</code> instead on the live site):</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
</code></pre><ul>
<li>And then update Discovery and Authority indexes</li>
<li>Minor fix for &ldquo;Subject&rdquo; string in Discovery search and Atmire modules (<a href="https://github.com/ilri/DSpace/pull/281">#281</a>)</li>
<li>Start testing batch fixes for ILRI subject from Peter:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu
$ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -m 203 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2016-09-29">2016-09-29</h2>
<ul>
@ -580,7 +580,7 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
<li>DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console</li>
<li>People on DSpace mailing list gave me a query to get authors from certain collections:</li>
</ul>
<pre><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
</code></pre><h2 id="2016-09-30">2016-09-30</h2>
<ul>
<li>Deny access to REST API&rsquo;s <code>find-by-metadata-field</code> endpoint to protect against an upstream security issue (DS-3250)</li>

View File

@ -42,7 +42,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -139,7 +139,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
</li>
<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
</ul>
<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
<pre tabindex="0"><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
</code></pre><ul>
<li>Hmm, with the <code>dc.contributor.author</code> column removed, DSpace doesn&rsquo;t detect any changes</li>
<li>With a blank <code>dc.contributor.author</code> column, DSpace wants to remove all non-ORCID authors and add the new ORCID authors</li>
@ -161,14 +161,14 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
</li>
<li>That left us with 3,180 valid corrections and 3 deletions:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
$ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -m 3 -d dspacetest -u dspacetest -p fuuu
</code></pre><ul>
<li>Remove old about page (<a href="https://github.com/ilri/DSpace/pull/284">#284</a>)</li>
<li>CGSpace crashed a few times today</li>
<li>Generate list of unique authors in CCAFS collections:</li>
</ul>
<pre><code>dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
<pre tabindex="0"><code>dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
</code></pre><h2 id="2016-10-05">2016-10-05</h2>
<ul>
<li>Work on more infrastructure cleanups for Ansible DSpace role</li>
@ -190,13 +190,13 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
<li>Re-deploy CGSpace with latest changes from late September and early October</li>
<li>Run fixes for ILRI subjects and delete blank metadata values:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 11
</code></pre><ul>
<li>Run all system updates and reboot CGSpace</li>
<li>Delete ten gigs of old 2015 Tomcat logs that never got rotated (WTF?):</li>
</ul>
<pre><code>root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
<pre tabindex="0"><code>root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
47
</code></pre><ul>
<li>Delete 2GB <code>cron-filter-media.log</code> file, as it is just a log from a cron job and it doesn&rsquo;t get rotated like normal log files (almost a year now maybe)</li>
@ -211,7 +211,7 @@ DELETE 11
<ul>
<li>A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)</li>
</ul>
@ -219,7 +219,7 @@ DELETE 11
<ul>
<li>Start working on DSpace 5.5 porting work again:</li>
</ul>
<pre><code>$ git checkout -b 5_x-55 5_x-prod
<pre tabindex="0"><code>$ git checkout -b 5_x-55 5_x-prod
$ git rebase -i dspace-5.5
</code></pre><ul>
<li>Have to fix about ten merge conflicts, mostly in the SCSS for the CGIAR theme</li>
@ -248,25 +248,25 @@ $ git rebase -i dspace-5.5
<ul>
<li>Move the LIVES community from the top level to the ILRI projects community</li>
</ul>
<pre><code>$ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
<pre tabindex="0"><code>$ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
</code></pre><ul>
<li>Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA</li>
<li>Start looking at batch fixing of &ldquo;old&rdquo; ILRI website links without www or https, for example:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
</code></pre><ul>
<li>Also CCAFS has HTTPS and their links should use it where possible:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
</code></pre><ul>
<li>And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):</li>
</ul>
<pre><code>dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
</code></pre><ul>
<li>Turns out there are shit tons of varieties of this, like with http, https, www, separate <code>&lt;/img&gt;</code> tags, alignments, etc</li>
<li>Had to find all variations and replace them individually:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;','&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;','&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;https://www.ilri.org/images/email.jpg&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;', '&lt;span class=&quot;fa fa-rss fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/Iconrss2.png&quot;/&gt;%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;', '&lt;span class=&quot;fa fa-at fa-2x&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt;') where resource_type_id in (3,4) and text_value like '%&lt;img align=&quot;left&quot; src=&quot;http://www.ilri.org/images/email.jpg&quot;/&gt;%';
@ -291,7 +291,7 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, '&lt;i
<ul>
<li>Run Font Awesome fixes on DSpace Test:</li>
</ul>
<pre><code>dspace=# \i /tmp/font-awesome-text-replace.sql
<pre tabindex="0"><code>dspace=# \i /tmp/font-awesome-text-replace.sql
UPDATE 17
UPDATE 17
UPDATE 3
@ -321,7 +321,7 @@ UPDATE 0
<ul>
<li>Fix some messed up authors on CGSpace:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
UPDATE 10
dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
UPDATE 36
@ -332,20 +332,20 @@ UPDATE 36
<li>Talk to Carlos Quiros about CG Core metadata in CGSpace</li>
<li>Get a list of countries from CGSpace so I can do some batch corrections:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
</code></pre><ul>
<li>Fix a bunch of countries in Open Refine and run the corrections on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Run a shit ton of author fixes from Peter Ballantyne that we&rsquo;ve been cleaning up for two months:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
</code></pre><ul>
<li>Run a few URL corrections for ilri.org and doi.org, etc:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);

File diff suppressed because one or more lines are too long

View File

@ -46,7 +46,7 @@ I see thousands of them in the logs for the last few months, so it&rsquo;s not r
I&rsquo;ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -137,7 +137,7 @@ Another worrying error from dspace.log is:
<li>CGSpace was down for five hours in the morning while I was sleeping</li>
<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
</ul>
<pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
@ -147,7 +147,7 @@ Another worrying error from dspace.log is:
<li>I&rsquo;ve raised a ticket with Atmire to ask</li>
<li>Another worrying error from dspace.log is:</li>
</ul>
<pre><code>org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
<pre tabindex="0"><code>org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
@ -236,13 +236,13 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
</code></pre><ul>
<li>The first error I see in dspace.log this morning is:</li>
</ul>
<pre><code>2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;&quot;b0b541c1-ec15-48bf-9209-6dbe8e338cdc&quot;
<pre tabindex="0"><code>2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;&quot;b0b541c1-ec15-48bf-9209-6dbe8e338cdc&quot;
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
</code></pre><ul>
<li>Looking through DSpace&rsquo;s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries</li>
<li>The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:</li>
</ul>
<pre><code>2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&amp;shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&amp;fq=-isInternal:true&amp;fq=-(author_mtdt:&quot;CGIAR\+Institutional\+Learning\+and\+Change\+Initiative&quot;++AND+subject_mtdt:&quot;PARTNERSHIPS&quot;+AND+subject_mtdt:&quot;RESEARCH&quot;+AND+subject_mtdt:&quot;AGRICULTURE&quot;+AND+subject_mtdt:&quot;DEVELOPMENT&quot;++AND+iso_mtdt:&quot;en&quot;+)&amp;rows=0&amp;wt=javabin&amp;version=2} hits=0 status=0 QTime=19
<pre tabindex="0"><code>2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&amp;shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&amp;fq=-isInternal:true&amp;fq=-(author_mtdt:&quot;CGIAR\+Institutional\+Learning\+and\+Change\+Initiative&quot;++AND+subject_mtdt:&quot;PARTNERSHIPS&quot;+AND+subject_mtdt:&quot;RESEARCH&quot;+AND+subject_mtdt:&quot;AGRICULTURE&quot;+AND+subject_mtdt:&quot;DEVELOPMENT&quot;++AND+iso_mtdt:&quot;en&quot;+)&amp;rows=0&amp;wt=javabin&amp;version=2} hits=0 status=0 QTime=19
2016-12-02 08:28:23,908 INFO org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
</code></pre><ul>
<li>DSpace&rsquo;s own Solr logs don&rsquo;t give IP addresses, so I will have to enable Nginx&rsquo;s logging of <code>/solr</code> so I can see where this request came from</li>
@ -255,7 +255,7 @@ org.apache.solr.client.solrj.SolrServerException: Server refused connection at:
<li>I got a weird report from the CGSpace checksum checker this morning</li>
<li>It says 732 bitstreams have potential issues, for example:</li>
</ul>
<pre><code>------------------------------------------------
<pre tabindex="0"><code>------------------------------------------------
Bitstream Id = 6
Process Start Date = Dec 4, 2016
Process End Date = Dec 4, 2016
@ -278,7 +278,7 @@ Result = The bitstream could not be found
<li>For what it&rsquo;s worth, there is no item on DSpace Test or S3 backups with that checksum either&hellip;</li>
<li>In other news, I&rsquo;m looking at JVM settings from the Solr 4.10.2 release, from <code>bin/solr.in.sh</code>:</li>
</ul>
<pre><code># These GC settings have shown to work well for a number of common Solr workloads
<pre tabindex="0"><code># These GC settings have shown to work well for a number of common Solr workloads
GC_TUNE=&quot;-XX:-UseSuperWord \
-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
@ -311,7 +311,7 @@ GC_TUNE=&quot;-XX:-UseSuperWord \
<li>Atmire responded about the MQM warnings in the DSpace logs</li>
<li>Apparently we need to change the batch edit consumers in <code>dspace/config/dspace.cfg</code>:</li>
</ul>
<pre><code>event.consumer.batchedit.filters = Community|Collection+Create
<pre tabindex="0"><code>event.consumer.batchedit.filters = Community|Collection+Create
</code></pre><ul>
<li>I haven&rsquo;t tested it yet, but I created a pull request: <a href="https://github.com/ilri/DSpace/pull/289">#289</a></li>
</ul>
@ -319,7 +319,7 @@ GC_TUNE=&quot;-XX:-UseSuperWord \
<ul>
<li>Some author authority corrections and name standardizations for Peter:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
UPDATE 11
dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
UPDATE 36
@ -343,7 +343,7 @@ UPDATE 561
<li>The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn&rsquo;t dedicated (also runs Solr, which can benefit from OS cache) so let&rsquo;s try 1024MB</li>
<li>In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):</li>
</ul>
<pre><code>$ time JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ time JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Exception: null
@ -376,7 +376,7 @@ sys 0m22.647s
<li>For example, do a Solr query for &ldquo;first_name:Grace&rdquo; and look at the results</li>
<li>Querying that ID shows the fields that need to be changed:</li>
</ul>
<pre><code>{
<pre tabindex="0"><code>{
&quot;responseHeader&quot;: {
&quot;status&quot;: 0,
&quot;QTime&quot;: 1,
@ -409,7 +409,7 @@ sys 0m22.647s
<li>I think I can just update the <code>value</code>, <code>first_name</code>, and <code>last_name</code> fields&hellip;</li>
<li>The update syntax should be something like this, but I&rsquo;m getting errors from Solr:</li>
</ul>
<pre><code>$ curl 'localhost:8081/solr/authority/update?commit=true&amp;wt=json&amp;indent=true' -H 'Content-type:application/json' -d '[{&quot;id&quot;:&quot;1&quot;,&quot;price&quot;:{&quot;set&quot;:100}}]'
<pre tabindex="0"><code>$ curl 'localhost:8081/solr/authority/update?commit=true&amp;wt=json&amp;indent=true' -H 'Content-type:application/json' -d '[{&quot;id&quot;:&quot;1&quot;,&quot;price&quot;:{&quot;set&quot;:100}}]'
{
&quot;responseHeader&quot;:{
&quot;status&quot;:400,
@ -421,13 +421,13 @@ sys 0m22.647s
<li>When I try using the XML format I get an error that the <code>updateLog</code> needs to be configured for that core</li>
<li>Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
UPDATE 561
</code></pre><ul>
<li>Then I&rsquo;ll reindex discovery and authority and see how the authority Solr core looks</li>
<li>After this, now there are authorities for some of the &ldquo;Grace, D.&rdquo; and &ldquo;Grace, Delia&rdquo; text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):</li>
</ul>
<pre><code>$ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&amp;wt=json&amp;indent=true'
{
&quot;responseHeader&quot;:{
&quot;status&quot;:0,
@ -453,7 +453,7 @@ UPDATE 561
<li>In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!</li>
<li>Better to use:</li>
</ul>
<pre><code>dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
</code></pre><ul>
<li>This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!</li>
<li>Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID</li>
@ -461,7 +461,7 @@ UPDATE 561
<li>Deploy &ldquo;take task&rdquo; hack/fix on CGSpace (<a href="https://github.com/ilri/DSpace/pull/290">#290</a>)</li>
<li>I ran the following author corrections and then reindexed discovery:</li>
</ul>
<pre><code>update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
<pre tabindex="0"><code>update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
@ -471,7 +471,7 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
<ul>
<li>Something weird happened and Peter Thorne&rsquo;s names all ended up as &ldquo;Thorne&rdquo;, I guess because the original authority had that as its name value:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
text_value | authority | confidence
------------------+--------------------------------------+------------
Thorne, P.J. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
@ -484,12 +484,12 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
</code></pre><ul>
<li>I generated a new UUID using <code>uuidgen | tr [A-Z] [a-z]</code> and set it along with correct name variation for all records:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
UPDATE 43
</code></pre><ul>
<li>Apparently we also need to normalize Phil Thornton&rsquo;s names to <code>Thornton, Philip K.</code>:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
text_value | authority | confidence
---------------------+--------------------------------------+------------
Thornton, P | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
@ -506,7 +506,7 @@ UPDATE 43
</code></pre><ul>
<li>Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
UPDATE 362
</code></pre><ul>
<li>It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)</li>
@ -520,7 +520,7 @@ UPDATE 362
<li>Set PostgreSQL&rsquo;s <code>shared_buffers</code> on CGSpace to 10% of system RAM (1200MB)</li>
<li>Run the following author corrections on CGSpace:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
</code></pre><ul>
<li>The authority IDs were different now than when I was looking a few days ago so I had to adjust them here</li>
@ -534,7 +534,7 @@ dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab76
<ul>
<li>Looking at CIAT records from last week again, they have a lot of double authors like:</li>
</ul>
<pre><code>International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
<pre tabindex="0"><code>International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::500
International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::0
</code></pre><ul>
@ -542,7 +542,7 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
<li>Removing the duplicates in OpenRefine and uploading a CSV to DSpace says &ldquo;no changes detected&rdquo;</li>
<li>Seems like the only way to sortof clean these up would be to start in SQL:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
text_value | authority | confidence
-----------------------------------------------+--------------------------------------+------------
International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | -1
@ -577,14 +577,14 @@ UPDATE 35
<li>So basically, new cron jobs for logs should look something like this:</li>
<li>Find any file named <code>*.log*</code> that isn&rsquo;t <code>dspace.log*</code>, isn&rsquo;t already zipped, and is older than one day, and zip it:</li>
</ul>
<pre><code># find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex &quot;.*\.log.*&quot; ! -iregex &quot;.*dspace\.log.*&quot; ! -iregex &quot;.*\.(gz|lrz|lzo|xz)&quot; ! -newermt &quot;Yesterday&quot; -exec schedtool -B -e ionice -c2 -n7 xz {} \;
<pre tabindex="0"><code># find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex &quot;.*\.log.*&quot; ! -iregex &quot;.*dspace\.log.*&quot; ! -iregex &quot;.*\.(gz|lrz|lzo|xz)&quot; ! -newermt &quot;Yesterday&quot; -exec schedtool -B -e ionice -c2 -n7 xz {} \;
</code></pre><ul>
<li>Since there is <code>xzgrep</code> and <code>xzless</code> we can actually just zip them after one day, why not?!</li>
<li>We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that</li>
<li>I use <code>schedtool -B</code> and <code>ionice -c2 -n7</code> to set the CPU scheduling to <code>SCHED_BATCH</code> and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less</li>
<li>When the tasks are running you can see that the policies do apply:</li>
</ul>
<pre><code>$ schedtool $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}') &amp;&amp; ionice -p $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}')
<pre tabindex="0"><code>$ schedtool $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}') &amp;&amp; ionice -p $(ps aux | grep &quot;xz /home&quot; | grep -v grep | awk '{print $2}')
PID 17049: PRIO 0, POLICY B: SCHED_BATCH , NICE 0, AFFINITY 0xf
best-effort: prio 7
</code></pre><ul>
@ -594,7 +594,7 @@ best-effort: prio 7
<li>Some users pointed out issues with the &ldquo;most popular&rdquo; stats on a community or collection</li>
<li>This error appears in the logs when you try to view them:</li>
</ul>
<pre><code>2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
@ -679,7 +679,7 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
<li>None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then</li>
<li>Update some names and authorities in the database:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
UPDATE 204
dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa9', confidence=600, text_value='Mekonnen, Kindu' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Mekonnen, K%';
UPDATE 89
@ -692,7 +692,7 @@ UPDATE 140
<li>Enable OCSP stapling for hosts &gt;= Ubuntu 16.04 in our Ansible playbooks (<a href="https://github.com/ilri/rmg-ansible-public/pull/76">#76</a>)</li>
<li>Working for DSpace Test on the second response:</li>
</ul>
<pre><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
<pre tabindex="0"><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP response: no response sent
$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
@ -704,12 +704,12 @@ OCSP Response Data:
<li>Migrate CGSpace to new server, roughly following these steps:</li>
<li>On old server:</li>
</ul>
<pre><code># service tomcat7 stop
<pre tabindex="0"><code># service tomcat7 stop
# /home/backup/scripts/postgres_backup.sh
</code></pre><ul>
<li>On new server:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/assetstore/ /home/cgspace.cgiar.org/assetstore/
# rsync -4 -av --delete 178.79.187.182:/home/backup/ /home/backup/
# rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/solr/ /home/cgspace.cgiar.org/solr
@ -750,7 +750,7 @@ $ exit
<li>Abenet wanted a CSV of the IITA community, but the web export doesn&rsquo;t include the <code>dc.date.accessioned</code> field</li>
<li>I had to export it from the command line using the <code>-a</code> flag:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
<pre tabindex="0"><code>$ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
</code></pre><h2 id="2016-12-28">2016-12-28</h2>
<ul>
<li>We&rsquo;ve been getting two alerts per day about CPU usage on the new server from Linode</li>

View File

@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -124,7 +124,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
<ul>
<li>I tried to shard my local dev instance and it fails the same way:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace stats-util -s
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace stats-util -s
Moving: 9318 into core statistics-2016
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
@ -171,7 +171,7 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
</code></pre><ul>
<li>And the DSpace log shows:</li>
</ul>
<pre><code>2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
<pre tabindex="0"><code>2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}-&gt;http://localhost:8081: Broken pipe (Write failed)
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}-&gt;http://localhost:8081
@ -179,7 +179,7 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>Despite failing instantly, a <code>statistics-2016</code> directory was created, but it only has a data dir (no conf)</li>
<li>The Tomcat access logs show more:</li>
</ul>
<pre><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
<pre tabindex="0"><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-17YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &quot;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 63
@ -208,11 +208,11 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help</li>
<li>For example, this shows 186 mappings for the item, the first three of which are real:</li>
</ul>
<pre><code>dspace=# select * from collection2item where item_id = '80596';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80596';
</code></pre><ul>
<li>Then I deleted the others:</li>
</ul>
<pre><code>dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
<pre tabindex="0"><code>dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
</code></pre><ul>
<li>And in the item view it now shows the correct mappings</li>
<li>I will have to ask the DSpace people if this is a valid approach</li>
@ -223,24 +223,24 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
<li>Maria found another item with duplicate mappings: <a href="https://cgspace.cgiar.org/handle/10568/78658">https://cgspace.cgiar.org/handle/10568/78658</a></li>
<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li>
</ul>
<pre><code>Traceback (most recent call last):
<pre tabindex="0"><code>Traceback (most recent call last):
File &quot;./fix-metadata-values.py&quot;, line 80, in &lt;module&gt;
print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
</code></pre><ul>
<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li>
</ul>
<pre><code>print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0].encode('utf-8')))
<pre tabindex="0"><code>print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0].encode('utf-8')))
</code></pre><ul>
<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li>
<li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li>
<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Now get the top 500 journal titles:</li>
</ul>
<pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
<pre tabindex="0"><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
</code></pre><ul>
<li>The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November</li>
<li>I will have to go through these and fix some more before making the controlled vocabulary</li>
@ -254,7 +254,7 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:
<ul>
<li>Fix the two items Maria found with duplicate mappings with this script:</li>
</ul>
<pre><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
<pre tabindex="0"><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
/* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
delete from collection2item where id = '91082';
@ -266,20 +266,20 @@ delete from collection2item where id = '91082';
<li>And the file names don&rsquo;t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore</li>
<li>Seems like the only ones I should replace are the <code>'</code> apostrophe characters, as <code>%27</code>:</li>
</ul>
<pre><code>value.replace(&quot;'&quot;,'%27')
<pre tabindex="0"><code>value.replace(&quot;'&quot;,'%27')
</code></pre><ul>
<li>Add the item&rsquo;s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
</code></pre><ul>
<li>Test importing of the new CIAT records (actually there are 232, not 234):</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
</code></pre><ul>
<li>Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB</li>
<li>These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:</li>
</ul>
<pre><code>$ convert -compress Zip -density 150x150 input.pdf output.pdf
<pre tabindex="0"><code>$ convert -compress Zip -density 150x150 input.pdf output.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
</code></pre><ul>
<li>Somewhere on the Internet suggested using a DPI of 144</li>
@ -289,7 +289,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<li>In testing a random sample of CIAT&rsquo;s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are</li>
<li>Import 232 CIAT records into CGSpace:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
</code></pre><h2 id="2017-01-22">2017-01-22</h2>
<ul>
<li>Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel&rsquo;s CSV exporter)</li>
@ -300,22 +300,22 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
<li>I merged Atmire&rsquo;s pull request into the development branch so they can deploy it on DSpace Test</li>
<li>Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):</li>
</ul>
<pre><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&quot;$community&quot; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&quot;$community&quot;; done
<pre tabindex="0"><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&quot;$community&quot; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&quot;$community&quot;; done
</code></pre><ul>
<li>Move some collections with <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move-collections.sh</code></a> using the following config:</li>
</ul>
<pre><code>10568/42161 10568/171 10568/79341
<pre tabindex="0"><code>10568/42161 10568/171 10568/79341
10568/41914 10568/171 10568/79340
</code></pre><h2 id="2017-01-24">2017-01-24</h2>
<ul>
<li>Run all updates on DSpace Test and reboot the server</li>
<li>Run fixes for Journal titles on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
</code></pre><ul>
<li>Create a new list of the top 500 journal titles from the database:</li>
</ul>
<pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
<pre tabindex="0"><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
</code></pre><ul>
<li>Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request (<a href="https://github.com/ilri/DSpace/pull/298">#298</a>)</li>
<li>This would be the last issue remaining to close the meta issue about switching to controlled vocabularies (<a href="https://github.com/ilri/DSpace/pull/69">#69</a>)</li>

View File

@ -50,7 +50,7 @@ DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -140,7 +140,7 @@ Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre><code>dspace=# select * from collection2item where item_id = '80278';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -166,7 +166,7 @@ DELETE 1
<li>The climate risk management one doesn&rsquo;t exist, so I will have to ask Magdalena if they want me to add it to the input forms</li>
<li>Start testing some nearly 500 author corrections that CCAFS sent me:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2017-02-09">2017-02-09</h2>
<ul>
<li>More work on CCAFS Phase II stuff</li>
@ -175,7 +175,7 @@ DELETE 1
<li>It&rsquo;s not a very good way to manage the registry, though, as removing one there doesn&rsquo;t cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created</li>
<li>Testing some corrections on CCAFS Phase II flagships (<code>cg.subject.ccafs</code>):</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2017-02-10">2017-02-10</h2>
<ul>
<li>CCAFS said they want to wait on the flagship updates (<code>cg.subject.ccafs</code>) on CGSpace, perhaps for a month or so</li>
@ -215,46 +215,46 @@ DELETE 1
<li>Fix issue with duplicate declaration of in atmire-dspace-xmlui <code>pom.xml</code> (causing non-fatal warnings during the maven build)</li>
<li>Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site&rsquo;s properties file:</li>
</ul>
<pre><code>handle.canonical.prefix = https://hdl.handle.net/
<pre tabindex="0"><code>handle.canonical.prefix = https://hdl.handle.net/
</code></pre><ul>
<li>And then a SQL command to update existing records:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
UPDATE 58193
</code></pre><ul>
<li>Seems to work fine!</li>
<li>I noticed a few items that have incorrect DOI links (<code>dc.identifier.doi</code>), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:</li>
</ul>
<pre><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
<pre tabindex="0"><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
</code></pre><ul>
<li>This will replace any that begin with <code>10.</code> and change them to <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
</code></pre><ul>
<li>This will get any that begin with <code>doi:10.</code> and change them to <code>https://dx.doi.org/10.x</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
</code></pre><ul>
<li>Fix DOIs like <code>dx.doi.org/10.</code> to be <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
</code></pre><ul>
<li>Fix DOIs like <code>http//</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
</code></pre><ul>
<li>Fix DOIs like <code>dx.doi.org./</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
</code></pre><ul>
<li>Delete some invalid DOIs:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
</code></pre><ul>
<li>Fix some other random outliers:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898';
dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
@ -263,13 +263,13 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
</code></pre><ul>
<li>And do another round of <code>http://</code> → <code>https://</code> cleanups:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
</code></pre><ul>
<li>Run all DOI corrections on CGSpace</li>
<li>Something to think about here is to write a <a href="https://wiki.lyrasis.org/display/DSDOC5x/Curation+System#CurationSystem-ScriptedTasks">Curation Task</a> in Java to do these sanity checks / corrections every night</li>
<li>Then we could add a cron job for them and run them from the command line like:</li>
</ul>
<pre><code>[dspace]/bin/dspace curate -t noop -i 10568/79891
<pre tabindex="0"><code>[dspace]/bin/dspace curate -t noop -i 10568/79891
</code></pre><h2 id="2017-02-20">2017-02-20</h2>
<ul>
<li>Run all system updates on DSpace Test and reboot the server</li>
@ -280,7 +280,7 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
<li>Testing the <code>fix-metadata-values.py</code> script on macOS and it seems like we don&rsquo;t need to use <code>.encode('utf-8')</code> anymore when printing strings to the screen</li>
<li>It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string &ldquo;Entwicklung &amp; Ländlicher Raum&rdquo; without the <code>encode()</code> call, but print it as a bytes when it <em>is</em> used:</li>
</ul>
<pre><code>$ python
<pre tabindex="0"><code>$ python
Python 3.6.0 (default, Dec 25 2016, 17:30:53)
&gt;&gt;&gt; print('Entwicklung &amp; Ländlicher Raum')
Entwicklung &amp; Ländlicher Raum
@ -294,7 +294,7 @@ b'Entwicklung &amp; L\xc3\xa4ndlicher Raum'
<li>Testing regenerating PDF thumbnails, like I started in 2016-11</li>
<li>It seems there is a bug in <code>filter-media</code> that causes it to process formats that aren&rsquo;t part of its configuration:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &quot;ImageMagick PDF Thumbnail&quot;
File: earlywinproposal_esa_postharvest.pdf.jpg
FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
File: postHarvest.jpg.jpg
@ -302,7 +302,7 @@ FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
</code></pre><ul>
<li>According to <code>dspace.cfg</code> the ImageMagick PDF Thumbnail plugin should only process PDFs:</li>
</ul>
<pre><code>filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
<pre tabindex="0"><code>filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
</code></pre><ul>
<li>I&rsquo;ve sent a message to the mailing list and might file a Jira issue</li>
@ -317,7 +317,7 @@ filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = A
<ul>
<li>Find all fields with &ldquo;<a href="http://hdl.handle.net">http://hdl.handle.net</a>&rdquo; values (most are in <code>dc.identifier.uri</code>, but some are in other URL-related fields like <code>cg.link.reference</code>, <code>cg.identifier.dataurl</code>, and <code>cg.identifier.url</code>):</li>
</ul>
<pre><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
UPDATE 58633
</code></pre><ul>
@ -328,7 +328,7 @@ UPDATE 58633
<ul>
<li>LDAP users cannot log in today, looks to be an issue with CGIAR&rsquo;s LDAP server:</li>
</ul>
<pre><code>$ openssl s_client -connect svcgroot2.cgiarad.org:3269
<pre tabindex="0"><code>$ openssl s_client -connect svcgroot2.cgiarad.org:3269
CONNECTED(00000003)
depth=0 CN = SVCGROOT2.CGIARAD.ORG
verify error:num=20:unable to get local issuer certificate
@ -345,7 +345,7 @@ Certificate chain
<li>For some reason it is now signed by a private certificate authority</li>
<li>This error seems to have started on 2017-02-25:</li>
</ul>
<pre><code>$ grep -c &quot;unable to find valid certification path&quot; [dspace]/log/dspace.log.2017-02-*
<pre tabindex="0"><code>$ grep -c &quot;unable to find valid certification path&quot; [dspace]/log/dspace.log.2017-02-*
[dspace]/log/dspace.log.2017-02-01:0
[dspace]/log/dspace.log.2017-02-02:0
[dspace]/log/dspace.log.2017-02-03:0
@ -381,7 +381,7 @@ Certificate chain
<li>The problem likely lies in the logic of <code>ImageMagickThumbnailFilter.java</code>, as <code>ImageMagickPdfThumbnailFilter.java</code> extends it</li>
<li>Run CIAT corrections on CGSpace</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
</code></pre><ul>
<li>CGNET has fixed the certificate chain on their LDAP server</li>
<li>Redeploy CGSpace and DSpace Test to on latest <code>5_x-prod</code> branch with fixes for LDAP bind user</li>
@ -393,16 +393,16 @@ Certificate chain
<li>Ah, this is probably because some items have the <code>International Center for Tropical Agriculture</code> author twice, which I first noticed in 2016-12 but couldn&rsquo;t figure out how to fix</li>
<li>I think I can do it by first exporting all metadatavalues that have the author <code>International Center for Tropical Agriculture</code></li>
</ul>
<pre><code>dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
COPY 1968
</code></pre><ul>
<li>And then use awk to print the duplicate lines to a separate file:</li>
</ul>
<pre><code>$ awk -F',' 'seen[$1]++' /tmp/ciat.csv &gt; /tmp/ciat-dupes.csv
<pre tabindex="0"><code>$ awk -F',' 'seen[$1]++' /tmp/ciat.csv &gt; /tmp/ciat-dupes.csv
</code></pre><ul>
<li>From that file I can create a list of 279 deletes and put them in a batch script like:</li>
</ul>
<pre><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
</code></pre>

View File

@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing reg
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600&#43;0&#43;0 8-bit CMYK 168KB 0.000u 0:00.000
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -156,7 +156,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
</ul>
<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
<pre tabindex="0"><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
</code></pre><ul>
<li>This results in discolored thumbnails when compared to the original PDF, for example sRGB and CMYK:</li>
@ -171,7 +171,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<li>I created a patch for DS-3517 and made a pull request against upstream <code>dspace-5_x</code>: <a href="https://github.com/DSpace/DSpace/pull/1669">https://github.com/DSpace/DSpace/pull/1669</a></li>
<li>Looks like <code>-colorspace sRGB</code> alone isn&rsquo;t enough, we need to use profiles:</li>
</ul>
<pre><code>$ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
<pre tabindex="0"><code>$ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
</code></pre><ul>
<li>This reads the input file, applies the CMYK profile, applies the RGB profile, then writes the file</li>
<li>Note that you should set the first profile immediately after the input file</li>
@ -180,7 +180,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<li>Somehow we need to detect the color system being used by the input file and handle each case differently (with profiles)</li>
<li>This is trivial with <code>identify</code> (even by the <a href="http://im4java.sourceforge.net/api/org/im4java/core/IMOps.html#identify">Java ImageMagick API</a>):</li>
</ul>
<pre><code>$ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
<pre tabindex="0"><code>$ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
DirectClass CMYK
$ identify -format '%r\n' Africa\ group\ of\ negotiators.pdf\[0\]
DirectClass sRGB Alpha
@ -196,7 +196,7 @@ DirectClass sRGB Alpha
<li>They want something like the items that are returned by the general &ldquo;LAND&rdquo; query in the search interface, but we cannot do that</li>
<li>We can only return specific results for metadata fields, like:</li>
</ul>
<pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;LAND REFORM&quot;, &quot;language&quot;: null}' | json_pp
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;LAND REFORM&quot;, &quot;language&quot;: null}' | json_pp
</code></pre><ul>
<li>But there are hundreds of combinations of fields and values (like <code>dc.subject</code> and all the center subjects), and we can&rsquo;t use wildcards in REST!</li>
<li>Reading about enabling multiple handle prefixes in DSpace</li>
@ -204,7 +204,7 @@ DirectClass sRGB Alpha
<li>And a comment from Atmire&rsquo;s Bram about it on the DSpace wiki: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296">https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296</a></li>
<li>Bram mentions an undocumented configuration option <code>handle.plugin.checknameauthority</code>, but I noticed another one in <code>dspace.cfg</code>:</li>
</ul>
<pre><code># List any additional prefixes that need to be managed by this handle server
<pre tabindex="0"><code># List any additional prefixes that need to be managed by this handle server
# (as for examle handle prefix coming from old dspace repository merged in
# that repository)
# handle.additional.prefixes = prefix1[, prefix2]
@ -212,20 +212,20 @@ DirectClass sRGB Alpha
<li>Because of this I noticed that our Handle server&rsquo;s <code>config.dct</code> was potentially misconfigured!</li>
<li>We had some default values still present:</li>
</ul>
<pre><code>&quot;300:0.NA/YOUR_NAMING_AUTHORITY&quot;
<pre tabindex="0"><code>&quot;300:0.NA/YOUR_NAMING_AUTHORITY&quot;
</code></pre><ul>
<li>I&rsquo;ve changed them to the following and restarted the handle server:</li>
</ul>
<pre><code>&quot;300:0.NA/10568&quot;
<pre tabindex="0"><code>&quot;300:0.NA/10568&quot;
</code></pre><ul>
<li>In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk</li>
<li>From <code>dspace/config/crosswalks/google-metadata.properties</code>:</li>
</ul>
<pre><code>google.citation_doi = cg.identifier.doi
<pre tabindex="0"><code>google.citation_doi = cg.identifier.doi
</code></pre><ul>
<li>This works, and makes DSpace output the following metadata on the item view page:</li>
</ul>
<pre><code>&lt;meta content=&quot;https://dx.doi.org/10.1186/s13059-017-1153-y&quot; name=&quot;citation_doi&quot;&gt;
<pre tabindex="0"><code>&lt;meta content=&quot;https://dx.doi.org/10.1186/s13059-017-1153-y&quot; name=&quot;citation_doi&quot;&gt;
</code></pre><ul>
<li>Submitted and merged pull request for this: <a href="https://github.com/ilri/DSpace/pull/305">https://github.com/ilri/DSpace/pull/305</a></li>
<li>Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of &ldquo;,&quot;: <a href="https://github.com/ilri/DSpace/pull/306">https://github.com/ilri/DSpace/pull/306</a></li>
@ -260,18 +260,18 @@ DirectClass sRGB Alpha
<ul>
<li>Export list of sponsors so Peter can clean it up:</li>
</ul>
<pre><code>dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
COPY 285
</code></pre><h2 id="2017-03-12">2017-03-12</h2>
<ul>
<li>Test the sponsorship fixes and deletes from Peter:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
$ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Generate a new list of unique sponsors so we can update the controlled vocabulary:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
</code></pre><ul>
<li>Pull request for controlled vocabulary if Peter approves: <a href="https://github.com/ilri/DSpace/pull/308">https://github.com/ilri/DSpace/pull/308</a></li>
<li>Review Sisay&rsquo;s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: <a href="https://github.com/ilri/DSpace/pull/307">https://github.com/ilri/DSpace/pull/307</a></li>
@ -311,12 +311,12 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
<ul>
<li>CCAFS said they are ready for the flagship updates for Phase II to be run (<code>cg.subject.ccafs</code>), so I ran them on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>We&rsquo;ve been waiting since February to run these</li>
<li>Also, I generated a list of all CCAFS flagships because there are a dozen or so more than there should be:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
</code></pre><ul>
<li>I sent a list to CCAFS people so they can tell me if some should be deleted or moved, etc</li>
<li>Test, squash, and merge Sisay&rsquo;s RTB theme into <code>5_x-prod</code>: <a href="https://github.com/ilri/DSpace/pull/316">https://github.com/ilri/DSpace/pull/316</a></li>
@ -325,11 +325,11 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
<ul>
<li>Dump a list of fields in the DC and CG schemas to compare with CG Core:</li>
</ul>
<pre><code>dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
<pre tabindex="0"><code>dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
</code></pre><ul>
<li>Ooh, a better one!</li>
</ul>
<pre><code>dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
<pre tabindex="0"><code>dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
</code></pre><h2 id="2017-03-30">2017-03-30</h2>
<ul>
<li>Adjust the Linode CPU usage alerts for the CGSpace server from 150% to 200%, as generally the nightly Solr indexing causes a usage around 150190%, so this should make the alerts less regular</li>

View File

@ -40,7 +40,7 @@ Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -136,16 +136,16 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-03">2017-04-03</h2>
<ul>
<li>Continue testing the CMYK patch on more communities:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&gt; /tmp/filter-media-cmyk.txt 2&gt;&amp;1
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&gt; /tmp/filter-media-cmyk.txt 2&gt;&amp;1
</code></pre><ul>
<li>So far there are almost 500:</li>
</ul>
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
484
</code></pre><ul>
<li>Looking at the CG Core document again, I&rsquo;ll send some feedback to Peter and Abenet:
@ -157,39 +157,39 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
</li>
<li>Also, I&rsquo;m noticing some weird outliers in <code>cg.coverage.region</code>, need to remember to go correct these later:</li>
</ul>
<pre><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
</code></pre><h2 id="2017-04-04">2017-04-04</h2>
<ul>
<li>The <code>filter-media</code> script has been running on more large communities and now there are many more CMYK PDFs that have been fixed:</li>
</ul>
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
1584
</code></pre><ul>
<li>Trying to find a way to get the number of items submitted by a certain user in 2016</li>
<li>It&rsquo;s not possible in the DSpace search / module interfaces, but might be able to be derived from <code>dc.description.provenance</code>, as that field contains the name and email of the submitter/approver, ie:</li>
</ul>
<pre><code>Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
<pre tabindex="0"><code>Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
No. of bitstreams: 1^M
ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
</code></pre><ul>
<li>This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a &ldquo;checksum&rdquo; (ie, there was a bitstream in the submission):</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
</code></pre><ul>
<li>Then this one does the same, but for fields that don&rsquo;t contain checksums (ie, there was no bitstream in the submission):</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
</code></pre><ul>
<li>For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.</li>
<li>It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled&hellip;</li>
<li>In that case it might just be better to see how many the user submitted (both <em>with</em> and <em>without</em> bitstreams):</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
</code></pre><h2 id="2017-04-05">2017-04-05</h2>
<ul>
<li>After doing a few more large communities it seems this is the final count of CMYK PDFs:</li>
</ul>
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
2505
</code></pre><h2 id="2017-04-06">2017-04-06</h2>
<ul>
@ -260,7 +260,7 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
<li>I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace</li>
<li>Running <code>dspace oai import</code> and <code>dspace oai clean-cache</code> have zero effect, but this seems to rebuild the cache from scratch:</li>
</ul>
<pre><code>$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
<pre tabindex="0"><code>$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
...
63900 items imported so far...
64000 items imported so far...
@ -273,7 +273,7 @@ OAI 2.0 manager action ended. It took 829 seconds.
<li>The import command should theoretically catch situations like this where an item&rsquo;s metadata was updated, but in this case we changed the metadata schema and it doesn&rsquo;t seem to catch it (could be a bug!)</li>
<li>Attempting a full rebuild of OAI on CGSpace:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
...
58700 items imported so far...
@ -326,14 +326,14 @@ sys 1m29.310s
<li>One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see <code>harvester.autoStart</code> in <code>dspace/config/modules/oai.cfg</code>)</li>
<li>Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(435) is still referenced from table &quot;bundle&quot;.
</code></pre><h2 id="2017-04-18">2017-04-18</h2>
<ul>
<li>Helping Tsega test his new <a href="https://github.com/ilri/ckm-cgspace-rest-api">CGSpace REST API Rails app</a> on DSpace Test</li>
<li>Setup and run with:</li>
</ul>
<pre><code>$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
<pre tabindex="0"><code>$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
$ cd ckm-cgspace-rest-api/app
$ gem install bundler
$ bundle
@ -342,12 +342,12 @@ $ rails -s
</code></pre><ul>
<li>I used Ansible to create a PostgreSQL user that only has <code>SELECT</code> privileges on the tables it needs:</li>
</ul>
<pre><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
<pre tabindex="0"><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
</code></pre><ul>
<li>Need to look into <a href="https://github.com/puma/puma/blob/master/docs/systemd.md">running this via systemd</a></li>
<li>This is interesting for creating runnable commands from <code>bundle</code>:</li>
</ul>
<pre><code>$ bundle binstubs puma --path ./sbin
<pre tabindex="0"><code>$ bundle binstubs puma --path ./sbin
</code></pre><h2 id="2017-04-19">2017-04-19</h2>
<ul>
<li>Usman sent another link to their OAI interface, where the country names are now capitalized: <a href="https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947">https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947</a></li>
@ -360,15 +360,15 @@ $ rails -s
<li>Looking at 933 CIAT records from Sisay, he&rsquo;s having problems creating a SAF bundle to import to DSpace Test</li>
<li>I started by looking at his CSV in OpenRefine, and I see there a <em>bunch</em> of fields with whitespace issues that I cleaned up:</li>
</ul>
<pre><code>value.replace(&quot; ||&quot;,&quot;||&quot;).replace(&quot;|| &quot;,&quot;||&quot;).replace(&quot; || &quot;,&quot;||&quot;)
<pre tabindex="0"><code>value.replace(&quot; ||&quot;,&quot;||&quot;).replace(&quot;|| &quot;,&quot;||&quot;).replace(&quot; || &quot;,&quot;||&quot;)
</code></pre><ul>
<li>Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:</li>
</ul>
<pre><code>unescape(value,&quot;url&quot;)
<pre tabindex="0"><code>unescape(value,&quot;url&quot;)
</code></pre><ul>
<li>Then create the filename column using the following transform from URL:</li>
</ul>
<pre><code>value.split('/')[-1].replace(/#.*$/,&quot;&quot;)
<pre tabindex="0"><code>value.split('/')[-1].replace(/#.*$/,&quot;&quot;)
</code></pre><ul>
<li>The <code>replace</code> part is because some URLs have an anchor like <code>#page=14</code> which we obviously don&rsquo;t want on the filename</li>
<li>Also, we need to only use the PDF on the item corresponding with page 1, so we don&rsquo;t end up with literally hundreds of duplicate PDFs</li>
@ -381,7 +381,7 @@ $ rails -s
<li>Looking at the CIAT data again, a bunch of items have metadata values ending in <code>||</code>, which might cause blank fields to be added at import time</li>
<li>Cleaning them up with OpenRefine:</li>
</ul>
<pre><code>value.replace(/\|\|$/,&quot;&quot;)
<pre tabindex="0"><code>value.replace(/\|\|$/,&quot;&quot;)
</code></pre><ul>
<li>Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle</li>
<li>I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items</li>
@ -391,15 +391,15 @@ $ rails -s
<li>Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace</li>
<li>Unbelievable, there are also metadata values like:</li>
</ul>
<pre><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
<pre tabindex="0"><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
</code></pre><ul>
<li>Add a description to the file names using:</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
</code></pre><ul>
<li>Test import of 933 records:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
<pre tabindex="0"><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
$ wc -l /tmp/ciat
933 /tmp/ciat
</code></pre><ul>
@ -409,7 +409,7 @@ $ wc -l /tmp/ciat
<li>More work on Ansible infrastructure stuff for Tsega&rsquo;s CKM DSpace REST API</li>
<li>I&rsquo;m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-22">2017-04-22</h2>
<ul>
@ -417,13 +417,13 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media
<li>The solution is to remove the ID (ie set to NULL) from the <code>primary_bitstream_id</code> column in the <code>bundle</code> table</li>
<li>After doing that and running the <code>cleanup</code> task again I find more bitstreams that are affected and end up with a long list of IDs that need to be fixed:</li>
</ul>
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
</code></pre><h2 id="2017-04-24">2017-04-24</h2>
<ul>
<li>Two users mentioned some items they recently approved not showing up in the search / XMLUI</li>
<li>I looked at the logs from yesterday and it seems the Discovery indexing has been crashing:</li>
</ul>
<pre><code>2017-04-24 00:00:15,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
<pre tabindex="0"><code>2017-04-24 00:00:15,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
2017-04-24 00:00:15,586 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (56 of 58853): 74507
2017-04-24 00:00:15,614 ERROR com.atmire.dspace.discovery.AtmireSolrService @ this IndexWriter is closed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this IndexWriter is closed
@ -447,7 +447,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
</code></pre><ul>
<li>Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:</li>
</ul>
<pre><code># grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
<pre tabindex="0"><code># grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
[dspace]/log/dspace.log.2017-04-01:0
[dspace]/log/dspace.log.2017-04-02:0
[dspace]/log/dspace.log.2017-04-03:0
@ -475,12 +475,12 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
</code></pre><ul>
<li>I restarted Tomcat and re-ran the discovery process manually:</li>
</ul>
<pre><code>[dspace]/bin/dspace index-discovery
<pre tabindex="0"><code>[dspace]/bin/dspace index-discovery
</code></pre><ul>
<li>Now everything is ok</li>
<li>Finally finished manually running the cleanup task over and over and null&rsquo;ing the conflicting IDs:</li>
</ul>
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
</code></pre><ul>
<li>Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it&rsquo;s likely we haven&rsquo;t had a cleanup task complete successfully in years&hellip;</li>
</ul>
@ -489,12 +489,12 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
<li>Finally finished running the PDF thumbnail re-processing on CGSpace, the final count of CMYK PDFs is about 2751</li>
<li>Preparing to run the cleanup task on CGSpace, I want to see how many files are in the assetstore:</li>
</ul>
<pre><code># find [dspace]/assetstore/ -type f | wc -l
<pre tabindex="0"><code># find [dspace]/assetstore/ -type f | wc -l
113104
</code></pre><ul>
<li>Troubleshooting the Atmire Solr update process that runs at 3:00 AM every morning, after finishing at 100% it has this error:</li>
</ul>
<pre><code>[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
<pre tabindex="0"><code>[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:13
@ -557,7 +557,7 @@ Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpac
<li>The size of the CGSpace database dump went from 111MB to 96MB, not sure about actual database size though</li>
<li>Update RVM&rsquo;s Ruby from 2.3.0 to 2.4.0 on DSpace Test:</li>
</ul>
<pre><code>$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
<pre tabindex="0"><code>$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
$ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
... reload shell to get new Ruby
$ gem install sass -v 3.3.14

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2017"/>
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -131,7 +131,7 @@
<li>Discovered that CGSpace has ~700 items that are missing the <code>cg.identifier.status</code> field</li>
<li>Need to perhaps try using the &ldquo;required metadata&rdquo; curation task to find fields missing these items:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - &gt; /tmp/curation.out
<pre tabindex="0"><code>$ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - &gt; /tmp/curation.out
</code></pre><ul>
<li>It seems the curation task dies when it finds an item which has missing metadata</li>
</ul>
@ -145,7 +145,7 @@
<ul>
<li>Testing one replacement for CCAFS Flagships (<code>cg.subject.ccafs</code>), first changed in the submission forms, and then in the database:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Also, CCAFS wants to re-order their flagships to prioritize the Phase II ones</li>
<li>Waiting for feedback from CCAFS, then I can merge <a href="https://github.com/ilri/DSpace/pull/320">#320</a></li>
@ -159,7 +159,7 @@
<li>This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using <code>dspace cleanup -v</code>, or else you&rsquo;ll run out of disk space</li>
<li>In the end I realized it&rsquo;s better to use submission mode (<code>-s</code>) to ingest the community object as a single AIP without its children, followed by each of the collections:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit&quot;
$ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
@ -184,13 +184,13 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>The CGIAR Library metadata has some blank metadata values, which leads to <code>|||</code> in the Discovery facets</li>
<li>Clean these up in the database using:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
</code></pre><ul>
<li>I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up</li>
<li>Hours into the re-ingestion I ran into more errors, and had to erase everything and start over <em>again</em>!</li>
<li>Now, no matter what I do I keep getting foreign key errors&hellip;</li>
</ul>
<pre><code>Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot;
<pre tabindex="0"><code>Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot;
Detail: Key (handle_id)=(80928) already exists.
</code></pre><ul>
<li>I think those errors actually come from me running the <code>update-sequences.sql</code> script while Tomcat/DSpace are running</li>
@ -202,7 +202,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields</li>
<li>Finally finished importing all the CGIAR Library content, final method was:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit&quot;
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2516/10947-2516.zip
@ -215,7 +215,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>The <code>-XX:-UseGCOverheadLimit</code> JVM option helps with some issues in large imports</li>
<li>After this I ran the <code>update-sequences.sql</code> script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
</code></pre><h2 id="2017-05-13">2017-05-13</h2>
<ul>
<li>After quite a bit of troubleshooting with importing cleaned up data as CSV, it seems that there are actually <a href="https://en.wikipedia.org/wiki/Null_character">NUL</a> characters in the <code>dc.description.abstract</code> field (at least) on the lines where CSV importing was failing</li>
@ -230,7 +230,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>Merge changes to CCAFS project identifiers and flagships: <a href="https://github.com/ilri/DSpace/pull/320">#320</a></li>
<li>Run updates for CCAFS flagships on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>
<p>These include:</p>
@ -258,19 +258,19 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<ul>
<li>Looking into the error I get when trying to create a new collection on DSpace Test:</li>
</ul>
<pre><code>ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot; Detail: Key (handle_id)=(84834) already exists.
<pre tabindex="0"><code>ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot; Detail: Key (handle_id)=(84834) already exists.
</code></pre><ul>
<li>I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn&rsquo;t helped</li>
<li>It appears item with <code>handle_id</code> 84834 is one of the imported CGIAR Library items:</li>
</ul>
<pre><code>dspace=# select * from handle where handle_id=84834;
<pre tabindex="0"><code>dspace=# select * from handle where handle_id=84834;
handle_id | handle | resource_type_id | resource_id
-----------+------------+------------------+-------------
84834 | 10947/1332 | 2 | 87113
</code></pre><ul>
<li>Looks like the max <code>handle_id</code> is actually much higher:</li>
</ul>
<pre><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
<pre tabindex="0"><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
handle_id | handle | resource_type_id | resource_id
-----------+----------+------------------+-------------
86873 | 10947/99 | 2 | 89153
@ -279,7 +279,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>I&rsquo;ve posted on the dspace-test mailing list to see if I can just manually set the <code>handle_seq</code> to that value</li>
<li>Actually, it seems I can manually set the handle sequence using:</li>
</ul>
<pre><code>dspace=# select setval('handle_seq',86873);
<pre tabindex="0"><code>dspace=# select setval('handle_seq',86873);
</code></pre><ul>
<li>After that I can create collections just fine, though I&rsquo;m not sure if it has other side effects</li>
</ul>
@ -294,11 +294,11 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
<li>Do some cleanups of community and collection names in CGIAR System Management Office community on DSpace Test, as well as move some items as Peter requested</li>
<li>Peter wanted a list of authors in here, so I generated a list of collections using the &ldquo;View Source&rdquo; on each community and this hacky awk:</li>
</ul>
<pre><code>$ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3&quot;/&quot;$4}' | awk -F\&quot; '{print $1}' | vim -
<pre tabindex="0"><code>$ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3&quot;/&quot;$4}' | awk -F\&quot; '{print $1}' | vim -
</code></pre><ul>
<li>Then I joined them together and ran this old SQL query from the dspace-tech mailing list which gives you authors for items in those collections:</li>
</ul>
<pre><code>dspace=# select distinct text_value
<pre tabindex="0"><code>dspace=# select distinct text_value
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
AND resource_type_id = 2
@ -314,7 +314,7 @@ AND resource_id IN (select item_id from collection2item where collection_id IN (
</code></pre><ul>
<li>To get a CSV (with counts) from that:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*)
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*)
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
AND resource_type_id = 2
@ -326,7 +326,7 @@ AND resource_id IN (select item_id from collection2item where collection_id IN (
<li>For now I&rsquo;ve suggested that they just change the collection names and that we fix their metadata manually afterwards</li>
<li>Also, they have a lot of messed up values in their <code>cg.subject.wle</code> field so I will clean up some of those first:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
COPY 111
</code></pre><ul>
<li>Respond to Atmire message about ORCIDs, saying that right now we&rsquo;d prefer to just have them available via REST API like any other metadata field, and that I&rsquo;m available for a Skype</li>
@ -343,21 +343,21 @@ COPY 111
<li>Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+June+2017">June, 2017 DCAT meeting</a></li>
<li>Find all of Amos Omore&rsquo;s author name variations so I can link them to his authority entry that has an ORCID:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
</code></pre><ul>
<li>Set the authority for all variations to one containing an ORCID:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
UPDATE 187
</code></pre><ul>
<li>Next I need to do Edgar Twine:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
</code></pre><ul>
<li>But it doesn&rsquo;t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via &ldquo;Edit this Item&rdquo; and looked up his ORCID and linked it there</li>
<li>Now I should be able to set his name variations to the new authority:</li>
</ul>
<pre><code>dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
<pre tabindex="0"><code>dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
</code></pre><ul>
<li>Run the corrections on CGSpace and then update discovery / authority</li>
<li>I notice that there are a handful of <code>java.lang.OutOfMemoryError: Java heap space</code> errors in the Catalina logs on CGSpace, I should go look into that&hellip;</li>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2017"/>
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -153,7 +153,7 @@
<li>17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF</li>
<li>I&rsquo;ve flagged them and proceeded without them (752 total) on DSpace Test:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
</code></pre><ul>
<li>I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)</li>
<li>Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT</li>
@ -167,7 +167,7 @@
<li>Created a new branch with just the relevant changes, so I can send it to them</li>
<li>One thing I noticed is that there is a failed database migration related to CUA:</li>
</ul>
<pre><code>+----------------+----------------------------+---------------------+---------+
<pre tabindex="0"><code>+----------------+----------------------------+---------------------+---------+
| Version | Description | Installed on | State |
+----------------+----------------------------+---------------------+---------+
| 1.1 | Initial DSpace 1.1 databas | | PreInit |
@ -213,7 +213,7 @@
</li>
<li>Finally import 914 CIAT Book Chapters to CGSpace in two batches:</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
$ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &amp;&gt; /tmp/ciat-books2.log
</code></pre><h2 id="2017-06-25">2017-06-25</h2>
<ul>
@ -221,7 +221,7 @@ $ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace impo
<li>Pull request with the changes to <code>input-forms.xml</code>: <a href="https://github.com/ilri/DSpace/pull/329">#329</a></li>
<li>As of now it doesn&rsquo;t look like there are any items using this research theme so we don&rsquo;t need to do any updates:</li>
</ul>
<pre><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
text_value
------------
(0 rows)
@ -233,7 +233,7 @@ $ JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot; [dspace]/bin/dspace impo
<ul>
<li>CGSpace went down briefly, I see lots of these errors in the dspace logs:</li>
</ul>
<pre><code>Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
<pre tabindex="0"><code>Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre><ul>
<li>After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load</li>
<li>Might be a good time to adjust DSpace&rsquo;s database connection settings, like I first mentioned in April, 2017 after reading the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">2017-04 DCAT comments</a></li>

View File

@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329)
Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace
We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the output into quasi XML:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -132,7 +132,7 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li>
<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
</ul>
<pre><code>$ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::'
<pre tabindex="0"><code>$ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::'
</code></pre><ul>
<li>The <code>sed</code> script is from a post on the <a href="https://www.postgresql.org/message-id/437E44A5.508%40ultimeth.com">PostgreSQL mailing list</a></li>
<li>Abenet says the ILRI board wants to be able to have &ldquo;lead author&rdquo; for every item, so I&rsquo;ve whipped up a WIP test in the <code>5_x-lead-author</code> branch</li>
@ -151,11 +151,11 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<li>Adjust WLE Research Theme to include both Phase I and II on the submission form according to editor feedback (<a href="https://github.com/ilri/DSpace/pull/330">#330</a>)</li>
<li>Generate list of fields in the current CGSpace <code>cg</code> scheme so we can record them properly in the metadata registry:</li>
</ul>
<pre><code>$ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::' &gt; cg-types.xml
<pre tabindex="0"><code>$ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::' &gt; cg-types.xml
</code></pre><ul>
<li>CGSpace was unavailable briefly, and I saw this error in the DSpace log file:</li>
</ul>
<pre><code>2017-07-05 13:05:36,452 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2017-07-05 13:05:36,452 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
</code></pre><ul>
<li>Looking at the <code>pg_stat_activity</code> table I saw there were indeed 98 active connections to PostgreSQL, and at this time the limit is 100, so that makes sense</li>
@ -163,7 +163,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
<li>Abenet said she was generating a report with Atmire&rsquo;s CUA module, so it could be due to that?</li>
<li>Looking in the logs I see this random error again that I should report to DSpace:</li>
</ul>
<pre><code>2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
<pre tabindex="0"><code>2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
</code></pre><ul>
<li>Seems to come from <code>dspace-api/src/main/java/org/dspace/statistics/SolrLogger.java</code></li>
</ul>
@ -211,7 +211,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
<ul>
<li>Move two top-level communities to be sub-communities of ILRI Projects</li>
</ul>
<pre><code>$ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=&quot;$community&quot;; done
<pre tabindex="0"><code>$ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=&quot;$community&quot;; done
</code></pre><ul>
<li>Discuss CGIAR Library data cleanup with Sisay and Abenet</li>
</ul>
@ -241,7 +241,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
<ul>
<li>Looks like the final list of metadata corrections for CCAFS project tags will be:</li>
</ul>
<pre><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
@ -250,7 +250,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and
<li>Temporarily increase the nginx upload limit to 200MB for Sisay to upload the CIAT presentations</li>
<li>Looking at CGSpace activity page, there are 52 Baidu bots concurrently crawling our website (I copied the activity page to a text file and grep it)!</li>
</ul>
<pre><code>$ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
52
</code></pre><ul>
<li>From looking at the <code>dspace.log</code> I see they are all using the same session, which means our Crawler Session Manager Valve is working</li>

View File

@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -215,7 +215,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<li>I need to get an author list from the database for only the CGIAR Library community to send to Peter</li>
<li>It turns out that I had already used this SQL query in <a href="/cgspace-notes/2017-05">May, 2017</a> to get the authors from CGIAR Library:</li>
</ul>
<pre><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
</code></pre><ul>
<li>Meeting with Peter and CGSpace team
<ul>
@ -242,7 +242,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<li>I sent a message to the mailing list about the duplicate content issue with <code>/rest</code> and <code>/bitstream</code> URLs</li>
<li>Looking at the logs for the REST API on <code>/rest</code>, it looks like there is someone hammering doing testing or something on it&hellip;</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
140 66.249.66.91
404 66.249.66.90
1479 50.116.102.77
@ -252,7 +252,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<li>The top offender is 70.32.83.92 which is actually the same IP as ccafs.cgiar.org, so I will email the Macaroni Bros to see if they can test on DSpace Test instead</li>
<li>I&rsquo;ve enabled logging of <code>/oai</code> requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)</li>
</ul>
<pre><code> # log oai requests
<pre tabindex="0"><code> # log oai requests
location /oai {
access_log /var/log/nginx/oai.log;
proxy_pass http://tomcat_http;
@ -266,11 +266,11 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
<ul>
<li>Run author corrections on CGIAR Library community from Peter</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-523.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p fuuuu
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-523.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p fuuuu
</code></pre><ul>
<li>There were only three deletions so I just did them manually:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
DELETE 1
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
</code></pre><ul>
@ -279,7 +279,7 @@ dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_i
<li>In that thread Chris Wilper suggests a new default of 35 max connections for <code>db.maxconnections</code> (from the current default of 30), knowing that <em>each DSpace web application</em> gets to use up to this many on its own</li>
<li>It would be good to approximate what the theoretical maximum number of connections on a busy server would be, perhaps by looking to see which apps use SQL:</li>
</ul>
<pre><code>$ grep -rsI SQLException dspace-jspui | wc -l
<pre tabindex="0"><code>$ grep -rsI SQLException dspace-jspui | wc -l
473
$ grep -rsI SQLException dspace-oai | wc -l
63
@ -320,37 +320,37 @@ $ grep -rsI SQLException dspace-xmlui | wc -l
<ul>
<li>I wanted to merge the various field variations like <code>cg.subject.system</code> and <code>cg.subject.system[en_US]</code> in OpenRefine but I realized it would be easier in PostgreSQL:</li>
</ul>
<pre><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
<pre tabindex="0"><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
</code></pre><ul>
<li>And actually, we can do it for other generic fields for items in those collections, for example <code>dc.description.abstract</code>:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
</code></pre><ul>
<li>And on others like <code>dc.language.iso</code>, <code>dc.relation.ispartofseries</code>, <code>dc.type</code>, <code>dc.title</code>, etc&hellip;</li>
<li>Also, to move fields from <code>dc.identifier.url</code> to <code>cg.identifier.url[en_US]</code> (because we don&rsquo;t use the Dublin Core one for some reason):</li>
</ul>
<pre><code>dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
<pre tabindex="0"><code>dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
UPDATE 15
</code></pre><ul>
<li>Set the text_lang of all <code>dc.identifier.uri</code> (Handle) fields to be NULL, just like default DSpace does:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
UPDATE 4248
</code></pre><ul>
<li>Also update the text_lang of <code>dc.contributor.author</code> fields for metadata in these collections:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
UPDATE 4899
</code></pre><ul>
<li>Wow, I just wrote this baller regex facet to find duplicate authors:</li>
</ul>
<pre><code>isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
<pre tabindex="0"><code>isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
</code></pre><ul>
<li>This would be true if the authors were like <code>CGIAR System Management Office||CGIAR System Management Office</code>, which some of the CGIAR Library&rsquo;s were</li>
<li>Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn&rsquo;t detect any changes, so you have to edit them all manually via DSpace&rsquo;s &ldquo;Edit Item&rdquo;</li>
<li>Ooh! And an even more interesting regex would match <em>any</em> duplicated author:</li>
</ul>
<pre><code>isNotNull(value.match(/(.+?)\|\|\1/))
<pre tabindex="0"><code>isNotNull(value.match(/(.+?)\|\|\1/))
</code></pre><ul>
<li>Which means it can also be used to find items with duplicate <code>dc.subject</code> fields&hellip;</li>
<li>Finally sent Peter the final dump of the CGIAR System Organization community so he can have a last look at it</li>
@ -365,12 +365,12 @@ UPDATE 4899
<li>Uptime Robot said CGSpace went down for 1 minute, not sure why</li>
<li>Looking in <code>dspace.log.2017-08-17</code> I see some weird errors that might be related?</li>
</ul>
<pre><code>2017-08-17 07:55:31,396 ERROR net.sf.ehcache.store.DiskStore @ cocoon-ehcacheCache: Could not read disk store element for key PK_G-aspect-cocoon://DRI/12/handle/10568/65885?pipelinehash=823411183535858997_T-Navigation-3368194896954203241. Error was invalid stream header: 00000000
<pre tabindex="0"><code>2017-08-17 07:55:31,396 ERROR net.sf.ehcache.store.DiskStore @ cocoon-ehcacheCache: Could not read disk store element for key PK_G-aspect-cocoon://DRI/12/handle/10568/65885?pipelinehash=823411183535858997_T-Navigation-3368194896954203241. Error was invalid stream header: 00000000
java.io.StreamCorruptedException: invalid stream header: 00000000
</code></pre><ul>
<li>Weird that these errors seem to have started on August 11th, the same day we had capacity issues with PostgreSQL:</li>
</ul>
<pre><code># grep -c &quot;ERROR net.sf.ehcache.store.DiskStore&quot; dspace.log.2017-08-*
<pre tabindex="0"><code># grep -c &quot;ERROR net.sf.ehcache.store.DiskStore&quot; dspace.log.2017-08-*
dspace.log.2017-08-01:0
dspace.log.2017-08-02:0
dspace.log.2017-08-03:0
@ -412,7 +412,7 @@ dspace.log.2017-08-17:584
<li>More information about authority framework: <a href="https://wiki.lyrasis.org/display/DSPACE/Authority+Control+of+Metadata+Values">https://wiki.lyrasis.org/display/DSPACE/Authority+Control+of+Metadata+Values</a></li>
<li>Wow, I&rsquo;m playing with the AGROVOC SPARQL endpoint using the <a href="https://github.com/tialaramex/sparql-query">sparql-query tool</a>:</li>
</ul>
<pre><code>$ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
<pre tabindex="0"><code>$ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
sparql$ PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt;
SELECT
?label
@ -452,7 +452,7 @@ WHERE {
<li>Since I cleared the XMLUI cache on 2017-08-17 there haven&rsquo;t been any more <code>ERROR net.sf.ehcache.store.DiskStore</code> errors</li>
<li>Look at the CGIAR Library to see if I can find the items that have been submitted since May:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z';
metadata_value_id | item_id | metadata_field_id | text_value | text_lang | place | authority | confidence
-------------------+---------+-------------------+----------------------+-----------+-------+-----------+------------
123117 | 5872 | 11 | 2017-06-28T13:05:18Z | | 1 | | -1
@ -465,7 +465,7 @@ WHERE {
<li>According to <code>dc.date.accessioned</code> (metadata field id 11) there have only been five items submitted since May</li>
<li>These are their handles:</li>
</ul>
<pre><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z');
<pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z');
handle
------------
10947/4658
@ -490,7 +490,7 @@ WHERE {
<li>I asked Sisay about this and hinted that he should go back and fix these things, but let&rsquo;s see what he says</li>
<li>Saw CGSpace go down briefly today and noticed SQL connection pool errors in the dspace log file:</li>
</ul>
<pre><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
<pre tabindex="0"><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Looking at the logs I see we have been having hundreds or thousands of these errors a few times per week in 2017-07 and almost every day in 2017-08</li>

View File

@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -130,7 +130,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account
<ul>
<li>Delete 58 blank metadata values from the CGSpace database:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 58
</code></pre><ul>
<li>I also ran it on DSpace Test because we&rsquo;ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate</li>
@ -145,7 +145,7 @@ DELETE 58
<li>There will need to be some metadata updatesthough if I recall correctly it is only about seven recordsfor that as well, I had made some notes about it in <a href="/cgspace-notes/2017-07">2017-07</a>, but I&rsquo;ve asked for more clarification from Lili just in case</li>
<li>Looking at the DSpace logs to see if we&rsquo;ve had a change in the &ldquo;Cannot get a connection&rdquo; errors since last month when we adjusted the <code>db.maxconnections</code> parameter on CGSpace:</li>
</ul>
<pre><code># grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-09-*
<pre tabindex="0"><code># grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-09-*
dspace.log.2017-09-01:0
dspace.log.2017-09-02:0
dspace.log.2017-09-03:9
@ -174,14 +174,14 @@ dspace.log.2017-09-10:0
<li>The import process takes the same amount of time with and without the caching</li>
<li>Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):</li>
</ul>
<pre><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
<pre tabindex="0"><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
</code></pre><ul>
<li>Great TCP dump guide here: <a href="https://danielmiessler.com/study/tcpdump">https://danielmiessler.com/study/tcpdump</a></li>
<li>The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation</li>
<li>I sent a message to the mailing list to see if anyone knows more about this</li>
<li>In looking at the tcpdump results I notice that there is an update check to the ehcache server on <em>every</em> iteration of the ingest loop, for example:</li>
</ul>
<pre><code>09:39:36.008956 IP 192.168.8.124.50515 &gt; 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&amp;pageID=update.properties&amp;id=2130706433&amp;os-name=Mac+OS+X&amp;jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&amp;jvm-version=1.8.0_144&amp;platform=x86_64&amp;tc-version=UNKNOWN&amp;tc-product=Ehcache+Core+1.7.2&amp;source=Ehcache+Core&amp;uptime-secs=0&amp;patch=UNKNOWN HTTP/1.1
<pre tabindex="0"><code>09:39:36.008956 IP 192.168.8.124.50515 &gt; 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&amp;pageID=update.properties&amp;id=2130706433&amp;os-name=Mac+OS+X&amp;jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&amp;jvm-version=1.8.0_144&amp;platform=x86_64&amp;tc-version=UNKNOWN&amp;tc-product=Ehcache+Core+1.7.2&amp;source=Ehcache+Core&amp;uptime-secs=0&amp;patch=UNKNOWN HTTP/1.1
</code></pre><ul>
<li>Turns out this is a known issue and Ehcache has refused to make it opt-in: <a href="https://jira.terracotta.org/jira/browse/EHC-461">https://jira.terracotta.org/jira/browse/EHC-461</a></li>
<li>But we can disable it by adding an <code>updateCheck=&quot;false&quot;</code> attribute to the main <code>&lt;ehcache &gt;</code> tag in <code>dspace-services/src/main/resources/caching/ehcache-config.xml</code></li>
@ -204,7 +204,7 @@ dspace.log.2017-09-10:0
<li>I wonder what was going on, and looking into the nginx logs I think maybe it&rsquo;s OAI&hellip;</li>
<li>Here is yesterday&rsquo;s top ten IP addresses making requests to <code>/oai</code>:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
1 213.136.89.78
1 66.249.66.90
1 66.249.66.92
@ -217,7 +217,7 @@ dspace.log.2017-09-10:0
</code></pre><ul>
<li>Compared to the previous day&rsquo;s logs it looks VERY high:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
1 207.46.13.39
1 66.249.66.93
2 66.249.66.91
@ -234,7 +234,7 @@ dspace.log.2017-09-10:0
</li>
<li>And this user agent has never been seen before today (or at least recently!):</li>
</ul>
<pre><code># grep -c &quot;API scraper&quot; /var/log/nginx/oai.log
<pre tabindex="0"><code># grep -c &quot;API scraper&quot; /var/log/nginx/oai.log
62088
# zgrep -c &quot;API scraper&quot; /var/log/nginx/oai.log.*.gz
/var/log/nginx/oai.log.10.gz:0
@ -270,19 +270,19 @@ dspace.log.2017-09-10:0
<li>Some of these heavy users are also using XMLUI, and their user agent isn&rsquo;t matched by the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L158">Tomcat Session Crawler valve</a>, so each request uses a different session</li>
<li>Yesterday alone the IP addresses using the <code>API scraper</code> user agent were responsible for 16,000 sessions in XMLUI:</li>
</ul>
<pre><code># grep -a -E &quot;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&quot; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -a -E &quot;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&quot; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
15924
</code></pre><ul>
<li>If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex</li>
<li>A search for &ldquo;API scraper&rdquo; user agent on Google returns a <code>robots.txt</code> with a comment that this is the Yewno bot: <a href="http://www.escholarship.org/robots.txt">http://www.escholarship.org/robots.txt</a></li>
<li>Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:</li>
</ul>
<pre><code>WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
<pre tabindex="0"><code>WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
</code></pre><ul>
<li>Looking at the spreadsheet with deletions and corrections that CCAFS sent last week</li>
<li>It appears they want to delete a lot of metadata, which I&rsquo;m not sure they realize the implications of:</li>
</ul>
<pre><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;
<pre tabindex="0"><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;
text_value | count
--------------------------+-------
FP4_ClimateModels | 6
@ -309,14 +309,14 @@ dspace.log.2017-09-10:0
<li>I sent CCAFS people an email to ask if they really want to remove these 200+ tags</li>
<li>She responded yes, so I&rsquo;ll at least need to do these deletes in PostgreSQL:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
DELETE 207
</code></pre><ul>
<li>When we discussed this in late July there were some other renames they had requested, but I don&rsquo;t see them in the current spreadsheet so I will have to follow that up</li>
<li>I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!</li>
<li>The final list of corrections and deletes should therefore be:</li>
</ul>
<pre><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
@ -332,7 +332,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
<li>Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database</li>
<li>Here are all my distinct authority combinations in the database before:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -347,7 +347,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>And then after adding a new item and selecting an existing &ldquo;Orth, Alan&rdquo; with an ORCID in the author lookup:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -363,7 +363,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>It created a new authority&hellip; let&rsquo;s try to add another item and select the same existing author and see what happens in the database:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -379,7 +379,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>No new one&hellip; so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -396,7 +396,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
</code></pre><ul>
<li>Shit, it created another authority! Let&rsquo;s try it again!</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
@ -427,7 +427,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
<ul>
<li>Apply CCAFS project tag corrections on CGSpace:</li>
</ul>
<pre><code>dspace=# \i /tmp/ccafs-projects.sql
<pre tabindex="0"><code>dspace=# \i /tmp/ccafs-projects.sql
DELETE 5
UPDATE 4
UPDATE 1
@ -439,7 +439,7 @@ DELETE 207
<li>We still need to do the changes to <code>config.dct</code> and regenerate the <code>sitebndl.zip</code> to send to the Handle.net admins</li>
<li>According to this <a href="http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html">dspace-tech mailing list entry from 2011</a>, we need to add the extra handle prefixes to <code>config.dct</code> like this:</li>
</ul>
<pre><code>&quot;server_admins&quot; = (
<pre tabindex="0"><code>&quot;server_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
)
@ -458,7 +458,7 @@ DELETE 207
<li>The problem was that we remapped the items to new collections after the initial import, so the items were using the 10947 prefix but the community and collection was using 10568</li>
<li>I ended up having to read the <a href="https://wiki.lyrasis.org/display/DSDOC5x/AIP+Backup+and+Restore#AIPBackupandRestore-ForceReplaceMode">AIP Backup and Restore</a> closely a few times and then explicitly preserve handles and ignore parents:</li>
</ul>
<pre><code>$ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
<pre tabindex="0"><code>$ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
</code></pre><ul>
<li>Also, this was in replace mode (-r) rather than submit mode (-s), because submit mode always generated a new handle even if I told it not to!</li>
<li>I decided to start the import process in the evening rather than waiting for the morning, and right as the first community was finished importing I started seeing <code>Timeout waiting for idle object</code> errors</li>
@ -478,7 +478,7 @@ DELETE 207
<ul>
<li>Nightly Solr indexing is working again, and it appears to be pretty quick actually:</li>
</ul>
<pre><code>2017-09-19 00:00:14,953 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (0 of 65808): 17607
<pre tabindex="0"><code>2017-09-19 00:00:14,953 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (0 of 65808): 17607
...
2017-09-19 00:04:18,017 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (65807 of 65808): 83753
</code></pre><ul>
@ -494,7 +494,7 @@ DELETE 207
<li>Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite</li>
<li>Force thumbnail regeneration for the CGIAR System Organization&rsquo;s Historic Archive community (2000 items):</li>
</ul>
<pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &quot;ImageMagick PDF Thumbnail&quot;
</code></pre><ul>
<li>I&rsquo;m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org</li>
</ul>
@ -540,7 +540,7 @@ DELETE 207
<li>Turns out he had already mapped some, but requested that I finish the rest</li>
<li>With this GREL in OpenRefine I can find items that are mapped, ie they have <code>10568/3||</code> or <code>10568/3$</code> in their <code>collection</code> field:</li>
</ul>
<pre><code>isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
<pre tabindex="0"><code>isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
</code></pre><ul>
<li>Peter also made a lot of changes to the data in the Archives collections while I was attempting to import the changes, so we were essentially competing for PostgreSQL and Solr connections</li>
<li>I ended up having to kill the import and wait until he was done</li>
@ -552,7 +552,7 @@ DELETE 207
<li>Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org</li>
<li>Peter wants me to clean up the text values for Delia Grace&rsquo;s metadata, as the authorities are all messed up again since we cleaned them up in <a href="/cgspace-notes/2016-12">2016-12</a>:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
text_value | authority | confidence
--------------+--------------------------------------+------------
Grace, Delia | | 600
@ -563,12 +563,12 @@ DELETE 207
<li>Strangely, none of her authority entries have ORCIDs anymore&hellip;</li>
<li>I&rsquo;ll just fix the text values and forget about it for now:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
UPDATE 610
</code></pre><ul>
<li>After this we have to reindex the Discovery and Authority cores (as <code>tomcat7</code> user):</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 83m56.895s
@ -603,7 +603,7 @@ sys 0m12.113s
<li>The <code>index-authority</code> script always seems to fail, I think it&rsquo;s the same old bug</li>
<li>Something interesting for my notes about JNDI database pool—since I couldn&rsquo;t determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:</li>
</ul>
<pre><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
<pre tabindex="0"><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
...
INFO org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspaceLocal
INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
@ -627,13 +627,13 @@ INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Da
<li>Now the redirects work</li>
<li>I quickly registered a Let&rsquo;s Encrypt certificate for the domain:</li>
</ul>
<pre><code># systemctl stop nginx
<pre tabindex="0"><code># systemctl stop nginx
# /opt/certbot-auto certonly --standalone --email aorth@mjanja.ch -d library.cgiar.org
# systemctl start nginx
</code></pre><ul>
<li>I modified the nginx configuration of the ansible playbooks to use this new certificate and now the certificate is enabled and OCSP stapling is working:</li>
</ul>
<pre><code>$ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org -tls1_2 -tlsextdebug -status
<pre tabindex="0"><code>$ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP Response Data:
...

View File

@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -124,7 +124,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<ul>
<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
</ul>
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
@ -134,13 +134,13 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Peter Ballantyne said he was having problems logging into CGSpace with &ldquo;both&rdquo; of his accounts (CGIAR LDAP and personal, apparently)</li>
<li>I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a &ldquo;no DN found&rdquo; error:</li>
</ul>
<pre><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
<pre tabindex="0"><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
2017-10-01 20:22:37,982 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
</code></pre><ul>
<li>I thought maybe his account had expired (seeing as it&rsquo;s was the first of the month) but he says he was finally able to log in today</li>
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
</ul>
<pre><code>$ grep -c &quot;ldap_authentication:type=failed_auth&quot; dspace.log.2017-10-01
<pre tabindex="0"><code>$ grep -c &quot;ldap_authentication:type=failed_auth&quot; dspace.log.2017-10-01
14
</code></pre><ul>
<li>For what it&rsquo;s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET&rsquo;s LDAP server</li>
@ -152,7 +152,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Communicate with Sam from the CGIAR System Organization about some broken links coming from their CGIAR Library domain to CGSpace</li>
<li>The first is a link to a browse page that should be handled better in nginx:</li>
</ul>
<pre><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject
<pre tabindex="0"><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject
</code></pre><ul>
<li>We&rsquo;ll need to check for browse links and handle them properly, including swapping the <code>subject</code> parameter for <code>systemsubject</code> (which doesn&rsquo;t exist in Discovery yet, but we&rsquo;ll need to add it) as we have moved their poorly curated subjects from <code>dc.subject</code> to <code>cg.subject.system</code></li>
<li>The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead</li>
@ -165,7 +165,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Twice in the past twenty-four hours Linode has warned that CGSpace&rsquo;s outbound traffic rate was exceeding the notification threshold</li>
<li>I had a look at yesterday&rsquo;s OAI and REST logs in <code>/var/log/nginx</code> but didn&rsquo;t see anything unusual:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
141 157.55.39.240
145 40.77.167.85
162 66.249.66.92
@ -225,7 +225,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Delete Community 10568/102 (ILRI Research and Development Issues)</li>
<li>Move five collections to 10568/27629 (ILRI Projects) using <code>move-collections.sh</code> with the following configuration:</li>
</ul>
<pre><code>10568/1637 10568/174 10568/27629
<pre tabindex="0"><code>10568/1637 10568/174 10568/27629
10568/1642 10568/174 10568/27629
10568/1614 10568/174 10568/27629
10568/75561 10568/150 10568/27629
@ -270,12 +270,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
<li>Still not sure where the load is coming from right now, but it&rsquo;s clear why there were so many alerts yesterday on the 25th!</li>
</ul>
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
18022
</code></pre><ul>
<li>Compared to other days there were two or three times the number of requests yesterday!</li>
</ul>
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
3141
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
7851
@ -302,7 +302,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I&rsquo;m still not sure why this started causing alerts so repeatadely the past week</li>
<li>I don&rsquo;t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
</ul>
<pre><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2049
</code></pre><ul>
<li>So there were 2049 unique sessions during the hour of 2AM</li>
@ -310,7 +310,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I think I&rsquo;ll need to enable access logging in nginx to figure out what&rsquo;s going on</li>
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I&rsquo;ve never seen before:</li>
</ul>
<pre><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &quot;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&quot; 200 7776 &quot;-&quot; &quot;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&quot;
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &quot;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&quot; 200 7776 &quot;-&quot; &quot;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&quot;
</code></pre><ul>
<li>CORE seems to be some bot that is &ldquo;Aggregating the worlds open access research papers&rdquo;</li>
<li>The contact address listed in their bot&rsquo;s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
@ -323,39 +323,39 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)</li>
<li>Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:</li>
</ul>
<pre><code>dspace=# SELECT * FROM pg_stat_activity;
<pre tabindex="0"><code>dspace=# SELECT * FROM pg_stat_activity;
...
(93 rows)
</code></pre><ul>
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
</ul>
<pre><code># grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log
26475
# grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log.1
135083
</code></pre><ul>
<li>IP addresses for this bot currently seem to be:</li>
</ul>
<pre><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
137.108.70.6
137.108.70.7
</code></pre><ul>
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won&rsquo;t help much because they are only using two sessions:</li>
</ul>
<pre><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
session_id=5771742CABA3D0780860B8DA81E0551B
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
</code></pre><ul>
<li>&hellip; and most of their requests are for dynamic discover pages:</li>
</ul>
<pre><code># grep -c 137.108.70 /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c 137.108.70 /var/log/nginx/access.log
26622
# grep 137.108.70 /var/log/nginx/access.log | grep -c &quot;GET /discover&quot;
24055
</code></pre><ul>
<li>Just because I&rsquo;m curious who the top IPs are:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
496 62.210.247.93
571 46.4.94.226
651 40.77.167.39
@ -371,7 +371,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don&rsquo;t reuse their session variable, creating thousands of new sessions!</li>
</ul>
<pre><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1419
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2811
@ -382,11 +382,11 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn&rsquo;t in Ubuntu 16.04&rsquo;s 7.0.68 build!</li>
<li>That would explain the errors I was getting when trying to set it:</li>
</ul>
<pre><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
</code></pre><ul>
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
</ul>
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
@ -399,7 +399,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>Ask on the dspace-tech mailing list if it&rsquo;s possible to use an existing item as a template for a new item</li>
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
</ul>
<pre><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
139109 137.108.70.6
139253 137.108.70.7
</code></pre><ul>
@ -408,7 +408,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>I added <a href="https://goaccess.io/">GoAccess</a> to the list of package to install in the DSpace role of the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
<li>It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:</li>
</ul>
<pre><code># goaccess /var/log/nginx/access.log --log-format=COMBINED
<pre tabindex="0"><code># goaccess /var/log/nginx/access.log --log-format=COMBINED
</code></pre><ul>
<li>According to Uptime Robot CGSpace went down and up a few times</li>
<li>I had a look at goaccess and I saw that CORE was actively indexing</li>
@ -416,7 +416,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>I&rsquo;m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
<li>Actually, come to think of it, they aren&rsquo;t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
</ul>
<pre><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | grep -o -E &quot;GET /(discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | grep -o -E &quot;GET /(discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
158058 GET /discover
14260 GET /search-filter
</code></pre><ul>

View File

@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -142,12 +142,12 @@ COPY 54701
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre><ul>
<li>Abenet asked if it would be possible to generate a report of items in Listing and Reports that had &ldquo;International Fund for Agricultural Development&rdquo; as the <em>only</em> investor</li>
@ -155,7 +155,7 @@ COPY 54701
<li>Work on making the thumbnails in the item view clickable</li>
<li>Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link</li>
</ul>
<pre><code>//mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
<pre tabindex="0"><code>//mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
</code></pre><ul>
<li>METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml</li>
<li>I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!</li>
@ -177,7 +177,7 @@ COPY 54701
<li>It&rsquo;s the first time in a few days that this has happened</li>
<li>I had a look to see what was going on, but it isn&rsquo;t the CORE bot:</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
306 68.180.229.31
323 61.148.244.116
414 66.249.66.91
@ -191,7 +191,7 @@ COPY 54701
</code></pre><ul>
<li>138.201.52.218 is from some Hetzner server, and I see it making 40,000 requests yesterday too, but none before that:</li>
</ul>
<pre><code># zgrep -c 138.201.52.218 /var/log/nginx/access.log*
<pre tabindex="0"><code># zgrep -c 138.201.52.218 /var/log/nginx/access.log*
/var/log/nginx/access.log:24403
/var/log/nginx/access.log.1:45958
/var/log/nginx/access.log.2.gz:0
@ -202,7 +202,7 @@ COPY 54701
</code></pre><ul>
<li>It&rsquo;s clearly a bot as it&rsquo;s making tens of thousands of requests, but it&rsquo;s using a &ldquo;normal&rdquo; user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
</code></pre><ul>
<li>For now I don&rsquo;t know what this user is!</li>
</ul>
@ -216,7 +216,7 @@ COPY 54701
<ul>
<li>But in the database the authors are correct (none with weird <code>, /</code> characters):</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
text_value | authority | confidence
--------------------------------------------+--------------------------------------+------------
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 0
@ -240,7 +240,7 @@ COPY 54701
<li>Tsega had to restart Tomcat 7 to fix it temporarily</li>
<li>I will start by looking at bot usage (access.log.1 includes usage until 6AM today):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
619 65.49.68.184
840 65.49.68.199
924 66.249.66.91
@ -254,7 +254,7 @@ COPY 54701
</code></pre><ul>
<li>104.196.152.243 seems to be a top scraper for a few weeks now:</li>
</ul>
<pre><code># zgrep -c 104.196.152.243 /var/log/nginx/access.log*
<pre tabindex="0"><code># zgrep -c 104.196.152.243 /var/log/nginx/access.log*
/var/log/nginx/access.log:336
/var/log/nginx/access.log.1:4681
/var/log/nginx/access.log.2.gz:3531
@ -268,7 +268,7 @@ COPY 54701
</code></pre><ul>
<li>This user is responsible for hundreds and sometimes thousands of Tomcat sessions:</li>
</ul>
<pre><code>$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
954
$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
6199
@ -278,7 +278,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>The worst thing is that this user never specifies a user agent string so we can&rsquo;t lump it in with the other bots using the Tomcat Session Crawler Manager Valve</li>
<li>They don&rsquo;t request dynamic URLs like &ldquo;/discover&rdquo; but they seem to be fetching handles from XMLUI instead of REST (and some with <code>//handle</code>, note the regex below):</li>
</ul>
<pre><code># grep -c 104.196.152.243 /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c 104.196.152.243 /var/log/nginx/access.log.1
4681
# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle'
4618
@ -286,19 +286,19 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>I just realized that <code>ciat.cgiar.org</code> points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior</li>
<li>The next IP (207.46.13.36) seem to be Microsoft&rsquo;s bingbot, but all its requests specify the &ldquo;bingbot&rdquo; user agent and there are no requests for dynamic URLs that are forbidden, like &ldquo;/discover&rdquo;:</li>
</ul>
<pre><code>$ grep -c 207.46.13.36 /var/log/nginx/access.log.1
<pre tabindex="0"><code>$ grep -c 207.46.13.36 /var/log/nginx/access.log.1
2034
# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
0
</code></pre><ul>
<li>The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:</li>
</ul>
<pre><code># grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
<pre tabindex="0"><code># grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
0
</code></pre><ul>
<li>The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre><code># grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1
5997
# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c &quot;bingbot&quot;
5988
@ -307,7 +307,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre><code># grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1
3048
# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google
3048
@ -316,14 +316,14 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre><code># grep -c 68.180.229.254 /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c 68.180.229.254 /var/log/nginx/access.log.1
1131
# grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
0
</code></pre><ul>
<li>The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:</li>
</ul>
<pre><code># grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1
2950
# grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c &quot;GET /discover&quot;
330
@ -338,7 +338,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>I&rsquo;ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs</li>
<li>While it&rsquo;s not in the top ten, Baidu is one bot that seems to not give a fuck:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;7/Nov/2017&quot; | grep -c Baiduspider
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;7/Nov/2017&quot; | grep -c Baiduspider
8912
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;7/Nov/2017&quot; | grep Baiduspider | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
2521
@ -349,7 +349,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>I should look in nginx access.log, rest.log, oai.log, and DSpace&rsquo;s dspace.log.2017-11-07</li>
<li>Here are the top IPs making requests to XMLUI from 2 to 8 AM:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
279 66.249.66.91
373 65.49.68.199
446 68.180.229.254
@ -364,7 +364,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot</li>
<li>Here are the top IPs making requests to REST from 2 to 8 AM:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
8 207.241.229.237
10 66.249.66.90
16 104.196.152.243
@ -377,14 +377,14 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>The OAI requests during that same time period are nothing to worry about:</li>
</ul>
<pre><code># cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
1 66.249.66.92
4 66.249.66.90
6 68.180.229.254
</code></pre><ul>
<li>The top IPs from dspace.log during the 28 AM period:</li>
</ul>
<pre><code>$ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code>$ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
143 ip_addr=213.55.99.121
181 ip_addr=66.249.66.91
223 ip_addr=157.55.39.161
@ -400,7 +400,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>The number of requests isn&rsquo;t even that high to be honest</li>
<li>As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:</li>
</ul>
<pre><code># zgrep -c 124.17.34.59 /var/log/nginx/access.log*
<pre tabindex="0"><code># zgrep -c 124.17.34.59 /var/log/nginx/access.log*
/var/log/nginx/access.log:22581
/var/log/nginx/access.log.1:0
/var/log/nginx/access.log.2.gz:14
@ -414,7 +414,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>The whois data shows the IP is from China, but the user agent doesn&rsquo;t really give any clues:</li>
</ul>
<pre><code># grep 124.17.34.59 /var/log/nginx/access.log | awk -F'&quot; ' '{print $3}' | sort | uniq -c | sort -h
<pre tabindex="0"><code># grep 124.17.34.59 /var/log/nginx/access.log | awk -F'&quot; ' '{print $3}' | sort | uniq -c | sort -h
210 &quot;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36&quot;
22610 &quot;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)&quot;
</code></pre><ul>
@ -424,7 +424,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 1214 hours)</li>
<li>At least for now it seems to be that new Chinese IP (124.17.34.59):</li>
</ul>
<pre><code># grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
198 207.46.13.103
203 207.46.13.80
205 207.46.13.36
@ -438,7 +438,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!</li>
</ul>
<pre><code># grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
<pre tabindex="0"><code># grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
5948
# grep -E &quot;07/Nov/2017:1[234]:&quot; /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
0
@ -446,7 +446,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
<li>About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day</li>
<li>All CIAT requests vs unique ones:</li>
</ul>
<pre><code>$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
<pre tabindex="0"><code>$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
3506
$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
3506
@ -459,18 +459,18 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<ul>
<li>But they literally just made this request today:</li>
</ul>
<pre><code>180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] &quot;GET /discover?filtertype_0=crpsubject&amp;filter_relational_operator_0=equals&amp;filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&amp;filtertype=subject&amp;filter_relational_operator=equals&amp;filter=WATER+RESOURCES HTTP/1.1&quot; 200 82265 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot;
<pre tabindex="0"><code>180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] &quot;GET /discover?filtertype_0=crpsubject&amp;filter_relational_operator_0=equals&amp;filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&amp;filtertype=subject&amp;filter_relational_operator=equals&amp;filter=WATER+RESOURCES HTTP/1.1&quot; 200 82265 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot;
</code></pre><ul>
<li>Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:</li>
</ul>
<pre><code># grep -c Baiduspider /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c Baiduspider /var/log/nginx/access.log
3806
# grep Baiduspider /var/log/nginx/access.log | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
1085
</code></pre><ul>
<li>I will think about blocking their IPs but they have 164 of them!</li>
</ul>
<pre><code># grep &quot;Baiduspider/2.0&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep &quot;Baiduspider/2.0&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
164
</code></pre><h2 id="2017-11-08">2017-11-08</h2>
<ul>
@ -478,12 +478,12 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<li>Linode sent another alert about CPU usage in the morning at 6:12AM</li>
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;0[78]/Nov/2017:&quot; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;0[78]/Nov/2017:&quot; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
24981
</code></pre><ul>
<li>This is about 20,000 Tomcat sessions:</li>
</ul>
<pre><code>$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
20733
</code></pre><ul>
<li>I&rsquo;m getting really sick of this</li>
@ -496,7 +496,7 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<li>Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process</li>
<li>Basically, we modify the nginx config to add a mapping with a modified user agent <code>$ua</code>:</li>
</ul>
<pre><code>map $remote_addr $ua {
<pre tabindex="0"><code>map $remote_addr $ua {
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
124.17.34.59 'ChineseBot';
default $http_user_agent;
@ -505,7 +505,7 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
<li>If the client&rsquo;s address matches then the user agent is set, otherwise the default <code>$http_user_agent</code> variable is used</li>
<li>Then, in the server&rsquo;s <code>/</code> block we pass this header to Tomcat:</li>
</ul>
<pre><code>proxy_pass http://tomcat_http;
<pre tabindex="0"><code>proxy_pass http://tomcat_http;
proxy_set_header User-Agent $ua;
</code></pre><ul>
<li>Note to self: the <code>$ua</code> variable won&rsquo;t show up in nginx access logs because the default <code>combined</code> log format doesn&rsquo;t show it, so don&rsquo;t run around pulling your hair out wondering with the modified user agents aren&rsquo;t showing in the logs!</li>
@ -516,14 +516,14 @@ proxy_set_header User-Agent $ua;
<li>I merged the clickable thumbnails code to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/347">#347</a>) and will deploy it later along with the new bot mapping stuff (and re-run the Asible <code>nginx</code> and <code>tomcat</code> tags)</li>
<li>I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in <code>robots.txt</code>:</li>
</ul>
<pre><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
<pre tabindex="0"><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
22229
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E &quot;GET /(browse|discover|search-filter)&quot;
0
</code></pre><ul>
<li>It seems that they rarely even bother checking <code>robots.txt</code>, but Google does multiple times per day!</li>
</ul>
<pre><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
<pre tabindex="0"><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
14
# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
1134
@ -538,14 +538,14 @@ proxy_set_header User-Agent $ua;
<ul>
<li>Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:</li>
</ul>
<pre><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
8956
$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
223
</code></pre><ul>
<li>Versus the same stats for yesterday and the day before:</li>
</ul>
<pre><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243
10216
$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2592
@ -569,7 +569,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
<li>Update the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure templates</a> to be a little more modular and flexible</li>
<li>Looking at the top client IPs on CGSpace so far this morning, even though it&rsquo;s only been eight hours:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;12/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;12/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
243 5.83.120.111
335 40.77.167.103
424 66.249.66.91
@ -583,12 +583,12 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
</code></pre><ul>
<li>5.9.6.51 seems to be a Russian bot:</li>
</ul>
<pre><code># grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
<pre tabindex="0"><code># grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] &quot;GET /handle/10568/16515/recent-submissions HTTP/1.1&quot; 200 5097 &quot;-&quot; &quot;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&quot;
</code></pre><ul>
<li>What&rsquo;s amazing is that it seems to reuse its Java session across all requests:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
1558
$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1
@ -596,14 +596,14 @@ $ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | s
<li>Bravo to MegaIndex.ru!</li>
<li>The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat&rsquo;s Crawler Session Manager valve regex should match &lsquo;YandexBot&rsquo;:</li>
</ul>
<pre><code># grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
<pre tabindex="0"><code># grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] &quot;GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1&quot; 200 972019 &quot;-&quot; &quot;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot;
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
991
</code></pre><ul>
<li>Move some items and collections on CGSpace for Peter Ballantyne, running <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move_collections.sh</code></a> with the following configuration:</li>
</ul>
<pre><code>10947/6 10947/1 10568/83389
<pre tabindex="0"><code>10947/6 10947/1 10568/83389
10947/34 10947/1 10568/83389
10947/2512 10947/1 10568/83389
</code></pre><ul>
@ -612,7 +612,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-
<li>The solution <a href="https://github.com/ilri/rmg-ansible-public/commit/f0646991772660c505bea9c5ac586490e7c86156">I came up with</a> uses tricks from both of those</li>
<li>I deployed the limit on CGSpace and DSpace Test and it seems to work well:</li>
</ul>
<pre><code>$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
<pre tabindex="0"><code>$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -642,7 +642,7 @@ Server: nginx
<ul>
<li>At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:</li>
</ul>
<pre><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;13/Nov/2017&quot; | grep &quot;Baiduspider&quot; | grep -c &quot; 200 &quot;
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;13/Nov/2017&quot; | grep &quot;Baiduspider&quot; | grep -c &quot; 200 &quot;
1132
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;13/Nov/2017&quot; | grep &quot;Baiduspider&quot; | grep -c &quot; 503 &quot;
10105
@ -675,7 +675,7 @@ Server: nginx
<li>Started testing DSpace 6.2 and a few things have changed</li>
<li>Now PostgreSQL needs <code>pgcrypto</code>:</li>
</ul>
<pre><code>$ psql dspace6
<pre tabindex="0"><code>$ psql dspace6
dspace6=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>Also, local settings are no longer in <code>build.properties</code>, they are now in <code>local.cfg</code></li>
@ -695,7 +695,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
<li>After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat</li>
<li>Looking at the REST and XMLUI log files, I don&rsquo;t see anything too crazy:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep &quot;17/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep &quot;17/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
13 66.249.66.223
14 207.46.13.36
17 207.46.13.137
@ -721,7 +721,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
<li>I need to look into using JMX to analyze active sessions I think, rather than looking at log files</li>
<li>After adding appropriate <a href="https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/">JMX listener options to Tomcat&rsquo;s JAVA_OPTS</a> and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:</li>
</ul>
<pre><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
<pre tabindex="0"><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
</code></pre><ul>
<li>Looking at the MBeans you can drill down in Catalina→Manager→webapp→localhost→Attributes and see active sessions, etc</li>
<li>I want to enable JMX listener on CGSpace but I need to do some more testing on DSpace Test and see if it causes any performance impact, for example</li>
@ -737,7 +737,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
<li>Linode sent an alert that CGSpace was using a lot of CPU around 46 AM</li>
<li>Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;19/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;19/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
111 66.249.66.155
171 5.9.6.51
188 54.162.241.40
@ -751,12 +751,12 @@ dspace6=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>66.249.66.153 appears to be Googlebot:</li>
</ul>
<pre><code>66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] &quot;GET /handle/10568/2203 HTTP/1.1&quot; 200 6309 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
<pre tabindex="0"><code>66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] &quot;GET /handle/10568/2203 HTTP/1.1&quot; 200 6309 &quot;-&quot; &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;
</code></pre><ul>
<li>We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity</li>
<li>In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)</li>
</ul>
<pre><code>$ wc -l dspace.log.2017-11-19
<pre tabindex="0"><code>$ wc -l dspace.log.2017-11-19
388472 dspace.log.2017-11-19
$ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
267494
@ -764,7 +764,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>WTF is this process doing every day, and for so many hours?</li>
<li>In unrelated news, when I was looking at the DSpace logs I saw a bunch of errors like this:</li>
</ul>
<pre><code>2017-11-19 03:00:32,806 INFO org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
<pre tabindex="0"><code>2017-11-19 03:00:32,806 INFO org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
</code></pre><ul>
<li>It&rsquo;s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:</li>
@ -780,13 +780,13 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<ul>
<li>Magdalena was having problems logging in via LDAP and it seems to be a problem with the CGIAR LDAP server:</li>
</ul>
<pre><code>2017-11-21 11:11:09,621 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
<pre tabindex="0"><code>2017-11-21 11:11:09,621 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
</code></pre><h2 id="2017-11-22">2017-11-22</h2>
<ul>
<li>Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM</li>
<li>The logs don&rsquo;t show anything particularly abnormal between those hours:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;22/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;22/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
136 31.6.77.23
174 68.180.229.254
217 66.249.66.91
@ -807,7 +807,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>Linode alerted again that CPU usage was high on CGSpace from 4:13 to 6:13 AM</li>
<li>I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;23/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;23/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
88 66.249.66.91
140 68.180.229.254
155 54.196.2.131
@ -821,7 +821,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
</code></pre><ul>
<li>&hellip; and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;23/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;23/Nov/2017:0[456]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
5 190.120.6.219
6 104.198.9.108
14 104.196.152.243
@ -836,7 +836,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>These IPs crawling the REST API don&rsquo;t specify user agents and I&rsquo;d assume they are creating many Tomcat sessions</li>
<li>I would catch them in nginx to assign a &ldquo;bot&rdquo; user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any reallyat least not in the dspace.log:</li>
</ul>
<pre><code>$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2
</code></pre><ul>
<li>I&rsquo;m wondering if REST works differently, or just doesn&rsquo;t log these sessions?</li>
@ -861,7 +861,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)</li>
<li>I also noticed that CGNET appears to be monitoring the old domain every few minutes:</li>
</ul>
<pre><code>192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] &quot;HEAD / HTTP/1.1&quot; 301 0 &quot;-&quot; &quot;curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2&quot;
<pre tabindex="0"><code>192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] &quot;HEAD / HTTP/1.1&quot; 301 0 &quot;-&quot; &quot;curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2&quot;
</code></pre><ul>
<li>I should probably tell CGIAR people to have CGNET stop that</li>
</ul>
@ -870,7 +870,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>Linode alerted that CGSpace server was using too much CPU from 5:18 to 7:18 AM</li>
<li>Yet another mystery because the load for all domains looks fine at that time:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;26/Nov/2017:0[567]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;26/Nov/2017:0[567]&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
190 66.249.66.83
195 104.196.152.243
220 40.77.167.82
@ -887,7 +887,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>About an hour later Uptime Robot said that the server was down</li>
<li>Here are all the top XMLUI and REST users from today:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;29/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;29/Nov/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
540 66.249.66.83
659 40.77.167.36
663 157.55.39.214
@ -905,12 +905,12 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
<li>I don&rsquo;t see much activity in the logs but there are 87 PostgreSQL connections</li>
<li>But shit, there were 10,000 unique Tomcat sessions today:</li>
</ul>
<pre><code>$ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
10037
</code></pre><ul>
<li>Although maybe that&rsquo;s not much, as the previous two days had more:</li>
</ul>
<pre><code>$ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
12377
$ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
16984

View File

@ -30,7 +30,7 @@ The logs say &ldquo;Timeout waiting for idle object&rdquo;
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -123,7 +123,7 @@ The list of connections to XMLUI and REST API for today:
<li>PostgreSQL activity says there are 115 connections currently</li>
<li>The list of connections to XMLUI and REST API for today:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
763 2.86.122.76
907 207.46.13.94
1018 157.55.39.206
@ -137,12 +137,12 @@ The list of connections to XMLUI and REST API for today:
</code></pre><ul>
<li>The number of DSpace sessions isn&rsquo;t even that high:</li>
</ul>
<pre><code>$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
5815
</code></pre><ul>
<li>Connections in the last two hours:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017:(09|10)&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017:(09|10)&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
78 93.160.60.22
101 40.77.167.122
113 66.249.66.70
@ -157,18 +157,18 @@ The list of connections to XMLUI and REST API for today:
<li>What the fuck is going on?</li>
<li>I&rsquo;ve never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:</li>
</ul>
<pre><code>$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
822
</code></pre><ul>
<li>Appears to be some new bot:</li>
</ul>
<pre><code>2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] &quot;GET /handle/10568/78444?show=full HTTP/1.1&quot; 200 29307 &quot;-&quot; &quot;Mozilla/3.0 (compatible; Indy Library)&quot;
<pre tabindex="0"><code>2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] &quot;GET /handle/10568/78444?show=full HTTP/1.1&quot; 200 29307 &quot;-&quot; &quot;Mozilla/3.0 (compatible; Indy Library)&quot;
</code></pre><ul>
<li>I restarted Tomcat and everything came back up</li>
<li>I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx</li>
<li>I will also add &lsquo;Drupal&rsquo; to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017&quot; | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;1/Dec/2017&quot; | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
3 54.75.205.145
6 70.32.83.92
14 2a01:7e00::f03c:91ff:fe18:7396
@ -206,7 +206,7 @@ The list of connections to XMLUI and REST API for today:
<li>I don&rsquo;t see any errors in the DSpace logs but I see in nginx&rsquo;s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)</li>
<li>Looking at the REST API logs I see some new client IP I haven&rsquo;t noticed before:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;6/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;6/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
18 95.108.181.88
19 68.180.229.254
30 207.46.13.151
@ -228,7 +228,7 @@ The list of connections to XMLUI and REST API for today:
<li>I looked just now and see that there are 121 PostgreSQL connections!</li>
<li>The top users right now are:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;7/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;7/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
838 40.77.167.11
939 66.249.66.223
1149 66.249.66.206
@ -243,24 +243,24 @@ The list of connections to XMLUI and REST API for today:
<li>We&rsquo;ve never seen 124.17.34.60 yet, but it&rsquo;s really hammering us!</li>
<li>Apparently it is from China, and here is one of its user agents:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
</code></pre><ul>
<li>It is responsible for 4,500 Tomcat sessions today alone:</li>
</ul>
<pre><code>$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
4574
</code></pre><ul>
<li>I&rsquo;ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it&rsquo;s the same bot on the same subnet</li>
<li>I was running the DSpace cleanup task manually and it hit an error:</li>
</ul>
<pre><code>$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
<pre tabindex="0"><code>$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(144666) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is like I discovered in <a href="/cgspace-notes/2017-04">2017-04</a>, to set the <code>primary_bitstream_id</code> to null:</li>
</ul>
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
UPDATE 1
</code></pre><h2 id="2017-12-13">2017-12-13</h2>
<ul>
@ -294,11 +294,11 @@ UPDATE 1
</li>
<li>I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the <code>collection</code> field)</li>
</ul>
<pre><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &amp;&gt; /tmp/ccafs.log
<pre tabindex="0"><code>$ JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot; ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &amp;&gt; /tmp/ccafs.log
</code></pre><ul>
<li>It&rsquo;s the same on DSpace Test, I can&rsquo;t import the SAF bundle without specifying the collection:</li>
</ul>
<pre><code>$ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
<pre tabindex="0"><code>$ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
No collections given. Assuming 'collections' file inside item directory
Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat
Generating mapfile: /tmp/ccafs.map
@ -321,14 +321,14 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>I even tried to debug it by adding verbose logging to the <code>JAVA_OPTS</code>:</li>
</ul>
<pre><code>-Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
<pre tabindex="0"><code>-Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
</code></pre><ul>
<li>&hellip; but the error message was the same, just with more INFO noise around it</li>
<li>For now I&rsquo;ll import into a collection in DSpace Test but I&rsquo;m really not sure what&rsquo;s up with this!</li>
<li>Linode alerted that CGSpace was using high CPU from 4 to 6 PM</li>
<li>The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;17/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;17/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
671 66.249.66.70
885 95.108.181.88
904 157.55.39.96
@ -342,7 +342,7 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;17/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;17/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
33 68.180.229.254
48 157.55.39.96
51 157.55.39.179
@ -371,7 +371,7 @@ Elapsed time: 2 secs (2559 msecs)
<li>Linode alerted this morning that there was high outbound traffic from 6 to 8 AM</li>
<li>The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
190 207.46.13.146
191 197.210.168.174
202 86.101.203.216
@ -385,7 +385,7 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
7 104.198.9.108
8 185.29.8.111
8 40.77.167.176
@ -402,7 +402,7 @@ Elapsed time: 2 secs (2559 msecs)
<li>Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM</li>
<li>The REST and OAI API logs look pretty much the same as earlier this morning, but there&rsquo;s a new IP harvesting XMLUI:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;18/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
360 95.108.181.88
477 66.249.66.90
526 86.101.203.216
@ -416,17 +416,17 @@ Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>2.86.72.181 appears to be from Greece, and has the following user agent:</li>
</ul>
<pre><code>Mozilla/3.0 (compatible; Indy Library)
<pre tabindex="0"><code>Mozilla/3.0 (compatible; Indy Library)
</code></pre><ul>
<li>Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:</li>
</ul>
<pre><code>$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>I guess there&rsquo;s nothing I can do to them for now</li>
<li>In other news, I am curious how many PostgreSQL connection pool errors we&rsquo;ve had in the last month:</li>
</ul>
<pre><code>$ grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-1* | grep -v :0
<pre tabindex="0"><code>$ grep -c &quot;Cannot get a connection, pool error Timeout waiting for idle object&quot; dspace.log.2017-1* | grep -v :0
dspace.log.2017-11-07:15695
dspace.log.2017-11-08:135
dspace.log.2017-11-17:1298
@ -456,7 +456,7 @@ dspace.log.2017-12-07:2769
<li>So I restarted Tomcat 7 and restarted the imports</li>
<li>I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything:</li>
</ul>
<pre><code>$ dspace index-discovery -r 10568/42211
<pre tabindex="0"><code>$ dspace index-discovery -r 10568/42211
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
</code></pre><ul>
<li>The PostgreSQL issues are getting out of control, I need to figure out how to enable connection pools in Tomcat!</li>
@ -476,7 +476,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>I re-deployed the <code>5_x-prod</code> branch on CGSpace, applied all system updates, and restarted the server</li>
<li>Looking through the dspace.log I see this error:</li>
</ul>
<pre><code>2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
<pre tabindex="0"><code>2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
</code></pre><ul>
<li>I don&rsquo;t have time now to look into this but the Solr sharding has long been an issue!</li>
<li>Looking into using JDBC / JNDI to provide a database pool to DSpace</li>
@ -484,7 +484,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>First, I uncomment <code>db.jndi</code> in <em>dspace/config/dspace.cfg</em></li>
<li>Then I create a global <code>Resource</code> in the main Tomcat <em>server.xml</em> (inside <code>GlobalNamingResources</code>):</li>
</ul>
<pre><code>&lt;Resource name=&quot;jdbc/dspace&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
<pre tabindex="0"><code>&lt;Resource name=&quot;jdbc/dspace&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
driverClassName=&quot;org.postgresql.Driver&quot;
url=&quot;jdbc:postgresql://localhost:5432/dspace&quot;
username=&quot;dspace&quot;
@ -500,12 +500,12 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>Most of the parameters are from comments by Mark Wood about his JNDI setup: <a href="https://jira.duraspace.org/browse/DS-3564">https://jira.duraspace.org/browse/DS-3564</a></li>
<li>Then I add a <code>ResourceLink</code> to each web application context:</li>
</ul>
<pre><code>&lt;ResourceLink global=&quot;jdbc/dspace&quot; name=&quot;jdbc/dspace&quot; type=&quot;javax.sql.DataSource&quot;/&gt;
<pre tabindex="0"><code>&lt;ResourceLink global=&quot;jdbc/dspace&quot; name=&quot;jdbc/dspace&quot; type=&quot;javax.sql.DataSource&quot;/&gt;
</code></pre><ul>
<li>I am not sure why several guides show configuration snippets for <em>server.xml</em> and web application contexts that use a Local and Global jdbc&hellip;</li>
<li>When DSpace can&rsquo;t find the JNDI context (for whatever reason) you will see this in the dspace logs:</li>
</ul>
<pre><code>2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
<pre tabindex="0"><code>2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
@ -535,11 +535,11 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</code></pre><ul>
<li>And indeed the Catalina logs show that it failed to set up the JDBC driver:</li>
</ul>
<pre><code>org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
<pre tabindex="0"><code>org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
</code></pre><ul>
<li>There are several copies of the PostgreSQL driver installed by DSpace:</li>
</ul>
<pre><code>$ find ~/dspace/ -iname &quot;postgresql*jdbc*.jar&quot;
<pre tabindex="0"><code>$ find ~/dspace/ -iname &quot;postgresql*jdbc*.jar&quot;
/Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/oai/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
@ -548,7 +548,7 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</code></pre><ul>
<li>These apparently come from the main DSpace <code>pom.xml</code>:</li>
</ul>
<pre><code>&lt;dependency&gt;
<pre tabindex="0"><code>&lt;dependency&gt;
&lt;groupId&gt;postgresql&lt;/groupId&gt;
&lt;artifactId&gt;postgresql&lt;/artifactId&gt;
&lt;version&gt;9.1-901-1.jdbc4&lt;/version&gt;
@ -556,12 +556,12 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</code></pre><ul>
<li>So WTF? Let&rsquo;s try copying one to Tomcat&rsquo;s lib folder and restarting Tomcat:</li>
</ul>
<pre><code>$ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
<pre tabindex="0"><code>$ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
</code></pre><ul>
<li>Oh that&rsquo;s fantastic, now at least Tomcat doesn&rsquo;t print an error during startup so I guess it succeeds to create the JNDI pool</li>
<li>DSpace starts up but I have no idea if it&rsquo;s using the JNDI configuration because I see this in the logs:</li>
</ul>
<pre><code>2017-12-19 13:26:54,271 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
<pre tabindex="0"><code>2017-12-19 13:26:54,271 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
2017-12-19 13:26:54,277 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
2017-12-19 13:26:54,293 INFO org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
2017-12-19 13:26:54,306 INFO org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
@ -580,7 +580,7 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
</li>
<li>After adding the <code>Resource</code> to <em>server.xml</em> on Ubuntu I get this in Catalina&rsquo;s logs:</li>
</ul>
<pre><code>SEVERE: Unable to create initial connections of pool.
<pre tabindex="0"><code>SEVERE: Unable to create initial connections of pool.
java.sql.SQLException: org.postgresql.Driver
...
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
@ -589,17 +589,17 @@ Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
<li>I tried installing Ubuntu&rsquo;s <code>libpostgresql-jdbc-java</code> package but Tomcat still can&rsquo;t find the class</li>
<li>Let me try to symlink the lib into Tomcat&rsquo;s libs:</li>
</ul>
<pre><code># ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
<pre tabindex="0"><code># ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
</code></pre><ul>
<li>Now Tomcat starts but the localhost container has errors:</li>
</ul>
<pre><code>SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
<pre tabindex="0"><code>SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract
</code></pre><ul>
<li>Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace&rsquo;s are 9.1&hellip;</li>
<li>Let me try to remove it and copy in DSpace&rsquo;s:</li>
</ul>
<pre><code># rm /usr/share/tomcat7/lib/postgresql.jar
<pre tabindex="0"><code># rm /usr/share/tomcat7/lib/postgresql.jar
# cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/
</code></pre><ul>
<li>Wow, I think that actually works&hellip;</li>
@ -608,12 +608,12 @@ java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClos
<li>Also, since I commented out all the db parameters in DSpace.cfg, how does the command line <code>dspace</code> tool work?</li>
<li>Let&rsquo;s try the upstream JDBC driver first:</li>
</ul>
<pre><code># rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
<pre tabindex="0"><code># rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
# wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar
</code></pre><ul>
<li>DSpace command line fails unless db settings are present in dspace.cfg:</li>
</ul>
<pre><code>$ dspace database info
<pre tabindex="0"><code>$ dspace database info
Caught exception:
java.sql.SQLException: java.lang.ClassNotFoundException:
at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
@ -633,7 +633,7 @@ Caused by: java.lang.ClassNotFoundException:
</code></pre><ul>
<li>And in the logs:</li>
</ul>
<pre><code>2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
<pre tabindex="0"><code>2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file: java.naming.factory.initial
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
@ -669,7 +669,7 @@ javax.naming.NoInitialContextException: Need to specify class name in environmen
<li>There are short bursts of connections up to 10, but it generally stays around 5</li>
<li>Test and import 13 records to CGSpace for Abenet:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
</code></pre><ul>
<li>The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.</li>
@ -677,7 +677,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchi
<li>There was an initial connection storm of 50 PostgreSQL connections, but then it settled down to 7</li>
<li>After that CGSpace came up fine and I was able to import the 13 items just fine:</li>
</ul>
<pre><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
<pre tabindex="0"><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
</code></pre><ul>
<li>The final code for the JNDI work in the Ansible infrastructure scripts is here: <a href="https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b">https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b</a></li>
@ -687,7 +687,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
<li>Linode alerted that CGSpace was using high CPU this morning around 6 AM</li>
<li>I&rsquo;m playing with reading all of a month&rsquo;s nginx logs into goaccess:</li>
</ul>
<pre><code># find /var/log/nginx -type f -newermt &quot;2017-12-01&quot; | xargs zcat --force | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># find /var/log/nginx -type f -newermt &quot;2017-12-01&quot; | xargs zcat --force | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I can see interesting things using this approach, for example:
<ul>
@ -708,7 +708,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
<ul>
<li>Looking at some old notes for metadata to clean up, I found a few hundred corrections in <code>cg.fulltextstatus</code> and <code>dc.language.iso</code>:</li>
</ul>
<pre><code># update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
<pre tabindex="0"><code># update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
DELETE 17
@ -735,7 +735,7 @@ DELETE 20
<li>Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM</li>
<li>Here&rsquo;s the XMLUI logs:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;30/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;30/Dec/2017&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
637 207.46.13.106
641 157.55.39.186
715 68.180.229.254
@ -751,7 +751,7 @@ DELETE 20
<li>They identify as &ldquo;com.plumanalytics&rdquo;, which Google says is associated with Elsevier</li>
<li>They only seem to have used one Tomcat session so that&rsquo;s good, I guess I don&rsquo;t need to add them to the Tomcat Crawler Session Manager valve:</li>
</ul>
<pre><code>$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>216.244.66.245 seems to be moz.com&rsquo;s DotBot</li>
@ -761,7 +761,7 @@ DELETE 20
<li>I finished working on the 42 records for CCAFS after Magdalena sent the remaining corrections</li>
<li>After that I uploaded them to CGSpace:</li>
</ul>
<pre><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &amp;&gt; ccafs.log
<pre tabindex="0"><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &amp;&gt; ccafs.log
</code></pre>

View File

@ -150,7 +150,7 @@ dspace.log.2018-01-02:34
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -244,19 +244,19 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
<li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
<li>And just before that I see this:</li>
</ul>
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
<pre tabindex="0"><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre><ul>
<li>Ah hah! So the pool was actually empty!</li>
<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -308,7 +308,7 @@ dspace.log.2018-01-02:34
<li>I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM</li>
<li>Looks like I need to increase the database pool size again:</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -319,7 +319,7 @@ dspace.log.2018-01-03:1909
<ul>
<li>The active IPs in XMLUI are:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
607 40.77.167.141
611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
663 188.226.169.37
@ -336,12 +336,12 @@ dspace.log.2018-01-03:1909
<li>This appears to be the <a href="https://github.com/internetarchive/heritrix3">Internet Archive&rsquo;s open source bot</a></li>
<li>They seem to be re-using their Tomcat session so I don&rsquo;t need to do anything to them just yet:</li>
</ul>
<pre><code>$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2
</code></pre><ul>
<li>The API logs show the normal users:</li>
</ul>
<pre><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;3/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
32 207.46.13.182
38 40.77.167.132
38 68.180.229.254
@ -356,12 +356,12 @@ dspace.log.2018-01-03:1909
<li>In other related news I see a sizeable amount of requests coming from python-requests</li>
<li>For example, just in the last day there were 1700!</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
1773
</code></pre><ul>
<li>But they come from hundreds of IPs, many of which are 54.x.x.x:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
9 54.144.87.92
9 54.146.222.143
9 54.146.249.249
@ -402,7 +402,7 @@ dspace.log.2018-01-03:1909
<li>CGSpace went down and up a bunch of times last night and ILRI staff were complaining a lot last night</li>
<li>The XMLUI logs show this activity:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;4/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;4/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
968 197.211.63.81
981 213.55.99.121
1039 66.249.64.93
@ -416,12 +416,12 @@ dspace.log.2018-01-03:1909
</code></pre><ul>
<li>Again we ran out of PostgreSQL database connections, even after bumping the pool max active limit from 50 to 75 to 125 yesterday!</li>
</ul>
<pre><code>2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].
</code></pre><ul>
<li>So for this week that is the number one problem!</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -436,7 +436,7 @@ dspace.log.2018-01-04:1559
<li>Peter said that CGSpace was down last night and Tsega restarted Tomcat</li>
<li>I don&rsquo;t see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-*
dspace.log.2018-01-01:0
dspace.log.2018-01-02:1972
dspace.log.2018-01-03:1909
@ -446,13 +446,13 @@ dspace.log.2018-01-05:0
<li>Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space</li>
<li>I had a look and there is one Apache 2 log file that is 73GB, with lots of this:</li>
</ul>
<pre><code>[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for &quot;9-16-1-RV.doc&quot; in &quot;/home/files/journals/6//articles/9/&quot;. Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
<pre tabindex="0"><code>[Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for &quot;9-16-1-RV.doc&quot; in &quot;/home/files/journals/6//articles/9/&quot;. Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
</code></pre><ul>
<li>I will delete the log file for now and tell Danny</li>
<li>Also, I&rsquo;m still seeing a hundred or so of the &ldquo;ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer&rdquo; errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is</li>
<li>I will run a full Discovery reindex in the mean time to see if it&rsquo;s something wrong with the Discovery Solr core</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 110m43.985s
@ -465,7 +465,7 @@ sys 3m14.890s
<ul>
<li>I&rsquo;m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:</li>
</ul>
<pre><code>org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
<pre tabindex="0"><code>org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
</code></pre><ul>
<li>I posted a message to the dspace-tech mailing list to see if anyone can help</li>
</ul>
@ -474,13 +474,13 @@ sys 3m14.890s
<li>Advise Sisay about blank lines in some IITA records</li>
<li>Generate a list of author affiliations for Peter to clean up:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4515
</code></pre><h2 id="2018-01-10">2018-01-10</h2>
<ul>
<li>I looked to see what happened to this year&rsquo;s Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:</li>
</ul>
<pre><code>Moving: 81742 into core statistics-2010
<pre tabindex="0"><code>Moving: 81742 into core statistics-2010
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
@ -526,7 +526,7 @@ Caused by: java.net.SocketException: Connection reset
</code></pre><ul>
<li>DSpace Test has the same error but with creating the 2017 core:</li>
</ul>
<pre><code>Moving: 2243021 into core statistics-2017
<pre tabindex="0"><code>Moving: 2243021 into core statistics-2017
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
@ -553,7 +553,7 @@ Caused by: org.apache.http.client.ClientProtocolException
<li>I can apparently search for records in the Solr stats core that have an empty <code>owningColl</code> field using this in the Solr admin query: <code>-owningColl:*</code></li>
<li>On CGSpace I see 48,000,000 records that have an <code>owningColl</code> field and 34,000,000 that don&rsquo;t:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:48476327,&quot;start&quot;:0,&quot;docs&quot;:[
$ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:34879872,&quot;start&quot;:0,&quot;docs&quot;:[
@ -561,19 +561,19 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=js
<li>I tested the <code>dspace stats-util -s</code> process on my local machine and it failed the same way</li>
<li>It doesn&rsquo;t seem to be helpful, but the dspace log shows this:</li>
</ul>
<pre><code>2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
<pre tabindex="0"><code>2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2018-01-10 10:51:19,301 INFO org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
</code></pre><ul>
<li>Terry Brady has written some notes on the DSpace Wiki about Solr sharing issues: <a href="https://wiki.lyrasis.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues">https://wiki.lyrasis.org/display/%7Eterrywbrady/Statistics+Import+Export+Issues</a></li>
<li>Uptime Robot said that CGSpace went down at around 9:43 AM</li>
<li>I looked at PostgreSQL&rsquo;s <code>pg_stat_activity</code> table and saw 161 active connections, but no pool errors in the DSpace logs:</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-10
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-10
0
</code></pre><ul>
<li>The XMLUI logs show quite a bit of activity today:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
951 207.46.13.159
954 157.55.39.123
1217 95.108.181.88
@ -587,17 +587,17 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=js
</code></pre><ul>
<li>The user agent for the top six or so IPs are all the same:</li>
</ul>
<pre><code>&quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot;
<pre tabindex="0"><code>&quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot;
</code></pre><ul>
<li><code>whois</code> says they come from <a href="http://www.perfectip.net/">Perfect IP</a></li>
<li>I&rsquo;ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:</li>
</ul>
<pre><code>$ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
49096
</code></pre><ul>
<li>Rather than blocking their IPs, I think I might just add their user agent to the &ldquo;badbots&rdquo; zone with Baidu, because they seem to be the only ones using that user agent:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
/537.36&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
6796 70.36.107.50
11870 70.36.107.190
@ -608,13 +608,13 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&amp;wt=js
</code></pre><ul>
<li>I added the user agent to nginx&rsquo;s badbots limit req zone but upon testing the config I got an error:</li>
</ul>
<pre><code># nginx -t
<pre tabindex="0"><code># nginx -t
nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
nginx: configuration file /etc/nginx/nginx.conf test failed
</code></pre><ul>
<li>According to nginx docs the <a href="https://nginx.org/en/docs/hash.html">bucket size should be a multiple of the CPU&rsquo;s cache alignment</a>, which is 64 for us:</li>
</ul>
<pre><code># cat /proc/cpuinfo | grep cache_alignment | head -n1
<pre tabindex="0"><code># cat /proc/cpuinfo | grep cache_alignment | head -n1
cache_alignment : 64
</code></pre><ul>
<li>On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx</li>
@ -637,7 +637,7 @@ cache_alignment : 64
<li>Linode rebooted DSpace Test and CGSpace for their host hypervisor kernel updates</li>
<li>Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat <code>localhost_access_log</code> at the time of my sharding attempt on my test machine:</li>
</ul>
<pre><code>127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
<pre tabindex="0"><code>127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 107
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-18YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 447
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 76
127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] &quot;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&quot; 200 63
@ -649,7 +649,7 @@ cache_alignment : 64
<li>This is apparently a common Solr error code that means &ldquo;version conflict&rdquo;: <a href="http://yonik.com/solr/optimistic-concurrency/">http://yonik.com/solr/optimistic-concurrency/</a></li>
<li>Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot; | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep &quot;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36&quot; | grep &quot;10/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
21572 70.36.107.50
30722 70.36.107.190
34566 70.36.107.49
@ -659,7 +659,7 @@ cache_alignment : 64
</code></pre><ul>
<li>Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat&rsquo;s <code>server.xml</code>:</li>
</ul>
<pre><code>&lt;Resource name=&quot;jdbc/dspaceWeb&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
<pre tabindex="0"><code>&lt;Resource name=&quot;jdbc/dspaceWeb&quot; auth=&quot;Container&quot; type=&quot;javax.sql.DataSource&quot;
driverClassName=&quot;org.postgresql.Driver&quot;
url=&quot;jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb&quot;
username=&quot;dspace&quot;
@ -677,7 +677,7 @@ cache_alignment : 64
<li>Also, I realized that the <code>db.jndi</code> parameter in dspace.cfg needs to match the <code>name</code> value in your applicaiton&rsquo;s context—not the <code>global</code> one</li>
<li>Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:</li>
</ul>
<pre><code>db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
<pre tabindex="0"><code>db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
</code></pre><ul>
<li>With that it is super easy to see where PostgreSQL connections are coming from in <code>pg_stat_activity</code></li>
</ul>
@ -685,7 +685,7 @@ cache_alignment : 64
<ul>
<li>I&rsquo;m looking at the <a href="https://wiki.lyrasis.org/display/DSDOC6x/Installing+DSpace#InstallingDSpace-ServletEngine(ApacheTomcat7orlater,Jetty,CauchoResinorequivalent)">DSpace 6.0 Install docs</a> and notice they tweak the number of threads in their Tomcat connector:</li>
</ul>
<pre><code>&lt;!-- Define a non-SSL HTTP/1.1 Connector on port 8080 --&gt;
<pre tabindex="0"><code>&lt;!-- Define a non-SSL HTTP/1.1 Connector on port 8080 --&gt;
&lt;Connector port=&quot;8080&quot;
maxThreads=&quot;150&quot;
minSpareThreads=&quot;25&quot;
@ -702,7 +702,7 @@ cache_alignment : 64
<li>Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don&rsquo;t need to specify that manually anymore: <a href="https://tomcat.apache.org/tomcat-8.5-doc/config/http.html">https://tomcat.apache.org/tomcat-8.5-doc/config/http.html</a></li>
<li>Ooh, I just saw the <code>acceptorThreadCount</code> setting (in Tomcat 7 and 8.5):</li>
</ul>
<pre><code>The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
<pre tabindex="0"><code>The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
</code></pre><ul>
<li>That could be very interesting</li>
</ul>
@ -711,7 +711,7 @@ cache_alignment : 64
<li>Still testing DSpace 6.2 on Tomcat 8.5.24</li>
<li>Catalina errors at Tomcat 8.5 startup:</li>
</ul>
<pre><code>13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of &quot;35&quot; for &quot;maxActive&quot; property, which is being ignored.
<pre tabindex="0"><code>13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of &quot;35&quot; for &quot;maxActive&quot; property, which is being ignored.
13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of &quot;5000&quot; for &quot;maxWait&quot; property, which is being ignored.
</code></pre><ul>
<li>I looked in my Tomcat 7.0.82 logs and I don&rsquo;t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing</li>
@ -719,7 +719,7 @@ cache_alignment : 64
<li>I have updated our <a href="https://github.com/ilri/rmg-ansible-public/commit/246f9d7b06d53794f189f0cc57ad5ddd80f0b014">Ansible infrastructure scripts</a> so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)</li>
<li>When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:</li>
</ul>
<pre><code>13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
<pre tabindex="0"><code>13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
java.lang.ExceptionInInitializerError
at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
@ -761,7 +761,7 @@ Caused by: java.lang.NullPointerException
<li>Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload</li>
<li>I&rsquo;m going to apply these ~130 corrections on CGSpace:</li>
</ul>
<pre><code>update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
<pre tabindex="0"><code>update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
@ -777,11 +777,11 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and
<ul>
<li>Apply corrections using <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a>:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
</code></pre><ul>
<li>In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2
@ -796,7 +796,7 @@ dspace=# select handle from item, handle where handle.resource_id = item.item_id
<li>Otherwise, the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL Helper Functions</a> provide <code>ds5_item2itemhandle()</code>, which is much easier than my long query above that I always have to go search for</li>
<li>For example, to find the Handle for an item that has the author &ldquo;Erni&rdquo;:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2
@ -809,16 +809,16 @@ dspace=# select ds5_item2itemhandle(70308);
</code></pre><ul>
<li>Next I apply the author deletions:</li>
</ul>
<pre><code>$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Now working on the affiliation corrections from Peter:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Now I made a new list of affiliations for Peter to look through:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4552
</code></pre><ul>
<li>Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)</li>
@ -828,11 +828,11 @@ COPY 4552
<li>Help Sisay with some thumbnails for book chapters in Open Refine and SAFBuilder</li>
<li>CGSpace users were having problems logging in, I think something&rsquo;s wrong with LDAP because I see this in the logs:</li>
</ul>
<pre><code>2018-01-15 12:53:15,810 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
<pre tabindex="0"><code>2018-01-15 12:53:15,810 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
</code></pre><ul>
<li>Looks like we processed 2.9 million requests on CGSpace in 2017-12:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Dec/2017&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Dec/2017&quot;
2890041
real 0m25.756s
@ -864,14 +864,14 @@ sys 0m2.210s
<li>Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses</li>
<li>In any case, importing them like this:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives.map &amp;&gt; lives.log
</code></pre><ul>
<li>And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload</li>
<li>When I looked there were 210 PostgreSQL connections!</li>
<li>I don&rsquo;t see any high load in XMLUI or REST/OAI:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;17/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &quot;17/Jan/2018&quot; | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
381 40.77.167.124
403 213.55.99.121
431 207.46.13.60
@ -896,13 +896,13 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
</code></pre><ul>
<li>But I do see this strange message in the dspace log:</li>
</ul>
<pre><code>2018-01-17 07:59:25,856 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}-&gt;http://localhost:8081: The target server failed to respond
<pre tabindex="0"><code>2018-01-17 07:59:25,856 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}-&gt;http://localhost:8081: The target server failed to respond
2018-01-17 07:59:25,856 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}-&gt;http://localhost:8081
</code></pre><ul>
<li>I have NEVER seen this error before, and there is no error before or after that in DSpace&rsquo;s solr.log</li>
<li>Tomcat&rsquo;s catalina.out does show something interesting, though, right at that time:</li>
</ul>
<pre><code>[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
<pre tabindex="0"><code>[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
[====================&gt; ]40% time remaining: 7 hour(s) 14 minute(s) 44 seconds. timestamp: 2018-01-17 07:57:37
[====================&gt; ]40% time remaining: 7 hour(s) 16 minute(s) 5 seconds. timestamp: 2018-01-17 07:57:49
@ -943,7 +943,7 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-627&quot; java.lang.OutOf
<li>You can see the timestamp above, which is some Atmire nightly task I think, but I can&rsquo;t figure out which one</li>
<li>So I restarted Tomcat and tried the import again, which finished very quickly and without errors!</li>
</ul>
<pre><code>$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &amp;&gt; lives2.log
<pre tabindex="0"><code>$ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &amp;&gt; lives2.log
</code></pre><ul>
<li>Looking at the JVM graphs from Munin it does look like the heap ran out of memory (see the blue dip just before the green spike when I restarted Tomcat):</li>
</ul>
@ -951,7 +951,7 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-627&quot; java.lang.OutOf
<ul>
<li>I&rsquo;m playing with maven repository caching using Artifactory in a Docker instance: <a href="https://www.jfrog.com/confluence/display/RTF/Installing+with+Docker">https://www.jfrog.com/confluence/display/RTF/Installing+with+Docker</a></li>
</ul>
<pre><code>$ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
<pre tabindex="0"><code>$ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
$ docker volume create --name artifactory5_data
$ docker network create dspace-build
$ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss:latest
@ -961,11 +961,11 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
<li>Wow, I even managed to add the Atmire repository as a remote and map it into the <code>libs-release</code> virtual repository, then tell maven to use it for <code>atmire.com-releases</code> in settings.xml!</li>
<li>Hmm, some maven dependencies for the SWORDv2 web application in DSpace 5.5 are broken:</li>
</ul>
<pre><code>[ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -&gt; org.apache.abdera:abdera-client:jar:1.1.1 -&gt; org.apache.abdera:abdera-core:jar:1.1.1 -&gt; org.apache.abdera:abdera-i18n:jar:1.1.1 -&gt; org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -&gt; [Help 1]
<pre tabindex="0"><code>[ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -&gt; org.apache.abdera:abdera-client:jar:1.1.1 -&gt; org.apache.abdera:abdera-core:jar:1.1.1 -&gt; org.apache.abdera:abdera-i18n:jar:1.1.1 -&gt; org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -&gt; [Help 1]
</code></pre><ul>
<li>I never noticed because I build with that web application disabled:</li>
</ul>
<pre><code>$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
<pre tabindex="0"><code>$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
</code></pre><ul>
<li>UptimeRobot said CGSpace went down for a few minutes</li>
<li>I didn&rsquo;t do anything but it came back up on its own</li>
@ -973,7 +973,7 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
<li>Now Linode alert says the CPU load is high, <em>sigh</em></li>
<li>Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I&rsquo;m not sure how far these logs go back, as they are not strictly daily):</li>
</ul>
<pre><code># zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
<pre tabindex="0"><code># zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
/var/log/tomcat7/catalina.out:2
/var/log/tomcat7/catalina.out.10.gz:7
/var/log/tomcat7/catalina.out.11.gz:1
@ -1004,7 +1004,7 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
<li>I don&rsquo;t see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499</li>
<li>I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
</code></pre><ul>
<li>Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the <a href="https://cgspace.cgiar.org/handle/10568/35501">Bioversity Journal Articles</a> collection</li>
@ -1012,7 +1012,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspa
<li>Use this GREL in OpenRefine after isolating all the Limited Access items: <code>value.startsWith(&quot;10568/35501&quot;)</code></li>
<li>UptimeRobot said CGSpace went down AGAIN and both Sisay and Danny immediately logged in and restarted Tomcat without talking to me <em>or</em> each other!</li>
</ul>
<pre><code>Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
<pre tabindex="0"><code>Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
Jan 18 07:01:22 linode18 sudo[10805]: pam_unix(sudo:session): session opened for user root by dhmichael(uid=0)
Jan 18 07:01:22 linode18 systemd[1]: Stopping LSB: Start Tomcat....
Jan 18 07:01:22 linode18 sudo[10812]: swebshet : TTY=pts/3 ; PWD=/home/swebshet ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
@ -1026,14 +1026,14 @@ Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for
<li>Linode alerted and said that the CPU load was 264.1% on CGSpace</li>
<li>Start the Discovery indexing again:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
</code></pre><ul>
<li>Linode alerted again and said that CGSpace was using 301% CPU</li>
<li>Peter emailed to ask why <a href="https://cgspace.cgiar.org/handle/10568/88090">this item</a> doesn&rsquo;t have an Altmetric badge on CGSpace but does have one on the <a href="https://www.altmetric.com/details/26709041">Altmetric dashboard</a></li>
<li>Looks like our badge code calls the <code>handle</code> endpoint which doesn&rsquo;t exist:</li>
</ul>
<pre><code>https://api.altmetric.com/v1/handle/10568/88090
<pre tabindex="0"><code>https://api.altmetric.com/v1/handle/10568/88090
</code></pre><ul>
<li>I told Peter we should keep an eye out and try again next week</li>
</ul>
@ -1041,7 +1041,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspa
<ul>
<li>Run the authority indexing script on CGSpace and of course it died:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Exception: null
@ -1071,7 +1071,7 @@ sys 0m12.317s
<li>In the end there were 324 items in the collection that were Limited Access, but only 199 had abstracts</li>
<li>I want to document the workflow of adding a production PostgreSQL database to a development instance of <a href="https://github.com/alanorth/docker-dspace">DSpace in Docker</a>:</li>
</ul>
<pre><code>$ docker exec dspace_db dropdb -U postgres dspace
<pre tabindex="0"><code>$ docker exec dspace_db dropdb -U postgres dspace
$ docker exec dspace_db createdb -U postgres -O dspace --encoding=UNICODE dspace
$ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace createuser;'
$ docker cp test.dump dspace_db:/tmp/test.dump
@ -1099,7 +1099,7 @@ $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
<li>The source code is here: <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a></li>
<li>Peter had said that found a bunch of ILRI collections that were called &ldquo;untitled&rdquo;, but I don&rsquo;t see any:</li>
</ul>
<pre><code>$ ./rest-find-collections.py 10568/1 | wc -l
<pre tabindex="0"><code>$ ./rest-find-collections.py 10568/1 | wc -l
308
$ ./rest-find-collections.py 10568/1 | grep -i untitled
</code></pre><ul>
@ -1119,12 +1119,12 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
<li>Thinking about generating a jmeter test plan for DSpace, along the lines of <a href="https://github.com/Georgetown-University-Libraries/dspace-performance-test">Georgetown&rsquo;s dspace-performance-test</a></li>
<li>I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -c -v &quot;/admin&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -c -v &quot;/admin&quot;
56405
</code></pre><ul>
<li>Apparently about 28% of these requests were for bitstreams, 30% for the REST API, and 30% for handles:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -Eo &quot;^/(handle|bitstream|rest|oai)/&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -Eo &quot;^/(handle|bitstream|rest|oai)/&quot; | sort | uniq -c | sort -n
38 /oai/
14406 /bitstream/
15179 /rest/
@ -1132,14 +1132,14 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
</code></pre><ul>
<li>And 3% were to the homepage or search:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
1050 /
413 /discover
170 /open-search
</code></pre><ul>
<li>The last 10% or so seem to be for static assets that would be served by nginx anyways:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
2 .gif
7 .css
84 .js
@ -1153,7 +1153,7 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
<ul>
<li>Looking at the REST requests, most of them are to expand all or metadata, but 5% are for retrieving bitstreams:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -E &quot;^/rest&quot; | grep -Eo &quot;(retrieve|expand=[a-z].*)&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep &quot;21/Jan/2018&quot; | grep &quot;GET &quot; | grep -v &quot;/admin&quot; | awk '{print $7}' | grep -E &quot;^/rest&quot; | grep -Eo &quot;(retrieve|expand=[a-z].*)&quot; | sort | uniq -c | sort -n
1 expand=collections
16 expand=all&amp;limit=1
45 expand=items
@ -1163,12 +1163,12 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
</code></pre><ul>
<li>I finished creating the test plan for DSpace Test and ran it from my Linode with:</li>
</ul>
<pre><code>$ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl
<pre tabindex="0"><code>$ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl
</code></pre><ul>
<li>Atmire responded to <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">my issue from two weeks ago</a> and said they will start looking into DSpace 5.8 compatibility for CGSpace</li>
<li>I set up a new Arch Linux Linode instance with 8192 MB of RAM and ran the test plan a few times to get a baseline:</li>
</ul>
<pre><code># lscpu
<pre tabindex="0"><code># lscpu
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
@ -1212,19 +1212,19 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
</code></pre><ul>
<li>Then I generated reports for these runs like this:</li>
</ul>
<pre><code>$ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
<pre tabindex="0"><code>$ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
</code></pre><h2 id="2018-01-25">2018-01-25</h2>
<ul>
<li>Run another round of tests on DSpace Test with jmeter after changing Tomcat&rsquo;s <code>minSpareThreads</code> to 20 (default is 10) and <code>acceptorThreadCount</code> to 2 (default is 1):</li>
</ul>
<pre><code>$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
<pre tabindex="0"><code>$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.log
</code></pre><ul>
<li>I changed the parameters back to the baseline ones and switched the Tomcat JVM garbage collector to G1GC and re-ran the tests</li>
<li>JVM options for Tomcat changed from <code>-Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC</code> to <code>-Xms3072m -Xmx3072m -XX:+UseG1GC -XX:+PerfDisableSharedMem</code></li>
</ul>
<pre><code>$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
<pre tabindex="0"><code>$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.log
$ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.log
</code></pre><ul>
@ -1242,7 +1242,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
<li>The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn&rsquo;t possible (?)</li>
<li>So I used some creativity and made several fields display values, but not store any, ie:</li>
</ul>
<pre><code>&lt;pair&gt;
<pre tabindex="0"><code>&lt;pair&gt;
&lt;displayed-value&gt;For products published by another party:&lt;/displayed-value&gt;
&lt;stored-value&gt;&lt;/stored-value&gt;
&lt;/pair&gt;
@ -1267,7 +1267,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
<li>CGSpace went down this morning for a few minutes, according to UptimeRobot</li>
<li>Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down:</li>
</ul>
<pre><code>2018-01-29 05:30:22,226 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
<pre tabindex="0"><code>2018-01-29 05:30:22,226 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
Was expecting one of:
&quot;TO&quot; ...
@ -1284,12 +1284,12 @@ Was expecting one of:
<li>I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early</li>
<li>Perhaps this from the nginx error log is relevant?</li>
</ul>
<pre><code>2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: &quot;GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1&quot;, upstream: &quot;http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12&quot;, host: &quot;cgspace.cgiar.org&quot;
<pre tabindex="0"><code>2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: &quot;GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1&quot;, upstream: &quot;http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12&quot;, host: &quot;cgspace.cgiar.org&quot;
</code></pre><ul>
<li>I think that must be unrelated, probably the client closed the request to nginx because DSpace (Tomcat) was taking too long</li>
<li>An interesting <a href="https://gist.github.com/magnetikonline/11312172">snippet to get the maximum and average nginx responses</a>:</li>
</ul>
<pre><code># awk '($9 ~ /200/) { i++;sum+=$10;max=$10&gt;max?$10:max; } END { printf(&quot;Maximum: %d\nAverage: %d\n&quot;,max,i?sum/i:0); }' /var/log/nginx/access.log
<pre tabindex="0"><code># awk '($9 ~ /200/) { i++;sum+=$10;max=$10&gt;max?$10:max; } END { printf(&quot;Maximum: %d\nAverage: %d\n&quot;,max,i?sum/i:0); }' /var/log/nginx/access.log
Maximum: 2771268
Average: 210483
</code></pre><ul>
@ -1297,7 +1297,7 @@ Average: 210483
<li>My best guess is that the Solr search error is related somehow but I can&rsquo;t figure it out</li>
<li>We definitely have enough database connections, as I haven&rsquo;t seen a pool error in weeks:</li>
</ul>
<pre><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-2*
<pre tabindex="0"><code>$ grep -c &quot;Timeout: Pool empty.&quot; dspace.log.2018-01-2*
dspace.log.2018-01-20:0
dspace.log.2018-01-21:0
dspace.log.2018-01-22:0
@ -1326,7 +1326,7 @@ dspace.log.2018-01-29:0
<li>Wow, so apparently you need to specify which connector to check if you want any of the Munin Tomcat plugins besides &ldquo;tomcat_jvm&rdquo; to work (the connector name can be seen in the Catalina logs)</li>
<li>I modified <em>/etc/munin/plugin-conf.d/tomcat</em> to add the connector (with surrounding quotes!) and now the other plugins work (obviously the credentials are incorrect):</li>
</ul>
<pre><code>[tomcat_*]
<pre tabindex="0"><code>[tomcat_*]
env.host 127.0.0.1
env.port 8081
env.connector &quot;http-bio-127.0.0.1-8443&quot;
@ -1335,7 +1335,7 @@ dspace.log.2018-01-29:0
</code></pre><ul>
<li>For example, I can see the threads:</li>
</ul>
<pre><code># munin-run tomcat_threads
<pre tabindex="0"><code># munin-run tomcat_threads
busy.value 0
idle.value 20
max.value 400
@ -1345,18 +1345,18 @@ max.value 400
<li>Although following the logic of <em>/usr/share/munin/plugins/jmx_tomcat_dbpools</em> could be useful for getting the active Tomcat sessions</li>
<li>From debugging the <code>jmx_tomcat_db_pools</code> script from the <code>munin-plugins-java</code> package, I see that this is how you call arbitrary mbeans:</li>
</ul>
<pre><code># port=5400 ip=&quot;127.0.0.1&quot; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
<pre tabindex="0"><code># port=5400 ip=&quot;127.0.0.1&quot; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
Catalina:type=DataSource,class=javax.sql.DataSource,name=&quot;jdbc/dspace&quot; maxActive 300
</code></pre><ul>
<li>More notes here: <a href="https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx">https://github.com/munin-monitoring/contrib/tree/master/plugins/jmx</a></li>
<li>Looking at the Munin graphs, I that the load is 200% every morning from 03:00 to almost 08:00</li>
<li>Tomcat&rsquo;s catalina.out log file is full of spam from this thing too, with lines like this</li>
</ul>
<pre><code>[===================&gt; ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
<pre tabindex="0"><code>[===================&gt; ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
</code></pre><ul>
<li>There are millions of these status lines, for example in just this one log file:</li>
</ul>
<pre><code># zgrep -c &quot;time remaining&quot; /var/log/tomcat7/catalina.out.1.gz
<pre tabindex="0"><code># zgrep -c &quot;time remaining&quot; /var/log/tomcat7/catalina.out.1.gz
1084741
</code></pre><ul>
<li>I filed a ticket with Atmire: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566</a></li>
@ -1370,26 +1370,26 @@ Catalina:type=DataSource,class=javax.sql.DataSource,name=&quot;jdbc/dspace&quot;
<li>Now PostgreSQL activity shows 308 connections!</li>
<li>Well this is interesting, there are 400 Tomcat threads busy:</li>
</ul>
<pre><code># munin-run tomcat_threads
<pre tabindex="0"><code># munin-run tomcat_threads
busy.value 400
idle.value 0
max.value 400
</code></pre><ul>
<li>And wow, we finally exhausted the database connections, from dspace.log:</li>
</ul>
<pre><code>2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
</code></pre><ul>
<li>Now even the nightly Atmire background thing is getting HTTP 500 error:</li>
</ul>
<pre><code>Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
<pre tabindex="0"><code>Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
</code></pre><ul>
<li>For now I will restart Tomcat to clear this shit and bring the site back up</li>
<li>The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;31/Jan/2018:(07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;31/Jan/2018:(07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
67 66.249.66.70
70 207.46.13.12
71 197.210.168.174
@ -1426,7 +1426,7 @@ javax.ws.rs.WebApplicationException
<li>I should make separate database pools for the web applications and the API applications like REST and OAI</li>
<li>Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat&rsquo;s activeSessions from JMX (using <code>munin-plugins-java</code>):</li>
</ul>
<pre><code># port=5400 ip=&quot;127.0.0.1&quot; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
<pre tabindex="0"><code># port=5400 ip=&quot;127.0.0.1&quot; /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
Catalina:type=Manager,context=/,host=localhost activeSessions 8
</code></pre><ul>
<li>If you connect to Tomcat in <code>jvisualvm</code> it&rsquo;s pretty obvious when you hover over the elements</li>

View File

@ -30,7 +30,7 @@ We don&rsquo;t need to distinguish between internal and external works, so that
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -128,7 +128,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-pl
<li>Run all system updates and reboot DSpace Test</li>
<li>Wow, I packaged up the <code>jmx_dspace_sessions</code> stuff in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> and deployed it on CGSpace and it totally works:</li>
</ul>
<pre><code># munin-run jmx_dspace_sessions
<pre tabindex="0"><code># munin-run jmx_dspace_sessions
v_.value 223
v_jspui.value 1
v_oai.value 0
@ -139,12 +139,12 @@ v_oai.value 0
<li>I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January</li>
<li>After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:</li>
</ul>
<pre><code>$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Then I started a full Discovery reindex:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 96m39.823s
user 14m10.975s
@ -152,12 +152,12 @@ sys 2m29.088s
</code></pre><ul>
<li>Generate a new list of affiliations for Peter to sort through:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723
</code></pre><ul>
<li>Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in <a href="/cgspace-notes/2017-12/">December</a>:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2018&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2018&quot;
3126109
real 0m23.839s
@ -167,14 +167,14 @@ sys 0m1.905s
<ul>
<li>Toying with correcting authors with trailing spaces via PostgreSQL:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
UPDATE 20
</code></pre><ul>
<li>I tried the <code>TRIM(TRAILING from text_value)</code> function and it said it changed 20 items but the spaces didn&rsquo;t go away</li>
<li>This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.</li>
<li>Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630
</code></pre><h2 id="2018-02-06">2018-02-06</h2>
<ul>
@ -182,7 +182,7 @@ COPY 55630
<li>I see 308 PostgreSQL connections in <code>pg_stat_activity</code></li>
<li>The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:</li>
</ul>
<pre><code># date
<pre tabindex="0"><code># date
Tue Feb 6 09:30:32 UTC 2018
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;6/Feb/2018:(08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2 223.185.41.40
@ -232,7 +232,7 @@ Tue Feb 6 09:30:32 UTC 2018
<li>CGSpace crashed again, this time around <code>Wed Feb 7 11:20:28 UTC 2018</code></li>
<li>I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' &gt; /tmp/pg_stat_activity.txt
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' &gt; /tmp/pg_stat_activity.txt
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
/tmp/pg_stat_activity1.txt:300
/tmp/pg_stat_activity2.txt:272
@ -242,7 +242,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
</code></pre><ul>
<li>Interestingly, all of those 751 connections were idle!</li>
</ul>
<pre><code>$ grep &quot;PostgreSQL JDBC&quot; /tmp/pg_stat_activity* | grep -c idle
<pre tabindex="0"><code>$ grep &quot;PostgreSQL JDBC&quot; /tmp/pg_stat_activity* | grep -c idle
751
</code></pre><ul>
<li>Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps</li>
@ -252,17 +252,17 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
<ul>
<li>Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:</li>
</ul>
<pre><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1828
</code></pre><ul>
<li>CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)</li>
<li>What&rsquo;s interesting is that the DSpace log says the connections are all busy:</li>
</ul>
<pre><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
<pre tabindex="0"><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<li>&hellip; but in PostgreSQL I see them <code>idle</code> or <code>idle in transaction</code>:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
250
@ -274,13 +274,13 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle
<li>I will try <code>testOnReturn='true'</code> too, just to add more validation, because I&rsquo;m fucking grasping at straws</li>
<li>Also, WTF, there was a heap space error randomly in catalina.out:</li>
</ul>
<pre><code>Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
<pre tabindex="0"><code>Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-58&quot; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I&rsquo;m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!</li>
<li>Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:</li>
</ul>
<pre><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
34 ip_addr=46.229.168.67
34 ip_addr=46.229.168.73
37 ip_addr=46.229.168.76
@ -304,7 +304,7 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-58&quot; java.lang.OutOfM
</code></pre><ul>
<li>These IPs made thousands of sessions today:</li>
</ul>
<pre><code>$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
530
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
859
@ -342,11 +342,11 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>What in the actual fuck, why is our load doing this? It&rsquo;s gotta be something fucked up with the database pool being &ldquo;busy&rdquo; but everything is fucking idle</li>
<li>One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:</li>
</ul>
<pre><code>BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
<pre tabindex="0"><code>BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
</code></pre><ul>
<li>This one makes two thousand requests per day or so recently:</li>
</ul>
<pre><code># grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
/var/log/nginx/access.log:1925
/var/log/nginx/access.log.1:2029
</code></pre><ul>
@ -355,13 +355,13 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker</li>
<li>This is how the connections looked when it crashed this afternoon:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
290 dspaceWeb
</code></pre><ul>
<li>This is how it is right now:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
5 dspaceWeb
</code></pre><ul>
@ -378,11 +378,11 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn&rsquo;t show up on the item</li>
<li>Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:</li>
</ul>
<pre><code>Field dc_contributor_author has choice presentation of type &quot;select&quot;, it may NOT be authority-controlled.
<pre tabindex="0"><code>Field dc_contributor_author has choice presentation of type &quot;select&quot;, it may NOT be authority-controlled.
</code></pre><ul>
<li>If I change choices.presentation to suggest it give this error:</li>
</ul>
<pre><code>xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
<pre tabindex="0"><code>xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
</code></pre><ul>
<li>So I don&rsquo;t think we can disable the ORCID lookup function and keep the ORCID badges</li>
</ul>
@ -394,12 +394,12 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<ul>
<li>I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:</li>
</ul>
<pre><code>$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
<pre tabindex="0"><code>$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
</code></pre><p><img src="/cgspace-notes/2018/02/CCAFS_WP_223.jpg" alt="Manual thumbnail"></p>
<ul>
<li>Peter sent me corrected author names last week but the file encoding is messed up:</li>
</ul>
<pre><code>$ isutf8 authors-2018-02-05.csv
<pre tabindex="0"><code>$ isutf8 authors-2018-02-05.csv
authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
</code></pre><ul>
<li>The <code>isutf8</code> program comes from <code>moreutils</code></li>
@ -409,18 +409,18 @@ authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between
<li>I updated my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts on the scripts page: <a href="https://github.com/ilri/DSpace/wiki/Scripts">https://github.com/ilri/DSpace/wiki/Scripts</a></li>
<li>I ran the 342 author corrections (after trimming whitespace and excluding those with <code>||</code> and other syntax errors) on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Then I ran a full Discovery re-indexing:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>That reminds me that Bizu had asked me to fix some of Alan Duncan&rsquo;s names in December</li>
<li>I see he actually has some variations with &ldquo;Duncan, Alan J.&quot;: <a href="https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=">https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=</a></li>
<li>I will just update those for her too and then restart the indexing:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value | authority | confidence
-----------------+--------------------------------------+------------
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | 600
@ -464,7 +464,7 @@ dspace=# commit;
<li>I see that in <a href="/cgspace-notes/2017-04/">April, 2017</a> I just used a SQL query to get a user&rsquo;s submissions by checking the <code>dc.description.provenance</code> field</li>
<li>So for Abenet, I can check her submissions in December, 2017 with:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
</code></pre><ul>
<li>I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it</li>
<li>This would be using <a href="https://www.linode.com/blockstorage">Linode&rsquo;s new block storage volumes</a></li>
@ -477,14 +477,14 @@ dspace=# commit;
<li>Peter said he was getting a &ldquo;socket closed&rdquo; error on CGSpace</li>
<li>I looked in the dspace.log.2018-02-13 and saw one recent one:</li>
</ul>
<pre><code>2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
...
Caused by: java.net.SocketException: Socket closed
</code></pre><ul>
<li>Could be because of the <code>removeAbandoned=&quot;true&quot;</code> that I enabled in the JDBC connection pool last week?</li>
</ul>
<pre><code>$ grep -c &quot;java.net.SocketException: Socket closed&quot; dspace.log.2018-02-*
<pre tabindex="0"><code>$ grep -c &quot;java.net.SocketException: Socket closed&quot; dspace.log.2018-02-*
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
@ -503,7 +503,7 @@ dspace.log.2018-02-13:4
<li>I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned</li>
<li>Peter hit this issue one more time, and this is apparently what Tomcat&rsquo;s catalina.out log says when an abandoned connection is removed:</li>
</ul>
<pre><code>Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
<pre tabindex="0"><code>Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception
</code></pre><h2 id="2018-02-14">2018-02-14</h2>
<ul>
@ -521,21 +521,21 @@ WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgCo
<li>Atmire responded on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 compatability ticket</a> and said they will let me know if they they want me to give them a clean 5.8 branch</li>
<li>I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:</li>
</ul>
<pre><code>$ sort cgspace-orcids.txt &gt; dspace/config/controlled-vocabularies/cg-creator-id.xml
<pre tabindex="0"><code>$ sort cgspace-orcids.txt &gt; dspace/config/controlled-vocabularies/cg-creator-id.xml
$ add XML formatting...
$ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>It seems the tidy fucks up accents, for example it turns <code>Adriana Tofiño (0000-0001-7115-7169)</code> into <code>Adriana Tofiño (0000-0001-7115-7169)</code></li>
<li>We need to force UTF-8:</li>
</ul>
<pre><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
<pre tabindex="0"><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>This preserves special accent characters</li>
<li>I tested the display and store of these in the XMLUI and PostgreSQL and it looks good</li>
<li>Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+</li>
<li>Peter combined it with mine and we have 1204 unique ORCIDs!</li>
</ul>
<pre><code>$ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
<pre tabindex="0"><code>$ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
1204
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
1204
@ -543,19 +543,19 @@ $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_c
<li>Also, save that regex for the future because it will be very useful!</li>
<li>CIAT sent a list of their authors' ORCIDs and combined with ours there are now 1227:</li>
</ul>
<pre><code>$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1227
</code></pre><ul>
<li>There are some formatting issues with names in Peter&rsquo;s list, so I should remember to re-generate the list of names from ORCID&rsquo;s API once we&rsquo;re done</li>
<li>The <code>dspace cleanup -v</code> currently fails on CGSpace with the following:</li>
</ul>
<pre><code> - Deleting bitstream record from database (ID: 149473)
<pre tabindex="0"><code> - Deleting bitstream record from database (ID: 149473)
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(149473) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is to update the bitstream table, as I&rsquo;ve discovered several other times in 2016 and 2017:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
UPDATE 1
</code></pre><ul>
<li>Then the cleanup process will continue for awhile and hit another foreign key conflict, and eventually it will complete after you manually resolve them all</li>
@ -575,25 +575,25 @@ UPDATE 1
<li>I only looked quickly in the logs but saw a bunch of database errors</li>
<li>PostgreSQL connections are currently:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
2 dspaceApi
1 dspaceWeb
3 dspaceApi
</code></pre><ul>
<li>I see shitloads of memory errors in Tomcat&rsquo;s logs:</li>
</ul>
<pre><code># grep -c &quot;Java heap space&quot; /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c &quot;Java heap space&quot; /var/log/tomcat7/catalina.out
56
</code></pre><ul>
<li>And shit tons of database connections abandoned:</li>
</ul>
<pre><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
612
</code></pre><ul>
<li>I have no fucking idea why it crashed</li>
<li>The XMLUI activity looks like:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;15/Feb/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;15/Feb/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
715 63.143.42.244
746 213.55.99.121
886 68.180.228.157
@ -610,7 +610,7 @@ UPDATE 1
<li>I made a pull request to fix it ((#354)[https://github.com/ilri/DSpace/pull/354])</li>
<li>I should remember to update existing values in PostgreSQL too:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
UPDATE 2
</code></pre><h2 id="2018-02-18">2018-02-18</h2>
<ul>
@ -624,7 +624,7 @@ UPDATE 2
<li>Run system updates on DSpace Test (linode02) and reboot the server</li>
<li>Looking back at the system errors on 2018-02-15, I wonder what the fuck caused this:</li>
</ul>
<pre><code>$ wc -l dspace.log.2018-02-1{0..8}
<pre tabindex="0"><code>$ wc -l dspace.log.2018-02-1{0..8}
383483 dspace.log.2018-02-10
275022 dspace.log.2018-02-11
249557 dspace.log.2018-02-12
@ -638,13 +638,13 @@ UPDATE 2
<li>From an average of a few hundred thousand to over four million lines in DSpace log?</li>
<li>Using grep&rsquo;s <code>-B1</code> I can see the line before the heap space error, which has the time, ie:</li>
</ul>
<pre><code>2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>So these errors happened at hours 16, 18, 19, and 20</li>
<li>Let&rsquo;s see what was going on in nginx then:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
168571
# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &quot;15/Feb/2018:(16|18|19|20)&quot; | wc -l
8188
@ -652,7 +652,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
<li>Only 8,000 requests during those four hours, out of 170,000 the whole day!</li>
<li>And the usage of XMLUI, REST, and OAI looks SUPER boring:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &quot;15/Feb/2018:(16|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E &quot;15/Feb/2018:(16|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
111 95.108.181.88
158 45.5.184.221
201 104.196.152.243
@ -677,20 +677,20 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
<ul>
<li>Combined list of CGIAR author ORCID iDs is up to 1,500:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1571
</code></pre><ul>
<li>I updated my <code>resolve-orcids-from-solr.py</code> script to be able to resolve ORCID identifiers from a text file so I renamed it to <code>resolve-orcids.py</code></li>
<li>Also, I updated it so it uses several new options:</li>
</ul>
<pre><code>$ ./resolve-orcids.py -i input.txt -o output.txt
<pre tabindex="0"><code>$ ./resolve-orcids.py -i input.txt -o output.txt
$ cat output.txt
Ali Ramadhan: 0000-0001-5019-1368
Ahmad Maryudi: 0000-0001-5051-7217
</code></pre><ul>
<li>I was running this on the new list of 1571 and found an error:</li>
</ul>
<pre><code>Looking up the name associated with ORCID iD: 0000-0001-9634-1958
<pre tabindex="0"><code>Looking up the name associated with ORCID iD: 0000-0001-9634-1958
Traceback (most recent call last):
File &quot;./resolve-orcids.py&quot;, line 111, in &lt;module&gt;
read_identifiers_from_file()
@ -704,7 +704,7 @@ TypeError: 'NoneType' object is not subscriptable
<li>I fixed the script so that it checks if the family name is null</li>
<li>Now another:</li>
</ul>
<pre><code>Looking up the name associated with ORCID iD: 0000-0002-1300-3636
<pre tabindex="0"><code>Looking up the name associated with ORCID iD: 0000-0002-1300-3636
Traceback (most recent call last):
File &quot;./resolve-orcids.py&quot;, line 117, in &lt;module&gt;
read_identifiers_from_file()
@ -722,13 +722,13 @@ TypeError: 'NoneType' object is not subscriptable
<li>Discuss some of the issues with null values and poor-quality names in some ORCID identifiers with Abenet and I think we&rsquo;ll now only use ORCID iDs that have been sent to use from partners, not those extracted via keyword searches on orcid.org</li>
<li>This should be the version we use (the existing controlled vocabulary generated from CGSpace&rsquo;s Solr authority core plus the IDs sent to us so far by partners):</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; 2018-02-20-combined.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; 2018-02-20-combined.txt
</code></pre><ul>
<li>I updated the <code>resolve-orcids.py</code> to use the &ldquo;credit-name&rdquo; if it exists in a profile, falling back to &ldquo;given-names&rdquo; + &ldquo;family-name&rdquo;</li>
<li>Also, I added color coded output to the debug messages and added a &ldquo;quiet&rdquo; mode that supresses the normal behavior of printing results to the screen</li>
<li>I&rsquo;m using this as the test input for <code>resolve-orcids.py</code>:</li>
</ul>
<pre><code>$ cat orcid-test-values.txt
<pre tabindex="0"><code>$ cat orcid-test-values.txt
# valid identifier with 'given-names' and 'family-name'
0000-0001-5019-1368
@ -770,7 +770,7 @@ TypeError: 'NoneType' object is not subscriptable
<li>It looks like Sisay restarted Tomcat because I was offline</li>
<li>There was absolutely nothing interesting going on at 13:00 on the server, WTF?</li>
</ul>
<pre><code># cat /var/log/nginx/*.log | grep -E &quot;22/Feb/2018:13&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># cat /var/log/nginx/*.log | grep -E &quot;22/Feb/2018:13&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
55 192.99.39.235
60 207.46.13.26
62 40.77.167.38
@ -784,7 +784,7 @@ TypeError: 'NoneType' object is not subscriptable
</code></pre><ul>
<li>Otherwise there was pretty normal traffic the rest of the day:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;22/Feb/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;22/Feb/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
839 216.244.66.245
1074 68.180.228.117
1114 157.55.39.100
@ -798,7 +798,7 @@ TypeError: 'NoneType' object is not subscriptable
</code></pre><ul>
<li>So I don&rsquo;t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!</li>
</ul>
<pre><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
729
# grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
519
@ -807,7 +807,7 @@ TypeError: 'NoneType' object is not subscriptable
<li>Abandoned connections is not a cause but a symptom, though perhaps something more like a few minutes is better?</li>
<li>Also, while looking at the logs I see some new bot:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
</code></pre><ul>
<li>It seems to re-use its user agent but makes tons of useless requests and I wonder if I should add &ldquo;.<em>spider.</em>&rdquo; to the Tomcat Crawler Session Manager valve?</li>
</ul>
@ -820,19 +820,19 @@ TypeError: 'NoneType' object is not subscriptable
<li>A few days ago Abenet sent me the list of ORCID iDs from CCAFS</li>
<li>We currently have 988 unique identifiers:</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
988
</code></pre><ul>
<li>After adding the ones from CCAFS we now have 1004:</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1004
</code></pre><ul>
<li>I will add them to DSpace Test but Abenet says she&rsquo;s still waiting to set us ILRI&rsquo;s list</li>
<li>I will tell her that we should proceed on sharing our work on DSpace Test with the partners this week anyways and we can update the list later</li>
<li>While regenerating the names for these ORCID identifiers I saw <a href="https://pub.orcid.org/v2.1/0000-0002-2614-426X/person">one that has a weird value for its names</a>:</li>
</ul>
<pre><code>Looking up the names associated with ORCID iD: 0000-0002-2614-426X
<pre tabindex="0"><code>Looking up the names associated with ORCID iD: 0000-0002-2614-426X
Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
</code></pre><ul>
<li>I don&rsquo;t know if the user accidentally entered this as their name or if that&rsquo;s how ORCID behaves when the name is private?</li>
@ -843,7 +843,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
<li>Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace</li>
<li>We have over 60,000 unique author + authority combinations on CGSpace:</li>
</ul>
<pre><code>dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
<pre tabindex="0"><code>dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
count
-------
62464
@ -853,7 +853,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
<li>The query in Solr would simply be <code>orcid_id:*</code></li>
<li>Assuming I know that authority record with <code>id:d7ef744b-bbd4-4171-b449-00e37e1b776f</code>, then I could query PostgreSQL for all metadata records using that authority:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
2726830 | 77710 | 3 | Rodríguez Chalarca, Jairo | | 2 | d7ef744b-bbd4-4171-b449-00e37e1b776f | 600 | 2
@ -862,13 +862,13 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
<li>Then I suppose I can use the <code>resource_id</code> to identify the item?</li>
<li>Actually, <code>resource_id</code> is the same id we use in CSV, so I could simply build something like this for a metadata import!</li>
</ul>
<pre><code>id,cg.creator.id
<pre tabindex="0"><code>id,cg.creator.id
93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
</code></pre><ul>
<li>I just discovered that <a href="https://requests-cache.readthedocs.io">requests-cache</a> can transparently cache HTTP requests</li>
<li>Running <code>resolve-orcids.py</code> with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time!</li>
</ul>
<pre><code>$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
<pre tabindex="0"><code>$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
Ali Ramadhan: 0000-0001-5019-1368
Alan S. Orth: 0000-0002-1735-7458
Ibrahim Mohammed: 0000-0001-5199-5528
@ -896,7 +896,7 @@ Nor Azwadi: 0000-0001-9634-1958
<li>I need to see which SQL queries are run during that time</li>
<li>And only a few hours after I disabled the <code>removeAbandoned</code> thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
279 dspaceWeb
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle in transaction&quot;
@ -905,7 +905,7 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle
<li>So I&rsquo;m re-enabling the <code>removeAbandoned</code> setting</li>
<li>I grabbed a snapshot of the active connections in <code>pg_stat_activity</code> for all queries running longer than 2 minutes:</li>
</ul>
<pre><code>dspace=# \copy (SELECT now() - query_start as &quot;runtime&quot;, application_name, usename, datname, waiting, state, query
<pre tabindex="0"><code>dspace=# \copy (SELECT now() - query_start as &quot;runtime&quot;, application_name, usename, datname, waiting, state, query
FROM pg_stat_activity
WHERE now() - query_start &gt; '2 minutes'::interval
ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
@ -913,11 +913,11 @@ COPY 263
</code></pre><ul>
<li>100 of these idle in transaction connections are the following query:</li>
</ul>
<pre><code>SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3
<pre tabindex="0"><code>SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3
</code></pre><ul>
<li>&hellip; but according to the <a href="https://www.postgresql.org/docs/9.5/static/view-pg-locks.html">pg_locks documentation</a> I should have done this to correlate the locks with the activity:</li>
</ul>
<pre><code>SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
<pre tabindex="0"><code>SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
</code></pre><ul>
<li>Tom Desair from Atmire shared some extra JDBC pool parameters that might be useful on my thread on the dspace-tech mailing list:
<ul>
@ -936,7 +936,7 @@ COPY 263
<li>CGSpace crashed today, the first HTTP 499 in nginx&rsquo;s access.log was around 09:12</li>
<li>There&rsquo;s nothing interesting going on in nginx&rsquo;s logs around that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Feb/2018:09:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Feb/2018:09:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
65 197.210.168.174
74 213.55.99.121
74 66.249.66.90
@ -950,12 +950,12 @@ COPY 263
</code></pre><ul>
<li>Looking in dspace.log-2018-02-28 I see this, though:</li>
</ul>
<pre><code>2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>Memory issues seem to be common this month:</li>
</ul>
<pre><code>$ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-*
<pre tabindex="0"><code>$ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-*
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
@ -987,7 +987,7 @@ dspace.log.2018-02-28:1
</code></pre><ul>
<li>Top ten users by session during the first twenty minutes of 9AM:</li>
</ul>
<pre><code>$ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code>$ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49
19 session_id=92E61C64A79F0812BE62A3882DA8F4BA
21 session_id=57417F5CB2F9E3871E609CEEBF4E001F
@ -1006,13 +1006,13 @@ dspace.log.2018-02-28:1
<li>I think I&rsquo;ll increase the JVM heap size on CGSpace from 6144m to 8192m because I&rsquo;m sick of this random crashing shit and the server has memory and I&rsquo;d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work</li>
<li>Run the few corrections from earlier this month for sponsor on CGSpace:</li>
</ul>
<pre><code>cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
<pre tabindex="0"><code>cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
UPDATE 3
</code></pre><ul>
<li>I finally got a CGIAR account so I logged into CGSpace with it and tried to delete my old unfinished submissions (22 of them)</li>
<li>Eventually it succeeded, but it took about five minutes and I noticed LOTS of locks happening with this query:</li>
</ul>
<pre><code>dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;
<pre tabindex="0"><code>dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;
</code></pre><ul>
<li>I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process</li>
<li>Afterwards I looked a few times and saw only 150 or 200 locks</li>

View File

@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
Export a CSV of the IITA community metadata for Martin Mueller
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -122,7 +122,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
<li>There were some records using a non-breaking space in their AGROVOC subject field</li>
<li>I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
</code></pre><ul>
<li>This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character</li>
@ -132,16 +132,16 @@ $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u d
<li>Run all system updates on DSpace Test and reboot server</li>
<li>I ran the <a href="https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c">orcid-authority-to-item.py</a> script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata</li>
</ul>
<pre><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
<pre tabindex="0"><code>$ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
</code></pre><ul>
<li>I ran the DSpace cleanup script on CGSpace and it threw an error (as always):</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(150659) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
UPDATE 1
</code></pre><ul>
<li>Apply the proposed PostgreSQL indexes from DS-3636 (pull request <a href="https://github.com/DSpace/DSpace/pull/1791/">#1791</a> on CGSpace (linode18)</li>
@ -159,7 +159,7 @@ UPDATE 1
<li>This makes the CSV have tons of columns, for example <code>dc.title</code>, <code>dc.title[]</code>, <code>dc.title[en]</code>, <code>dc.title[eng]</code>, <code>dc.title[en_US]</code> and so on!</li>
<li>I think I can fixor at least normalizethem in the database:</li>
</ul>
<pre><code>dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
<pre tabindex="0"><code>dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
text_lang
-----------
@ -199,7 +199,7 @@ dspacetest=# select distinct text_lang from metadatavalue where resource_type_id
<li>On second inspection it looks like <code>dc.description.provenance</code> fields use the text_lang &ldquo;en&rdquo; so that&rsquo;s probably why there are over 100,000 fields changed&hellip;</li>
<li>If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
UPDATE 2309
</code></pre><ul>
<li>I will apply this on CGSpace right now</li>
@ -207,18 +207,18 @@ UPDATE 2309
<li>Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the <code>cg.creator.id</code> field</li>
<li>For example, a GREL expression in a custom text facet to get all items with <code>dc.contributor.author[en_US]</code> of a certain author with several name variations (this is how you use a logical OR in OpenRefine):</li>
</ul>
<pre><code>or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
<pre tabindex="0"><code>or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
</code></pre><ul>
<li>Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:</li>
</ul>
<pre><code>if(isBlank(value), &quot;Hernan Ceballos: 0000-0002-8744-7918&quot;, value + &quot;||Hernan Ceballos: 0000-0002-8744-7918&quot;)
<pre tabindex="0"><code>if(isBlank(value), &quot;Hernan Ceballos: 0000-0002-8744-7918&quot;, value + &quot;||Hernan Ceballos: 0000-0002-8744-7918&quot;)
</code></pre><ul>
<li>One thing that bothers me is that this won&rsquo;t honor author order</li>
<li>It might be better to do batches of these in PostgreSQL with a script that takes the <code>place</code> column of an author into account when setting the <code>cg.creator.id</code></li>
<li>I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching <code>cg.creator.id</code> fields: <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py </a></li>
<li>The CSV should have two columns: author name and ORCID identifier:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Orth, Alan&quot;,Alan S. Orth: 0000-0002-1735-7458
&quot;Orth, A.&quot;,Alan S. Orth: 0000-0002-1735-7458
</code></pre><ul>
@ -236,7 +236,7 @@ UPDATE 2309
<li>Peter also wrote to say he is having issues with the Atmire Listings and Reports module</li>
<li>When I logged in to try it I get a blank white page after continuing and I see this in dspace.log.2018-03-11:</li>
</ul>
<pre><code>2018-03-11 11:38:15,592 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
<pre tabindex="0"><code>2018-03-11 11:38:15,592 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
g/jspui/listings-and-reports
-- Method: POST
-- Parameters were:
@ -282,7 +282,7 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
<ul>
<li>The error in the DSpace log is:</li>
</ul>
<pre><code>org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
<pre tabindex="0"><code>org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
</code></pre><ul>
<li>The full error is here: <a href="https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca">https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca</a></li>
<li>If I do a report for &ldquo;Orth, Alan&rdquo; with the same custom layout it works!</li>
@ -295,16 +295,16 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
<li>I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164</li>
<li>Looking at the CRP subjects on CGSpace I see there is one blank one so I&rsquo;ll just fix it:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
</code></pre><ul>
<li>Copy all CRP subjects to a CSV to do the mass updates:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
COPY 21
</code></pre><ul>
<li>Once I prepare the new input forms (<a href="https://github.com/ilri/DSpace/issues/362">#362</a>) I will need to do the batch corrections:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
</code></pre><ul>
<li>Create a pull request to update the input forms for the new CRP subject style (<a href="https://github.com/ilri/DSpace/pull/366">#366</a>)</li>
</ul>
@ -316,13 +316,13 @@ COPY 21
<li>CGSpace crashed this morning for about seven minutes and Dani restarted Tomcat</li>
<li>Around that time there were an increase of SQL errors:</li>
</ul>
<pre><code>2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
</code></pre><ul>
<li>But these errors, I don&rsquo;t even know what they mean, because a handful of them happen every day:</li>
</ul>
<pre><code>$ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
<pre tabindex="0"><code>$ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
dspace.log.2018-03-10:13
dspace.log.2018-03-11:15
dspace.log.2018-03-12:13
@ -336,7 +336,7 @@ dspace.log.2018-03-19:90
</code></pre><ul>
<li>There wasn&rsquo;t even a lot of traffic at the time (89 AM):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Mar/2018:0[89]:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Mar/2018:0[89]:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.197
92 83.103.94.48
96 40.77.167.175
@ -350,7 +350,7 @@ dspace.log.2018-03-19:90
</code></pre><ul>
<li>Well there is a hint in Tomcat&rsquo;s <code>catalina.out</code>:</li>
</ul>
<pre><code>Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
<pre tabindex="0"><code>Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>So someone was doing something heavy somehow&hellip; my guess is content and usage stats!</li>
@ -367,7 +367,7 @@ Exception in thread &quot;http-bio-127.0.0.1-8081-exec-280&quot; java.lang.OutOf
<ul>
<li>DSpace Test has been down for a few hours with SQL and memory errors starting this morning:</li>
</ul>
<pre><code>2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
@ -377,20 +377,20 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
<li>Abenet told me that one of Lance Robinson&rsquo;s ORCID iDs on CGSpace is incorrect</li>
<li>I will remove it from the controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/367">#367</a>) and update any items using the old one:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
UPDATE 1
</code></pre><ul>
<li>Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits</li>
<li>Merge the changes to CRP names to the <code>5_x-prod</code> branch and deploy on CGSpace (<a href="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
<li>Run corrections for CRP names in the database:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Run all system updates on CGSpace (linode18) and reboot the server</li>
<li>I started a full Discovery re-index on CGSpace because of the updated CRPs</li>
<li>I see this error in the DSpace log:</li>
</ul>
<pre><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field &quot;dc_contributor_author&quot;.
<pre tabindex="0"><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field &quot;dc_contributor_author&quot;.
java.lang.IllegalArgumentException: No choices plugin was configured for field &quot;dc_contributor_author&quot;.
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
@ -427,28 +427,28 @@ java.lang.IllegalArgumentException: No choices plugin was configured for field
<li>Afterwards we&rsquo;ll want to do some batch tagging of ORCID identifiers to these names</li>
<li>CGSpace crashed again this afternoon, I&rsquo;m not sure of the cause but there are a lot of SQL errors in the DSpace log:</li>
</ul>
<pre><code>2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection has already been closed.
</code></pre><ul>
<li>I have no idea why so many connections were abandoned this afternoon:</li>
</ul>
<pre><code># grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
<pre tabindex="0"><code># grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
268
</code></pre><ul>
<li>DSpace Test crashed again due to Java heap space, this is from the DSpace log:</li>
</ul>
<pre><code>2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>And this is from the Tomcat Catalina log:</li>
</ul>
<pre><code>Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
<pre tabindex="0"><code>Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>But there are tons of heap space errors on DSpace Test actually:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
319
</code></pre><ul>
<li>I guess we need to give it more RAM because it now has CGSpace&rsquo;s large Solr core</li>
@ -457,7 +457,7 @@ java.lang.OutOfMemoryError: Java heap space
<li>Deploy the new JDBC driver on DSpace Test</li>
<li>I&rsquo;m also curious to see how long the <code>dspace index-discovery -b</code> takes on DSpace Test where the DSpace installation directory is on one of Linode&rsquo;s new block storage volumes</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 208m19.155s
user 8m39.138s
@ -470,7 +470,7 @@ sys 2m45.135s
<li>For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields</li>
<li>I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:</li>
</ul>
<pre><code>isNotNull(value.match(/.*\ufffd.*/))
<pre tabindex="0"><code>isNotNull(value.match(/.*\ufffd.*/))
</code></pre><ul>
<li>I need to be able to add many common characters though so that it is useful to copy and paste into a new project to find issues</li>
</ul>
@ -489,11 +489,11 @@ sys 2m45.135s
<li>Looking at Peter&rsquo;s author corrections and trying to work out a way to find errors in OpenRefine easily</li>
<li>I can find all names that have acceptable characters using a GREL expression like:</li>
</ul>
<pre><code>isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
<pre tabindex="0"><code>isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
</code></pre><ul>
<li>But it&rsquo;s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
@ -502,7 +502,7 @@ sys 2m45.135s
</code></pre><ul>
<li>And here&rsquo;s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it&rsquo;s time to add delete support to my <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*delete.*/i)),
isNotNull(value.match(/.*remove.*/i)),
isNotNull(value.match(/.*check.*/i))
@ -521,7 +521,7 @@ sys 2m45.135s
<p>Test the corrections and deletions locally, then run them on CGSpace:</p>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
</code></pre><ul>
<li>Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test</li>
@ -542,12 +542,12 @@ $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.cont
<li>DSpace Test crashed due to heap space so I&rsquo;ve increased it from 4096m to 5120m</li>
<li>The error in Tomcat&rsquo;s <code>catalina.out</code> was:</li>
</ul>
<pre><code>Exception in thread &quot;RMI TCP Connection(idle)&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &quot;RMI TCP Connection(idle)&quot; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>Add ISI Journal (cg.isijournal) as an option in Atmire&rsquo;s Listing and Reports layout (<a href="https://github.com/ilri/DSpace/pull/370">#370</a>) for Abenet</li>
<li>I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH

View File

@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when
Catalina logs at least show some memory errors yesterday:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -117,7 +117,7 @@ Catalina logs at least show some memory errors yesterday:
<li>I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when</li>
<li>Catalina logs at least show some memory errors yesterday:</li>
</ul>
<pre><code>Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
<pre tabindex="0"><code>Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
@ -134,12 +134,12 @@ Exception in thread &quot;ContainerBackgroundProcessor[StandardEngine[Catalina]]
<li>Peter noticed that there were still some old CRP names on CGSpace, because I hadn&rsquo;t forced the Discovery index to be updated after I fixed the others last week</li>
<li>For completeness I re-ran the CRP corrections on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
</code></pre><ul>
<li>Then started a full Discovery index:</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 76m13.841s
@ -149,18 +149,18 @@ sys 2m2.498s
<li>Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme&rsquo;s items</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>The CSV format of <code>jtohme-2018-04-04.csv</code> was:</li>
</ul>
<pre><code class="language-csv" data-lang="csv">dc.contributor.author,cg.creator.id
<pre tabindex="0"><code class="language-csv" data-lang="csv">dc.contributor.author,cg.creator.id
&quot;Tohme, Joseph M.&quot;,Joe Tohme: 0000-0003-2765-7101
</code></pre><ul>
<li>There was a quoting error in my CRP CSV and the replacements for <code>Forests, Trees and Agroforestry</code> got messed up</li>
<li>So I fixed them and had to re-index again!</li>
<li>I started preparing the git branch for the the DSpace 5.5→5.8 upgrade:</li>
</ul>
<pre><code>$ git checkout -b 5_x-dspace-5.8 5_x-prod
<pre tabindex="0"><code>$ git checkout -b 5_x-dspace-5.8 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.8
</code></pre><ul>
@ -181,7 +181,7 @@ $ git rebase -i dspace-5.8
<li>Fix Sisay&rsquo;s sudo access on the new DSpace Test server (linode19)</li>
<li>The reindexing process on DSpace Test took <em>forever</em> yesterday:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 599m32.961s
user 9m3.947s
@ -193,7 +193,7 @@ sys 2m52.585s
<li>Help Peter with the GDPR compliance / reporting form for CGSpace</li>
<li>DSpace Test crashed due to memory issues again:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
16
</code></pre><ul>
<li>I ran all system updates on DSpace Test and rebooted it</li>
@ -205,7 +205,7 @@ sys 2m52.585s
<li>I got a notice that CGSpace CPU usage was very high this morning</li>
<li>Looking at the nginx logs, here are the top users today so far:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
282 207.46.13.112
286 54.175.208.220
287 207.46.13.113
@ -220,24 +220,24 @@ sys 2m52.585s
<li>45.5.186.2 is of course CIAT</li>
<li>95.108.181.88 appears to be Yandex:</li>
</ul>
<pre><code>95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] &quot;GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1&quot; 200 2638 &quot;-&quot; &quot;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot;
<pre tabindex="0"><code>95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] &quot;GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1&quot; 200 2638 &quot;-&quot; &quot;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot;
</code></pre><ul>
<li>And for some reason Yandex created a lot of Tomcat sessions today:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
4363
</code></pre><ul>
<li>70.32.83.92 appears to be some harvester we&rsquo;ve seen before, but on a new IP</li>
<li>They are not creating new Tomcat sessions so there is no problem there</li>
<li>178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
3982
</code></pre><ul>
<li>I&rsquo;m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve</li>
<li>Let&rsquo;s try a manual request with and without their user agent:</li>
</ul>
<pre><code>$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -294,7 +294,7 @@ X-XSS-Protection: 1; mode=block
<ul>
<li>In other news, it looks like the number of total requests processed by nginx in March went down from the previous months:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2018&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2018&quot;
2266594
real 0m13.658s
@ -303,25 +303,25 @@ sys 0m1.087s
</code></pre><ul>
<li>In other other news, the database cleanup script has an issue again:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(151626) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
UPDATE 1
</code></pre><ul>
<li>Looking at abandoned connections in Tomcat:</li>
</ul>
<pre><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
<pre tabindex="0"><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
2115
</code></pre><ul>
<li>Apparently from these stacktraces we should be able to see which code is not closing connections properly</li>
<li>Here&rsquo;s a pretty good overview of days where we had database issues recently:</li>
</ul>
<pre><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
1 Feb 18, 2018
1 Feb 19, 2018
1 Feb 20, 2018
@ -356,7 +356,7 @@ UPDATE 1
<ul>
<li>DSpace Test (linode19) crashed again some time since yesterday:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
168
</code></pre><ul>
<li>I ran all system updates and rebooted the server</li>
@ -374,12 +374,12 @@ UPDATE 1
<ul>
<li>While testing an XMLUI patch for <a href="https://jira.duraspace.org/browse/DS-3883">DS-3883</a> I noticed that there is still some remaining Authority / Solr configuration left that we need to remove:</li>
</ul>
<pre><code>2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check &quot;solr.authority.server&quot; property in the dspace.cfg
<pre tabindex="0"><code>2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check &quot;solr.authority.server&quot; property in the dspace.cfg
java.lang.NullPointerException
</code></pre><ul>
<li>I assume we need to remove <code>authority</code> from the consumers in <code>dspace/config/dspace.cfg</code>:</li>
</ul>
<pre><code>event.dispatcher.default.consumers = authority, versioning, discovery, eperson, harvester, statistics,batchedit, versioningmqm
<pre tabindex="0"><code>event.dispatcher.default.consumers = authority, versioning, discovery, eperson, harvester, statistics,batchedit, versioningmqm
</code></pre><ul>
<li>I see the same error on DSpace Test so this is definitely a problem</li>
<li>After disabling the authority consumer I no longer see the error</li>
@ -387,7 +387,7 @@ java.lang.NullPointerException
<li>File a ticket on DSpace&rsquo;s Jira for the <code>target=&quot;_blank&quot;</code> security and performance issue (<a href="https://jira.duraspace.org/browse/DS-3891">DS-3891</a>)</li>
<li>I re-deployed DSpace Test (linode19) and was surprised by how long it took the ant update to complete:</li>
</ul>
<pre><code>BUILD SUCCESSFUL
<pre tabindex="0"><code>BUILD SUCCESSFUL
Total time: 4 minutes 12 seconds
</code></pre><ul>
<li>The Linode block storage is much slower than the instance storage</li>
@ -404,7 +404,7 @@ Total time: 4 minutes 12 seconds
<li>They will need to use OpenSearch, but I can&rsquo;t remember all the parameters</li>
<li>Apparently search sort options for OpenSearch are in <code>dspace.cfg</code>:</li>
</ul>
<pre><code>webui.itemlist.sort-option.1 = title:dc.title:title
<pre tabindex="0"><code>webui.itemlist.sort-option.1 = title:dc.title:title
webui.itemlist.sort-option.2 = dateissued:dc.date.issued:date
webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
webui.itemlist.sort-option.4 = type:dc.type:text
@ -422,27 +422,27 @@ webui.itemlist.sort-option.4 = type:dc.type:text
<li>They are missing the <code>order</code> parameter (ASC vs DESC)</li>
<li>I notice that DSpace Test has crashed again, due to memory:</li>
</ul>
<pre><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
<pre tabindex="0"><code># grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
178
</code></pre><ul>
<li>I will increase the JVM heap size from 5120M to 6144M, though we don&rsquo;t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace</li>
<li>Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats</li>
<li>I got a list of all the CIP collections manually and use the same query that I used in <a href="/cgspace-notes/2017-08">August, 2017</a>:</li>
</ul>
<pre><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
</code></pre><h2 id="2018-04-19">2018-04-19</h2>
<ul>
<li>Run updates on DSpace Test (linode19) and reboot the server</li>
<li>Also try deploying updated GeoLite database during ant update while re-deploying code:</li>
</ul>
<pre><code>$ ant update update_geolite clean_backups
<pre tabindex="0"><code>$ ant update update_geolite clean_backups
</code></pre><ul>
<li>I also re-deployed CGSpace (linode18) to make the ORCID search, authority cleanup, CCAFS project tag <code>PII-LAM_CSAGender</code> live</li>
<li>When re-deploying I also updated the GeoLite databases so I hope the country stats become more accurate&hellip;</li>
<li>After re-deployment I ran all system updates on the server and rebooted it</li>
<li>After the reboot I forced a reïndexing of the Discovery to populate the new ORCID index:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 73m42.635s
user 8m15.885s
@ -456,21 +456,21 @@ sys 2m2.687s
<li>I confirm that it&rsquo;s just giving a white page around 4:16</li>
<li>The DSpace logs show that there are no database connections:</li>
</ul>
<pre><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
<pre tabindex="0"><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
</code></pre><ul>
<li>And there have been shit tons of errors in the last (starting only 20 minutes ago luckily):</li>
</ul>
<pre><code># grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
<pre tabindex="0"><code># grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
32147
</code></pre><ul>
<li>I can&rsquo;t even log into PostgreSQL as the <code>postgres</code> user, WTF?</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
^C
</code></pre><ul>
<li>Here are the most active IPs today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Apr/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
917 207.46.13.182
935 213.55.99.121
970 40.77.167.134
@ -484,7 +484,7 @@ sys 2m2.687s
</code></pre><ul>
<li>It doesn&rsquo;t even seem like there is a lot of traffic compared to the previous days:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Apr/2018&quot; | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Apr/2018&quot; | wc -l
74931
# zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E &quot;19/Apr/2018&quot; | wc -l
91073
@ -499,7 +499,7 @@ sys 2m2.687s
<li>Everything is back but I have no idea what caused this—I suspect something with the hosting provider</li>
<li>Also super weird, the last entry in the DSpace log file is from <code>2018-04-20 16:35:09</code>, and then immediately it goes to <code>2018-04-20 19:15:04</code> (three hours later!):</li>
</ul>
<pre><code>2018-04-20 16:35:09,144 ERROR org.dspace.app.util.AbstractDSpaceWebapp @ Failed to record shutdown in Webapp table.
<pre tabindex="0"><code>2018-04-20 16:35:09,144 ERROR org.dspace.app.util.AbstractDSpaceWebapp @ Failed to record shutdown in Webapp table.
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle
:0; lastwait:5000].
at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:685)
@ -543,12 +543,12 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time
<li>One other new thing I notice is that PostgreSQL 9.6 no longer uses <code>createuser</code> and <code>nocreateuser</code>, as those have actually meant <code>superuser</code> and <code>nosuperuser</code> and have been deprecated for <em>ten years</em></li>
<li>So for my notes, when I&rsquo;m importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:</li>
</ul>
<pre><code>$ psql dspacetest -c 'alter user dspacetest superuser;'
<pre tabindex="0"><code>$ psql dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
</code></pre><ul>
<li>There&rsquo;s another issue with Tomcat in Ubuntu 18.04:</li>
</ul>
<pre><code>25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
<pre tabindex="0"><code>25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
at org.apache.coyote.http11.Http11InputBuffer.init(Http11InputBuffer.java:688)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:672)

View File

@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -175,7 +175,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>There are lots of errors on language, CRP, and even some encoding errors on abstract fields</li>
<li>I export them and include the hidden metadata fields like <code>dc.date.accessioned</code> so I can filter the ones from 2018-04 and correct them in Open Refine:</li>
</ul>
<pre><code>$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
<pre tabindex="0"><code>$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
</code></pre><ul>
<li>Abenet sent a list of 46 ORCID identifiers for ILRI authors so I need to get their names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script and merge them into our controlled vocabulary</li>
<li>On the messed up IITA records from 2018-04 I see sixty DOIs in incorrect format (cg.identifier.doi)</li>
@ -185,7 +185,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>Fixing the IITA records from Sisay, sixty DOIs have completely invalid format like <code>http:dx.doi.org10.1016j.cropro.2008.07.003</code></li>
<li>I corrected all the DOIs and then checked them for validity with a quick bash loop:</li>
</ul>
<pre><code>$ for line in $(&lt; /tmp/links.txt); do echo $line; http --print h $line; done
<pre tabindex="0"><code>$ for line in $(&lt; /tmp/links.txt); do echo $line; http --print h $line; done
</code></pre><ul>
<li>Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher&rsquo;s site so&hellip;</li>
<li>Also, there are some duplicates:
@ -205,7 +205,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: <code></code> (0x2019), <code>·</code> (0x00b7), and <code></code> (0x20ac)</li>
<li>A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
@ -218,7 +218,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
<li>I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!</li>
<li>Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-05-06-combined.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-05-06-combined.txt
$ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -242,12 +242,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I could use it with <a href="https://github.com/okfn/reconcile-csv">reconcile-csv</a> or to populate a Solr instance for reconciliation</li>
<li>This XPath expression gets close, but outputs all items on one line:</li>
</ul>
<pre><code>$ xmllint --xpath '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/node()' dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xmllint --xpath '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/node()' dspace/config/input-forms.xml
Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
</code></pre><ul>
<li>Maybe <code>xmlstarlet</code> is better:</li>
</ul>
<pre><code>$ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/text()' dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name=&quot;crpsubject&quot;]/pair/stored-value/text()' dspace/config/input-forms.xml
Agriculture for Nutrition and Health
Big Data
Climate Change, Agriculture and Food Security
@ -275,7 +275,7 @@ Livestock and Fish
<li>I told them to get all <a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=oai_dc&amp;set=com_10568_35697">CIAT records via OAI</a></li>
<li>Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file:</li>
</ul>
<pre><code>$ lein run /tmp/crps.csv name id
<pre tabindex="0"><code>$ lein run /tmp/crps.csv name id
</code></pre><ul>
<li>I tried to reconcile against a CSV of our countries but reconcile-csv crashes</li>
</ul>
@ -310,7 +310,7 @@ Livestock and Fish
<li>Also, I learned how to do something cool with Jython expressions in OpenRefine</li>
<li>This will fetch a URL and return its HTTP response code:</li>
</ul>
<pre><code>import urllib2
<pre tabindex="0"><code>import urllib2
import re
pattern = re.compile('.*10.1016.*')
@ -329,24 +329,24 @@ return &quot;blank&quot;
<li>I was checking the CIFOR data for duplicates using Atmire&rsquo;s Metadata Quality Module (and found some duplicates actually), but then DSpace died&hellip;</li>
<li>I didn&rsquo;t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in <code>dmest -T</code>:</li>
</ul>
<pre><code>[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
<pre tabindex="0"><code>[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
[Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
[Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>So the Linux kernel killed Java&hellip;</li>
<li>Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace:</li>
</ul>
<pre><code>Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
<pre tabindex="0"><code>Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
</code></pre><ul>
<li>Looking in the DSpace log I see something related:</li>
</ul>
<pre><code>2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
<pre tabindex="0"><code>2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
</code></pre><ul>
<li>So I&rsquo;m not sure&hellip;</li>
<li>I finally figured out how to get OpenRefine to reconcile values from Solr via <a href="https://github.com/codeforkjeff/conciliator">conciliator</a>:</li>
<li>The trick was to use a more appropriate Solr fieldType <code>text_en</code> instead of <code>text_general</code> so that more terms match, for example uppercase and lower case:</li>
</ul>
<pre><code>$ ./bin/solr start
<pre tabindex="0"><code>$ ./bin/solr start
$ ./bin/solr create_core -c countries
$ curl -X POST -H 'Content-type:application/json' --data-binary '{&quot;add-field&quot;: {&quot;name&quot;:&quot;country&quot;, &quot;type&quot;:&quot;text_en&quot;, &quot;multiValued&quot;:false, &quot;stored&quot;:true}}' http://localhost:8983/solr/countries/schema
$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
@ -357,7 +357,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<ul>
<li>I should probably make a general copy field and set it to be the default search field, like DSpace&rsquo;s search core does (see schema.xml):</li>
</ul>
<pre><code>&lt;defaultSearchField&gt;search_text&lt;/defaultSearchField&gt;
<pre tabindex="0"><code>&lt;defaultSearchField&gt;search_text&lt;/defaultSearchField&gt;
...
&lt;copyField source=&quot;*&quot; dest=&quot;search_text&quot;/&gt;
</code></pre><ul>
@ -381,7 +381,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<li>I created and merged a pull request to fix the sorting issue in Listings and Reports (<a href="https://github.com/ilri/DSpace/pull/374">#374</a>)</li>
<li>Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in <code>page-structure-alterations.xsl</code> to:</li>
</ul>
<pre><code>ga('send', 'pageview', {
<pre tabindex="0"><code>ga('send', 'pageview', {
'anonymizeIp': true
});
</code></pre><ul>
@ -439,7 +439,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<ul>
<li>I&rsquo;m investigating how many non-CGIAR users we have registered on CGSpace:</li>
</ul>
<pre><code>dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
<pre tabindex="0"><code>dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
</code></pre><ul>
<li>We might need to do something regarding these users for GDPR compliance because we have their names, emails, and potentially phone numbers</li>
<li>I decided that I will just use the cookieconsent script as is, since it looks good and technically does set the cookie with &ldquo;allow&rdquo; or &ldquo;dismiss&rdquo;</li>
@ -460,7 +460,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<li>DSpace Test crashed last night, seems to be related to system memory (not JVM heap)</li>
<li>I see this in <code>dmesg</code>:</li>
</ul>
<pre><code>[Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
<pre tabindex="0"><code>[Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
[Wed May 30 00:00:39 2018] Killed process 6082 (java) total-vm:14876264kB, anon-rss:5683372kB, file-rss:0kB, shmem-rss:0kB
[Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
@ -471,7 +471,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
<li>I generated a list of CIFOR duplicates from the <code>CIFOR_May_9</code> collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika</li>
<li>I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each &ldquo;Item1&rdquo; line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):</li>
</ul>
<pre><code>$ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html &gt; ~/cifor-duplicates.txt
<pre tabindex="0"><code>$ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html &gt; ~/cifor-duplicates.txt
$ sed 's/.*Item1.*/\n&amp;/g' ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cleaned.txt
</code></pre><ul>
<li>I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR&rsquo;s collection</li>
@ -482,18 +482,18 @@ $ sed 's/.*Item1.*/\n&amp;/g' ~/cifor-duplicates.txt &gt; ~/cifor-duplicates-cle
<li>Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a></li>
<li>The output isn&rsquo;t great, but all the handles and IDs are printed in debug mode:</li>
</ul>
<pre><code>$ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2&gt; /tmp/ilri-collections.txt
<pre tabindex="0"><code>$ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2&gt; /tmp/ilri-collections.txt
</code></pre><ul>
<li>Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
</code></pre><h2 id="2018-05-31">2018-05-31</h2>
<ul>
<li>Clarify CGSpace&rsquo;s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance</li>
<li>Testing running PostgreSQL in a Docker container on localhost because when I&rsquo;m on Arch Linux there isn&rsquo;t an easily installable package for particular PostgreSQL versions</li>
<li>Now I can just use Docker:</li>
</ul>
<pre><code>$ docker pull postgres:9.5-alpine
<pre tabindex="0"><code>$ docker pull postgres:9.5-alpine
$ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest

View File

@ -58,7 +58,7 @@ real 74m42.646s
user 8m5.056s
sys 2m7.289s
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -154,12 +154,12 @@ sys 2m7.289s
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
@ -193,19 +193,19 @@ sys 2m7.289s
<li>I uploaded fixes for all those now, but I will continue with the rest of the data later</li>
<li>Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:</li>
</ul>
<pre><code>delete from schema_version where version = '5.6.2015.12.03.2';
<pre tabindex="0"><code>delete from schema_version where version = '5.6.2015.12.03.2';
update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
</code></pre><ul>
<li>And then I need to ignore the ignored ones:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace database migrate ignored
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database migrate ignored
</code></pre><ul>
<li>Now DSpace starts up properly!</li>
<li>Gabriela from CIP got back to me about the author names we were correcting on CGSpace</li>
<li>I did a quick sanity check on them and then did a test import with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
</code></pre><ul>
<li>I will apply them on CGSpace tomorrow I think&hellip;</li>
</ul>
@ -220,7 +220,7 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
<li>I spent some time removing the Atmire Metadata Quality Module (MQM) from the proposed DSpace 5.8 changes</li>
<li>After removing all code mentioning MQM, mqm, metadata-quality, batchedit, duplicatechecker, etc, I think I got most of it removed, but there is a Spring error during Tomcat startup:</li>
</ul>
<pre><code> INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
<pre tabindex="0"><code> INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
</code></pre><ul>
<li>I can fix this by commenting out the <code>ItemCollectionPlugin</code> line of <code>discovery.xml</code>, but from looking at the git log I&rsquo;m not actually sure if that is related to MQM or not</li>
@ -335,7 +335,7 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
</ul>
</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
value.contains('€'),
value.contains('6g'),
value.contains('6m'),
@ -357,24 +357,24 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara&rsquo;s items</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>The contents of <code>2018-06-13-Robin-Buruchara.csv</code> were:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Buruchara, Robin&quot;,Robin Buruchara: 0000-0003-0934-1218
&quot;Buruchara, Robin A.&quot;,Robin Buruchara: 0000-0003-0934-1218
</code></pre><ul>
<li>On a hunch I checked to see if CGSpace&rsquo;s bitstream cleanup was working properly and of course it&rsquo;s broken:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(152402) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>As always, the solution is to delete that ID manually in PostgreSQL:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
UPDATE 1
</code></pre><h2 id="2018-06-14">2018-06-14</h2>
<ul>
@ -387,7 +387,7 @@ UPDATE 1
<ul>
<li>I was restoring a PostgreSQL dump on my test machine and found a way to restore the CGSpace dump as the <code>postgres</code> user, but have the owner of the schema be the <code>dspacetest</code> user:</li>
</ul>
<pre><code>$ dropdb -h localhost -U postgres dspacetest
<pre tabindex="0"><code>$ dropdb -h localhost -U postgres dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost /tmp/cgspace_2018-06-24.backup
@ -407,12 +407,12 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
<li>There is already a search filter for this field defined in <code>discovery.xml</code> but we aren&rsquo;t using it, so I quickly enabled and tested it, then merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/380">#380</a>)</li>
<li>Back to testing the DSpace 5.8 changes from Atmire, I had another issue with SQL migrations:</li>
</ul>
<pre><code>Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
<pre tabindex="0"><code>Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
</code></pre><ul>
<li>It took me a while to figure out that this migration is for MQM, which I removed after Atmire&rsquo;s original advice about the migrations so we actually need to delete this migration instead up updating it</li>
<li>So I need to make sure to run the following during the DSpace 5.8 upgrade:</li>
</ul>
<pre><code>-- Delete existing CUA 4 migration if it exists
<pre tabindex="0"><code>-- Delete existing CUA 4 migration if it exists
delete from schema_version where version = '5.6.2015.12.03.2';
-- Update version of CUA 4 migration
@ -423,18 +423,18 @@ delete from schema_version where version = '5.5.2015.12.03.3';
</code></pre><ul>
<li>After that you can run the migrations manually and then DSpace should work fine:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace database migrate ignored
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database migrate ignored
...
Done.
</code></pre><ul>
<li>Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis' items on CGSpace</li>
<li>I used my <a href="https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
</code></pre><ul>
<li>The contents of <code>2018-06-24-andy-jarvis-orcid.csv</code> were:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Jarvis, A.&quot;,Andy Jarvis: 0000-0001-6543-0798
&quot;Jarvis, Andy&quot;,Andy Jarvis: 0000-0001-6543-0798
&quot;Jarvis, Andrew&quot;,Andy Jarvis: 0000-0001-6543-0798
@ -444,7 +444,7 @@ Done.
<li>I removed both those beans and did some simple tests to check item submission, media-filter of PDFs, REST API, but got an error &ldquo;No matches for the query&rdquo; when listing records in OAI</li>
<li>This warning appears in the DSpace log:</li>
</ul>
<pre><code>2018-06-26 16:58:12,052 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
<pre tabindex="0"><code>2018-06-26 16:58:12,052 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
</code></pre><ul>
<li>It&rsquo;s actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting</li>
<li>Ah, I think I just need to run <code>dspace oai import</code></li>
@ -455,7 +455,7 @@ Done.
<li>I&rsquo;ll have to figure out how to separate those we&rsquo;re keeping, deleting, and mapping into CIFOR&rsquo;s archive collection</li>
<li>First, get the 62 deletes from Vika&rsquo;s file and remove them from the collection:</li>
</ul>
<pre><code>$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' &gt; cifor-handle-to-delete.txt
<pre tabindex="0"><code>$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' &gt; cifor-handle-to-delete.txt
$ wc -l cifor-handle-to-delete.txt
62 cifor-handle-to-delete.txt
$ wc -l 10568-92904.csv
@ -467,14 +467,14 @@ $ wc -l 10568-92904.csv
<li>This iterates over the handles for deletion and uses <code>sed</code> with an alternative pattern delimiter of &lsquo;#&rsquo; (which must be escaped), because the pattern itself contains a &lsquo;/&rsquo;</li>
<li>The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:</li>
</ul>
<pre><code>$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' &gt; cifor-handle-to-map.txt
<pre tabindex="0"><code>$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' &gt; cifor-handle-to-map.txt
$ wc -l cifor-handle-to-map.txt
50 cifor-handle-to-map.txt
</code></pre><ul>
<li>I can either get them from the databse, or programatically export the metadata using <code>dspace metadata-export -i 10568/xxxxx</code>&hellip;</li>
<li>Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the <code>id</code> and <code>collection</code> columns using <a href="https://csvkit.readthedocs.io/">csvkit</a>:</li>
</ul>
<pre><code>$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done &lt; /tmp/cifor-handle-to-map.txt
<pre tabindex="0"><code>$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done &lt; /tmp/cifor-handle-to-map.txt
$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
</code></pre><ul>
<li>Then I can use Open Refine to add the &ldquo;CIFOR Archive&rdquo; collection to the mappings</li>
@ -487,7 +487,7 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 &gt; map-to-cifor-archive.csv
<li>DSpace Test appears to have crashed last night</li>
<li>There is nothing in the Tomcat or DSpace logs, but I see the following in <code>dmesg -T</code>:</li>
</ul>
<pre><code>[Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
<pre tabindex="0"><code>[Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
[Thu Jun 28 00:00:30 2018] Killed process 14501 (java) total-vm:14926704kB, anon-rss:5693608kB, file-rss:0kB, shmem-rss:0kB
[Thu Jun 28 00:00:30 2018] oom_reaper: reaped process 14501 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>

View File

@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r
There is insufficient memory for the Java Runtime Environment to continue.
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -126,20 +126,20 @@ There is insufficient memory for the Java Runtime Environment to continue.
<ul>
<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
</ul>
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
</code></pre><ul>
<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
</ul>
<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
<pre tabindex="0"><code>There is insufficient memory for the Java Runtime Environment to continue.
</code></pre><ul>
<li>As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
</code></pre><ul>
<li>Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:</li>
</ul>
<pre><code>$ sudo su - postgres
<pre tabindex="0"><code>$ sudo su - postgres
$ psql dspace
...
dspace=# begin;
@ -171,13 +171,13 @@ $ dspace database migrate ignored
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
</code></pre><ul>
<li>I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:</li>
</ul>
<pre><code>dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
<pre tabindex="0"><code>dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
count
-------
785
@ -188,7 +188,7 @@ dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadat
</code></pre><ul>
<li>I think I should fix that as well as some other garbage values like &ldquo;test&rdquo; and &ldquo;dspace.ilri.org&rdquo; etc:</li>
</ul>
<pre><code>dspace=# begin;
<pre tabindex="0"><code>dspace=# begin;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
UPDATE 785
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
@ -201,7 +201,7 @@ dspace=# commit;
</code></pre><ul>
<li>Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:</li>
</ul>
<pre><code>03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
<pre tabindex="0"><code>03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
@ -241,7 +241,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>It looks like I added Solr to the <code>backup_to_s3.sh</code> script, but that script is not even being used (<code>s3cmd</code> is run directly from root&rsquo;s crontab)</li>
<li>For now I have just initiated a manual S3 backup of the Solr data:</li>
</ul>
<pre><code># s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
<pre tabindex="0"><code># s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
</code></pre><ul>
<li>But I need to add this to cron!</li>
<li>I wonder if I should convert some of the cron jobs to systemd services / timers&hellip;</li>
@ -249,7 +249,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (<a href="https://github.com/ilri/DSpace/issues/384">#384</a>)</li>
<li>I regenerated the list of names for all our ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; /tmp/2018-07-08-orcids.txt
<pre tabindex="0"><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; /tmp/2018-07-08-orcids.txt
$ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
</code></pre><ul>
<li>But after comparing to the existing list of names I didn&rsquo;t see much change, so I just ignored it</li>
@ -259,22 +259,22 @@ $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt
<li>Uptime Robot said that CGSpace was down for two minutes early this morning but I don&rsquo;t see anything in Tomcat logs or dmesg</li>
<li>Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat&rsquo;s <code>catalina.out</code>:</li>
</ul>
<pre><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-557&quot; java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>Exception in thread &quot;http-bio-127.0.0.1-8081-exec-557&quot; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I&rsquo;m not sure if it&rsquo;s the same error, but I see this in DSpace&rsquo;s <code>solr.log</code>:</li>
</ul>
<pre><code>2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code>2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I see a strange error around that time in <code>dspace.log.2018-07-08</code>:</li>
</ul>
<pre><code>2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
<pre tabindex="0"><code>2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr/statistics
</code></pre><ul>
<li>But not sure what caused that&hellip;</li>
<li>I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT</li>
<li>Looking in the nginx logs I see the top ten IP addresses active today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;09/Jul/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;09/Jul/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1691 40.77.167.84
1701 40.77.167.69
1718 50.116.102.77
@ -288,7 +288,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
</code></pre><ul>
<li>Of those, <em>all</em> except <code>70.32.83.92</code> and <code>50.116.102.77</code> are <em>NOT</em> re-using their Tomcat sessions, for example from the XMLUI logs:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
4435
</code></pre><ul>
<li><code>95.108.181.88</code> appears to be Yandex, so I dunno why it&rsquo;s creating so many sessions, as its user agent should match Tomcat&rsquo;s Crawler Session Manager Valve</li>
@ -314,7 +314,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC</li>
<li>These are the top ten users in the last two hours:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Jul/2018:(11|12|13)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Jul/2018:(11|12|13)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
81 193.95.22.113
82 50.116.102.77
112 40.77.167.90
@ -328,7 +328,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
</code></pre><ul>
<li>Looks like <code>213.139.52.250</code> is Moayad testing his new CGSpace vizualization thing:</li>
</ul>
<pre><code>213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] &quot;GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0&quot; 200 53750 &quot;http://localhost:4200/&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36&quot;
<pre tabindex="0"><code>213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] &quot;GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0&quot; 200 53750 &quot;http://localhost:4200/&quot; &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36&quot;
</code></pre><ul>
<li>He said there was a bug that caused his app to request a bunch of invalid URLs</li>
<li>I&rsquo;ll have to keep and eye on this and see how their platform evolves</li>
@ -349,7 +349,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM</li>
<li>Here are the top ten IPs from last night and this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;11/Jul/2018:22&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;11/Jul/2018:22&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
48 66.249.64.91
50 35.227.26.162
57 157.55.39.234
@ -377,7 +377,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>A brief Google search doesn&rsquo;t turn up any information about what this bot is, but lots of users complaining about it</li>
<li>This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
17098 208.110.72.10
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
1161
@ -386,7 +386,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
</code></pre><ul>
<li>I think the problem is that, despite the bot requesting <code>robots.txt</code>, it almost exlusively requests dynamic pages from <code>/discover</code>:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | grep -o -E &quot;GET /(browse|discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;Pcore-HTTP&quot; | grep -o -E &quot;GET /(browse|discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
13364 GET /discover
993 GET /search-filter
804 GET /browse
@ -397,7 +397,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
<li>I&rsquo;ll also add it to Tomcat&rsquo;s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case</li>
<li>Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
COPY 4518
dspace=# \q
$ csvcut -c 1 &lt; /tmp/affiliations.csv &gt; /tmp/affiliations-1.csv
@ -408,7 +408,7 @@ $ csvcut -c 1 &lt; /tmp/affiliations.csv &gt; /tmp/affiliations-1.csv
<ul>
<li>Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
COPY 4518
</code></pre><h2 id="2018-07-15">2018-07-15</h2>
<ul>
@ -420,7 +420,7 @@ COPY 4518
<li>Altmetric help said that <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/82810">according to OAI that item is only in one department</a></li>
<li>I noticed that indeed there was only one collection listed, so I forced an OAI re-import on CGSpace:</li>
</ul>
<pre><code>$ dspace oai import -c
<pre tabindex="0"><code>$ dspace oai import -c
OAI 2.0 manager action started
Clearing index
Index cleared
@ -438,19 +438,19 @@ OAI 2.0 manager action ended. It took 697 seconds.
<li>I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change</li>
<li>ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1020
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1158
</code></pre><ul>
<li>I combined the two lists and regenerated the names for all our the ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-07-15-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2018-07-15-orcid-ids.txt
$ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolved-orcids.txt -d
</code></pre><ul>
<li>Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via <code>% !sort</code> and then checked the formatting with tidy:</li>
</ul>
<pre><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
<pre tabindex="0"><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>I will check with the CGSpace team to see if they want me to add these to CGSpace</li>
<li>Help Udana from WLE understand some Altmetrics concepts</li>
@ -465,7 +465,7 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
<li>For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts</li>
<li>In there I see two bots making about 750 requests each, and this one is probably Altmetric:</li>
</ul>
<pre><code>178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1&quot; 200 58653 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
<pre tabindex="0"><code>178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1&quot; 200 58653 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////200 HTTP/1.1&quot; 200 67950 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
...
178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] &quot;GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////73900 HTTP/1.1&quot; 20 0 25049 &quot;-&quot; &quot;Apache-HttpClient/4.5.2 (Java/1.8.0_121)&quot;
@ -474,7 +474,7 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
<li>I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve&hellip; does OAI use Tomcat sessions?</li>
<li>Appears not:</li>
</ul>
<pre><code>$ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100'
<pre tabindex="0"><code>$ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100'
GET /oai/request?verb=ListRecords&amp;resumptionToken=oai_dc////100 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -511,7 +511,7 @@ X-XSS-Protection: 1; mode=block
<li>They say that it is a burden for them to capture the issue dates, so I cautioned them that this is in their own benefit for future posterity and that everyone else on CGSpace manages to capture the issue dates!</li>
<li>For future reference, as I had previously noted in <a href="/cgspace-notes/2018-04/">2018-04</a>, sort options are configured in <code>dspace.cfg</code>, for example:</li>
</ul>
<pre><code>webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
<pre tabindex="0"><code>webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
</code></pre><ul>
<li>Just because I was curious I made sure that these options are working as expected in DSpace 5.8 on DSpace Test (they are)</li>
<li>I tested the Atmire Listings and Reports (L&amp;R) module one last time on my local test environment with a new snapshot of CGSpace&rsquo;s database and re-generated Discovery index and it worked fine</li>
@ -523,7 +523,7 @@ X-XSS-Protection: 1; mode=block
<li>Still discussing dates with IWMI</li>
<li>I looked in the database to see the breakdown of date formats used in <code>dc.date.issued</code>, ie YYYY, YYYY-MM, or YYYY-MM-DD:</li>
</ul>
<pre><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
count
-------
53292

View File

@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -136,7 +136,7 @@ I ran all system updates on DSpace Test and rebooted it
<ul>
<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
</ul>
<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
<pre tabindex="0"><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
@ -161,7 +161,7 @@ I ran all system updates on DSpace Test and rebooted it
<ul>
<li>DSpace Test crashed again and I don&rsquo;t see the only error I see is this in <code>dmesg</code>:</li>
</ul>
<pre><code>[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
<pre tabindex="0"><code>[Thu Aug 2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
[Thu Aug 2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?</li>
@ -179,13 +179,13 @@ I ran all system updates on DSpace Test and rebooted it
<li>I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors</li>
<li>Finally I did a test run with the <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897"><code>fix-metadata-value.py</code></a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre><h2 id="2018-08-16">2018-08-16</h2>
<ul>
<li>Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv;
</code></pre><ul>
<li>Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month</li>
<li>I might need to overhaul the <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration</li>
@ -195,7 +195,7 @@ $ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspac
<li>I will have to update my script to extract the ORCID identifier and search for that</li>
<li>Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:</li>
</ul>
<pre><code>$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
<pre tabindex="0"><code>$ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
@ -209,7 +209,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<li>This is less obvious and more error prone with names like &ldquo;Peters&rdquo; where there are many more authors</li>
<li>I see some errors in the variations of names as well, for example:</li>
</ul>
<pre><code>Verchot, Louis
<pre tabindex="0"><code>Verchot, Louis
Verchot, L
Verchot, L. V.
Verchot, L.V
@ -220,7 +220,7 @@ Verchot, Louis V.
<li>I&rsquo;ll just tag them all with Louis Verchot&rsquo;s ORCID identifier&hellip;</li>
<li>In the end, I&rsquo;ll run the following CSV with my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Campbell, Bruce&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Campbell, Bruce M.&quot;,Bruce M Campbell: 0000-0002-0123-4859
&quot;Campbell, B.M&quot;,Bruce M Campbell: 0000-0002-0123-4859
@ -251,13 +251,13 @@ Verchot, Louis V.
</code></pre><ul>
<li>The invocation would be:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers</li>
<li>Looking at the list of author affialitions from Peter one last time</li>
<li>I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -268,12 +268,12 @@ Verchot, Louis V.
<li>This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n</li>
<li>I will run the following on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre><ul>
<li>Then force an update of the Discovery index on DSpace Test:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 72m12.570s
@ -282,7 +282,7 @@ sys 2m2.461s
</code></pre><ul>
<li>And then on CGSpace:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 79m44.392s
@ -292,7 +292,7 @@ sys 2m20.248s
<li>Run system updates on DSpace Test and reboot the server</li>
<li>In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
1553
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
1724
@ -300,7 +300,7 @@ sys 2m20.248s
<li>I don&rsquo;t even know how its possible for the bot to use MORE sessions than total requests&hellip;</li>
<li>The user agent is:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
</code></pre><ul>
<li>So I&rsquo;m thinking we should add &ldquo;crawl&rdquo; to the Tomcat Crawler Session Manager valve, as we already have &ldquo;bot&rdquo; that catches Googlebot, Bingbot, etc.</li>
</ul>
@ -325,7 +325,7 @@ sys 2m20.248s
<ul>
<li>Something must have happened, as the <code>mvn package</code> <em>always</em> takes about two hours now, stopping for a very long time near the end at this step:</li>
</ul>
<pre><code>[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
<pre tabindex="0"><code>[INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
</code></pre><ul>
<li>It&rsquo;s the same on DSpace Test, my local laptop, and CGSpace&hellip;</li>
<li>It wasn&rsquo;t this way before when I was constantly building the previous 5.8 branch with Atmire patches&hellip;</li>
@ -335,7 +335,7 @@ sys 2m20.248s
<li>That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8</li>
<li>I notice that the step this pauses at is:</li>
</ul>
<pre><code>[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
<pre tabindex="0"><code>[INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
</code></pre><ul>
<li>And I notice that Atmire changed something in the XMLUI module&rsquo;s <code>pom.xml</code> as part of the DSpace 5.8 changes, specifically to remove the exclude for <code>node_modules</code> in the <code>maven-war-plugin</code> step</li>
<li>This exclude is <em>present</em> in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!</li>
@ -352,23 +352,23 @@ sys 2m20.248s
<li>It appears that the web UI&rsquo;s upload interface <em>requires</em> you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the <code>collections</code> file inside each item in the bundle</li>
<li>I imported the CTA items on CGSpace for Sisay:</li>
</ul>
<pre><code>$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
<pre tabindex="0"><code>$ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
</code></pre><h2 id="2018-08-26">2018-08-26</h2>
<ul>
<li>Doing the DSpace 5.8 upgrade on CGSpace (linode18)</li>
<li>I already finished the Maven build, now I&rsquo;ll take a backup of the PostgreSQL database and do a database cleanup just in case:</li>
</ul>
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
$ dspace cleanup -v
</code></pre><ul>
<li>Now I can stop Tomcat and do the install:</li>
</ul>
<pre><code>$ cd dspace/target/dspace-installer
<pre tabindex="0"><code>$ cd dspace/target/dspace-installer
$ ant update clean_backups update_geolite
</code></pre><ul>
<li>After the successful Ant update I can run the database migrations:</li>
</ul>
<pre><code>$ psql dspace dspace
<pre tabindex="0"><code>$ psql dspace dspace
dspace=&gt; \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql
DELETE 0
@ -380,7 +380,7 @@ $ dspace database migrate ignored
</code></pre><ul>
<li>Then I&rsquo;ll run all system updates and reboot the server:</li>
</ul>
<pre><code>$ sudo su -
<pre tabindex="0"><code>$ sudo su -
# apt update &amp;&amp; apt full-upgrade
# apt clean &amp;&amp; apt autoclean &amp;&amp; apt autoremove
# reboot
@ -391,11 +391,11 @@ $ dspace database migrate ignored
<li>I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject <code>GENDER</code> or <code>GENDER POVERTY AND INSTITUTIONS</code>, and CRP <code>Water, Land and Ecosystems</code></li>
<li>Then I extracted the Handle links from the report so I could export each item&rsquo;s metadata as CSV</li>
</ul>
<pre><code>$ grep -o -E &quot;[0-9]{5}/[0-9]{0,5}&quot; listings-export.txt &gt; /tmp/iwmi-gender-items.txt
<pre tabindex="0"><code>$ grep -o -E &quot;[0-9]{5}/[0-9]{0,5}&quot; listings-export.txt &gt; /tmp/iwmi-gender-items.txt
</code></pre><ul>
<li>Then on the DSpace server I exported the metadata for each item one by one:</li>
</ul>
<pre><code>$ while read -r line; do dspace metadata-export -f &quot;/tmp/${line/\//-}.csv&quot; -i $line; sleep 2; done &lt; /tmp/iwmi-gender-items.txt
<pre tabindex="0"><code>$ while read -r line; do dspace metadata-export -f &quot;/tmp/${line/\//-}.csv&quot; -i $line; sleep 2; done &lt; /tmp/iwmi-gender-items.txt
</code></pre><ul>
<li>But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them</li>
<li>I&rsquo;m not sure how to proceed without writing some script to parse and join the CSVs, and I don&rsquo;t think it&rsquo;s worth my time</li>

View File

@ -30,7 +30,7 @@ I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and ru
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -123,7 +123,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
<li>Also, I&rsquo;ll re-run the <code>postgresql</code> tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month</li>
<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
<pre><code>02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
<pre tabindex="0"><code>02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
@ -184,7 +184,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>Playing with <a href="https://github.com/eykhagen/strest">strest</a> to test the DSpace REST API programatically</li>
<li>For example, given this <code>test.yaml</code>:</li>
</ul>
<pre><code>version: 1
<pre tabindex="0"><code>version: 1
requests:
test:
@ -217,19 +217,19 @@ requests:
<li>We could eventually use this to test sanity of the API for creating collections etc</li>
<li>A user is getting an error in her workflow:</li>
</ul>
<pre><code>2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step:
<pre tabindex="0"><code>2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step:
org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
</code></pre><ul>
<li>Seems to be during submit step, because it&rsquo;s workflow step 1&hellip;?</li>
<li>Move some top-level CRP communities to be below the new <a href="https://cgspace.cgiar.org/handle/10568/97114">CGIAR Research Programs and Platforms</a> community:</li>
</ul>
<pre><code>$ dspace community-filiator --set -p 10568/97114 -c 10568/51670
<pre tabindex="0"><code>$ dspace community-filiator --set -p 10568/97114 -c 10568/51670
$ dspace community-filiator --set -p 10568/97114 -c 10568/35409
$ dspace community-filiator --set -p 10568/97114 -c 10568/3112
</code></pre><ul>
<li>Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:</li>
</ul>
<pre><code>update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
<pre tabindex="0"><code>update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
UPDATE 1
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
UPDATE 23
@ -246,7 +246,7 @@ UPDATE 15
<li>Linode said that CGSpace (linode18) had a high CPU load earlier today</li>
<li>When I looked, I see it&rsquo;s the same Russian IP that I noticed last month:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1459 157.55.39.202
1579 95.108.181.88
1615 157.55.39.147
@ -260,17 +260,17 @@ UPDATE 15
</code></pre><ul>
<li>And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):</li>
</ul>
<pre><code># grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10
<pre tabindex="0"><code># grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10
14133
</code></pre><ul>
<li>The user agent is still the same:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
</code></pre><ul>
<li>I added <code>.*crawl.*</code> to the Tomcat Session Crawler Manager Valve, so I&rsquo;m not sure why the bot is creating so many sessions&hellip;</li>
<li>I just tested that user agent on CGSpace and it <em>does not</em> create a new session:</li>
</ul>
<pre><code>$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -300,7 +300,7 @@ X-XSS-Protection: 1; mode=block
<li>Merge AReS explorer changes to nginx config and deploy on CGSpace so CodeObia can start testing more</li>
<li>Re-create my local Docker container for PostgreSQL data, but using a volume for the database data:</li>
</ul>
<pre><code>$ sudo docker volume create --name dspacetest_data
<pre tabindex="0"><code>$ sudo docker volume create --name dspacetest_data
$ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><ul>
<li>Sisay is still having problems with the controlled vocabulary for top authors</li>
@ -319,7 +319,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
<li>Linode says that CGSpace (linode18) has had high CPU for the past two hours</li>
<li>The top IP addresses today are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;13/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;13/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
32 46.229.161.131
38 104.198.9.108
39 66.249.64.91
@ -333,7 +333,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
</code></pre><ul>
<li>And the top two addresses seem to be re-using their Tomcat sessions properly:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
7
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
2
@ -343,7 +343,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>I said no, but that we might be able to piggyback on the Atmire statlet REST API</li>
<li>For example, when you expand the &ldquo;statlet&rdquo; at the bottom of an item like <a href="https://cgspace.cgiar.org/handle/10568/97103">10568/97103</a> you can see the following request in the browser console:</li>
</ul>
<pre><code>https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
<pre tabindex="0"><code>https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
</code></pre><ul>
<li>That JSON file has the total page views and item downloads for the item&hellip;</li>
<li>Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds</li>
@ -397,12 +397,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>There are some example queries on the <a href="https://wiki.lyrasis.org/display/DSPACE/Solr">DSpace Solr wiki</a></li>
<li>For example, this query returns 1655 rows for item <a href="https://cgspace.cgiar.org/handle/10568/10630">10568/10630</a>:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false'
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false'
</code></pre><ul>
<li>The id in the Solr query is the item&rsquo;s database id (get it from the REST API or something)</li>
<li>Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire&rsquo;s statlet shows, though the query logic here is confusing:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
</code></pre><ul>
<li>According to the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">SolrQuerySyntax</a> page on the Apache wiki, the <code>[* TO *]</code> syntax just selects a range (in this case all values for a field)</li>
<li>So it seems to be:
@ -413,15 +413,15 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
</li>
<li>What the shit, I think I&rsquo;m right: the simplified logic in <em>this</em> query returns the same 889:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
</code></pre><ul>
<li>And if I simplify the <code>statistics_type</code> logic the same way, it still returns the same 889!</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view'
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view'
</code></pre><ul>
<li>As for item views, I suppose that&rsquo;s just the same query, minus the <code>bundleName:ORIGINAL</code>:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view'
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view'
</code></pre><ul>
<li>That one returns 766, which is exactly 1655 minus 889&hellip;</li>
<li>Also, Solr&rsquo;s <code>fq</code> is similar to the regular <code>q</code> query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries</li>
@ -432,7 +432,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>It uses the Python-based <a href="https://falcon.readthedocs.io">Falcon</a> web framework and talks to Solr directly using the <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> library (which seems to have issues in Python 3.7 currently)</li>
<li>After deploying on DSpace Test I can then get the stats for an item using its ID:</li>
</ul>
<pre><code>$ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
<pre tabindex="0"><code>$ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
{
&quot;downloads&quot;: 2,
&quot;id&quot;: 110988,
@ -443,7 +443,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&amp;page=1</li>
<li>Getting all the item IDs from PostgreSQL is certainly easy:</li>
</ul>
<pre><code>dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
<pre tabindex="0"><code>dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
</code></pre><ul>
<li>The rest of the Falcon tooling will be more difficult&hellip;</li>
</ul>
@ -457,11 +457,11 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>Contact Atmire to ask how we can buy more credits for future development (<a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=644">#644</a>)</li>
<li>I researched the Solr <code>filterCache</code> size and I found out that the formula for calculating the potential memory use of <strong>each entry</strong> in the cache is:</li>
</ul>
<pre><code>((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
<pre tabindex="0"><code>((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
</code></pre><ul>
<li>Which means that, for our statistics core with <em>149 million</em> documents, each entry in our <code>filterCache</code> would use 8.9 GB!</li>
</ul>
<pre><code>((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
<pre tabindex="0"><code>((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
</code></pre><ul>
<li>So I think we can forget about tuning this for now!</li>
<li><a href="http://lucene.472066.n3.nabble.com/Calculating-filterCache-size-td4142526.html">Discussion on the mailing list about <code>filterCache</code> size</a></li>
@ -495,7 +495,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>Trying to figure out how to get item views and downloads from SQLite in a join</li>
<li>It appears SQLite doesn&rsquo;t support <code>FULL OUTER JOIN</code> so some people on StackOverflow have emulated it with <code>LEFT JOIN</code> and <code>UNION</code>:</li>
</ul>
<pre><code>&gt; SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
<pre tabindex="0"><code>&gt; SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
LEFT JOIN itemdownloads downloads USING(id)
UNION ALL
SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloads downloads
@ -505,7 +505,7 @@ WHERE views.id IS NULL;
<li>This &ldquo;works&rdquo; but the resulting rows are kinda messy so I&rsquo;d have to do extra logic in Python</li>
<li>Maybe we can use one &ldquo;items&rdquo; table with defaults values and UPSERT (aka insert&hellip; on conflict &hellip; do update):</li>
</ul>
<pre><code>sqlite&gt; CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
<pre tabindex="0"><code>sqlite&gt; CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
sqlite&gt; INSERT INTO items(id, views) VALUES(0, 52);
sqlite&gt; INSERT INTO items(id, downloads) VALUES(1, 171);
sqlite&gt; INSERT INTO items(id, downloads) VALUES(1, 176) ON CONFLICT(id) DO UPDATE SET downloads=176;
@ -521,7 +521,7 @@ sqlite&gt; INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
<li>Ok this is hilarious, I manually downloaded the <a href="https://packages.ubuntu.com/cosmic/libsqlite3-0">libsqlite3 3.24.0 deb from Ubuntu 18.10 &ldquo;cosmic&rdquo;</a> and installed it in Ubnutu 16.04 and now the Python <code>indexer.py</code> works</li>
<li>This is definitely a dirty hack, but the list of packages we use that depend on <code>libsqlite3-0</code> in Ubuntu 16.04 are actually pretty few:</li>
</ul>
<pre><code># apt-cache rdepends --installed libsqlite3-0 | sort | uniq
<pre tabindex="0"><code># apt-cache rdepends --installed libsqlite3-0 | sort | uniq
gnupg2
libkrb5-26-heimdal
libnss3
@ -530,7 +530,7 @@ sqlite&gt; INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
</code></pre><ul>
<li>I wonder if I could work around this by detecting the SQLite library version, for example on Ubuntu 16.04 after I replaced the library:</li>
</ul>
<pre><code># python3
<pre tabindex="0"><code># python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
@ -542,7 +542,7 @@ Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;licen
<li>I changed the syntax of the SQLite stuff and PostgreSQL is working flawlessly with psycopg2&hellip; hmmm.</li>
<li>For reference, creating a PostgreSQL database for testing this locally (though <code>indexer.py</code> will create the table):</li>
</ul>
<pre><code>$ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
<pre tabindex="0"><code>$ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
$ createuser -h localhost -U postgres --pwprompt dspacestatistics
$ psql -h localhost -U postgres dspacestatistics
dspacestatistics=&gt; CREATE TABLE IF NOT EXISTS items
@ -558,7 +558,7 @@ dspacestatistics-&gt; (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
<li>DSpace Test currently has about 2,000,000 documents with <code>isBot:true</code> in its Solr statistics core, and the size on disk is 2GB (it&rsquo;s not much, but I have to test this somewhere!)</li>
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics+Maintenance">DSpace 5.x Solr documentation</a> I can use <code>dspace stats-util -f</code>, so let&rsquo;s try it:</li>
</ul>
<pre><code>$ dspace stats-util -f
<pre tabindex="0"><code>$ dspace stats-util -f
</code></pre><ul>
<li>The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with <code>isBot:true</code></li>
<li>I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it&rsquo;s 201 instead of 2,000,000, and statistics core is only 30MB now!</li>
@ -576,11 +576,11 @@ dspacestatistics-&gt; (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
<li>According to the <a href="https://support.google.com/webmasters/answer/80553">Googlebot FAQ</a> the domain name in the reverse DNS lookup should contain either <code>googlebot.com</code> or <code>google.com</code></li>
<li>In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):</li>
</ul>
<pre><code>*:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
<pre tabindex="0"><code>*:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
</code></pre><ul>
<li>I translate that into a delete command using the <code>/update</code> handler:</li>
</ul>
<pre><code>http://localhost:8081/solr/statistics/update?commit=true&amp;stream.body=&lt;delete&gt;&lt;query&gt;*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false&lt;/query&gt;&lt;/delete&gt;
<pre tabindex="0"><code>http://localhost:8081/solr/statistics/update?commit=true&amp;stream.body=&lt;delete&gt;&lt;query&gt;*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false&lt;/query&gt;&lt;/delete&gt;
</code></pre><ul>
<li>And magically all those 81,000 documents are gone!</li>
<li>After a few hours the Solr statistics core is down to 44GB on CGSpace!</li>
@ -588,7 +588,7 @@ dspacestatistics-&gt; (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
<li>Basically, it turns out that using <code>facet.mincount=1</code> is really beneficial for me because it reduces the size of the Solr result set, reduces the amount of data we need to ingest into PostgreSQL, and the API returns HTTP 404 Not Found for items without views or downloads anyways</li>
<li>I deployed the new version on CGSpace and now it looks pretty good!</li>
</ul>
<pre><code>Indexing item views (page 28 of 753)
<pre tabindex="0"><code>Indexing item views (page 28 of 753)
...
Indexing item downloads (page 260 of 260)
</code></pre><ul>
@ -606,12 +606,12 @@ Indexing item downloads (page 260 of 260)
<li>I will have to keep an eye on that over the next few weeks to see if things stay as they are</li>
<li>I did a batch replacement of the access rights with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script on DSpace Test:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
</code></pre><ul>
<li>This changes &ldquo;Open Access&rdquo; to &ldquo;Unrestricted Access&rdquo; and &ldquo;Limited Access&rdquo; to &ldquo;Restricted Access&rdquo;</li>
<li>After that I did a full Discovery reindex:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 77m3.755s
user 7m39.785s
@ -629,7 +629,7 @@ sys 2m18.485s
<li>Linode emailed to say that CGSpace&rsquo;s (linode19) CPU load was high for a few hours last night</li>
<li>Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;26/Sep/2018:(19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;26/Sep/2018:(19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
295 34.218.226.147
296 66.249.64.95
350 157.55.39.185
@ -645,7 +645,7 @@ sys 2m18.485s
<li><code>68.6.87.12</code> is on Cox Communications in the US (?)</li>
<li>These hosts are not using proper user agents and are not re-using their Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
5423
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
758
@ -659,12 +659,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26
<li>Peter sent me a list of 43 author names to fix, but it had some encoding errors like <code>Belalcázar, John</code> like usual (I will tell him to stop trying to export as UTF-8 because it never seems to work)</li>
<li>I did batch replaces for both on CGSpace with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
</code></pre><ul>
<li>Afterwards I started a full Discovery re-index:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Linode sent an alert that both CGSpace and DSpace Test were using high CPU for the last two hours</li>
<li>It seems to be Moayad trying to do the AReS explorer indexing</li>
@ -675,18 +675,18 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
<li>Valerio keeps sending items on CGSpace that have weird or incorrect languages, authors, etc</li>
<li>I think I should just batch export and update all languages&hellip;</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
</code></pre><ul>
<li>Then I can simply delete the &ldquo;Other&rdquo; and &ldquo;other&rdquo; ones because that&rsquo;s not useful at all:</li>
</ul>
<pre><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
DELETE 6
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
DELETE 79
</code></pre><ul>
<li>Looking through the list I see some weird language codes like <code>gh</code>, so I checked out those items:</li>
</ul>
<pre><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
resource_id
-------------
94530
@ -699,12 +699,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
</code></pre><ul>
<li>Those items are from Ghana, so the submitter apparently thought <code>gh</code> was a language&hellip; I can safely delete them:</li>
</ul>
<pre><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
DELETE 2
</code></pre><ul>
<li>The next issue would be <code>jn</code>:</li>
</ul>
<pre><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
resource_id
-------------
94001
@ -718,7 +718,7 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
<li>Those items are about Japan, so I will update them to be <code>ja</code></li>
<li>Other replacements:</li>
</ul>
<pre><code>DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';

View File

@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nairobi right now
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -121,7 +121,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
<ul>
<li>I see Moayad was busy collecting item views and downloads from CGSpace yesterday:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
933 40.77.167.90
971 95.108.181.88
1043 41.204.190.40
@ -135,18 +135,18 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
</code></pre><ul>
<li>Of those, about 20% were HTTP 500 responses (!):</li>
</ul>
<pre><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
<pre tabindex="0"><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
118927 200
31435 500
</code></pre><ul>
<li>I added Phil Thornton and Sonal Henson&rsquo;s ORCID identifiers to the controlled vocabulary for <code>cg.creator.orcid</code> and then re-generated the names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; 2018-10-03-orcids.txt
<pre tabindex="0"><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; 2018-10-03-orcids.txt
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
</code></pre><ul>
<li>I found a new corner case error that I need to check, given <em>and</em> family names deactivated:</li>
</ul>
<pre><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
<pre tabindex="0"><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
</code></pre><ul>
<li>It appears to be Jim Lorenzen&hellip; I need to check that later!</li>
@ -154,7 +154,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>Linode sent another alert about CPU usage on CGSpace (linode18) this evening</li>
<li>It seems that Moayad is making quite a lot of requests today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1594 157.55.39.160
1627 157.55.39.173
1774 136.243.6.84
@ -169,29 +169,29 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>But in super positive news, he says they are using my new <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> and it&rsquo;s MUCH faster than using Atmire CUA&rsquo;s internal &ldquo;restlet&rdquo; API</li>
<li>I don&rsquo;t recognize the <code>138.201.49.199</code> IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:</li>
</ul>
<pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
<pre tabindex="0"><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
8324 GET /bitstream
4193 GET /handle
</code></pre><ul>
<li>Suspiciously, it&rsquo;s only grabbing the CGIAR System Office community (handle prefix 10947):</li>
</ul>
<pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
<pre tabindex="0"><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
7 GET /handle/10568
4186 GET /handle/10947
</code></pre><ul>
<li>The user agent is suspicious too:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
</code></pre><ul>
<li>It&rsquo;s clearly a bot and it&rsquo;s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list</li>
<li>I looked in Solr&rsquo;s statistics core and these hits were actually all counted as <code>isBot:false</code> (of course)&hellip; hmmm</li>
<li>I tagged all of Sonal and Phil&rsquo;s items with their ORCID identifiers on CGSpace using my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers.py</a> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Where <code>2018-10-03-add-orcids.csv</code> contained:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Henson, Sonal P.&quot;,Sonal Henson: 0000-0002-2002-5462
&quot;Henson, S.&quot;,Sonal Henson: 0000-0002-2002-5462
&quot;Thornton, P.K.&quot;,Philip Thornton: 0000-0002-1854-0182
@ -214,7 +214,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>So it&rsquo;s fixed, but I&rsquo;m not sure why!</li>
<li>Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
251226
</code></pre><ul>
<li>I found a logic error in the dspace-statistics-api <code>indexer.py</code> script that was causing item views to be inserted into downloads</li>
@ -242,7 +242,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>Peter noticed that some recently added PDFs don&rsquo;t have thumbnails</li>
<li>When I tried to force them to be generated I got an error that I&rsquo;ve never seen before:</li>
</ul>
<pre><code>$ dspace filter-media -v -f -i 10568/97613
<pre tabindex="0"><code>$ dspace filter-media -v -f -i 10568/97613
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
</code></pre><ul>
<li>I see there was an update to Ubuntu&rsquo;s ImageMagick on 2018-10-05, so maybe something changed or broke?</li>
@ -251,7 +251,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
<li>Wow, someone on <a href="https://twitter.com/rosscampbell/status/1048268966819319808">Twitter posted about this breaking his web application</a> (and it was retweeted by the ImageMagick acount!)</li>
<li>I commented out the line that disables PDF thumbnails in <code>/etc/ImageMagick-6/policy.xml</code>:</li>
</ul>
<pre><code> &lt;!--&lt;policy domain=&quot;coder&quot; rights=&quot;none&quot; pattern=&quot;PDF&quot; /&gt;--&gt;
<pre tabindex="0"><code> &lt;!--&lt;policy domain=&quot;coder&quot; rights=&quot;none&quot; pattern=&quot;PDF&quot; /&gt;--&gt;
</code></pre><ul>
<li>This works, but I&rsquo;m not sure what ImageMagick&rsquo;s long-term plan is if they are going to disable ALL image formats&hellip;</li>
<li>I suppose I need to enable a workaround for this in Ansible?</li>
@ -261,7 +261,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
<li>I emailed DuraSpace to update <a href="https://duraspace.org/registry/entry/4188/?gvid=178">our entry in their DSpace registry</a> (the data was still on DSpace 3, JSPUI, etc)</li>
<li>Generate a list of the top 1500 values for <code>dc.subject</code> so Sisay can start making a controlled vocabulary for it:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
</code></pre><ul>
<li>Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!</li>
@ -269,7 +269,7 @@ COPY 1500
<li>Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <code>&lt;meta&gt;</code> tags in their page header, and using &ldquo;dct:identifier&rdquo; property instead of &ldquo;dc:identifier&rdquo;</li>
<li>I re-created my local DSpace databse container using <a href="https://github.com/containers/libpod">podman</a> instead of Docker:</li>
</ul>
<pre><code>$ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
<pre tabindex="0"><code>$ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
$ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ sudo podman start dspacedb
$ createuser -h localhost -U postgres --pwprompt dspacetest
@ -283,13 +283,13 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<li>I can pull the <code>docker.bintray.io/jfrog/artifactory-oss:latest</code> image, but not start it</li>
<li>I decided to use a Sonatype Nexus repository instead:</li>
</ul>
<pre><code>$ mkdir -p ~/.local/lib/containers/volumes/nexus_data
<pre tabindex="0"><code>$ mkdir -p ~/.local/lib/containers/volumes/nexus_data
$ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
</code></pre><ul>
<li>With a few changes to my local Maven <code>settings.xml</code> it is working well</li>
<li>Generate a list of the top 10,000 authors for Peter Ballantyne to look through:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
COPY 10000
</code></pre><ul>
<li>CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections</li>
@ -301,7 +301,7 @@ COPY 10000
<li>Look through Peter&rsquo;s list of 746 author corrections in OpenRefine</li>
<li>I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -311,7 +311,7 @@ COPY 10000
</code></pre><ul>
<li>Then I exported and applied them on my local test server:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
</code></pre><ul>
<li>I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay&rsquo;s author controlled vocabulary</li>
</ul>
@ -321,7 +321,7 @@ COPY 10000
<li>Switch to new CGIAR LDAP server on CGSpace, as it&rsquo;s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)</li>
<li>Apply Peter&rsquo;s 746 author corrections on CGSpace and DSpace Test using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Run all system updates on CGSpace (linode19) and reboot the server</li>
<li>After rebooting the server I noticed that Handles are not resolving, and the <code>dspace-handle-server</code> systemd service is not running (or rather, it exited with success)</li>
@ -352,7 +352,7 @@ COPY 10000
</li>
<li>I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:</li>
</ul>
<pre><code>$ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
<pre tabindex="0"><code>$ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
@ -364,12 +364,12 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
<ul>
<li>Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:</li>
</ul>
<pre><code>dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
</code></pre><ul>
<li>Talking to the CodeObia guys about the REST API I started to wonder why it&rsquo;s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it</li>
<li>Interestingly, the speed doesn&rsquo;t get better after you request the same thing multiple timesit&rsquo;s consistently bad on both CGSpace and DSpace Test!</li>
</ul>
<pre><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
...
0.35s user 0.06s system 1% cpu 25.133 total
0.31s user 0.04s system 1% cpu 25.223 total
@ -389,7 +389,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
<li>I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?</li>
<li>I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!</li>
</ul>
<pre><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
...
0.20s user 0.03s system 0% cpu 25.017 total
0.23s user 0.02s system 1% cpu 23.299 total
@ -399,7 +399,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
</code></pre><ul>
<li>If I make a request without the expands it is ten time faster:</li>
</ul>
<pre><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&amp;offset=0'
...
0.20s user 0.03s system 7% cpu 3.098 total
0.22s user 0.03s system 8% cpu 2.896 total
@ -414,7 +414,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
<li>Most of the are from Bioversity, and I asked Maria for permission before updating them</li>
<li>I manually went through and looked at the existing values and updated them in several batches:</li>
</ul>
<pre><code>UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
<pre tabindex="0"><code>UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value LIKE '%/by/%';
@ -436,7 +436,7 @@ UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND
<li>Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server</li>
<li>IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script, and regenerated the controlled vocabulary:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt;
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt;
2018-10-17-orcids.txt
$ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -444,7 +444,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I also decided to add the ORCID identifiers that MEL had sent us a few months ago&hellip;</li>
<li>One problem I had with the <code>resolve-orcids.py</code> script is that one user seems to have disabled their profile data since we last updated:</li>
</ul>
<pre><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
<pre tabindex="0"><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
</code></pre><ul>
<li>So I need to handle that situation in the script for sure, but I&rsquo;m not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?</li>
@ -457,7 +457,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
<li>After they do some tests and we check the values Enrico will send a formal email to Peter et al to ask that they start depositing officially</li>
<li>I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually <a href="https://wiki.postgresql.org/wiki/Using_pg_upgrade_on_Ubuntu/Debian">migrate from 9.5 to 9.6</a>:</li>
</ul>
<pre><code># su - postgres
<pre tabindex="0"><code># su - postgres
$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
$ exit
# systemctl start postgresql
@ -468,7 +468,7 @@ $ exit
<li>Linode emailed me to say that CGSpace (linode18) had high CPU usage for a few hours this afternoon</li>
<li>Looking at the nginx logs around that time I see the following IPs making the most requests:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Oct/2018:(12|13|14|15)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Oct/2018:(12|13|14|15)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
361 207.46.13.179
395 181.115.248.74
485 66.249.64.93
@ -487,7 +487,7 @@ $ exit
<li>I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace&rsquo;s Solr configuration is for 4.9</li>
<li>This means our existing Solr configuration doesn&rsquo;t run in Solr 5.5:</li>
</ul>
<pre><code>$ sudo docker pull solr:5
<pre tabindex="0"><code>$ sudo docker pull solr:5
$ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
$ sudo docker logs my_solr
...
@ -498,7 +498,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
<li>Linode sent a message that the CPU usage was high on CGSpace (linode18) last night</li>
<li>According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Oct/2018:(14|15|16)&quot; | awk '{print $1}' | sort
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Oct/2018:(14|15|16)&quot; | awk '{print $1}' | sort
| uniq -c | sort -n | tail -n 10
249 207.46.13.179
250 157.55.39.173
@ -513,7 +513,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
</code></pre><ul>
<li>This bot is only using the XMLUI and it does <em>not</em> seem to be re-using its sessions:</li>
</ul>
<pre><code># grep -c 5.9.6.51 /var/log/nginx/*.log
<pre tabindex="0"><code># grep -c 5.9.6.51 /var/log/nginx/*.log
/var/log/nginx/access.log:9323
/var/log/nginx/error.log:0
/var/log/nginx/library-access.log:0
@ -525,7 +525,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
</code></pre><ul>
<li>Last month I added &ldquo;crawl&rdquo; to the Tomcat Crawler Session Manager Valve&rsquo;s regular expression matching, and it seems to be working for MegaIndex&rsquo;s user agent:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'&quot;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&quot;'
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'&quot;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&quot;'
</code></pre><ul>
<li>So I&rsquo;m not sure why this bot uses so many sessionsis it because it requests very slowly?</li>
</ul>
@ -539,7 +539,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
<li>Change <code>build.properties</code> to use HTTPS for Handles in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>We will still need to do a batch update of the <code>dc.identifier.uri</code> and other fields in the database:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
</code></pre><ul>
<li>While I was doing that I found two items using CGSpace URLs instead of handles in their <code>dc.identifier.uri</code> so I corrected those</li>
<li>I also found several items that had invalid characters or multiple Handles in some related URL field like <code>cg.link.reference</code> so I corrected those too</li>
@ -547,7 +547,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
<li>I deployed the changes on CGSpace, ran all system updates, and rebooted the server</li>
<li>Also, I updated all Handles in the database to use HTTPS:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
UPDATE 76608
</code></pre><ul>
<li>Skype with Peter about ToRs for the AReS open source work and future plans to develop tools around the DSpace ecosystem</li>
@ -560,14 +560,14 @@ UPDATE 76608
<li>I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace</li>
<li>Testing REST login and logout via httpie because Felix from Earlham says he&rsquo;s having issues:</li>
</ul>
<pre><code>$ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
<pre tabindex="0"><code>$ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
acef8a4a-41f3-4392-b870-e873790f696b
$ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
</code></pre><ul>
<li>Also works via curl (login, check status, logout, check status):</li>
</ul>
<pre><code>$ curl -H &quot;Content-Type: application/json&quot; --data '{&quot;email&quot;:&quot;testdeposit@cgiar.org&quot;, &quot;password&quot;:&quot;deposit&quot;}' https://dspacetest.cgiar.org/rest/login
<pre tabindex="0"><code>$ curl -H &quot;Content-Type: application/json&quot; --data '{&quot;email&quot;:&quot;testdeposit@cgiar.org&quot;, &quot;password&quot;:&quot;deposit&quot;}' https://dspacetest.cgiar.org/rest/login
e09fb5e1-72b0-4811-a2e5-5c1cd78293cc
$ curl -X GET -H &quot;Content-Type: application/json&quot; -H &quot;Accept: application/json&quot; -H &quot;rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot; https://dspacetest.cgiar.org/rest/status
{&quot;okay&quot;:true,&quot;authenticated&quot;:true,&quot;email&quot;:&quot;testdeposit@cgiar.org&quot;,&quot;fullname&quot;:&quot;Test deposit&quot;,&quot;token&quot;:&quot;e09fb5e1-72b0-4811-a2e5-5c1cd78293cc&quot;}

View File

@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
Today these are the top 10 IPs:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -132,7 +132,7 @@ Today these are the top 10 IPs:
<li>Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage</li>
<li>Today these are the top 10 IPs:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1300 66.249.64.63
1384 35.237.175.180
1430 138.201.52.218
@ -148,22 +148,22 @@ Today these are the top 10 IPs:
<li><code>70.32.83.92</code> is well known, probably CCAFS or something, as it&rsquo;s only a few thousand requests and always to REST API</li>
<li><code>84.38.130.177</code> is some new IP in Latvia that is only hitting the XMLUI, using the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
</code></pre><ul>
<li>They at least seem to be re-using their Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
342
</code></pre><ul>
<li><code>50.116.102.77</code> is also a regular REST API user</li>
<li><code>40.77.167.175</code> and <code>207.46.13.156</code> seem to be Bing</li>
<li><code>138.201.52.218</code> seems to be on Hetzner in Germany, but is using this user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>And it doesn&rsquo;t seem they are re-using their Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
1243
</code></pre><ul>
<li>Ah, we&rsquo;ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day&hellip;</li>
@ -171,7 +171,7 @@ Today these are the top 10 IPs:
<li>Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth</li>
<li>Looking at the nginx logs again I see the following top ten IPs:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1979 50.116.102.77
1980 35.237.175.180
2186 207.46.13.156
@ -185,11 +185,11 @@ Today these are the top 10 IPs:
</code></pre><ul>
<li><code>78.46.89.18</code> is new since I last checked a few hours ago, and it&rsquo;s from Hetzner with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>It&rsquo;s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
8449
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
1
@ -200,7 +200,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<li>I think it&rsquo;s reasonable for a human to click one of those links five or ten times a minute&hellip;</li>
<li>To contrast, <code>78.46.89.18</code> made about 300 requests per minute for a few hours today:</li>
</ul>
<pre><code># grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code># grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
286 03/Nov/2018:18:02
287 03/Nov/2018:18:21
289 03/Nov/2018:18:23
@ -232,7 +232,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<li>Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again</li>
<li>Here are the top ten IPs active so far this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1083 2a03:2880:11ff:2::face:b00c
1105 2a03:2880:11ff:d::face:b00c
1111 2a03:2880:11ff:f::face:b00c
@ -246,7 +246,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
</code></pre><ul>
<li><code>78.46.89.18</code> is back&hellip; and it is still actually re-using its Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
8765
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
1
@ -254,7 +254,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
<li><em>Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly</em></li>
<li>Also, now we have a ton of Facebook crawlers:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | awk '{print $1}' | sort | uniq -c | sort -n
905 2a03:2880:11ff:b::face:b00c
955 2a03:2880:11ff:5::face:b00c
965 2a03:2880:11ff:e::face:b00c
@ -275,18 +275,18 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
</code></pre><ul>
<li>They are really making shit tons of requests:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
37721
</code></pre><ul>
<li><em>Updated on 2018-12-04 to correct the grep command to accurately show the number of requests</em></li>
<li>Their user agent is:</li>
</ul>
<pre><code>facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
<pre tabindex="0"><code>facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
</code></pre><ul>
<li>I will add it to the Tomcat Crawler Session Manager valve</li>
<li>Later in the evening&hellip; ok, this Facebook bot is getting super annoying:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Nov/2018&quot; | grep &quot;2a03:2880:11ff:&quot; | awk '{print $1}' | sort | uniq -c | sort -n
1871 2a03:2880:11ff:3::face:b00c
1885 2a03:2880:11ff:b::face:b00c
1941 2a03:2880:11ff:8::face:b00c
@ -307,7 +307,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
</code></pre><ul>
<li>Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
37721
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
15206
@ -315,7 +315,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<li>I think we still need to limit more of the dynamic pages, like the &ldquo;most popular&rdquo; country, item, and author pages</li>
<li>It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!</li>
</ul>
<pre><code># grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
<pre tabindex="0"><code># grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
7033
</code></pre><ul>
<li>I added the &ldquo;most-popular&rdquo; pages to the list that return <code>X-Robots-Tag: none</code> to try to inform bots not to index or follow those pages</li>
@ -325,14 +325,14 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<ul>
<li>I wrote a small Python script <a href="https://gist.github.com/alanorth/4ff81d5f65613814a66cb6f84fdf1fc5">add-dc-rights.py</a> to add usage rights (<code>dc.rights</code>) to CGSpace items based on the CSV Hector gave me from MARLO:</li>
</ul>
<pre><code>$ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>The file <code>marlo.csv</code> was cleaned up and formatted in Open Refine</li>
<li>165 of the items in their 2017 data are from CGSpace!</li>
<li>I will add the data to CGSpace this week (done!)</li>
<li>Jesus, is Facebook <em>trying</em> to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Nov/2018&quot; | grep -c &quot;2a03:2880:11ff:&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Nov/2018&quot; | grep -c &quot;2a03:2880:11ff:&quot;
29889
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
29763
@ -350,7 +350,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<li>While I was updating the <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a> script I noticed it was using <code>expand=all</code> to get the collection and community IDs</li>
<li>I realized I actually only need <code>expand=collections,subCommunities</code>, and I wanted to see how much overhead the extra expands created so I did three runs of each:</li>
</ul>
<pre><code>$ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
<pre tabindex="0"><code>$ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
</code></pre><ul>
<li>Average time with all expands was 14.3 seconds, and 12.8 seconds with <code>collections,subCommunities</code>, so <strong>1.5 seconds difference</strong>!</li>
</ul>
@ -403,22 +403,22 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<ul>
<li>Testing corrections and deletions for AGROVOC (<code>dc.subject</code>) that Sisay and Peter were working on earlier this month:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
</code></pre><ul>
<li>Then I ran them on both CGSpace and DSpace Test, and started a full Discovery re-index on CGSpace:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
</code></pre><h2 id="2018-11-20">2018-11-20</h2>
<ul>
<li>The Discovery re-indexing on CGSpace never finished yesterday&hellip; the command died after six minutes</li>
<li>The <code>dspace.log.2018-11-19</code> shows this at the time:</li>
</ul>
<pre><code>2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
<pre tabindex="0"><code>2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
java.lang.IllegalStateException: DSpace kernel cannot be null
at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
@ -479,13 +479,13 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
<li><a href="https://cgspace.cgiar.org/handle/10568/97709">This WLE item</a> is issued on 2018-10 and accessioned on 2018-10-22 but does not show up in the <a href="https://cgspace.cgiar.org/handle/10568/41888">WLE R4D Learning Series</a> collection on CGSpace for some reason, and therefore does not show up on the WLE publication website</li>
<li>I tried to remove that collection from Discovery and do a simple re-index:</li>
</ul>
<pre><code>$ dspace index-discovery -r 10568/41888
<pre tabindex="0"><code>$ dspace index-discovery -r 10568/41888
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
</code></pre><ul>
<li>&hellip; but the item still doesn&rsquo;t appear in the collection</li>
<li>Now I will try a full Discovery re-index:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Ah, Marianne had set the item as private when she uploaded it, so it was still private</li>
<li>I made it public and now it shows up in the collection list</li>
@ -497,7 +497,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>Linode alerted me that the outbound traffic rate on CGSpace (linode19) was very high</li>
<li>The top users this morning are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;27/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;27/Nov/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
229 46.101.86.248
261 66.249.64.61
447 66.249.64.59
@ -512,7 +512,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>We know 70.32.83.92 is CCAFS harvester on MediaTemple, but 205.186.128.185 is new appears to be a new CCAFS harvester</li>
<li>I think we might want to prune some old accounts from CGSpace, perhaps users who haven&rsquo;t logged in in the last two years would be a conservative bunch:</li>
</ul>
<pre><code>$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
<pre tabindex="0"><code>$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
409
$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
</code></pre><ul>

View File

@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server
I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -135,7 +135,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
<ul>
<li>The error when I try to manually run the media filter for one item from the command line:</li>
</ul>
<pre><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&quot; &quot;-f/tmp/magick-129895Bmp44lvUfxo&quot; &quot;-f/tmp/magick-12989C0QFG51fktLF&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
<pre tabindex="0"><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&quot; &quot;-f/tmp/magick-129895Bmp44lvUfxo&quot; &quot;-f/tmp/magick-12989C0QFG51fktLF&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d&quot; &quot;-f/tmp/magick-129895Bmp44lvUfxo&quot; &quot;-f/tmp/magick-12989C0QFG51fktLF&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.Info.getBaseInfo(Info.java:360)
at org.im4java.core.Info.&lt;init&gt;(Info.java:151)
@ -157,13 +157,13 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
<li>I think we need to wait for a fix from Ubuntu</li>
<li>For what it&rsquo;s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:</li>
</ul>
<pre><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
<pre tabindex="0"><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
zsh: segmentation fault (core dumped) gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
</code></pre><ul>
<li>When I replace the <code>pngalpha</code> device with <code>png16m</code> as suggested in the StackOverflow comments it works:</li>
</ul>
<pre><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
<pre tabindex="0"><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
DEBUG: FC_WEIGHT didn't match
</code></pre><ul>
<li>Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend (<a href="https://dspacetest.cgiar.org/handle/10568/108298">IITA_Dec_1_1997 aka Daniel1807</a>)
@ -182,7 +182,7 @@ DEBUG: FC_WEIGHT didn't match
</li>
<li>Expand my &ldquo;encoding error&rdquo; detection GREL to include <code>~</code> as I saw a lot of that in some copy pasted French text recently:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -196,29 +196,29 @@ DEBUG: FC_WEIGHT didn't match
<li>I can successfully generate a thumbnail for another recent item (<a href="https://hdl.handle.net/10568/98394">10568/98394</a>), but not for <a href="https://hdl.handle.net/10568/98390">10568/98930</a></li>
<li>Even manually on my Arch Linux desktop with ghostscript 9.26-1 and the <code>pngalpha</code> device, I can generate a thumbnail for the first one (10568/98394):</li>
</ul>
<pre><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
<pre tabindex="0"><code>$ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
</code></pre><ul>
<li>So it seems to be something about the PDFs themselves, perhaps related to alpha support?</li>
<li>The first item (10568/98394) has the following information:</li>
</ul>
<pre><code>$ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
<pre tabindex="0"><code>$ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=&gt;Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
</code></pre><ul>
<li>And wow, I can&rsquo;t even run ImageMagick&rsquo;s <code>identify</code> on the first page of the second item (10568/98930):</li>
</ul>
<pre><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
<pre tabindex="0"><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
</code></pre><ul>
<li>But with GraphicsMagick&rsquo;s <code>identify</code> it works:</li>
</ul>
<pre><code>$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
<pre tabindex="0"><code>$ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
DEBUG: FC_WEIGHT didn't match
Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
</code></pre><ul>
<li>Interesting that ImageMagick&rsquo;s <code>identify</code> <em>does</em> work if you do not specify a page, perhaps as <a href="https://bugs.ghostscript.com/show_bug.cgi?id=699815">alluded to in the recent Ghostscript bug report</a>:</li>
</ul>
<pre><code>$ identify Food\ safety\ Kenya\ fruits.pdf
<pre tabindex="0"><code>$ identify Food\ safety\ Kenya\ fruits.pdf
Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
@ -228,7 +228,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</code></pre><ul>
<li>As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):</li>
</ul>
<pre><code>$ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
<pre tabindex="0"><code>$ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
zsh: abort (core dumped) convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
$ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
DEBUG: FC_WEIGHT didn't match
@ -236,7 +236,7 @@ DEBUG: FC_WEIGHT didn't match
<li>I inspected the troublesome PDF using <a href="http://jhove.openpreservation.org/">jhove</a> and noticed that it is using <code>ISO PDF/A-1, Level B</code> and the other one doesn&rsquo;t list a profile, though I don&rsquo;t think this is relevant</li>
<li>I found another item that fails when generating a thumbnail (<a href="https://hdl.handle.net/10568/98391">10568/98391</a>, DSpace complains:</li>
</ul>
<pre><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
<pre tabindex="0"><code>org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `&quot;gs&quot; -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 &quot;-sDEVICE=pngalpha&quot; -dTextAlphaBits=4 -dGraphicsAlphaBits=4 &quot;-r72x72&quot; -dFirstPage=1 -dLastPage=1 &quot;-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d&quot; &quot;-f/tmp/magick-14296Q0rJjfCeIj3w&quot; &quot;-f/tmp/magick-14296k_K6MWqwvpDm&quot;' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
at org.im4java.core.Info.getBaseInfo(Info.java:360)
at org.im4java.core.Info.&lt;init&gt;(Info.java:151)
@ -265,16 +265,16 @@ Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `
</code></pre><ul>
<li>And on my Arch Linux environment ImageMagick&rsquo;s <code>convert</code> also segfaults:</li>
</ul>
<pre><code>$ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
<pre tabindex="0"><code>$ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
zsh: abort (core dumped) convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] x60
</code></pre><ul>
<li>But GraphicsMagick&rsquo;s <code>convert</code> works:</li>
</ul>
<pre><code>$ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
<pre tabindex="0"><code>$ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
</code></pre><ul>
<li>So far the only thing that stands out is that the two files that don&rsquo;t work were created with Microsoft Office 2016:</li>
</ul>
<pre><code>$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
<pre tabindex="0"><code>$ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
Creator: Microsoft® Word 2016
Producer: Microsoft® Word 2016
$ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
@ -283,13 +283,13 @@ Producer: Microsoft® Word 2016
</code></pre><ul>
<li>And the one that works was created with Office 365:</li>
</ul>
<pre><code>$ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
<pre tabindex="0"><code>$ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
Creator: Microsoft® Word for Office 365
Producer: Microsoft® Word for Office 365
</code></pre><ul>
<li>I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:</li>
</ul>
<pre><code>$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
<pre tabindex="0"><code>$ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
$ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
</code></pre><ul>
<li>I&rsquo;ve tried a few times this week to register for the <a href="https://www.evisa.gov.et/">Ethiopian eVisa website</a>, but it is never successful</li>
@ -304,7 +304,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
</ul>
</li>
</ul>
<pre><code>2018-12-03 15:44:00,030 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
<pre tabindex="0"><code>2018-12-03 15:44:00,030 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2018-12-03 15:44:03,390 ERROR com.atmire.app.webui.servlet.ExportServlet @ Error converter plugin not found: interface org.infoCon.ConverterPlugin
...
2018-12-03 15:45:01,667 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-listing-and-reports not found
@ -312,7 +312,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
<li>I tested it on my local environment with Tomcat 8.5.34 and the JSPUI application still has an error (again, the logs show something about tag cloud, so be unrelated), and the Listings and Reports still asks you to log in again, despite already being logged in in XMLUI, but does appear to work (I generated a report and exported a PDF)</li>
<li>I think the errors about missing Atmire components must be important, here on my local machine as well (though not the one about atmire-listings-and-reports):</li>
</ul>
<pre><code>2018-12-03 16:44:00,009 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
<pre tabindex="0"><code>2018-12-03 16:44:00,009 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
</code></pre><ul>
<li>This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness&hellip;?</li>
</ul>
@ -320,7 +320,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
<ul>
<li>Last night Linode sent a message that the load on CGSpace (linode18) was too high, here&rsquo;s a list of the top users at the time and throughout the day:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Dec/2018:1(5|6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Dec/2018:1(5|6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
225 40.77.167.142
226 66.249.64.63
232 46.101.86.248
@ -345,30 +345,30 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
</code></pre><ul>
<li><code>35.237.175.180</code> is known to us (CCAFS?), and I&rsquo;ve already added it to the list of bot IPs in nginx, which appears to be working:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
4772
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
630
</code></pre><ul>
<li>I haven&rsquo;t seen <code>2a01:4f8:140:3192::2</code> before. Its user agent is some new bot:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
</code></pre><ul>
<li>At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
5111
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
419
</code></pre><ul>
<li><code>78.46.79.71</code> is another host on Hetzner with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>This is not the first time a host on Hetzner has used a &ldquo;normal&rdquo; user agent to make thousands of requests</li>
<li>At least it is re-using its Tomcat sessions somehow:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
2044
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
1
@ -385,7 +385,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
<li>Linode sent a message that the CPU usage of CGSpace (linode18) is too high last night</li>
<li>I looked in the logs and there&rsquo;s nothing particular going on:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;05/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1225 157.55.39.177
1240 207.46.13.12
1261 207.46.13.101
@ -399,11 +399,11 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
</code></pre><ul>
<li><code>54.70.40.11</code> is some new bot with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
<pre tabindex="0"><code>Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
</code></pre><ul>
<li>But Tomcat is forcing them to re-use their Tomcat sessions with the Crawler Session Manager valve:</li>
</ul>
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
6980
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
1156
@ -446,7 +446,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
<li>Linode alerted me twice today that the load on CGSpace (linode18) was very high</li>
<li>Looking at the nginx logs I see a few new IPs in the top 10:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;17/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;17/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
927 157.55.39.81
975 54.70.40.11
2090 50.116.102.77
@ -460,7 +460,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
</code></pre><ul>
<li><code>94.71.244.172</code> and <code>143.233.227.216</code> are both in Greece and use the following user agent:</li>
</ul>
<pre><code>Mozilla/3.0 (compatible; Indy Library)
<pre tabindex="0"><code>Mozilla/3.0 (compatible; Indy Library)
</code></pre><ul>
<li>I see that I added this bot to the Tomcat Crawler Session Manager valve in 2017-12 so its XMLUI sessions are getting re-used</li>
<li><code>2a01:4f8:173:1e85::2</code> is some new bot called <code>BLEXBot/1.0</code> which should be matching the existing &ldquo;bot&rdquo; pattern in the Tomcat Crawler Session Manager regex</li>
@ -477,7 +477,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
<ul>
<li>Testing compression of PostgreSQL backups with xz and gzip:</li>
</ul>
<pre><code>$ time xz -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.xz
<pre tabindex="0"><code>$ time xz -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.xz
xz -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.xz 48.29s user 0.19s system 99% cpu 48.579 total
$ time gzip -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.gz
gzip -c cgspace_2018-12-19.backup &gt; cgspace_2018-12-19.backup.gz 2.78s user 0.09s system 99% cpu 2.899 total
@ -492,7 +492,7 @@ $ ls -lh cgspace_2018-12-19.backup*
<li>Peter asked if we could create a controlled vocabulary for publishers (<code>dc.publisher</code>)</li>
<li>I see we have about 3500 distinct publishers:</li>
</ul>
<pre><code># SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
<pre tabindex="0"><code># SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
count
-------
3522
@ -501,17 +501,17 @@ $ ls -lh cgspace_2018-12-19.backup*
<li>I reverted the metadata changes related to &ldquo;Unrestricted Access&rdquo; and &ldquo;Restricted Access&rdquo; on DSpace Test because we&rsquo;re not pushing forward with the new status terms for now</li>
<li>Purge remaining Oracle Java 8 stuff from CGSpace (linode18) since we migrated to OpenJDK a few months ago:</li>
</ul>
<pre><code># dpkg -P oracle-java8-installer oracle-java8-set-default
<pre tabindex="0"><code># dpkg -P oracle-java8-installer oracle-java8-set-default
</code></pre><ul>
<li>Update usage rights on CGSpace as we agreed with Maria Garruccio and Peter last month:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
Connected to database.
Fixed 466 occurences of: Copyrighted; Any re-use allowed
</code></pre><ul>
<li>Upgrade PostgreSQL on CGSpace (linode18) from 9.5 to 9.6:</li>
</ul>
<pre><code># apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
<pre tabindex="0"><code># apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
# pg_ctlcluster 9.5 main stop
# tar -cvzpf var-lib-postgresql-9.5.tar.gz /var/lib/postgresql/9.5
# tar -cvzpf etc-postgresql-9.5.tar.gz /etc/postgresql/9.5
@ -525,7 +525,7 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
<li>Run all system updates on CGSpace (linode18) and restart the server</li>
<li>Try to run the DSpace cleanup script on CGSpace (linode18), but I get some errors about foreign key constraints:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
- Deleting bitstream information (ID: 158227)
- Deleting bitstream record from database (ID: 158227)
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
@ -534,7 +534,7 @@ Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign k
</code></pre><ul>
<li>As always, the solution is to delete those IDs manually in PostgreSQL:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
UPDATE 1
</code></pre><ul>
<li>After all that I started a full Discovery reindex to get the index name changes and rights updates</li>
@ -544,7 +544,7 @@ UPDATE 1
<li>CGSpace went down today for a few minutes while I was at dinner and I quickly restarted Tomcat</li>
<li>The top IP addresses as of this evening are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;29/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;29/Dec/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
963 40.77.167.152
987 35.237.175.180
1062 40.77.167.55
@ -558,7 +558,7 @@ UPDATE 1
</code></pre><ul>
<li>And just around the time of the alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E &quot;29/Dec/2018:1(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E &quot;29/Dec/2018:1(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
115 66.249.66.223
118 207.46.13.14
123 34.218.226.147

View File

@ -50,7 +50,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
357 207.46.13.1
903 54.70.40.11
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -141,7 +141,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -155,7 +155,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
</code></pre><ul>
<li>Analyzing the types of requests made by the top few IPs during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | grep 54.70.40.11 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | grep 54.70.40.11 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
30 bitstream
534 discover
352 handle
@ -168,7 +168,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
<li>It&rsquo;s not clear to me what was causing the outbound traffic spike</li>
<li>Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):</li>
</ul>
<pre><code>Moving: 81742 into core statistics-2010
<pre tabindex="0"><code>Moving: 81742 into core statistics-2010
Moving: 1837285 into core statistics-2011
Moving: 3764612 into core statistics-2012
Moving: 4557946 into core statistics-2013
@ -185,7 +185,7 @@ Moving: 18497180 into core statistics-2018
<ul>
<li>Update local Docker image for DSpace PostgreSQL, re-using the existing data volume:</li>
</ul>
<pre><code>$ sudo docker pull postgres:9.6-alpine
<pre tabindex="0"><code>$ sudo docker pull postgres:9.6-alpine
$ sudo docker rm dspacedb
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><ul>
@ -197,7 +197,7 @@ $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/d
</li>
<li>The JSPUI application—which Listings and Reports depends upon—also does not load, though the error is perhaps unrelated:</li>
</ul>
<pre><code>2019-01-03 14:45:21,727 INFO org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
<pre tabindex="0"><code>2019-01-03 14:45:21,727 INFO org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
2019-01-03 14:45:21,971 INFO org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23
2019-01-03 14:45:22,115 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9471D72242DAA05BCC87734FE3C66EA6:internal_error:-- URL Was: http://localhost:8080/jspui/internal-error
-- Method: GET
@ -283,7 +283,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
<ul>
<li>Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don&rsquo;t see anything around that time in the web server logs:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Jan/2019:1(7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Jan/2019:1(7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
189 207.46.13.192
217 31.6.77.23
340 66.249.70.29
@ -298,7 +298,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
<li>I&rsquo;m thinking about trying to validate our <code>dc.subject</code> terms against <a href="http://aims.fao.org/agrovoc/webservices">AGROVOC webservices</a></li>
<li>There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for <code>SOIL</code>:</li>
</ul>
<pre><code>$ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&amp;lang=en
<pre tabindex="0"><code>$ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&amp;lang=en
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
@ -345,7 +345,7 @@ X-Frame-Options: ALLOW-FROM http://aims.fao.org
<li>The API does not appear to be case sensitive (searches for <code>SOIL</code> and <code>soil</code> return the same thing)</li>
<li>I&rsquo;m a bit confused that there&rsquo;s no obvious return code or status when a term is not found, for example <code>SOILS</code>:</li>
</ul>
<pre><code>HTTP/1.1 200 OK
<pre tabindex="0"><code>HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: Keep-Alive
Content-Length: 367
@ -381,7 +381,7 @@ X-Frame-Options: ALLOW-FROM http://aims.fao.org
<li>I guess the <code>results</code> object will just be empty&hellip;</li>
<li>Another way would be to try with SPARQL, perhaps using the Python 2.7 <a href="https://pypi.org/project/sparql-client/">sparql-client</a>:</li>
</ul>
<pre><code>$ python2.7 -m virtualenv /tmp/sparql
<pre tabindex="0"><code>$ python2.7 -m virtualenv /tmp/sparql
$ . /tmp/sparql/bin/activate
$ pip install sparql-client ipython
$ ipython
@ -466,7 +466,7 @@ In [14]: for row in result.fetchone():
</li>
<li>I am testing the speed of the WorldFish DSpace repository&rsquo;s REST API and it&rsquo;s five to ten times faster than CGSpace as I tested in <a href="/cgspace-notes/2018-10/">2018-10</a>:</li>
</ul>
<pre><code>$ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
0.16s user 0.03s system 3% cpu 5.185 total
0.17s user 0.02s system 2% cpu 7.123 total
@ -474,7 +474,7 @@ In [14]: for row in result.fetchone():
</code></pre><ul>
<li>In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;14/Jan/2019:(17|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;14/Jan/2019:(17|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
157 31.6.77.23
192 54.70.40.11
202 66.249.64.157
@ -599,11 +599,11 @@ In [14]: for row in result.fetchone():
<ul>
<li>In the Solr admin UI I see the following error:</li>
</ul>
<pre><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>Looking in the Solr log I see this:</li>
</ul>
<pre><code>2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
<pre tabindex="0"><code>2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:873)
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:646)
@ -721,7 +721,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>For 2019-01 alone the Usage Stats are already around 1.2 million</li>
<li>I tried to look in the nginx logs to see how many raw requests there are so far this month and it&rsquo;s about 1.4 million:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
1442874
real 0m17.161s
@ -786,7 +786,7 @@ sys 0m2.396s
<ul>
<li>That&rsquo;s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:</li>
</ul>
<pre><code># w
<pre tabindex="0"><code># w
04:46:14 up 213 days, 7:25, 4 users, load average: 1.94, 1.50, 1.35
</code></pre><ul>
<li>I&rsquo;ve definitely rebooted it several times in the past few months&hellip; according to <code>journalctl -b</code> it was a few weeks ago on 2019-01-02</li>
@ -803,7 +803,7 @@ sys 0m2.396s
<li>Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04&rsquo;s Tomcat 8.5</li>
<li>I could either run with a simple <code>tomcat7.service</code> like this:</li>
</ul>
<pre><code>[Unit]
<pre tabindex="0"><code>[Unit]
Description=Apache Tomcat 7 Web Application Container
After=network.target
[Service]
@ -817,7 +817,7 @@ WantedBy=multi-user.target
</code></pre><ul>
<li>Or try to use adapt a real systemd service like Arch Linux&rsquo;s:</li>
</ul>
<pre><code>[Unit]
<pre tabindex="0"><code>[Unit]
Description=Tomcat 7 servlet container
After=network.target
@ -859,7 +859,7 @@ WantedBy=multi-user.target
<li>I think I might manage this the same way I do the restic releases in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a>, where I download a specific version and symlink to some generic location without the version number</li>
<li>I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view' | grep numFound
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;33&quot; start=&quot;0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;241&quot; start=&quot;0&quot;&gt;
@ -868,7 +868,7 @@ $ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&amp;rows=0&a
<li>I don&rsquo;t think the <a href="https://solrclient.readthedocs.io/en/latest/">SolrClient library</a> we are currently using supports these type of queries so we might have to just do raw queries with requests</li>
<li>The <a href="https://github.com/django-haystack/pysolr">pysolr</a> library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):</li>
</ul>
<pre><code>import pysolr
<pre tabindex="0"><code>import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
print(results.facets['facet_fields'])
@ -876,7 +876,7 @@ print(results.facets['facet_fields'])
</code></pre><ul>
<li>If I double check one item from above, for example <code>77572</code>, it appears this is only working on the current statistics core and not the shards:</li>
</ul>
<pre><code>import pysolr
<pre tabindex="0"><code>import pysolr
solr = pysolr.Solr('http://localhost:3000/solr/statistics')
results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
print(results.hits)
@ -889,12 +889,12 @@ print(results.hits)
<li>So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON</li>
<li>This enumerates the list of Solr cores and returns JSON format:</li>
</ul>
<pre><code>http://localhost:3000/solr/admin/cores?action=STATUS&amp;wt=json
<pre tabindex="0"><code>http://localhost:3000/solr/admin/cores?action=STATUS&amp;wt=json
</code></pre><ul>
<li>I think I figured out how to search across shards, I needed to give the whole URL to each other core</li>
<li>Now I get more results when I start adding the other statistics cores:</li>
</ul>
<pre><code>$ http 'http://localhost:3000/solr/statistics/select?&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound&lt;result name=&quot;response&quot; numFound=&quot;2061320&quot; start=&quot;0&quot;&gt;
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound&lt;result name=&quot;response&quot; numFound=&quot;2061320&quot; start=&quot;0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;16280292&quot; start=&quot;0&quot; maxScore=&quot;1.0&quot;&gt;
$ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&amp;indent=on&amp;rows=0&amp;q=*:*' | grep numFound
@ -913,7 +913,7 @@ $ http 'http://localhost:3000/solr/statistics/select?&amp;shards=localhost:8081/
</ul>
</li>
</ul>
<pre><code>$ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
<pre tabindex="0"><code>$ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;275&quot; start=&quot;0&quot; maxScore=&quot;12.205825&quot;&gt;
$ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:2+id:11576&amp;fq=isBot:false&amp;fq=statistics_type:view&amp;shards=localhost:8081/solr/statistics-2018' | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;241&quot; start=&quot;0&quot; maxScore=&quot;12.205825&quot;&gt;
@ -924,7 +924,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<li>I deployed it on CGSpace (linode18) and restarted the indexer as well</li>
<li>Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;22/Jan/2019:1(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;22/Jan/2019:1(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
155 40.77.167.106
176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
189 107.21.16.70
@ -939,12 +939,12 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<li>35.237.175.180 is known to us</li>
<li>I don&rsquo;t think we&rsquo;ve seen 196.191.127.37 before. Its user agent is:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
</code></pre><ul>
<li>Interestingly this IP is located in Addis Ababa&hellip;</li>
<li>Another interesting one is 154.113.73.30, which is apparently at IITA Nigeria and uses the user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
</code></pre><h2 id="2019-01-23">2019-01-23</h2>
<ul>
<li>Peter noticed that some goo.gl links in our tweets from Feedburner are broken, for example this one from last week:</li>
@ -979,13 +979,13 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&amp;rows=0&amp;q=
<p>I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:</p>
</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
COPY 1109
</code></pre><ul>
<li>Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP</li>
<li>Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2019:0(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2019:0(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
222 54.226.25.74
241 40.77.167.13
272 46.101.86.248
@ -1019,7 +1019,7 @@ COPY 1109
<p>Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace&rsquo;s <code>filter-media</code>:</p>
</li>
</ul>
<pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
$ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391
</code></pre><ul>
<li>Both of these were successful, so there must have been an update to ImageMagick or Ghostscript in Ubuntu since early 2018-12</li>
@ -1034,7 +1034,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace fi
<li>I re-compiled Arch&rsquo;s ghostscript with the patch and then I was able to generate a thumbnail from one of the <a href="https://cgspace.cgiar.org/handle/10568/98390">troublesome PDFs</a></li>
<li>Before and after:</li>
</ul>
<pre><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
<pre tabindex="0"><code>$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
zsh: abort (core dumped) identify Food\ safety\ Kenya\ fruits.pdf\[0\]
$ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
Food safety Kenya fruits.pdf[0]=&gt;Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
@ -1044,7 +1044,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<li>I told Atmire to go ahead with the Metadata Quality Module addition based on our <code>5_x-dev</code> branch (<a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=657">657</a>)</li>
<li>Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
305 3.81.136.184
306 3.83.14.11
306 52.54.252.47
@ -1059,7 +1059,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<li>45.5.186.2 is CIAT and 66.249.64.155 is Google&hellip; hmmm.</li>
<li>Linode sent another alert this morning, here are the top ten IPs active during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;24/Jan/2019:0(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;24/Jan/2019:0(4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
360 3.89.134.93
362 34.230.15.139
366 100.24.48.177
@ -1073,7 +1073,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</code></pre><ul>
<li>Just double checking what CIAT is doing, they are mainly hitting the REST API:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;24/Jan/2019:&quot; | grep 45.5.186.2 | grep -Eo &quot;GET /(handle|bitstream|rest|oai)/&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;24/Jan/2019:&quot; | grep 45.5.186.2 | grep -Eo &quot;GET /(handle|bitstream|rest|oai)/&quot; | sort | uniq -c | sort -n
</code></pre><ul>
<li>CIAT&rsquo;s community currently has 12,000 items in it so this is normal</li>
<li>The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again&hellip;</li>
@ -1102,7 +1102,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;27/Jan/2019:0(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;27/Jan/2019:0(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
189 40.77.167.108
191 157.55.39.2
263 34.218.226.147
@ -1132,7 +1132,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</li>
<li>Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Jan/2019:0(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Jan/2019:0(6|7|8)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
67 207.46.13.50
105 41.204.190.40
117 34.218.226.147
@ -1153,7 +1153,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
</li>
<li>Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Jan/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;28/Jan/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
310 45.5.184.2
425 5.143.231.39
526 54.70.40.11
@ -1168,12 +1168,12 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<li>Of course there is CIAT&rsquo;s <code>45.5.186.2</code>, but also <code>45.5.184.2</code> appears to be CIAT&hellip; I wonder why they have two harvesters?</li>
<li><code>199.47.87.140</code> and <code>199.47.87.141</code> is TurnItIn with the following user agent:</li>
</ul>
<pre><code>TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
<pre tabindex="0"><code>TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
</code></pre><h2 id="2019-01-29">2019-01-29</h2>
<ul>
<li>Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;29/Jan/2019:0(3|4|5|6|7)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;29/Jan/2019:0(3|4|5|6|7)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
334 45.5.184.72
429 66.249.66.223
522 35.237.175.180
@ -1198,7 +1198,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Got another alert from Linode about CGSpace (linode18) this morning, here are the top IPs before, during, and after the alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;30/Jan/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;30/Jan/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
273 46.101.86.248
301 35.237.175.180
334 45.5.184.72
@ -1216,7 +1216,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<ul>
<li>Linode sent alerts about CGSpace (linode18) last night and this morning, here are the top IPs before, during, and after those times:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;30/Jan/2019:(16|17|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;30/Jan/2019:(16|17|18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
436 18.196.196.108
460 157.55.39.168
460 207.46.13.96
@ -1242,7 +1242,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
<li><code>45.5.186.2</code> and <code>45.5.184.2</code> are CIAT as always</li>
<li><code>85.25.237.71</code> is some new server in Germany that I&rsquo;ve never seen before with the user agent:</li>
</ul>
<pre><code>Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
<pre tabindex="0"><code>Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
</code></pre><!-- raw HTML omitted -->

View File

@ -72,7 +72,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -163,7 +163,7 @@ sys 0m1.979s
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -179,7 +179,7 @@ sys 0m1.979s
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
3018243
real 0m19.873s
@ -198,7 +198,7 @@ sys 0m1.979s
<ul>
<li>Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Feb/2019:0(1|2|3|4|5)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Feb/2019:0(1|2|3|4|5)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
284 18.195.78.144
329 207.46.13.32
417 35.237.175.180
@ -219,7 +219,7 @@ sys 0m1.979s
<li>This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!</li>
<li>Here are the top IPs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
325 85.25.237.71
340 45.5.184.72
431 5.143.231.8
@ -234,11 +234,11 @@ sys 0m1.979s
<li><code>45.5.184.2</code> is CIAT, <code>70.32.83.92</code> and <code>205.186.128.185</code> are Macaroni Bros harvesters for CCAFS I think</li>
<li><code>195.201.104.240</code> is a new IP address in Germany with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>This user was making 2060 requests per minute this morning&hellip; seems like I should try to block this type of behavior heuristically, regardless of user agent!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019&quot; | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019&quot; | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
19 03/Feb/2019:07:42
20 03/Feb/2019:07:12
21 03/Feb/2019:07:27
@ -262,7 +262,7 @@ sys 0m1.979s
</code></pre><ul>
<li>At least they re-used their Tomcat session!</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li>This user was making requests to <code>/browse</code>, which is not currently under the existing rate limiting of dynamic pages in our nginx config
@ -280,14 +280,14 @@ sys 0m1.979s
<ul>
<li>Generate a list of CTA subjects from CGSpace for Peter:</li>
</ul>
<pre><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
COPY 321
</code></pre><ul>
<li>Skype with Michael Victor about CKM and CGSpace</li>
<li>Discuss the new IITA research theme field with Abenet and decide that we should use <code>cg.identifier.iitatheme</code></li>
<li>This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
589 2a01:4f8:140:3192::2
762 66.249.66.219
889 35.237.175.180
@ -307,7 +307,7 @@ COPY 321
<li>Peter sent me corrections and deletions for the CTA subjects and as usual, there were encoding errors with some accentsÁ in his file</li>
<li>In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use <code>toString()</code>:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -318,17 +318,17 @@ COPY 321
</code></pre><ul>
<li>Testing the corrections for sixty-five items and sixteen deletions using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> and <a href="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a> scripts:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
</code></pre><ul>
<li>I applied them on DSpace Test and CGSpace and started a full Discovery re-index:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Peter had marked several terms with <code>||</code> to indicate multiple values in his corrections so I will have to go back and do those manually:</li>
</ul>
<pre><code>EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
<pre tabindex="0"><code>EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
ENVIRONMENTAL PROTECTION AND NATURAL RESOURCES MANAGEMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
FISHERIES AND AQUACULTURE,FISHERIES||AQUACULTURE
MARKETING AND TRADE,MARKETING||TRADE
@ -340,21 +340,21 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<ul>
<li>I dumped the CTA community so I can try to fix the subjects with multiple subjects that Peter indicated in his corrections:</li>
</ul>
<pre><code>$ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
<pre tabindex="0"><code>$ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
</code></pre><ul>
<li>Then I used <code>csvcut</code> to get only the CTA subject columns:</li>
</ul>
<pre><code>$ csvcut -c &quot;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&quot; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
<pre tabindex="0"><code>$ csvcut -c &quot;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&quot; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
</code></pre><ul>
<li>After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values</li>
<li>Then I imported it back into CGSpace:</li>
</ul>
<pre><code>$ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
<pre tabindex="0"><code>$ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
</code></pre><ul>
<li>Another day, another alert about high load on CGSpace (linode18) from Linode</li>
<li>This time the load average was 370% and the top ten IPs before, during, and after that time were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
689 35.237.175.180
1236 5.9.6.51
1305 34.218.226.147
@ -368,7 +368,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Looking closer at the top users, I see <code>45.5.186.2</code> is in Brazil and was making over 100 requests per minute to the REST API:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
118 06/Feb/2019:05:46
119 06/Feb/2019:05:37
119 06/Feb/2019:05:47
@ -382,7 +382,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
10411 200
1 301
7 302
@ -392,7 +392,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
328 220.247.212.35
372 66.249.66.221
380 207.46.13.2
@ -419,7 +419,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Linode sent an alert last night that the load on CGSpace (linode18) was over 300%</li>
<li>Here are the top IPs in the web server and API logs before, during, and after that time, respectively:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.209
6 2a01:4f8:210:51ef::2
6 40.77.167.75
@ -444,7 +444,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Then again this morning another alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.223
8 104.198.9.108
13 110.54.160.222
@ -471,7 +471,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don&rsquo;t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)</li>
<li>Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
</code></pre><ul>
<li>Collection 1056 appears to be <a href="https://cgspace.cgiar.org/handle/10568/68741">IITA Posters and Presentations</a> and I see that its workflow step 1 (Accept/Reject) is empty:</li>
</ul>
@ -482,7 +482,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Bizuwork asked about the &ldquo;DSpace Submission Approved and Archived&rdquo; emails that stopped working last month</li>
<li>I tried the <code>test-email</code> command on DSpace and it indeed is not working:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: aorth@mjanja.ch
@ -503,7 +503,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>I re-configured CGSpace to use the email/password for cgspace-support, but I get this error when I try the <code>test-email</code> script:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
</code></pre><ul>
<li>I tried to log into Outlook 365 with the credentials but I think the ones I have must be wrong, so I will ask ICT to reset the password</li>
@ -513,7 +513,7 @@ Please see the DSpace documentation for assistance.
<li>Linode sent alerts about CPU load yesterday morning, yesterday night, and this morning! All over 300% CPU load!</li>
<li>This is just for this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
289 35.237.175.180
290 66.249.66.221
296 18.195.78.144
@ -539,7 +539,7 @@ Please see the DSpace documentation for assistance.
<li>I know 66.249.66.219 is Google, 5.9.6.51 is MegaIndex, and 5.143.231.38 is SputnikBot</li>
<li>Ooh, but 151.80.203.180 is some malicious bot making requests for <code>/etc/passwd</code> like this:</li>
</ul>
<pre><code>/bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;amp;isAllowed=../etc/passwd
<pre tabindex="0"><code>/bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;amp;isAllowed=../etc/passwd
</code></pre><ul>
<li>151.80.203.180 is on OVH so I sent a message to their abuse email&hellip;</li>
</ul>
@ -547,7 +547,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
232 18.195.78.144
238 35.237.175.180
281 66.249.66.221
@ -572,14 +572,14 @@ Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>Another interesting thing might be the total number of requests for web and API services during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
16333
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
15964
</code></pre><ul>
<li>Also, the number of unique IPs served during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
1622
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
95
@ -610,7 +610,7 @@ Please see the DSpace documentation for assistance.
</ul>
</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: cannot test email because mail.server.disabled is set to true
</code></pre><ul>
<li>I&rsquo;m not sure why I didn&rsquo;t know about this configuration option before, and always maintained multiple configurations for development and production
@ -620,7 +620,7 @@ Please see the DSpace documentation for assistance.
</li>
<li>I updated my local Sonatype nexus Docker image and had an issue with the volume for some reason so I decided to just start from scratch:</li>
</ul>
<pre><code># docker rm nexus
<pre tabindex="0"><code># docker rm nexus
# docker pull sonatype/nexus3
# mkdir -p /home/aorth/.local/lib/containers/volumes/nexus_data
# chown 200:200 /home/aorth/.local/lib/containers/volumes/nexus_data
@ -628,7 +628,7 @@ Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>For some reason my <code>mvn package</code> for DSpace is not working now&hellip; I might go back to <a href="https://mjanja.ch/2018/02/cache-maven-artifacts-with-artifactory/">using Artifactory for caching</a> instead:</li>
</ul>
<pre><code># docker pull docker.bintray.io/jfrog/artifactory-oss:latest
<pre tabindex="0"><code># docker pull docker.bintray.io/jfrog/artifactory-oss:latest
# mkdir -p /home/aorth/.local/lib/containers/volumes/artifactory5_data
# chown 1030 /home/aorth/.local/lib/containers/volumes/artifactory5_data
# docker run --name artifactory --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
@ -643,13 +643,13 @@ Please see the DSpace documentation for assistance.
<li>On a similar note, I wonder if we could use the performance-focused <a href="https://libvips.github.io/libvips/">libvps</a> and the third-party <a href="https://github.com/codecitizen/jlibvips/">jlibvips Java library</a> in DSpace</li>
<li>Testing the <code>vipsthumbnail</code> command line tool with <a href="https://cgspace.cgiar.org/handle/10568/51999">this CGSpace item that uses CMYK</a>:</li>
</ul>
<pre><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
<pre tabindex="0"><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
</code></pre><ul>
<li>(DSpace 5 appears to use JPEG 92 quality so I do the same)</li>
<li>Thinking about making &ldquo;top items&rdquo; endpoints in my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
<li>I could use the following SQL queries very easily to get the top items by views or downloads:</li>
</ul>
<pre><code>dspacestatistics=# SELECT * FROM items WHERE views &gt; 0 ORDER BY views DESC LIMIT 10;
<pre tabindex="0"><code>dspacestatistics=# SELECT * FROM items WHERE views &gt; 0 ORDER BY views DESC LIMIT 10;
dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads DESC LIMIT 10;
</code></pre><ul>
<li>I&rsquo;d have to think about what to make the REST API endpoints, perhaps: <code>/statistics/top/items?limit=10</code></li>
@ -660,7 +660,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
</ul>
</li>
</ul>
<pre><code>$ identify -verbose alc_contrastes_desafios.pdf.jpg
<pre tabindex="0"><code>$ identify -verbose alc_contrastes_desafios.pdf.jpg
...
Colorspace: sRGB
</code></pre><ul>
@ -671,35 +671,35 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
<li>ILRI ICT reset the password for the CGSpace mail account, but I still can&rsquo;t get it to send mail from DSpace&rsquo;s <code>test-email</code> utility</li>
<li>I even added extra mail properties to <code>dspace.cfg</code> as suggested by someone on the dspace-tech mailing list:</li>
</ul>
<pre><code>mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
<pre tabindex="0"><code>mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
</code></pre><ul>
<li>But the result is still:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
</code></pre><ul>
<li>I tried to log into the Outlook 365 web mail and it doesn&rsquo;t work so I&rsquo;ve emailed ILRI ICT again</li>
<li>After reading the <a href="https://javaee.github.io/javamail/FAQ#commonmistakes">common mistakes in the JavaMail FAQ</a> I reconfigured the extra properties in DSpace&rsquo;s mail configuration to be simply:</li>
</ul>
<pre><code>mail.extraproperties = mail.smtp.starttls.enable=true
<pre tabindex="0"><code>mail.extraproperties = mail.smtp.starttls.enable=true
</code></pre><ul>
<li>&hellip; and then I was able to send a mail using my personal account where I know the credentials work</li>
<li>The CGSpace account still gets this error message:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: javax.mail.AuthenticationFailedException
</code></pre><ul>
<li>I updated the <a href="https://github.com/ilri/DSpace/pull/410">DSpace SMTP settings in <code>dspace.cfg</code></a> as well as the <a href="https://github.com/ilri/rmg-ansible-public/commit/ab5fe4d10e16413cd04ffb1bc3179dc970d6d47c">variables in the DSpace role of the Ansible infrastructure scripts</a></li>
<li>Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:</li>
</ul>
<pre><code>$ dspace user --delete --email blah@cta.int
<pre tabindex="0"><code>$ dspace user --delete --email blah@cta.int
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
</code></pre><ul>
<li>On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable <code>webui.user.assumelogin = true</code></li>
<li>I will enable this on CGSpace (<a href="https://github.com/ilri/DSpace/pull/411">#411</a>)</li>
<li>Test re-creating my local PostgreSQL and Artifactory containers with podman instead of Docker (using the volumes from my old Docker containers though):</li>
</ul>
<pre><code># podman pull postgres:9.6-alpine
<pre tabindex="0"><code># podman pull postgres:9.6-alpine
# podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
# podman pull docker.bintray.io/jfrog/artifactory-oss
# podman run --name artifactory -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
@ -707,7 +707,7 @@ $ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int
<li>Totally works&hellip; awesome!</li>
<li>Then I tried with rootless containers by creating the subuid and subgid mappings for aorth:</li>
</ul>
<pre><code>$ sudo touch /etc/subuid /etc/subgid
<pre tabindex="0"><code>$ sudo touch /etc/subuid /etc/subgid
$ usermod --add-subuids 10000-75535 aorth
$ usermod --add-subgids 10000-75535 aorth
$ sudo sysctl kernel.unprivileged_userns_clone=1
@ -717,7 +717,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<li>Which totally works, but Podman&rsquo;s rootless support doesn&rsquo;t work with port mappings yet&hellip;</li>
<li>Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# apt remove tomcat7 tomcat7-admin
# useradd -m -r -s /bin/bash dspace
# mv /usr/share/tomcat7/.m2 /home/dspace
@ -728,14 +728,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
</code></pre><ul>
<li>After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:</li>
</ul>
<pre><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
<pre tabindex="0"><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
</code></pre><ul>
<li>The issue last month was address space, which is now set as <code>LimitAS=infinity</code> in <code>tomcat7.service</code>&hellip;</li>
<li>I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server</li>
<li>Still the error persists after reboot</li>
<li>I will try to stop Tomcat and then remove the locks manually:</li>
</ul>
<pre><code># find /home/cgspace.cgiar.org/solr/ -iname &quot;write.lock&quot; -delete
<pre tabindex="0"><code># find /home/cgspace.cgiar.org/solr/ -iname &quot;write.lock&quot; -delete
</code></pre><ul>
<li>After restarting Tomcat the usage statistics are back</li>
<li>Interestingly, many of the locks were from last month, last year, and even 2015! I&rsquo;m pretty sure that&rsquo;s not supposed to be how locks work&hellip;</li>
@ -747,19 +747,19 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<ul>
<li>Tomcat was killed around 3AM by the kernel&rsquo;s OOM killer according to <code>dmesg</code>:</li>
</ul>
<pre><code>[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
<pre tabindex="0"><code>[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
[Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
[Fri Feb 15 03:10:43 2019] oom_reaper: reaped process 12027 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>The <code>tomcat7</code> service shows:</li>
</ul>
<pre><code>Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
<pre tabindex="0"><code>Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
</code></pre><ul>
<li>I suspect it was related to the media-filter cron job that runs at 3AM but I don&rsquo;t see anything particular in the log files</li>
<li>I want to try to normalize the <code>text_lang</code> values to make working with metadata easier</li>
<li>We currently have a bunch of weird values that DSpace uses like <code>NULL</code>, <code>en_US</code>, and <code>en</code> and others that have been entered manually by editors:</li>
</ul>
<pre><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
| 1069539
@ -778,7 +778,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<li>Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!</li>
<li>I&rsquo;m going to normalized these to <code>NULL</code> at least on DSpace Test for now:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
UPDATE 1045410
</code></pre><ul>
<li>I started proofing IITA&rsquo;s 2019-01 records that Sisay uploaded this week
@ -790,7 +790,7 @@ UPDATE 1045410
<li>ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works</li>
<li>Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman&rsquo;s volumes:</li>
</ul>
<pre><code>$ podman pull postgres:9.6-alpine
<pre tabindex="0"><code>$ podman pull postgres:9.6-alpine
$ podman volume create dspacedb_data
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
@ -803,7 +803,7 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
<li>And it&rsquo;s all running without root!</li>
<li>Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:</li>
</ul>
<pre><code>$ podman volume create artifactory_data
<pre tabindex="0"><code>$ podman volume create artifactory_data
artifactory_data
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
$ buildah unshare
@ -817,13 +817,13 @@ $ podman start artifactory
<ul>
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(162844) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
UPDATE 1
</code></pre><ul>
<li>I merged the Atmire Metadata Quality Module (MQM) changes to the <code>5_x-prod</code> branch and deployed it on CGSpace (<a href="https://github.com/ilri/DSpace/pull/407">#407</a>)</li>
@ -834,7 +834,7 @@ UPDATE 1
<li>Jesus fucking Christ, Linode sent an alert that CGSpace (linode18) was using 421% CPU for a few hours this afternoon (server time):</li>
<li>There seems to have been a lot of activity in XMLUI:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1236 18.212.208.240
1276 54.164.83.99
1277 3.83.14.11
@ -864,7 +864,7 @@ UPDATE 1
<li>94.71.244.172 is in Greece and uses the user agent &ldquo;Indy Library&rdquo;</li>
<li>At least they are re-using their Tomcat session:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
</code></pre><ul>
<li>
<p>The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent &ldquo;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&rdquo;:</p>
@ -886,7 +886,7 @@ UPDATE 1
<p>For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:</p>
</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
1173 52.91.249.23
1176 107.22.118.106
1178 3.88.173.152
@ -920,7 +920,7 @@ UPDATE 1
</code></pre><ul>
<li>In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
10 18/Feb/2019:17:20
10 18/Feb/2019:17:22
10 18/Feb/2019:17:31
@ -935,7 +935,7 @@ UPDATE 1
<li>As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics</li>
<li>There were 92,000 requests from these IPs alone today!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
92756
</code></pre><ul>
<li>I will add this user agent to the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/nginx/default.conf.j2">&ldquo;badbots&rdquo; rate limiting in our nginx configuration</a></li>
@ -943,7 +943,7 @@ UPDATE 1
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -956,7 +956,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>Unfortunately, I don&rsquo;t see any strange activity in the web server API or XMLUI logs at that time in particular</li>
<li>So far today the top ten IPs in the XMLUI logs are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
11541 18.212.208.240
11560 3.81.136.184
11562 3.88.237.84
@ -978,7 +978,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>The top requests in the API logs today are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
42 66.249.66.221
44 156.156.81.215
55 3.85.54.129
@ -999,17 +999,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate</li>
<li>I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from <a href="https://hdl.handle.net/10568/96140">10568/96140</a> almost 200 times:</li>
</ul>
<pre><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
185
</code></pre><ul>
<li>Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:</li>
</ul>
<pre><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
346
</code></pre><ul>
<li>In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
1 139.162.146.60
1 157.55.39.159
1 196.188.127.94
@ -1032,7 +1032,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</code></pre><ul>
<li>That is so weird, they are all using this Android user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
</code></pre><ul>
<li>I wrote a quick and dirty Python script called <code>resolve-addresses.py</code> to resolve IP addresses to their owning organization&rsquo;s name, ASN, and country using the <a href="https://ipapi.co">IPAPI.co API</a></li>
</ul>
@ -1042,7 +1042,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I told him that they should probably try to use the REST API&rsquo;s <code>find-by-metadata-field</code> endpoint</li>
<li>The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:</li>
</ul>
<pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;&quot;}'
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;&quot;}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: null}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;en_US&quot;}'
</code></pre><ul>
@ -1063,23 +1063,23 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: applica
<li>It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to</li>
<li>I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files:</li>
</ul>
<pre><code>$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
<pre tabindex="0"><code>$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
</code></pre><ul>
<li>Then I generated a list of all the unique matched terms:</li>
</ul>
<pre><code>$ cat /tmp/matched-subjects-* | sort | uniq &gt; /tmp/2019-02-21-matched-subjects.txt
<pre tabindex="0"><code>$ cat /tmp/matched-subjects-* | sort | uniq &gt; /tmp/2019-02-21-matched-subjects.txt
</code></pre><ul>
<li>And then a list of all the unique <em>unmatched</em> terms using some utility I&rsquo;ve never heard of before called <code>comm</code> or with <code>diff</code>:</li>
</ul>
<pre><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
<pre tabindex="0"><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format=&quot;&quot; --unchanged-line-format=&quot;&quot; /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
</code></pre><ul>
<li>Generate a list of countries and regions from CGSpace for Sisay to look through:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
COPY 202
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
COPY 33
@ -1124,7 +1124,7 @@ COPY 33
<p>I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:</p>
</li>
</ul>
<pre><code>import json
<pre tabindex="0"><code>import json
import re
import urllib
import urllib2
@ -1148,7 +1148,7 @@ return &quot;unmatched&quot;
<li>I&rsquo;m not sure how to deal with terms like &ldquo;CORN&rdquo; that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be &ldquo;MAIZE&rdquo;</li>
<li>For example, <a href="http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&amp;lang=en">a query</a> for <code>CORN*</code> returns:</li>
</ul>
<pre><code> &quot;results&quot;: [
<pre tabindex="0"><code> &quot;results&quot;: [
{
&quot;altLabel&quot;: &quot;corn (maize)&quot;,
&quot;lang&quot;: &quot;en&quot;,
@ -1176,7 +1176,7 @@ return &quot;unmatched&quot;
<li>There seems to be something going on with Solr on CGSpace (linode18) because statistics on communities and collections are blank for January and February this year</li>
<li>I see some errors started recently in Solr (yesterday):</li>
</ul>
<pre><code>$ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
<pre tabindex="0"><code>$ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
/home/cgspace.cgiar.org/log/solr.log.2019-02-11.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-12.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-13.xz:0
@ -1195,7 +1195,7 @@ return &quot;unmatched&quot;
<li>But I don&rsquo;t see anything interesting in yesterday&rsquo;s Solr log&hellip;</li>
<li>I see this in the Tomcat 7 logs yesterday:</li>
</ul>
<pre><code>Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
<pre tabindex="0"><code>Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
Feb 25 21:09:29 linode18 tomcat7[1015]: java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger$9.visit(SourceFile:1241)
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger.visitEachStatisticShard(SourceFile:268)
@ -1207,7 +1207,7 @@ Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.Statist
<li>In the Solr admin GUI I see we have the following error: &ldquo;statistics-2011: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher&rdquo;</li>
<li>I restarted Tomcat and upon startup I see lots of errors in the systemd journal, like:</li>
</ul>
<pre><code>Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
<pre tabindex="0"><code>Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
Feb 25 21:37:49 linode18 tomcat7[28363]: java.io.StreamCorruptedException: invalid type code: 00
Feb 25 21:37:49 linode18 tomcat7[28363]: at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1601)
Feb 25 21:37:49 linode18 tomcat7[28363]: at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
@ -1220,7 +1220,7 @@ Feb 25 21:37:49 linode18 tomcat7[28363]: at sun.reflect.NativeMethodAcce
<li>Also, now the Solr admin UI says &ldquo;statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher&rdquo;</li>
<li>In the Solr log I see:</li>
</ul>
<pre><code>2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
<pre tabindex="0"><code>2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:873)
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:646)
@ -1243,7 +1243,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
</code></pre><ul>
<li>I tried to shutdown Tomcat and remove the locks:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr -iname &quot;*.lock&quot; -delete
# systemctl start tomcat7
</code></pre><ul>
@ -1254,7 +1254,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the <code>LimitAS</code> setting does work, and the <code>infinity</code> setting in systemd does get translated to &ldquo;unlimited&rdquo; on the service</li>
<li>I thought it might be open file limit, but it seems we&rsquo;re nowhere near the current limit of 16384:</li>
</ul>
<pre><code># lsof -u dspace | wc -l
<pre tabindex="0"><code># lsof -u dspace | wc -l
3016
</code></pre><ul>
<li>For what it&rsquo;s worth I see the same errors about <code>solr_update_time_stamp</code> on DSpace Test (linode19)</li>
@ -1270,7 +1270,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>I sent a mail to the dspace-tech mailing list about the &ldquo;solr_update_time_stamp&rdquo; error</li>
<li>A CCAFS user sent a message saying they got this error when submitting to CGSpace:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
</code></pre><ul>
<li>According to the <a href="https://cgspace.cgiar.org/rest/collections/1021">REST API</a> collection 1021 appears to be <a href="https://cgspace.cgiar.org/handle/10568/66581">CCAFS Tools, Maps, Datasets and Models</a></li>
<li>I looked at the <code>WORKFLOW_STEP_1</code> (Accept/Reject) and the group is of course empty</li>
@ -1287,7 +1287,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>He asked me to upload the files for him via the command line, but the file he referenced (<code>Thumbnails_feb_2019.zip</code>) doesn&rsquo;t exist</li>
<li>I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file&rsquo;s name:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
<pre tabindex="0"><code>$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
</code></pre><ul>
<li>Why don&rsquo;t they just derive the directory from the path to the zip file?</li>
<li>Working on Udana&rsquo;s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
@ -1303,12 +1303,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<ul>
<li>I helped Sisay upload the nineteen CTA records from last week via the command line because they required mappings (which is not possible to do via the batch upload web interface)</li>
</ul>
<pre><code>$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
<pre tabindex="0"><code>$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
</code></pre><ul>
<li>Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out <em>sigh</em></li>
<li>Now I&rsquo;m getting this message when trying to use DSpace&rsquo;s <code>test-email</code> script:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: stfu@google.com

View File

@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -151,7 +151,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<ul>
<li>Trying to finally upload IITA&rsquo;s 259 Feb 14 items to CGSpace so I exported them from DSpace Test:</li>
</ul>
<pre><code>$ mkdir 2019-03-03-IITA-Feb14
<pre tabindex="0"><code>$ mkdir 2019-03-03-IITA-Feb14
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
</code></pre><ul>
<li>As I was inspecting the archive I noticed that there were some problems with the bitsreams:
@ -163,7 +163,7 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
</li>
<li>After adding the missing bitstreams and descriptions manually I tested them again locally, then imported them to a temporary collection on CGSpace:</li>
</ul>
<pre><code>$ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
<pre tabindex="0"><code>$ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
</code></pre><ul>
<li>DSpace&rsquo;s export function doesn&rsquo;t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something</li>
<li>After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the <code>dspace cleanup</code> script</li>
@ -180,7 +180,7 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
<li>I suspect it&rsquo;s related to the email issue that ICT hasn&rsquo;t responded about since last week</li>
<li>As I thought, I still cannot send emails from CGSpace:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: blah@stfu.com
@ -197,7 +197,7 @@ Error sending email:
<li>ICT reset the email password and I confirmed that it is working now</li>
<li>Generate a controlled vocabulary of 1187 AGROVOC subjects from the top 1500 that I checked last month, dumping the terms themselves using <code>csvcut</code> and then applying XML controlled vocabulary format in vim and then checking with tidy for good measure:</li>
</ul>
<pre><code>$ csvcut -c name 2019-02-22-subjects.csv &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
<pre tabindex="0"><code>$ csvcut -c name 2019-02-22-subjects.csv &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
$ # apply formatting in XML file
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
</code></pre><ul>
@ -217,7 +217,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
</ul>
</li>
</ul>
<pre><code># journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
<pre tabindex="0"><code># journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
1076
</code></pre><ul>
<li>I restarted Tomcat and it&rsquo;s OK now&hellip;</li>
@ -238,11 +238,11 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
<li>The FireOak report highlights the fact that several CGSpace collections have mixed-content errors due to the use of HTTP links in the Feedburner forms</li>
<li>I see 46 occurrences of these with this query:</li>
</ul>
<pre><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
<pre tabindex="0"><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
</code></pre><ul>
<li>I can replace these globally using the following SQL:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
UPDATE 43
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
UPDATE 44
@ -254,7 +254,7 @@ UPDATE 44
<li>Working on tagging IITA&rsquo;s items with their new research theme (<code>cg.identifier.iitatheme</code>) based on their existing IITA subjects (see <a href="/cgspace-notes/2018-02/">notes from 2019-02</a>)</li>
<li>I exported the entire IITA community from CGSpace and then used <code>csvcut</code> to extract only the needed fields:</li>
</ul>
<pre><code>$ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv &gt; /tmp/iita.csv
<pre tabindex="0"><code>$ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv &gt; /tmp/iita.csv
</code></pre><ul>
<li>
<p>After importing to OpenRefine I realized that tagging items based on their subjects is tricky because of the row/record mode of OpenRefine when you split the multi-value cells as well as the fact that some items might need to be tagged twice (thus needing a <code>||</code>)</p>
@ -263,7 +263,7 @@ UPDATE 44
<p>I think it might actually be easier to filter by IITA subject, then by IITA theme (if needed), and then do transformations with some conditional values in GREL expressions like:</p>
</li>
</ul>
<pre><code>if(isBlank(value), 'PLANT PRODUCTION &amp; HEALTH', value + '||PLANT PRODUCTION &amp; HEALTH')
<pre tabindex="0"><code>if(isBlank(value), 'PLANT PRODUCTION &amp; HEALTH', value + '||PLANT PRODUCTION &amp; HEALTH')
</code></pre><ul>
<li>Then it&rsquo;s more annoying because there are four IITA subject columns&hellip;</li>
<li>In total this would add research themes to 1,755 items</li>
@ -288,7 +288,7 @@ UPDATE 44
</li>
<li>This is a bit ugly, but it works (using the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL helper function</a> to resolve ID to handle):</li>
</ul>
<pre><code>for id in $(psql -U postgres -d dspacetest -h localhost -c &quot;SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'&quot; | grep -oE '[0-9]{3,}'); do
<pre tabindex="0"><code>for id in $(psql -U postgres -d dspacetest -h localhost -c &quot;SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'&quot; | grep -oE '[0-9]{3,}'); do
echo &quot;Getting handle for id: ${id}&quot;
@ -300,7 +300,7 @@ done
</code></pre><ul>
<li>Then I couldn&rsquo;t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:</li>
</ul>
<pre><code>$ grep -oE '201[89]' /tmp/*.csv | sort -u
<pre tabindex="0"><code>$ grep -oE '201[89]' /tmp/*.csv | sort -u
/tmp/94834.csv:2018
/tmp/95615.csv:2018
/tmp/96747.csv:2018
@ -314,7 +314,7 @@ done
<li>CGSpace (linode18) has the blank page error again</li>
<li>I&rsquo;m not sure if it&rsquo;s related, but I see the following error in DSpace&rsquo;s log:</li>
</ul>
<pre><code>2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is closed.
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
@ -326,7 +326,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
</code></pre><ul>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, <del>but spikes of over 1,000 today</del>, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
</ul>
<pre><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
<pre tabindex="0"><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
5 dspace.log.2019-02-27
11 dspace.log.2019-02-28
29 dspace.log.2019-03-01
@ -356,14 +356,14 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
<li>(Update on 2019-03-23 to use correct grep query)</li>
<li>There are not too many connections currently in PostgreSQL:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
6 dspaceApi
10 dspaceCli
15 dspaceWeb
</code></pre><ul>
<li>I didn&rsquo;t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today <em>might</em> be related?</li>
</ul>
<pre><code>SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
<pre tabindex="0"><code>SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
java.util.EmptyStackException
at java.util.Stack.peek(Stack.java:102)
at java.util.Stack.pop(Stack.java:84)
@ -436,13 +436,13 @@ java.util.EmptyStackException
<li>I copied the 2019 Solr statistics core from CGSpace to DSpace Test and it works (and is only 5.5GB currently), so now we have some useful stats on DSpace Test for the CUA module and the dspace-statistics-api</li>
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(164496) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code># su - postgres
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);'
UPDATE 1
</code></pre><h2 id="2019-03-18">2019-03-18</h2>
@ -455,7 +455,7 @@ UPDATE 1
</li>
<li>Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my <code>agrovoc-lookup.py</code> script:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
dspace=# \q
$ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv &gt; 2019-03-18-top-1500-subject.csv
@ -474,7 +474,7 @@ $ wc -l 2019-03-18-subjects-unmatched.txt
<li>Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (<a href="https://github.com/ilri/DSpace/pull/416">#416</a>)</li>
<li>We are getting the blank page issue on CGSpace again today and I see a <del>large number</del> of the &ldquo;SQL QueryTable Error&rdquo; in the DSpace log again (last time was 2019-03-15):</li>
</ul>
<pre><code>$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
<pre tabindex="0"><code>$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
dspace.log.2019-03-15:929
dspace.log.2019-03-16:67
dspace.log.2019-03-17:72
@ -482,7 +482,7 @@ dspace.log.2019-03-18:1038
</code></pre><ul>
<li>Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the &ldquo;binary file matches&rdquo; result with <code>-I</code>:</li>
</ul>
<pre><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
<pre tabindex="0"><code>$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
8
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
9 dspace.log.2019-03-08
@ -495,7 +495,7 @@ $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F
<li>It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use <code>-I</code> to say binary files don&rsquo;t match</li>
<li>Anyways, the full error in DSpace&rsquo;s log is:</li>
</ul>
<pre><code>2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
@ -504,7 +504,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is c
</code></pre><ul>
<li>There is a low number of connections to PostgreSQL currently:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | wc -l
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | wc -l
33
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
6 dspaceApi
@ -513,13 +513,13 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
</code></pre><ul>
<li>I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:</li>
</ul>
<pre><code>2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column &quot;waiting&quot; does not exist at character 217
<pre tabindex="0"><code>2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR: column &quot;waiting&quot; does not exist at character 217
</code></pre><ul>
<li>This is unrelated and apparently due to <a href="https://github.com/munin-monitoring/munin/issues/746">Munin checking a column that was changed in PostgreSQL 9.6</a></li>
<li>I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it&rsquo;s a Cocoon thing?</li>
<li>Looking in the cocoon logs I see a large number of warnings about &ldquo;Can not load requested doc&rdquo; around 11AM and 12PM:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
2 2019-03-18 00:
6 2019-03-18 02:
3 2019-03-18 04:
@ -535,7 +535,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
</code></pre><ul>
<li>And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 12PM:</li>
</ul>
<pre><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
4 2019-03-15 01:
3 2019-03-15 02:
1 2019-03-15 03:
@ -561,7 +561,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
</code></pre><ul>
<li>And again on 2019-03-08, surprise surprise, it happened in the morning:</li>
</ul>
<pre><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
11 2019-03-08 01:
3 2019-03-08 02:
1 2019-03-08 03:
@ -581,7 +581,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
<li>I found a handful of AGROVOC subjects that use a non-breaking space (0x00a0) instead of a regular space, which makes for a pretty confusing debugging&hellip;</li>
<li>I will replace these in the database immediately to save myself the headache later:</li>
</ul>
<pre><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
<pre tabindex="0"><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
count
-------
84
@ -591,7 +591,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
<li>CGSpace (linode18) is having problems with Solr again, I&rsquo;m seeing &ldquo;Error opening new searcher&rdquo; in the Solr logs and there are no stats for previous years</li>
<li>Apparently the Solr statistics shards didn&rsquo;t load properly when we restarted Tomcat <em>yesterday</em>:</li>
</ul>
<pre><code>2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
<pre tabindex="0"><code>2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
...
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
@ -603,7 +603,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>For reference, I don&rsquo;t see the <code>ulimit -v unlimited</code> in the <code>catalina.sh</code> script, though the <code>tomcat7</code> systemd service has <code>LimitAS=infinity</code></li>
<li>The limits of the current Tomcat java process are:</li>
</ul>
<pre><code># cat /proc/27182/limits
<pre tabindex="0"><code># cat /proc/27182/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
@ -629,7 +629,7 @@ Max realtime timeout unlimited unlimited us
</li>
<li>For now I will just stop Tomcat, delete Solr locks, then start Tomcat again:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr/ -iname &quot;*.lock&quot; -delete
# systemctl start tomcat7
</code></pre><ul>
@ -660,7 +660,7 @@ Max realtime timeout unlimited unlimited us
<ul>
<li>It&rsquo;s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
3 2019-03-20 00:
12 2019-03-20 02:
$ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21 [0-9]{2}:' | sort | uniq -c
@ -704,7 +704,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21
<ul>
<li>CGSpace (linode18) is having the blank page issue again and it seems to have started last night around 21:00:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
2 2019-03-22 00:
69 2019-03-22 01:
1 2019-03-22 02:
@ -742,7 +742,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
<li>I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn&rsquo;t</li>
<li>Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
1 2019-03-22 21:0
1 2019-03-22 21:1
59 2019-03-22 21:2
@ -752,11 +752,11 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
</code></pre><ul>
<li>Looking at the Cocoon log around that time I see the full error is:</li>
</ul>
<pre><code>2019-03-22 21:21:34,378 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
<pre tabindex="0"><code>2019-03-22 21:21:34,378 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
</code></pre><ul>
<li>A few milliseconds before that time I see this in the DSpace log:</li>
</ul>
<pre><code>2019-03-22 21:21:34,356 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
<pre tabindex="0"><code>2019-03-22 21:21:34,356 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
org.postgresql.util.PSQLException: This statement has been closed.
at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
@ -824,7 +824,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
<li>I did some more tests with the <a href="https://github.com/gnosly/TomcatJdbcConnectionTest">TomcatJdbcConnectionTest</a> thing and while monitoring the number of active connections in jconsole and after adjusting the limits quite low I eventually saw some connections get abandoned</li>
<li>I forgot that to connect to a remote JMX session with jconsole you need to use a dynamic SSH SOCKS proxy (as I originally <a href="/cgspace-notes/2017-11/">discovered in 2017-11</a>:</li>
</ul>
<pre><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
<pre tabindex="0"><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
</code></pre><ul>
<li>I need to remember to check the active connections next time we have issues with blank item pages on CGSpace</li>
<li>In other news, I&rsquo;ve been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing</li>
@ -855,7 +855,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
</li>
<li>Also, CGSpace doesn&rsquo;t have many Cocoon errors yet this morning:</li>
</ul>
<pre><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
<pre tabindex="0"><code>$ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
4 2019-03-25 00:
1 2019-03-25 01:
</code></pre><ul>
@ -869,7 +869,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
<li>Uptime Robot reported that CGSpace went down and I see the load is very high</li>
<li>The top IPs around the time in the nginx API and web logs were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;25/Mar/2019:(18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;25/Mar/2019:(18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
9 190.252.43.162
12 157.55.39.140
18 157.55.39.54
@ -894,16 +894,16 @@ org.postgresql.util.PSQLException: This statement has been closed.
</code></pre><ul>
<li>The IPs look pretty normal except we&rsquo;ve never seen <code>93.179.69.74</code> before, and it uses the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
</code></pre><ul>
<li>Surprisingly they are re-using their Tomcat session:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
1
</code></pre><ul>
<li>That&rsquo;s weird because the total number of sessions today seems low compared to recent days:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
5657
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-24 | sort -u | wc -l
17710
@ -914,7 +914,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
</code></pre><ul>
<li>PostgreSQL seems to be pretty busy:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
11 dspaceApi
10 dspaceCli
67 dspaceWeb
@ -931,7 +931,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>UptimeRobot says CGSpace went down again and I see the load is again at 14.0!</li>
<li>Here are the top IPs in nginx logs in the last hour:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;26/Mar/2019:(06|07)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;26/Mar/2019:(06|07)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
3 35.174.184.209
3 66.249.66.81
4 104.198.9.108
@ -960,7 +960,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>I will add these three to the &ldquo;bad bot&rdquo; rate limiting that I originally used for Baidu</li>
<li>Going further, these are the IPs making requests to Discovery and Browse pages so far today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;(discover|browse)&quot; | grep -E &quot;26/Mar/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;(discover|browse)&quot; | grep -E &quot;26/Mar/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
120 34.207.146.166
128 3.91.79.74
132 108.179.57.67
@ -978,7 +978,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)</li>
<li>Looking at the database usage I&rsquo;m wondering why there are so many connections from the DSpace CLI:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
10 dspaceCli
13 dspaceWeb
@ -987,19 +987,19 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
<li>Make a minor edit to my <code>agrovoc-lookup.py</code> script to match subject terms with parentheses like <code>COCOA (PLANT)</code></li>
<li>Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
</code></pre><ul>
<li>UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0</li>
<li>Looking at the nginx logs I don&rsquo;t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:</li>
</ul>
<pre><code># grep SemrushBot /var/log/nginx/access.log | grep -E &quot;26/Mar/2019&quot; | grep -E '(discover|browse)' | wc -l
<pre tabindex="0"><code># grep SemrushBot /var/log/nginx/access.log | grep -E &quot;26/Mar/2019&quot; | grep -E '(discover|browse)' | wc -l
2931
</code></pre><ul>
<li>So I&rsquo;m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with &ldquo;bot&rdquo; in the name for a few days to see if things calm down&hellip; maybe not just yet</li>
<li>Otherwise, these are the top users in the web and API logs the last hour (1819):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;26/Mar/2019:(18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;26/Mar/2019:(18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
54 41.216.228.158
65 199.47.87.140
75 157.55.39.238
@ -1025,7 +1025,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>For the XMLUI I see <code>18.195.78.144</code> and <code>18.196.196.108</code> requesting only CTA items and with no user agent</li>
<li>They are responsible for almost 1,000 XMLUI sessions today:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
937
</code></pre><ul>
<li>I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat&rsquo;s Crawler Session Manager Valve to force them to re-use their session</li>
@ -1033,19 +1033,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request</li>
<li>I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E &quot;26/Mar/2019:&quot; | grep -E '(discover|browse)' | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E &quot;26/Mar/2019:&quot; | grep -E '(discover|browse)' | wc -l
119
</code></pre><ul>
<li>What&rsquo;s strange is that I can&rsquo;t see any of their requests in the DSpace log&hellip;</li>
</ul>
<pre><code>$ grep -I -c 45.5.184.72 dspace.log.2019-03-26
<pre tabindex="0"><code>$ grep -I -c 45.5.184.72 dspace.log.2019-03-26
0
</code></pre><h2 id="2019-03-28">2019-03-28</h2>
<ul>
<li>Run the corrections and deletions to AGROVOC (dc.subject) on DSpace Test and CGSpace, and then start a full re-index of Discovery</li>
<li>What the hell is going on with this CTA publication?</li>
</ul>
<pre><code># grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
1 37.48.65.147
1 80.113.172.162
2 108.174.5.117
@ -1077,7 +1077,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
</li>
<li>In other news, I see that it&rsquo;s not even the end of the month yet and we have 3.6 million hits already:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2019&quot;
3654911
</code></pre><ul>
<li>In other other news I see that DSpace has no statistics for years before 2019 currently, yet when I connect to Solr I see all the cores up</li>
@ -1105,7 +1105,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
<li>It is frustrating to see that the load spikes for own own legitimate load on the server were <em>very</em> aggravated and drawn out by the contention for CPU on this host</li>
<li>We had 4.2 million hits this month according to the web server logs:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Mar/2019&quot;
4218841
real 0m26.609s
@ -1114,7 +1114,7 @@ sys 0m2.551s
</code></pre><ul>
<li>Interestingly, now that the CPU steal is not an issue the REST API is ten seconds faster than it was in <a href="/cgspace-notes/2018-10/">2018-10</a>:</li>
</ul>
<pre><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
<pre tabindex="0"><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
...
0.33s user 0.07s system 2% cpu 17.167 total
0.27s user 0.04s system 1% cpu 16.643 total
@ -1137,7 +1137,7 @@ sys 0m2.551s
<li>Looking at the weird issue with shitloads of downloads on the <a href="https://cgspace.cgiar.org/handle/10568/100289">CTA item</a> again</li>
<li>The item was added on 2019-03-13 and these three IPs have attempted to download the item&rsquo;s bitstream 43,000 times since it was added eighteen days ago:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
42 196.43.180.134
621 185.247.144.227
8102 18.194.46.84
@ -1152,7 +1152,7 @@ sys 0m2.551s
</ul>
</li>
</ul>
<pre><code>2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
<pre tabindex="0"><code>2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
</code></pre><ul>
<li>IWMI people emailed to ask why two items with the same DOI don&rsquo;t have the same Altmetric score:
<ul>
@ -1168,15 +1168,15 @@ sys 0m2.551s
</ul>
</li>
</ul>
<pre><code>_altmetric.embed_callback({&quot;title&quot;:&quot;Distilling the role of ecosystem services in the Sustainable Development Goals&quot;,&quot;doi&quot;:&quot;10.1016/j.ecoser.2017.10.010&quot;,&quot;tq&quot;:[&quot;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&quot;,&quot;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&quot;,&quot;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&quot;,&quot;Excellent paper about the contribution of #ecosystemservices to SDGs&quot;,&quot;So great to work with amazing collaborators&quot;],&quot;altmetric_jid&quot;:&quot;521611533cf058827c00000a&quot;,&quot;issns&quot;:[&quot;2212-0416&quot;],&quot;journal&quot;:&quot;Ecosystem Services&quot;,&quot;cohorts&quot;:{&quot;sci&quot;:58,&quot;pub&quot;:239,&quot;doc&quot;:3,&quot;com&quot;:2},&quot;context&quot;:{&quot;all&quot;:{&quot;count&quot;:12732768,&quot;mean&quot;:7.8220956572788,&quot;rank&quot;:56146,&quot;pct&quot;:99,&quot;higher_than&quot;:12676701},&quot;journal&quot;:{&quot;count&quot;:549,&quot;mean&quot;:7.7567299270073,&quot;rank&quot;:2,&quot;pct&quot;:99,&quot;higher_than&quot;:547},&quot;similar_age_3m&quot;:{&quot;count&quot;:386919,&quot;mean&quot;:11.573702536454,&quot;rank&quot;:3299,&quot;pct&quot;:99,&quot;higher_than&quot;:383619},&quot;similar_age_journal_3m&quot;:{&quot;count&quot;:28,&quot;mean&quot;:9.5648148148148,&quot;rank&quot;:1,&quot;pct&quot;:96,&quot;higher_than&quot;:27}},&quot;authors&quot;:[&quot;Sylvia L.R. Wood&quot;,&quot;Sarah K. Jones&quot;,&quot;Justin A. Johnson&quot;,&quot;Kate A. Brauman&quot;,&quot;Rebecca Chaplin-Kramer&quot;,&quot;Alexander Fremier&quot;,&quot;Evan Girvetz&quot;,&quot;Line J. Gordon&quot;,&quot;Carrie V. Kappel&quot;,&quot;Lisa Mandle&quot;,&quot;Mark Mulligan&quot;,&quot;Patrick O'Farrell&quot;,&quot;William K. Smith&quot;,&quot;Louise Willemen&quot;,&quot;Wei Zhang&quot;,&quot;Fabrice A. DeClerck&quot;],&quot;type&quot;:&quot;article&quot;,&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],&quot;handle&quot;:&quot;10568/89975&quot;,&quot;altmetric_id&quot;:29816439,&quot;schema&quot;:&quot;1.5.4&quot;,&quot;is_oa&quot;:false,&quot;cited_by_posts_count&quot;:377,&quot;cited_by_tweeters_count&quot;:302,&quot;cited_by_fbwalls_count&quot;:1,&quot;cited_by_gplus_count&quot;:1,&quot;cited_by_policies_count&quot;:2,&quot;cited_by_accounts_count&quot;:306,&quot;last_updated&quot;:1554039125,&quot;score&quot;:208.65,&quot;history&quot;:{&quot;1y&quot;:54.75,&quot;6m&quot;:10.35,&quot;3m&quot;:5.5,&quot;1m&quot;:5.5,&quot;1w&quot;:1.5,&quot;6d&quot;:1.5,&quot;5d&quot;:1.5,&quot;4d&quot;:1.5,&quot;3d&quot;:1.5,&quot;2d&quot;:1,&quot;1d&quot;:1,&quot;at&quot;:208.65},&quot;url&quot;:&quot;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&quot;,&quot;added_on&quot;:1512153726,&quot;published_on&quot;:1517443200,&quot;readers&quot;:{&quot;citeulike&quot;:0,&quot;mendeley&quot;:248,&quot;connotea&quot;:0},&quot;readers_count&quot;:248,&quot;images&quot;:{&quot;small&quot;:&quot;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&quot;,&quot;medium&quot;:&quot;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&quot;,&quot;large&quot;:&quot;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&quot;},&quot;details_url&quot;:&quot;http://www.altmetric.com/details.php?citation_id=29816439&quot;})
<pre tabindex="0"><code>_altmetric.embed_callback({&quot;title&quot;:&quot;Distilling the role of ecosystem services in the Sustainable Development Goals&quot;,&quot;doi&quot;:&quot;10.1016/j.ecoser.2017.10.010&quot;,&quot;tq&quot;:[&quot;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&quot;,&quot;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&quot;,&quot;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&quot;,&quot;Excellent paper about the contribution of #ecosystemservices to SDGs&quot;,&quot;So great to work with amazing collaborators&quot;],&quot;altmetric_jid&quot;:&quot;521611533cf058827c00000a&quot;,&quot;issns&quot;:[&quot;2212-0416&quot;],&quot;journal&quot;:&quot;Ecosystem Services&quot;,&quot;cohorts&quot;:{&quot;sci&quot;:58,&quot;pub&quot;:239,&quot;doc&quot;:3,&quot;com&quot;:2},&quot;context&quot;:{&quot;all&quot;:{&quot;count&quot;:12732768,&quot;mean&quot;:7.8220956572788,&quot;rank&quot;:56146,&quot;pct&quot;:99,&quot;higher_than&quot;:12676701},&quot;journal&quot;:{&quot;count&quot;:549,&quot;mean&quot;:7.7567299270073,&quot;rank&quot;:2,&quot;pct&quot;:99,&quot;higher_than&quot;:547},&quot;similar_age_3m&quot;:{&quot;count&quot;:386919,&quot;mean&quot;:11.573702536454,&quot;rank&quot;:3299,&quot;pct&quot;:99,&quot;higher_than&quot;:383619},&quot;similar_age_journal_3m&quot;:{&quot;count&quot;:28,&quot;mean&quot;:9.5648148148148,&quot;rank&quot;:1,&quot;pct&quot;:96,&quot;higher_than&quot;:27}},&quot;authors&quot;:[&quot;Sylvia L.R. Wood&quot;,&quot;Sarah K. Jones&quot;,&quot;Justin A. Johnson&quot;,&quot;Kate A. Brauman&quot;,&quot;Rebecca Chaplin-Kramer&quot;,&quot;Alexander Fremier&quot;,&quot;Evan Girvetz&quot;,&quot;Line J. Gordon&quot;,&quot;Carrie V. Kappel&quot;,&quot;Lisa Mandle&quot;,&quot;Mark Mulligan&quot;,&quot;Patrick O'Farrell&quot;,&quot;William K. Smith&quot;,&quot;Louise Willemen&quot;,&quot;Wei Zhang&quot;,&quot;Fabrice A. DeClerck&quot;],&quot;type&quot;:&quot;article&quot;,&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],&quot;handle&quot;:&quot;10568/89975&quot;,&quot;altmetric_id&quot;:29816439,&quot;schema&quot;:&quot;1.5.4&quot;,&quot;is_oa&quot;:false,&quot;cited_by_posts_count&quot;:377,&quot;cited_by_tweeters_count&quot;:302,&quot;cited_by_fbwalls_count&quot;:1,&quot;cited_by_gplus_count&quot;:1,&quot;cited_by_policies_count&quot;:2,&quot;cited_by_accounts_count&quot;:306,&quot;last_updated&quot;:1554039125,&quot;score&quot;:208.65,&quot;history&quot;:{&quot;1y&quot;:54.75,&quot;6m&quot;:10.35,&quot;3m&quot;:5.5,&quot;1m&quot;:5.5,&quot;1w&quot;:1.5,&quot;6d&quot;:1.5,&quot;5d&quot;:1.5,&quot;4d&quot;:1.5,&quot;3d&quot;:1.5,&quot;2d&quot;:1,&quot;1d&quot;:1,&quot;at&quot;:208.65},&quot;url&quot;:&quot;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&quot;,&quot;added_on&quot;:1512153726,&quot;published_on&quot;:1517443200,&quot;readers&quot;:{&quot;citeulike&quot;:0,&quot;mendeley&quot;:248,&quot;connotea&quot;:0},&quot;readers_count&quot;:248,&quot;images&quot;:{&quot;small&quot;:&quot;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&quot;,&quot;medium&quot;:&quot;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&quot;,&quot;large&quot;:&quot;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&quot;},&quot;details_url&quot;:&quot;http://www.altmetric.com/details.php?citation_id=29816439&quot;})
</code></pre><ul>
<li>The response paylod for the second one is the same:</li>
</ul>
<pre><code>_altmetric.embed_callback({&quot;title&quot;:&quot;Distilling the role of ecosystem services in the Sustainable Development Goals&quot;,&quot;doi&quot;:&quot;10.1016/j.ecoser.2017.10.010&quot;,&quot;tq&quot;:[&quot;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&quot;,&quot;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&quot;,&quot;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&quot;,&quot;Excellent paper about the contribution of #ecosystemservices to SDGs&quot;,&quot;So great to work with amazing collaborators&quot;],&quot;altmetric_jid&quot;:&quot;521611533cf058827c00000a&quot;,&quot;issns&quot;:[&quot;2212-0416&quot;],&quot;journal&quot;:&quot;Ecosystem Services&quot;,&quot;cohorts&quot;:{&quot;sci&quot;:58,&quot;pub&quot;:239,&quot;doc&quot;:3,&quot;com&quot;:2},&quot;context&quot;:{&quot;all&quot;:{&quot;count&quot;:12732768,&quot;mean&quot;:7.8220956572788,&quot;rank&quot;:56146,&quot;pct&quot;:99,&quot;higher_than&quot;:12676701},&quot;journal&quot;:{&quot;count&quot;:549,&quot;mean&quot;:7.7567299270073,&quot;rank&quot;:2,&quot;pct&quot;:99,&quot;higher_than&quot;:547},&quot;similar_age_3m&quot;:{&quot;count&quot;:386919,&quot;mean&quot;:11.573702536454,&quot;rank&quot;:3299,&quot;pct&quot;:99,&quot;higher_than&quot;:383619},&quot;similar_age_journal_3m&quot;:{&quot;count&quot;:28,&quot;mean&quot;:9.5648148148148,&quot;rank&quot;:1,&quot;pct&quot;:96,&quot;higher_than&quot;:27}},&quot;authors&quot;:[&quot;Sylvia L.R. Wood&quot;,&quot;Sarah K. Jones&quot;,&quot;Justin A. Johnson&quot;,&quot;Kate A. Brauman&quot;,&quot;Rebecca Chaplin-Kramer&quot;,&quot;Alexander Fremier&quot;,&quot;Evan Girvetz&quot;,&quot;Line J. Gordon&quot;,&quot;Carrie V. Kappel&quot;,&quot;Lisa Mandle&quot;,&quot;Mark Mulligan&quot;,&quot;Patrick O'Farrell&quot;,&quot;William K. Smith&quot;,&quot;Louise Willemen&quot;,&quot;Wei Zhang&quot;,&quot;Fabrice A. DeClerck&quot;],&quot;type&quot;:&quot;article&quot;,&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],&quot;handle&quot;:&quot;10568/89975&quot;,&quot;altmetric_id&quot;:29816439,&quot;schema&quot;:&quot;1.5.4&quot;,&quot;is_oa&quot;:false,&quot;cited_by_posts_count&quot;:377,&quot;cited_by_tweeters_count&quot;:302,&quot;cited_by_fbwalls_count&quot;:1,&quot;cited_by_gplus_count&quot;:1,&quot;cited_by_policies_count&quot;:2,&quot;cited_by_accounts_count&quot;:306,&quot;last_updated&quot;:1554039125,&quot;score&quot;:208.65,&quot;history&quot;:{&quot;1y&quot;:54.75,&quot;6m&quot;:10.35,&quot;3m&quot;:5.5,&quot;1m&quot;:5.5,&quot;1w&quot;:1.5,&quot;6d&quot;:1.5,&quot;5d&quot;:1.5,&quot;4d&quot;:1.5,&quot;3d&quot;:1.5,&quot;2d&quot;:1,&quot;1d&quot;:1,&quot;at&quot;:208.65},&quot;url&quot;:&quot;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&quot;,&quot;added_on&quot;:1512153726,&quot;published_on&quot;:1517443200,&quot;readers&quot;:{&quot;citeulike&quot;:0,&quot;mendeley&quot;:248,&quot;connotea&quot;:0},&quot;readers_count&quot;:248,&quot;images&quot;:{&quot;small&quot;:&quot;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&quot;,&quot;medium&quot;:&quot;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&quot;,&quot;large&quot;:&quot;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&quot;},&quot;details_url&quot;:&quot;http://www.altmetric.com/details.php?citation_id=29816439&quot;})
<pre tabindex="0"><code>_altmetric.embed_callback({&quot;title&quot;:&quot;Distilling the role of ecosystem services in the Sustainable Development Goals&quot;,&quot;doi&quot;:&quot;10.1016/j.ecoser.2017.10.010&quot;,&quot;tq&quot;:[&quot;Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of&quot;,&quot;Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers&quot;,&quot;How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!&quot;,&quot;Excellent paper about the contribution of #ecosystemservices to SDGs&quot;,&quot;So great to work with amazing collaborators&quot;],&quot;altmetric_jid&quot;:&quot;521611533cf058827c00000a&quot;,&quot;issns&quot;:[&quot;2212-0416&quot;],&quot;journal&quot;:&quot;Ecosystem Services&quot;,&quot;cohorts&quot;:{&quot;sci&quot;:58,&quot;pub&quot;:239,&quot;doc&quot;:3,&quot;com&quot;:2},&quot;context&quot;:{&quot;all&quot;:{&quot;count&quot;:12732768,&quot;mean&quot;:7.8220956572788,&quot;rank&quot;:56146,&quot;pct&quot;:99,&quot;higher_than&quot;:12676701},&quot;journal&quot;:{&quot;count&quot;:549,&quot;mean&quot;:7.7567299270073,&quot;rank&quot;:2,&quot;pct&quot;:99,&quot;higher_than&quot;:547},&quot;similar_age_3m&quot;:{&quot;count&quot;:386919,&quot;mean&quot;:11.573702536454,&quot;rank&quot;:3299,&quot;pct&quot;:99,&quot;higher_than&quot;:383619},&quot;similar_age_journal_3m&quot;:{&quot;count&quot;:28,&quot;mean&quot;:9.5648148148148,&quot;rank&quot;:1,&quot;pct&quot;:96,&quot;higher_than&quot;:27}},&quot;authors&quot;:[&quot;Sylvia L.R. Wood&quot;,&quot;Sarah K. Jones&quot;,&quot;Justin A. Johnson&quot;,&quot;Kate A. Brauman&quot;,&quot;Rebecca Chaplin-Kramer&quot;,&quot;Alexander Fremier&quot;,&quot;Evan Girvetz&quot;,&quot;Line J. Gordon&quot;,&quot;Carrie V. Kappel&quot;,&quot;Lisa Mandle&quot;,&quot;Mark Mulligan&quot;,&quot;Patrick O'Farrell&quot;,&quot;William K. Smith&quot;,&quot;Louise Willemen&quot;,&quot;Wei Zhang&quot;,&quot;Fabrice A. DeClerck&quot;],&quot;type&quot;:&quot;article&quot;,&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],&quot;handle&quot;:&quot;10568/89975&quot;,&quot;altmetric_id&quot;:29816439,&quot;schema&quot;:&quot;1.5.4&quot;,&quot;is_oa&quot;:false,&quot;cited_by_posts_count&quot;:377,&quot;cited_by_tweeters_count&quot;:302,&quot;cited_by_fbwalls_count&quot;:1,&quot;cited_by_gplus_count&quot;:1,&quot;cited_by_policies_count&quot;:2,&quot;cited_by_accounts_count&quot;:306,&quot;last_updated&quot;:1554039125,&quot;score&quot;:208.65,&quot;history&quot;:{&quot;1y&quot;:54.75,&quot;6m&quot;:10.35,&quot;3m&quot;:5.5,&quot;1m&quot;:5.5,&quot;1w&quot;:1.5,&quot;6d&quot;:1.5,&quot;5d&quot;:1.5,&quot;4d&quot;:1.5,&quot;3d&quot;:1.5,&quot;2d&quot;:1,&quot;1d&quot;:1,&quot;at&quot;:208.65},&quot;url&quot;:&quot;http://dx.doi.org/10.1016/j.ecoser.2017.10.010&quot;,&quot;added_on&quot;:1512153726,&quot;published_on&quot;:1517443200,&quot;readers&quot;:{&quot;citeulike&quot;:0,&quot;mendeley&quot;:248,&quot;connotea&quot;:0},&quot;readers_count&quot;:248,&quot;images&quot;:{&quot;small&quot;:&quot;https://badges.altmetric.com/?size=64&amp;score=209&amp;types=tttttfdg&quot;,&quot;medium&quot;:&quot;https://badges.altmetric.com/?size=100&amp;score=209&amp;types=tttttfdg&quot;,&quot;large&quot;:&quot;https://badges.altmetric.com/?size=180&amp;score=209&amp;types=tttttfdg&quot;},&quot;details_url&quot;:&quot;http://www.altmetric.com/details.php?citation_id=29816439&quot;})
</code></pre><ul>
<li>Very interesting to see this in the response:</li>
</ul>
<pre><code>&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],
<pre tabindex="0"><code>&quot;handles&quot;:[&quot;10568/89975&quot;,&quot;10568/89846&quot;],
&quot;handle&quot;:&quot;10568/89975&quot;
</code></pre><ul>
<li>On further inspection I see that the Altmetric explorer pages for each of these Handles is actually doing the right thing:

View File

@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -163,13 +163,13 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
@ -191,26 +191,26 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-04-03-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-04-03-orcid-ids.txt
</code></pre><ul>
<li>We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!</li>
<li>Next I will resolve all their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
<pre tabindex="0"><code>$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
</code></pre><ul>
<li>After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim</li>
<li>One user&rsquo;s name has changed so I will update those using my <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
</code></pre><ul>
<li>I created a pull request and merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/417">#417</a>)</li>
<li>A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it&rsquo;s still going:</li>
</ul>
<pre><code>2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
<pre tabindex="0"><code>2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
</code></pre><ul>
<li>Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:</li>
</ul>
<pre><code>$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
<pre tabindex="0"><code>$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
1
3 http://localhost:8081/solr//statistics-2017
5662 http://localhost:8081/solr//statistics-2018
@ -222,14 +222,14 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Uptime Robot reported that CGSpace (linode18) went down tonight</li>
<li>I see there are lots of PostgreSQL connections:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
10 dspaceCli
250 dspaceWeb
</code></pre><ul>
<li>I still see those weird messages about updating the statistics-2018 Solr core:</li>
</ul>
<pre><code>2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
<pre tabindex="0"><code>2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
</code></pre><ul>
<li>Looking at <code>iostat 1 10</code> I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:</li>
</ul>
@ -242,7 +242,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre><code>statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>I restarted it again and all the Solr cores came up properly&hellip;</li>
</ul>
@ -257,7 +257,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</li>
<li>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
222 18.195.78.144
245 207.46.13.58
303 207.46.13.194
@ -282,17 +282,17 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</code></pre><ul>
<li><code>45.5.184.72</code> is in Colombia so it&rsquo;s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT&rsquo;s datasets collection:</li>
</ul>
<pre><code>GET /handle/10568/72970/discover?filtertype_0=type&amp;filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_relational_operator_0=equals&amp;filter_1=&amp;filter_0=Dataset&amp;filtertype=dateIssued&amp;filter_relational_operator=equals&amp;filter=2014
<pre tabindex="0"><code>GET /handle/10568/72970/discover?filtertype_0=type&amp;filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_relational_operator_0=equals&amp;filter_1=&amp;filter_0=Dataset&amp;filtertype=dateIssued&amp;filter_relational_operator=equals&amp;filter=2014
</code></pre><ul>
<li>Their user agent is the one I added to the badbots list in nginx last week: &ldquo;GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1&rdquo;</li>
<li>They made 22,000 requests to Discover on this collection today alone (and it&rsquo;s only 11AM):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;06/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;06/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
22077 /handle/10568/72970/discover
</code></pre><ul>
<li>Yesterday they made 43,000 requests and we actually blocked most of them:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
43631 /handle/10568/72970/discover
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
142 200
@ -315,7 +315,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</ul>
</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-03&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-03&amp;rows=0&amp;wt=json&amp;indent=true'
{
&quot;response&quot;: {
&quot;docs&quot;: [],
@ -341,7 +341,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</code></pre><ul>
<li>Strangely I don&rsquo;t see many hits in 2019-04:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-04&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-04&amp;rows=0&amp;wt=json&amp;indent=true'
{
&quot;response&quot;: {
&quot;docs&quot;: [],
@ -367,7 +367,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
</code></pre><ul>
<li>Making some tests on GET vs HEAD requests on the <a href="https://dspacetest.cgiar.org/handle/10568/100289">CTA Spore 192 item</a> on DSpace Test:</li>
</ul>
<pre><code>$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
<pre tabindex="0"><code>$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -419,7 +419,7 @@ X-XSS-Protection: 1; mode=block
</code></pre><ul>
<li>And from the server side, the nginx logs show:</li>
</ul>
<pre><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] &quot;GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 68078 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
<pre tabindex="0"><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] &quot;GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 68078 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] &quot;HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 0 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
</code></pre><ul>
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
@ -428,7 +428,7 @@ X-XSS-Protection: 1; mode=block
</ul>
</li>
</ul>
<pre><code>2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
<pre tabindex="0"><code>2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
</code></pre><ul>
<li>So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
@ -437,7 +437,7 @@ X-XSS-Protection: 1; mode=block
</ul>
</li>
</ul>
<pre><code>2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
<pre tabindex="0"><code>2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
</code></pre><ul>
<li>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>&hellip; very weird
<ul>
@ -448,7 +448,7 @@ X-XSS-Protection: 1; mode=block
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true'
{
&quot;response&quot;: {
&quot;docs&quot;: [],
@ -496,12 +496,12 @@ X-XSS-Protection: 1; mode=block
<li>UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check <code>iostat 1 10</code> and I saw that CPU steal is around 1030 percent right now&hellip;</li>
<li>The load average is super high right now, as I&rsquo;ve noticed the last few times UptimeRobot said that CGSpace went down:</li>
</ul>
<pre><code>$ cat /proc/loadavg
<pre tabindex="0"><code>$ cat /proc/loadavg
10.70 9.17 8.85 18/633 4198
</code></pre><ul>
<li>According to the server logs there is actually not much going on right now:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
118 18.195.78.144
128 207.46.13.219
129 167.114.64.100
@ -529,7 +529,7 @@ X-XSS-Protection: 1; mode=block
<li><code>2408:8214:7a00:868f:7c1e:e0f3:20c6:c142</code> is some stupid Chinese bot making malicious POST requests</li>
<li>There are free database connections in the pool:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
7 dspaceCli
23 dspaceWeb
@ -546,7 +546,7 @@ X-XSS-Protection: 1; mode=block
</ul>
</li>
</ul>
<pre><code>$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
<pre tabindex="0"><code>$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
</code></pre><ul>
<li>After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
<ul>
@ -555,34 +555,34 @@ X-XSS-Protection: 1; mode=block
</ul>
</li>
</ul>
<pre><code>if(cell.recon.matched, cell.recon.match.name, value)
<pre tabindex="0"><code>if(cell.recon.matched, cell.recon.match.name, value)
</code></pre><ul>
<li>See the <a href="https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon">OpenRefine variables documentation</a> for more notes about the <code>recon</code> object</li>
<li>I also noticed a handful of errors in our current list of affiliations so I corrected them:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
</code></pre><ul>
<li>We should create a new list of affiliations to update our controlled vocabulary again</li>
<li>I dumped a list of the top 1500 affiliations:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
COPY 1500
</code></pre><ul>
<li>Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural and Livestock Research^M%';
</code></pre><ul>
<li>I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
COPY 20
</code></pre><ul>
<li>I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
</code></pre><ul>
<li>UptimeRobot said that CGSpace (linode18) went down tonight
@ -592,14 +592,14 @@ $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db
</ul>
</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
7 dspaceCli
250 dspaceWeb
</code></pre><ul>
<li>On a related note I see connection pool errors in the DSpace log:</li>
</ul>
<pre><code>2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
<pre tabindex="0"><code>2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<li>But still I see 10 to 30% CPU steal in <code>iostat</code> that is also reflected in the Munin graphs:</li>
@ -609,7 +609,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Linode Support still didn&rsquo;t respond to my ticket from yesterday, so I attached a new output of <code>iostat 1 10</code> and asked them to move the VM to a less busy host</li>
<li>The web server logs are not very busy:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
124 40.77.167.135
135 95.108.181.88
139 157.55.39.206
@ -636,7 +636,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning</li>
<li>Here are the top IPs in the web server logs around that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
18 66.249.79.139
21 157.55.39.160
29 66.249.79.137
@ -661,11 +661,11 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
</code></pre><ul>
<li><code>45.5.186.2</code> is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
</code></pre><ul>
<li>Database connection usage looks fine:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
7 dspaceCli
11 dspaceWeb
@ -683,13 +683,13 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
<li>Abenet pointed out a possibility of validating funders against the <a href="https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API">CrossRef API</a></li>
<li>Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked</li>
</ul>
<pre><code>$ http 'https://api.crossref.org/funders?query=mercator&amp;mailto=me@cgiar.org'
<pre tabindex="0"><code>$ http 'https://api.crossref.org/funders?query=mercator&amp;mailto=me@cgiar.org'
</code></pre><ul>
<li>Otherwise, they provide the funder data in <a href="https://www.crossref.org/services/funder-registry/">CSV and RDF format</a></li>
<li>I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn&rsquo;t match will need a human to go and do some manual checking and informed decision making&hellip;</li>
<li>If I want to write a script for this I could use the Python <a href="https://habanero.readthedocs.io/en/latest/modules/crossref.html">habanero library</a>:</li>
</ul>
<pre><code>from habanero import Crossref
<pre tabindex="0"><code>from habanero import Crossref
cr = Crossref(mailto=&quot;me@cgiar.org&quot;)
x = cr.funders(query = &quot;mercator&quot;)
</code></pre><h2 id="2019-04-11">2019-04-11</h2>
@ -720,7 +720,7 @@ x = cr.funders(query = &quot;mercator&quot;)
</li>
<li>I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA&rsquo;s records, so I applied them to DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
</code></pre><ul>
<li>Answer more questions about DOIs and Altmetric scores from WLE</li>
@ -753,7 +753,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
<ul>
<li>Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:</li>
</ul>
<pre><code>GC_TUNE=&quot;-XX:NewRatio=3 \
<pre tabindex="0"><code>GC_TUNE=&quot;-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
@ -786,7 +786,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
</ul>
</li>
</ul>
<pre><code>import json
<pre tabindex="0"><code>import json
import re
import urllib
import urllib2
@ -809,7 +809,7 @@ return item_id
</li>
<li>I ran a full Discovery indexing on CGSpace because I didn&rsquo;t do it after all the metadata updates last week:</li>
</ul>
<pre><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 82m45.324s
user 7m33.446s
@ -1001,7 +1001,7 @@ sys 2m13.463s
<li>For future reference, Linode mentioned that they consider CPU steal above 8% to be significant</li>
<li>Regarding the other Linode issue about speed, I did a test with <code>iperf</code> between linode18 and linode19:</li>
</ul>
<pre><code># iperf -s
<pre tabindex="0"><code># iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
@ -1049,11 +1049,11 @@ TCP window size: 85.0 KByte (default)
</li>
<li>I want to get rid of this annoying warning that is constantly in our DSpace logs:</li>
</ul>
<pre><code>2019-04-08 19:02:31,770 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
<pre tabindex="0"><code>2019-04-08 19:02:31,770 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
</code></pre><ul>
<li>Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):</li>
</ul>
<pre><code>$ grep -c 'Falling back to request address' dspace.log.2019-04-20
<pre tabindex="0"><code>$ grep -c 'Falling back to request address' dspace.log.2019-04-20
dspace.log.2019-04-20:1515
</code></pre><ul>
<li>I will fix it in <code>dspace/config/modules/oai.cfg</code></li>
@ -1098,7 +1098,7 @@ dspace.log.2019-04-20:1515
</ul>
</li>
</ul>
<pre><code>$ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv &gt; /tmp/iita.csv
<pre tabindex="0"><code>$ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv &gt; /tmp/iita.csv
</code></pre><ul>
<li>Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
<ul>
@ -1108,7 +1108,7 @@ dspace.log.2019-04-20:1515
</ul>
</li>
</ul>
<pre><code>$ curl -f -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
curl: (22) The requested URL returned error: 401
</code></pre><ul>
<li>Note that curl only shows the HTTP 401 error if you use <code>-f</code> (fail), and only then if you <em>don&rsquo;t</em> include <code>-s</code>
@ -1118,7 +1118,7 @@ curl: (22) The requested URL returned error: 401
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
count
-------
376
@ -1138,7 +1138,7 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
</code></pre><ul>
<li>I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn&rsquo;t have permission to access&hellip; from the DSpace log:</li>
</ul>
<pre><code>2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
<pre tabindex="0"><code>2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
2019-04-24 08:11:51,231 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
2019-04-24 08:11:51,238 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
2019-04-24 08:11:51,243 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
@ -1146,14 +1146,14 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
</code></pre><ul>
<li>Nevertheless, if I request using the <code>null</code> language I get 1020 results, plus 179 for a blank language attribute:</li>
</ul>
<pre><code>$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: null}' | jq length
<pre tabindex="0"><code>$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: null}' | jq length
1020
$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;&quot;}' | jq length
179
</code></pre><ul>
<li>This is weird because I see 9421156 items with &ldquo;WATER MANAGEMENT&rdquo; (depending on wildcard matching for errors in subject spelling):</li>
</ul>
<pre><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
count
-------
942
@ -1177,13 +1177,13 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
</li>
<li>I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:</li>
</ul>
<pre><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/login&quot; -d '{&quot;email&quot;:&quot;example@me.com&quot;,&quot;password&quot;:&quot;fuuuuu&quot;}'
<pre tabindex="0"><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/login&quot; -d '{&quot;email&quot;:&quot;example@me.com&quot;,&quot;password&quot;:&quot;fuuuuu&quot;}'
$ curl -f -H &quot;Content-Type: application/json&quot; -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -X GET &quot;https://dspacetest.cgiar.org/rest/status&quot;
$ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
</code></pre><ul>
<li>I created a normal user for Carlos to try as an unprivileged user:</li>
</ul>
<pre><code>$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
<pre tabindex="0"><code>$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
</code></pre><ul>
<li>But still I get the HTTP 401 and I have no idea which item is causing it</li>
<li>I enabled more verbose logging in <code>ItemsResource.java</code> and now I can at least see the item ID that causes the failure&hellip;
@ -1192,7 +1192,7 @@ $ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot;
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT * FROM item WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# SELECT * FROM item WHERE item_id=74648;
item_id | submitter_id | in_archive | withdrawn | last_modified | owning_collection | discoverable
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
74648 | 113 | f | f | 2016-03-30 09:00:52.131+00 | | t
@ -1212,7 +1212,7 @@ $ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot;
<ul>
<li>Export a list of authors for Peter to look through:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
COPY 65752
</code></pre><h2 id="2019-04-28">2019-04-28</h2>
<ul>
@ -1222,7 +1222,7 @@ COPY 65752
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT * FROM item WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# SELECT * FROM item WHERE item_id=74648;
item_id | submitter_id | in_archive | withdrawn | last_modified | owning_collection | discoverable
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
74648 | 113 | f | f | 2019-04-28 08:48:52.114-07 | | f
@ -1230,7 +1230,7 @@ COPY 65752
</code></pre><ul>
<li>And I tried the <code>curl</code> command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:</li>
</ul>
<pre><code>2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
<pre tabindex="0"><code>2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
</code></pre><ul>
<li>I even tried to &ldquo;expunge&rdquo; the item using an <a href="https://wiki.lyrasis.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-Performing'actions'onitems">action in CSV</a>, and it said &ldquo;EXPUNGED!&rdquo; but the item is still there&hellip;</li>
</ul>
@ -1239,7 +1239,7 @@ COPY 65752
<li>Send mail to the dspace-tech mailing list to ask about the item expunge issue</li>
<li>Delete and re-create Podman container for dspacedb after pulling a new PostgreSQL container:</li>
</ul>
<pre><code>$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
<pre tabindex="0"><code>$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><ul>
<li>Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I&rsquo;ll try to do a CSV
<ul>
@ -1247,7 +1247,7 @@ COPY 65752
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
text_lang | count
-----------+---------
| 358647

View File

@ -48,7 +48,7 @@ DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present&hellip;
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -145,7 +145,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
</li>
<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li>
</ul>
<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
</code></pre><ul>
<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li>
@ -158,7 +158,7 @@ DELETE 1
</ul>
</li>
</ul>
<pre><code>dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
dspace=# DELETE FROM item WHERE item_id=74648;
</code></pre><ul>
@ -168,12 +168,12 @@ dspace=# DELETE FROM item WHERE item_id=74648;
</ul>
</li>
</ul>
<pre><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
curl: (22) The requested URL returned error: 401 Unauthorized
</code></pre><ul>
<li>The DSpace log shows the item ID (because I modified the error text):</li>
</ul>
<pre><code>2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
<pre tabindex="0"><code>2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
</code></pre><ul>
<li>If I delete that one I get another, making the list of item IDs so far:
<ul>
@ -202,7 +202,7 @@ curl: (22) The requested URL returned error: 401 Unauthorized
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&amp;expand=metadata
<pre tabindex="0"><code>https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&amp;expand=metadata
</code></pre><h2 id="2019-05-03">2019-05-03</h2>
<ul>
<li>A user from CIAT emailed to say that CGSpace submission emails have not been working the last few weeks
@ -211,7 +211,7 @@ curl: (22) The requested URL returned error: 401 Unauthorized
</ul>
</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: woohoo@cgiar.org
@ -255,11 +255,11 @@ Please see the DSpace documentation for assistance.
</ul>
</li>
</ul>
<pre><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>As well as this error in the logs:</li>
</ul>
<pre><code>Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
<pre tabindex="0"><code>Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
</code></pre><ul>
<li>Strangely enough, I <em>do</em> see the statistics-2018, statistics-2017, etc cores in the Admin UI&hellip;</li>
<li>I restarted Tomcat a few times (and even deleted all the Solr write locks) and at least five times there were issues loading one statistics core, causing the Atmire stats to be incomplete
@ -282,7 +282,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>The number of unique sessions today is <em>ridiculously</em> high compared to the last few days considering it&rsquo;s only 12:30PM right now:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
101108
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
14618
@ -297,7 +297,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc
</code></pre><ul>
<li>The number of unique IP addresses from 2 to 6 AM this morning is already several times higher than the average for that time of the morning this past week:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
7127
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1231
@ -312,7 +312,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc
</code></pre><ul>
<li>Just this morning between the hours of 2 and 6 the number of unique sessions was <em>very</em> high compared to previous mornings:</li>
</ul>
<pre><code>$ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
83650
$ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2547
@ -327,7 +327,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
</code></pre><ul>
<li>Most of the requests were GETs:</li>
</ul>
<pre><code># cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot;(GET|HEAD|POST|PUT)&quot; | sort | uniq -c | sort -n
<pre tabindex="0"><code># cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot;(GET|HEAD|POST|PUT)&quot; | sort | uniq -c | sort -n
1 PUT
98 POST
2845 HEAD
@ -336,19 +336,19 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
<li>I&rsquo;m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?</li>
<li>Looking again, I see 84,000 requests to <code>/handle</code> this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in <code>access.log</code>):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E &quot; /handle/[0-9]+/[0-9]+&quot;
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E &quot; /handle/[0-9]+/[0-9]+&quot;
84350
</code></pre><ul>
<li>But it would be difficult to find a pattern for those requests because they cover 78,000 <em>unique</em> Handles (ie direct browsing of items, collections, or communities) and only 2,492 discover/browse (total, not unique):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+ HTTP&quot; | sort | uniq | wc -l
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+ HTTP&quot; | sort | uniq | wc -l
78104
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+/(discover|browse)&quot; | wc -l
2492
</code></pre><ul>
<li>In other news, I see some IP is making several requests per second to the exact same REST API endpoints, for example:</li>
</ul>
<pre><code># grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
<pre tabindex="0"><code># grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
3 2a01:7e00::f03c:91ff:fe0a:d645
113 63.32.242.35
</code></pre><ul>
@ -363,7 +363,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
<ul>
<li>The total number of unique IPs on CGSpace yesterday was almost 14,000, which is several thousand higher than previous day totals:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
13969
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '05/May/2019' | awk '{print $1}' | sort | uniq | wc -l
5936
@ -374,7 +374,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
</code></pre><ul>
<li>Total number of sessions yesterday was <em>much</em> higher compared to days last week:</li>
</ul>
<pre><code>$ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
144160
$ cat dspace.log.2019-05-05 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
57269
@ -407,7 +407,7 @@ $ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq |
</ul>
</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: wooooo@cgiar.org
@ -423,7 +423,7 @@ Please see the DSpace documentation for assistance.
<li>Help Moayad with certbot-auto for Let&rsquo;s Encrypt scripts on the new AReS server (linode20)</li>
<li>Normalize all <code>text_lang</code> values for metadata on CGSpace and DSpace Test (as I had tested last month):</li>
</ul>
<pre><code>UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
<pre tabindex="0"><code>UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
</code></pre><ul>
@ -454,7 +454,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
</li>
<li>All of the IPs from these networks are using generic user agents like this, but MANY more, and they change many times:</li>
</ul>
<pre><code>&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36&quot;
<pre tabindex="0"><code>&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36&quot;
</code></pre><ul>
<li>I found a <a href="https://www.qurium.org/alerts/azerbaijan/azerbaijan-and-the-region40-ddos-service/">blog post from 2018 detailing an attack from a DDoS service</a> that matches our pattern exactly</li>
<li>They specifically mention:</li>
@ -473,7 +473,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
<ul>
<li>I see that the Unpaywall bot is resonsible for a few thousand XMLUI sessions every day (IP addresses come from nginx access.log):</li>
</ul>
<pre><code>$ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2206
</code></pre><ul>
<li>I added &ldquo;Unpaywall&rdquo; to the list of bots in the Tomcat Crawler Session Manager Valve</li>
@ -505,7 +505,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
<ul>
<li>Export a list of all investors (<code>dc.description.sponsorship</code>) for Peter to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
COPY 995
</code></pre><ul>
<li>Fork the <a href="https://github.com/icarda-git/AReS">ICARDA AReS v1 repository</a> to <a href="https://github.com/ilri/AReS">ILRI&rsquo;s GitHub</a> and give access to CodeObia guys
@ -519,19 +519,19 @@ COPY 995
<li>Peter sent me a bunch of fixes for investors from yesterday</li>
<li>I did a quick check in Open Refine (trim and collapse whitespace, clean smart quotes, etc) and then applied them on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>I was going to make a new controlled vocabulary of the top 100 terms after these corrections, but I noticed a bunch of duplicates and variations when I sorted them alphabetically</li>
<li>Instead, I exported a new list and asked Peter to look at it again</li>
<li>Apply Peter&rsquo;s new corrections on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Then I re-exported the sponsors and took the top 100 to update the existing controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/423">#423</a>)
@ -564,7 +564,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dsp
</li>
<li>Generate Simple Archive Format bundle with SAFBuilder and import into the <a href="https://cgspace.cgiar.org/handle/10568/101106">AfricaRice Articles in Journals</a> collection on CGSpace:</li>
</ul>
<pre><code>$ dspace import -a -e me@cgiar.org -m 2019-05-25-AfricaRice.map -s /tmp/SimpleArchiveFormat
<pre tabindex="0"><code>$ dspace import -a -e me@cgiar.org -m 2019-05-25-AfricaRice.map -s /tmp/SimpleArchiveFormat
</code></pre><h2 id="2019-05-27">2019-05-27</h2>
<ul>
<li>Peter sent me over two thousand corrections for the authors on CGSpace that I had dumped last month
@ -573,16 +573,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dsp
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
</code></pre><ul>
<li>Then start a full Discovery re-indexing on each server:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Export new list of all authors from CGSpace database to send to Peter:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
COPY 64871
</code></pre><ul>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
@ -605,11 +605,11 @@ COPY 64871
<ul>
<li>I see the following error in the DSpace log when the user tries to log in with her CGIAR email and password on the LDAP login:</li>
</ul>
<pre><code>2019-05-30 07:19:35,166 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
<pre tabindex="0"><code>2019-05-30 07:19:35,166 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
</code></pre><ul>
<li>For now I just created an eperson with her personal email address until I have time to check LDAP to see what&rsquo;s up with her CGIAR account:</li>
</ul>
<pre><code>$ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
</code></pre><!-- raw HTML omitted -->

View File

@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it
Skype with Marie-Angélique and Abenet about CG Core v2
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -169,7 +169,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2
<ul>
<li>Thierry noticed that the CUA statistics were missing previous years again, and I see that the Solr admin UI has the following message:</li>
</ul>
<pre><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>I had to restart Tomcat a few times for all the stats cores to get loaded with no issue</li>
</ul>
@ -197,13 +197,13 @@ Skype with Marie-Angélique and Abenet about CG Core v2
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/countries.csv WITH CSV HEADER
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/countries.csv WITH CSV HEADER
COPY 192
$ csvcut -l -c 0 /tmp/countries.csv &gt; 2019-06-10-countries.csv
</code></pre><ul>
<li>Get a list of all the unique AGROVOC subject terms in IITA&rsquo;s data and export it to a text file so I can validate them with my <code>agrovoc-lookup.py</code> script:</li>
</ul>
<pre><code>$ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u &gt; iita-agrovoc.txt
<pre tabindex="0"><code>$ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u &gt; iita-agrovoc.txt
$ ./agrovoc-lookup.py -i iita-agrovoc.txt -om iita-agrovoc-matches.txt -or iita-agrovoc-rejects.txt
$ wc -l iita-agrovoc*
402 iita-agrovoc-matches.txt
@ -212,11 +212,11 @@ $ wc -l iita-agrovoc*
</code></pre><ul>
<li>Combine these IITA matches with the subjects I matched a few months ago:</li>
</ul>
<pre><code>$ csvcut -c name 2019-03-18-subjects-matched.csv | grep -v name | cat - iita-agrovoc-matches.txt | sort -u &gt; 2019-06-10-subjects-matched.txt
<pre tabindex="0"><code>$ csvcut -c name 2019-03-18-subjects-matched.csv | grep -v name | cat - iita-agrovoc-matches.txt | sort -u &gt; 2019-06-10-subjects-matched.txt
</code></pre><ul>
<li>Then make a new list to use with reconcile-csv by adding line numbers with csvcut and changing the line number header to <code>id</code>:</li>
</ul>
<pre><code>$ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' &gt; 2019-06-10-subjects-matched.csv
<pre tabindex="0"><code>$ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' &gt; 2019-06-10-subjects-matched.csv
</code></pre><h2 id="2019-06-20">2019-06-20</h2>
<ul>
<li>Share some feedback about AReS v2 with the colleagues and encourage them to do the same</li>
@ -231,14 +231,14 @@ $ wc -l iita-agrovoc*
</li>
<li>Update my local PostgreSQL container:</li>
</ul>
<pre><code>$ podman pull docker.io/library/postgres:9.6-alpine
<pre tabindex="0"><code>$ podman pull docker.io/library/postgres:9.6-alpine
$ podman rm dspacedb
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><h2 id="2019-06-25">2019-06-25</h2>
<ul>
<li>Normalize <code>text_lang</code> values for metadata on DSpace Test and CGSpace:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE 1551
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 2070
@ -291,7 +291,7 @@ UPDATE 2
</ul>
</li>
</ul>
<pre><code>$ dspace import -a -e me@cgiar.org -m 2019-06-30-AfricaRice-11to73.map -s /tmp/2019-06-30-AfricaRice-11to73
<pre tabindex="0"><code>$ dspace import -a -e me@cgiar.org -m 2019-06-30-AfricaRice-11to73.map -s /tmp/2019-06-30-AfricaRice-11to73
</code></pre><ul>
<li>I sent feedback about a few missing PDFs and one duplicate to Ibnou to check</li>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>

View File

@ -38,7 +38,7 @@ CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -153,12 +153,12 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
</ul>
</li>
</ul>
<pre><code>org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
<pre tabindex="0"><code>org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
</code></pre><ul>
<li>I restarted Tomcat <em>ten times</em> and it never worked&hellip;</li>
<li>I tried to stop Tomcat and delete the write locks:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# find /dspace/solr/statistics* -iname &quot;*.lock&quot; -print -delete
/dspace/solr/statistics/data/index/write.lock
/dspace/solr/statistics-2010/data/index/write.lock
@ -176,23 +176,23 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
<li>But it still didn&rsquo;t work!</li>
<li>I stopped Tomcat, deleted the old locks, and will try to use the &ldquo;simple&rdquo; lock file type in <code>solr/statistics/conf/solrconfig.xml</code>:</li>
</ul>
<pre><code>&lt;lockType&gt;${solr.lock.type:simple}&lt;/lockType&gt;
<pre tabindex="0"><code>&lt;lockType&gt;${solr.lock.type:simple}&lt;/lockType&gt;
</code></pre><ul>
<li>And after restarting Tomcat it still doesn&rsquo;t work</li>
<li>Now I&rsquo;ll try going back to &ldquo;native&rdquo; locking with <code>unlockAtStartup</code>:</li>
</ul>
<pre><code>&lt;unlockOnStartup&gt;true&lt;/unlockOnStartup&gt;
<pre tabindex="0"><code>&lt;unlockOnStartup&gt;true&lt;/unlockOnStartup&gt;
</code></pre><ul>
<li>Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can&rsquo;t access any stats before 2018</li>
<li>I filed an <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=685">issue with Atmire</a>, so let&rsquo;s see if they can help</li>
<li>And since I&rsquo;m annoyed and it&rsquo;s been a few months, I&rsquo;m going to move the JVM heap settings that I&rsquo;ve been testing on DSpace Test to CGSpace</li>
<li>The old ones were:</li>
</ul>
<pre><code>-Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
<pre tabindex="0"><code>-Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
</code></pre><ul>
<li>And the new ones come from Solr 4.10.x&rsquo;s startup scripts:</li>
</ul>
<pre><code> -Djava.awt.headless=true
<pre tabindex="0"><code> -Djava.awt.headless=true
-Xms8192m -Xmx8192m
-Dfile.encoding=UTF-8
-XX:NewRatio=3
@ -221,14 +221,14 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
</ul>
</li>
</ul>
<pre><code>$ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
<pre tabindex="0"><code>$ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
$ echo &quot;10568/101992&quot; &gt;&gt; item_*/collections
$ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped
</code></pre><ul>
<li>I noticed that all twenty-seven items had double dates like &ldquo;2019-05||2019-05&rdquo; so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection</li>
<li>Finish looking at the fifty-six AfricaRice items and upload them to CGSpace:</li>
</ul>
<pre><code>$ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat
<pre tabindex="0"><code>$ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat
</code></pre><ul>
<li>Peter pointed out that the Sharefair dates I fixed were not actually fixed
<ul>
@ -249,20 +249,20 @@ $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair
</ul>
</li>
</ul>
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-07-04-orcid-ids.txt
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-07-04-orcid-ids.txt
$ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d
</code></pre><ul>
<li>Send and merge a pull request for the new ORCID identifiers (<a href="https://github.com/ilri/DSpace/pull/428">#428</a>)</li>
<li>I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:</li>
</ul>
<pre><code>cg.creator.id,correct
<pre tabindex="0"><code>cg.creator.id,correct
&quot;Marius Ekué: 0000-0002-5829-6321&quot;,&quot;Marius R.M. Ekué: 0000-0002-5829-6321&quot;
&quot;Mwungu: 0000-0001-6181-8445&quot;,&quot;Chris Miyinzi Mwungu: 0000-0001-6181-8445&quot;
&quot;Mwungu: 0000-0003-1658-287X&quot;,&quot;Chris Miyinzi Mwungu: 0000-0003-1658-287X&quot;
</code></pre><ul>
<li>But when I ran <code>fix-metadata-values.py</code> I didn&rsquo;t see any changes:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
</code></pre><h2 id="2019-07-06">2019-07-06</h2>
<ul>
<li>Send a reminder to Marie about my notes on the <a href="https://github.com/AgriculturalSemantics/cg-core/issues/2">CG Core v2 issue I created two weeks ago</a></li>
@ -282,7 +282,7 @@ $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names
</li>
<li>Playing with the idea of using <a href="https://github.com/BurntSushi/xsv">xsv</a> to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:</li>
</ul>
<pre><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
<pre tabindex="0"><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
field,value,count
cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
@ -291,13 +291,13 @@ dc.title,Reference evapotranspiration prediction using hybridized fuzzy model wi
</code></pre><ul>
<li>Or perhaps if DOIs are valid or not (having doi.org in the URL):</li>
</ul>
<pre><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
<pre tabindex="0"><code>$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
field,value,count
cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1
</code></pre><ul>
<li>Or perhaps items with invalid ISSNs (according to the <a href="https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format">ISSN code format</a>):</li>
</ul>
<pre><code>$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '&quot;' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
<pre tabindex="0"><code>$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '&quot;' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
dc.identifier.issn
978-3-319-71997-9
978-3-319-71997-9
@ -333,7 +333,7 @@ dc.identifier.issn
<li>Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: &ldquo;Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.&rdquo;</li>
<li>I looked in the DSpace logs and found this right around the time of the screenshot he sent me:</li>
</ul>
<pre><code>2019-07-10 11:50:27,433 INFO org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
<pre tabindex="0"><code>2019-07-10 11:50:27,433 INFO org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
</code></pre><ul>
<li>I&rsquo;m assuming something happened in his browser (like a refresh) after the item was submitted&hellip;</li>
</ul>
@ -350,24 +350,24 @@ dc.identifier.issn
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
<li>Try to run <code>dspace cleanup -v</code> on CGSpace and ran into an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(167394) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code># su - postgres
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);'
UPDATE 1
</code></pre><h2 id="2019-07-16">2019-07-16</h2>
<ul>
<li>Completely reset the Podman configuration on my laptop because there were some layers that I couldn&rsquo;t delete and it had been some time since I did a cleanup:</li>
</ul>
<pre><code>$ podman system prune -a -f --volumes
<pre tabindex="0"><code>$ podman system prune -a -f --volumes
$ sudo rm -rf ~/.local/share/containers
</code></pre><ul>
<li>Then pull a new PostgreSQL 9.6 image and load a CGSpace database dump into a new local test container:</li>
</ul>
<pre><code>$ podman pull postgres:9.6-alpine
<pre tabindex="0"><code>$ podman pull postgres:9.6-alpine
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
@ -388,7 +388,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</li>
<li>Sisay said a user was having problems registering on CGSpace and it looks like the email account expired again:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: blahh@cgiar.org
@ -414,7 +414,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>Create an account for Lionelle Samnick on CGSpace because the registration isn&rsquo;t working for some reason:</li>
</ul>
<pre><code>$ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
<pre tabindex="0"><code>$ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
</code></pre><ul>
<li>I added her as a submitter to <a href="https://cgspace.cgiar.org/handle/10568/74536">CTA ISF Pro-Agro series</a></li>
<li>Start looking at 1429 records for the Bioversity batch import
@ -442,7 +442,7 @@ Please see the DSpace documentation for assistance.
</ul>
</li>
</ul>
<pre><code> &lt;dct:coverage&gt;
<pre tabindex="0"><code> &lt;dct:coverage&gt;
&lt;dct:spatial&gt;
&lt;type&gt;Country&lt;/type&gt;
&lt;dct:identifier&gt;http://sws.geonames.org/192950&lt;/dct:identifier&gt;
@ -484,14 +484,14 @@ Please see the DSpace documentation for assistance.
<p>I might be able to use <a href="https://pypi.org/project/isbnlib/">isbnlib</a> to validate ISBNs in Python:</p>
</li>
</ul>
<pre><code>if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
<pre tabindex="0"><code>if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
print(&quot;Yes&quot;)
else:
print(&quot;No&quot;)
</code></pre><ul>
<li>Or with <a href="https://github.com/arthurdejong/python-stdnum">python-stdnum</a>:</li>
</ul>
<pre><code>from stdnum import isbn
<pre tabindex="0"><code>from stdnum import isbn
from stdnum import issn
isbn.validate('978-92-9043-389-7')
@ -510,7 +510,7 @@ issn.validate('1020-3362')
<p>I figured out a GREL to trim spaces in multi-value cells without splitting them:</p>
</li>
</ul>
<pre><code>value.replace(/\s+\|\|/,&quot;||&quot;).replace(/\|\|\s+/,&quot;||&quot;)
<pre tabindex="0"><code>value.replace(/\s+\|\|/,&quot;||&quot;).replace(/\|\|\s+/,&quot;||&quot;)
</code></pre><ul>
<li>I whipped up a quick script using Python Pandas to do whitespace cleanup</li>
</ul>

View File

@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s luck
Run system updates on DSpace Test (linode19) and reboot it
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -194,7 +194,7 @@ Run system updates on DSpace Test (linode19) and reboot it
</ul>
</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/^.*.*$/)),
isNotNull(value.match(/^.*é.*$/)),
isNotNull(value.match(/^.*á.*$/)),
@ -235,14 +235,14 @@ Run system updates on DSpace Test (linode19) and reboot it
</ul>
</li>
</ul>
<pre><code># /opt/certbot-auto renew --standalone --pre-hook &quot;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&quot; --post-hook &quot;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&quot;
<pre tabindex="0"><code># /opt/certbot-auto renew --standalone --pre-hook &quot;/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld&quot; --post-hook &quot;/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx&quot;
</code></pre><ul>
<li>It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains</li>
<li>Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04&rsquo;s <a href="https://ssl-config.mozilla.org/#server=nginx&amp;server-version=1.16.0&amp;config=intermediate&amp;openssl-version=1.1.0g&amp;hsts=false&amp;ocsp=false">OpenSSL 1.1.0g with nginx 1.16.0</a></li>
<li>Run all system updates on AReS dev server (linode20) and reboot it</li>
<li>Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:</li>
</ul>
<pre><code>$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
<pre tabindex="0"><code>$ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload.csv
$ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
$ grep -B1 &quot;Download failed&quot; /tmp/2019-08-08-download-pdfs2.txt | grep &quot;Downloading&quot; | sed -e 's/&gt; Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 &gt; /tmp/user-upload2.csv
@ -277,7 +277,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
</ul>
</li>
</ul>
<pre><code>proxy_set_header Host dev.ares.codeobia.com;
<pre tabindex="0"><code>proxy_set_header Host dev.ares.codeobia.com;
</code></pre><ul>
<li>Though I am really wondering why this happened now, because the configuration has been working for months&hellip;</li>
<li>Improve the output of the suspicious characters check in <a href="https://github.com/alanorth/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.0</li>
@ -329,7 +329,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
<ul>
<li>Create a test user on DSpace Test for Mohammad Salem to attempt depositing:</li>
</ul>
<pre><code>$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
<pre tabindex="0"><code>$ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
</code></pre><ul>
<li>Create and merge a pull request (<a href="https://github.com/ilri/DSpace/pull/429">#429</a>) to add eleven new CCAFS Phase II Project Tags to CGSpace</li>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=685">Solr cores issue</a> last week, but they could not reproduce the issue
@ -339,13 +339,13 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
</li>
<li>Testing an import of 1,429 Bioversity items (metadata only) on my local development machine and got an error with Java memory after about 1,000 items:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
<pre tabindex="0"><code>$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
...
java.lang.OutOfMemoryError: GC overhead limit exceeded
</code></pre><ul>
<li>I increased the heap size to 1536m and tried again:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1536m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1536m&quot;
$ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
</code></pre><ul>
<li>This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM</li>
@ -361,7 +361,7 @@ $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
$ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
$ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
</code></pre><ul>
@ -377,7 +377,7 @@ $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
<li>Deploy Tomcat 7.0.96 and PostgreSQL JDBC 42.2.6 driver on CGSpace (linde18)</li>
<li>After restarting Tomcat one of the Solr statistics cores failed to start up:</li>
</ul>
<pre><code>statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>I decided to run all system updates on the server and reboot it</li>
<li>After reboot the statistics-2018 core failed to load so I restarted <code>tomcat7</code> again</li>
@ -393,7 +393,7 @@ $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
</ul>
</li>
</ul>
<pre><code>import os
<pre tabindex="0"><code>import os
return os.path.basename(value)
</code></pre><ul>
@ -429,7 +429,7 @@ return os.path.basename(value)
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
</code></pre><ul>
<li>Apply the corrections on CGSpace and DSpace Test
<ul>
@ -437,7 +437,7 @@ return os.path.basename(value)
</ul>
</li>
</ul>
<pre><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 81m47.057s
user 8m5.265s
@ -478,21 +478,21 @@ sys 2m24.715s
</ul>
</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
COPY 65597
</code></pre><ul>
<li>Then I created a new CSV with two author columns (edit title of second column after):</li>
</ul>
<pre><code>$ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv &gt; /tmp/all-authors.csv
<pre tabindex="0"><code>$ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv &gt; /tmp/all-authors.csv
</code></pre><ul>
<li>Then I ran my script on the new CSV, skipping one of the author columns:</li>
</ul>
<pre><code>$ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
<pre tabindex="0"><code>$ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
</code></pre><ul>
<li>This fixed a bunch of issues with spaces, commas, unneccesary Unicode characters, etc</li>
<li>Then I ran the corrections on my test server and there were 185 of them!</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
</code></pre><ul>
<li>I very well might run these on CGSpace soon&hellip;</li>
</ul>
@ -506,7 +506,7 @@ COPY 65597
</ul>
</li>
</ul>
<pre><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec ./cgcore-xsl-replacements.sed {} \;
<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec ./cgcore-xsl-replacements.sed {} \;
</code></pre><ul>
<li>I think I got everything in the XMLUI themes, but there may be some things I should check once I get a deployment up and running:
<ul>
@ -526,7 +526,7 @@ COPY 65597
</ul>
</li>
</ul>
<pre><code>&quot;handles&quot;:[&quot;10986/30568&quot;,&quot;10568/97825&quot;],&quot;handle&quot;:&quot;10986/30568&quot;
<pre tabindex="0"><code>&quot;handles&quot;:[&quot;10986/30568&quot;,&quot;10568/97825&quot;],&quot;handle&quot;:&quot;10986/30568&quot;
</code></pre><ul>
<li>So this is the same issue we had before, where Altmetric <em>knows</em> this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn&rsquo;t show it because it seems to a secondary handle or something</li>
</ul>
@ -535,7 +535,7 @@ COPY 65597
<li>Run system updates on DSpace Test (linode19) and reboot the server</li>
<li>Run the author fixes on DSpace Test and CGSpace and start a full Discovery re-index:</li>
</ul>
<pre><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 90m47.967s
user 8m12.826s

View File

@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
7249 2a01:7e00::f03c:91ff:fe18:7396
9124 45.5.186.2
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -163,7 +163,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -189,18 +189,18 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<li><code>3.94.211.189</code> is MauiBot, and most of its requests are to Discovery and get rate limited with HTTP 503</li>
<li><code>163.172.71.23</code> is some IP on Online SAS in France and its user agent is:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>It actually got mostly HTTP 200 responses:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
1775 200
703 499
72 503
</code></pre><ul>
<li>And it was mostly requesting Discover pages:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | grep 163.172.71.23 | grep -o -E &quot;(bitstream|discover|handle)&quot; | sort | uniq -c
2350 discover
71 handle
</code></pre><ul>
@ -279,16 +279,16 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
</ul>
</li>
</ul>
<pre><code>2019-09-15 15:32:18,137 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
<pre tabindex="0"><code>2019-09-15 15:32:18,137 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
</code></pre><ul>
<li>Around the same time I see the following in the DSpace log:</li>
</ul>
<pre><code>2019-09-15 15:32:18,079 INFO org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644
<pre tabindex="0"><code>2019-09-15 15:32:18,079 INFO org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644
2019-09-15 15:32:18,135 WARN org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name=&quot;METSRIGHTS&quot;
</code></pre><ul>
<li>I see a lot of these errors today, but not earlier this month:</li>
</ul>
<pre><code># grep -c 'Cannot find named plugin' dspace.log.2019-09-*
<pre tabindex="0"><code># grep -c 'Cannot find named plugin' dspace.log.2019-09-*
dspace.log.2019-09-01:0
dspace.log.2019-09-02:0
dspace.log.2019-09-03:0
@ -307,7 +307,7 @@ dspace.log.2019-09-15:808
</code></pre><ul>
<li>Something must have happened when I restarted Tomcat a few hours ago, because earlier in the DSpace log I see a bunch of errors like this:</li>
</ul>
<pre><code>2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.METSRightsCrosswalk&quot;, name=&quot;METSRIGHTS&quot;
<pre tabindex="0"><code>2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.METSRightsCrosswalk&quot;, name=&quot;METSRIGHTS&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.OREDisseminationCrosswalk&quot;, name=&quot;ore&quot;
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class=&quot;org.dspace.content.crosswalk.DIMDisseminationCrosswalk&quot;, name=&quot;dim&quot;
</code></pre><ul>
@ -321,7 +321,7 @@ dspace.log.2019-09-15:808
<ul>
<li>For some reason my podman PostgreSQL container isn&rsquo;t working so I had to use Docker to re-create it for my testing work today:</li>
</ul>
<pre><code># docker pull docker.io/library/postgres:9.6-alpine
<pre tabindex="0"><code># docker pull docker.io/library/postgres:9.6-alpine
# docker create volume dspacedb_data
# docker run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
@ -338,7 +338,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</ul>
</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Kihara, Job&quot;,&quot;Job Kihara: 0000-0002-4394-9553&quot;
&quot;Twyman, Jennifer&quot;,&quot;Jennifer Twyman: 0000-0002-8581-5668&quot;
&quot;Ishitani, Manabu&quot;,&quot;Manabu Ishitani: 0000-0002-6950-4018&quot;
@ -358,7 +358,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</code></pre><ul>
<li>I tested the file on my local development machine with the following invocation:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>In my test environment this added 390 ORCID identifier</li>
<li>I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update</li>
@ -386,15 +386,15 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<li>Follow up with Marissa again about the CCAFS phase II project tags</li>
<li>Generate a list of the top 1500 authors on CGSpace:</li>
</ul>
<pre><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
</code></pre><ul>
<li>Then I used <code>csvcut</code> to select the column of author names, strip the header and quote characters, and saved the sorted file:</li>
</ul>
<pre><code>$ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/&quot;//g' | sort &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
<pre tabindex="0"><code>$ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/&quot;//g' | sort &gt; dspace/config/controlled-vocabularies/dc-contributor-author.xml
</code></pre><ul>
<li>After adding the XML formatting back to the file I formatted it using XML tidy:</li>
</ul>
<pre><code>$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contributor-author.xml
<pre tabindex="0"><code>$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contributor-author.xml
</code></pre><ul>
<li>I created and merged <a href="https://github.com/ilri/DSpace/pull/433">a pull request for the updates</a>
<ul>
@ -416,7 +416,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</ul>
</li>
</ul>
<pre><code>$ perl-rename -n 's/_{2,3}/_/g' *.pdf
<pre tabindex="0"><code>$ perl-rename -n 's/_{2,3}/_/g' *.pdf
</code></pre><ul>
<li>I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
<ul>
@ -426,7 +426,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
</ul>
</li>
</ul>
<pre><code>$ rename -v 's/___/_/g' *.pdf
<pre tabindex="0"><code>$ rename -v 's/___/_/g' *.pdf
$ rename -v 's/__/_/g' *.pdf
</code></pre><ul>
<li>I&rsquo;m still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I&rsquo;ve removed it from the CSV, but for future reference it has the number 630 in its permalink)</li>
@ -436,15 +436,15 @@ $ rename -v 's/__/_/g' *.pdf
</ul>
</li>
</ul>
<pre><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,&quot;&quot;)
<pre tabindex="0"><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,&quot;&quot;)
</code></pre><ul>
<li>The second targets cities and countries after names like &ldquo;International Livestock Research Intstitute, Kenya&rdquo;:</li>
</ul>
<pre><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,&quot;&quot;)
<pre tabindex="0"><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,&quot;&quot;)
</code></pre><ul>
<li>I imported the 1,427 Bioversity records with bitstreams to a new collection called <a href="https://dspacetest.cgiar.org/handle/10568/103688">2019-09-20 Bioversity Migration Test</a> on DSpace Test (after splitting them in two batches of about 700 each):</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
</code></pre><ul>
@ -513,7 +513,7 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
</li>
<li>Get a list of institutions from CCAFS&rsquo;s Clarisa API and try to parse it with <code>jq</code>, do some small cleanups and add a header in <code>sed</code>, and then pass it through <code>csvcut</code> to add line numbers:</li>
</ul>
<pre><code>$ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
<pre tabindex="0"><code>$ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
</code></pre><ul>
<li>The csv-metadata-quality tool caught a few records with excessive spacing and unnecessary Unicode</li>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2019"/>
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -113,7 +113,7 @@
</ul>
</li>
</ul>
<pre><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
<pre tabindex="0"><code>$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv &gt; /tmp/iwmi-title-region-subregion-river.csv
</code></pre><ul>
<li>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can&rsquo;t figure out the correct sed syntax to do it directly from the pipe above</li>
<li>I uploaded those to CGSpace and then re-exported the metadata</li>
@ -121,7 +121,7 @@
<li>I modified the script so it replaces the non-breaking spaces instead of removing them</li>
<li>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</li>
</ul>
<pre><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
</code></pre><ul>
<li>That fixed 153 items (unnecessary Unicode, duplicates, commaspace fixes, etc)</li>
<li>Release <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.1">version 0.3.1 of the csv-metadata-quality script</a> with the non-breaking spaces change</li>
@ -134,7 +134,7 @@
<ul>
<li>Create an account for Bioversity&rsquo;s ICT consultant Francesco on DSpace Test:</li>
</ul>
<pre><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
<pre tabindex="0"><code>$ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
</code></pre><ul>
<li>Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21
<ul>
@ -193,19 +193,19 @@
</ul>
</li>
</ul>
<pre><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
</code></pre><h2 id="2019-10-11">2019-10-11</h2>
<ul>
<li>I ran the DSpace cleanup function on CGSpace and it found some errors:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
...
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(171221) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution, as always, is (repeat as many times as needed):</li>
</ul>
<pre><code># su - postgres
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
UPDATE 1
</code></pre><h2 id="2019-10-12">2019-10-12</h2>
@ -223,7 +223,7 @@ UPDATE 1
</ul>
</li>
</ul>
<pre><code>from,to
<pre tabindex="0"><code>from,to
CIAT,International Center for Tropical Agriculture
International Centre for Tropical Agriculture,International Center for Tropical Agriculture
International Maize and Wheat Improvement Center (CIMMYT),International Maize and Wheat Improvement Center
@ -234,7 +234,7 @@ International Maize and Wheat Improvement Centre,International Maize and Wheat I
</code></pre><ul>
<li>Then I applied it with my <code>fix-metadata-values.py</code> script on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
</code></pre><ul>
<li>I did some manual curation of about 300 authors in OpenRefine in preparation for telling Peter and Abenet that the migration is almost ready
<ul>
@ -260,17 +260,17 @@ International Maize and Wheat Improvement Centre,International Maize and Wheat I
</ul>
</li>
</ul>
<pre><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 82m35.993s
</code></pre><ul>
<li>After the re-indexing the top authors still list the following:</li>
</ul>
<pre><code>Jagwe, J.|Ouma, E.A.|Brandes-van Dorresteijn, D.|Kawuma, Brian|Smith, J.
<pre tabindex="0"><code>Jagwe, J.|Ouma, E.A.|Brandes-van Dorresteijn, D.|Kawuma, Brian|Smith, J.
</code></pre><ul>
<li>I looked in the database to find authors that had &ldquo;|&rdquo; in them:</li>
</ul>
<pre><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
<pre tabindex="0"><code>dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
text_value | resource_id
----------------------------------+-------------
Anandajayasekeram, P.|Puskur, R. | 157
@ -280,7 +280,7 @@ real 82m35.993s
</code></pre><ul>
<li>Then I found their handles and corrected them, for example:</li>
</ul>
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
handle
-----------
10568/129
@ -304,7 +304,7 @@ real 82m35.993s
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
$ mkdir 2019-10-15-Bioversity
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
$ sed -i '/&lt;dcvalue element=&quot;identifier&quot; qualifier=&quot;uri&quot;&gt;/d' 2019-10-15-Bioversity/*/dublin_core.xml
@ -312,12 +312,12 @@ $ sed -i '/&lt;dcvalue element=&quot;identifier&quot; qualifier=&quot;uri&quot;&
<li>It&rsquo;s really stupid, but for some reason the handles are included even though I specified the <code>-m</code> option, so after the export I removed the <code>dc.identifier.uri</code> metadata values from the items</li>
<li>Then I imported a test subset of them in my local test environment:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
<pre tabindex="0"><code>$ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
</code></pre><ul>
<li>I had forgotten (again) that the <code>dspace export</code> command doesn&rsquo;t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import&hellip;</li>
<li>On CGSpace I will increase the RAM of the command line Java process for good luck before import&hellip;</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s 2019-10-15-Bioversity
</code></pre><ul>
<li>After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them</li>

View File

@ -58,7 +58,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -152,7 +152,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</ul>
</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
1277694
@ -160,14 +160,14 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781
</code></pre><ul>
<li>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | awk '{print $6}' | sed 's/&quot;//' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | awk '{print $6}' | sed 's/&quot;//' | sort | uniq -c | sort -n
1 PUT
8 PROPFIND
283 OPTIONS
@ -177,16 +177,16 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</code></pre><ul>
<li>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
365288
</code></pre><ul>
<li>Their user agent is one I&rsquo;ve never seen before:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
</code></pre><ul>
<li>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -o -E &quot;GET /(bitstream|discover|handle)&quot; | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -o -E &quot;GET /(bitstream|discover|handle)&quot; | sort | uniq -c
6566 GET /bitstream
351928 GET /handle
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep Amazonbot | grep -E &quot;GET /(bitstream|discover|handle)&quot; | grep -c discover
@ -196,12 +196,12 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</code></pre><ul>
<li>As far as I can tell, none of their requests are counted in the Solr statistics:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true'
</code></pre><ul>
<li>Still, those requests are CPU intensive so I will add their user agent to the &ldquo;badbots&rdquo; rate limiting in nginx to reduce the impact on server load</li>
<li>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:&quot;Amazonbot/0.1&quot;
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:&quot;Amazonbot/0.1&quot;
</code></pre><ul>
<li>On the topic of spiders, I have been wanting to update DSpace&rsquo;s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire&rsquo;s COUNTER-Robots</a> project
<ul>
@ -210,13 +210,13 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
</ul>
</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;iskanie&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;iskanie&quot;
</code></pre><ul>
<li>A bit later I checked Solr and found three requests from my IP with that user agent this month:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0'
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;1&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:73.178.9.24 AND userAgent:iskanie&lt;/str&gt;&lt;str name=&quot;fq&quot;&gt;dateYearMonth:2019-11&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;3&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
@ -224,7 +224,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
</code></pre><ul>
<li>Now I want to make similar requests with a user agent that is included in DSpace&rsquo;s current user agent list:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;celestial&quot;
</code></pre><ul>
@ -234,7 +234,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
</ul>
</li>
</ul>
<pre><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
<pre tabindex="0"><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
</code></pre><ul>
<li>Apparently that is part of Atmire&rsquo;s CUA, despite being in a standard DSpace configuration file&hellip;</li>
<li>I tried with some other garbage user agents like &ldquo;fuuuualan&rdquo; and they were visible in Solr
@ -247,7 +247,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
</ul>
</li>
</ul>
<pre><code>else if (line.hasOption('m'))
<pre tabindex="0"><code>else if (line.hasOption('m'))
{
SolrLogger.markRobotsByIP();
}
@ -263,12 +263,12 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
<ul>
<li>I added &ldquo;alanfuu2&rdquo; to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu1&quot;
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu1&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;alanfuuu2&quot;
</code></pre><ul>
<li>After committing the changes in Solr I saw one request for &ldquo;alanfuu1&rdquo; and no requests for &ldquo;alanfuu2&rdquo;:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
@ -281,12 +281,12 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
</li>
<li>I&rsquo;m curious how the special character matching is in Solr, so I will test two requests: one with &ldquo;<a href="http://www.gnip.com">www.gnip.com</a>&rdquo; which is in the spider list, and one with &ldquo;<a href="http://www.gnyp.com">www.gnyp.com</a>&rdquo; which isn&rsquo;t:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnip.com&quot;
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnip.com&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnyp.com&quot;
</code></pre><ul>
<li>Then commit changes to Solr so we don&rsquo;t have to wait:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
@ -314,12 +314,12 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
</ul>
</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;62944&quot; start=&quot;0&quot;&gt;
</code></pre><ul>
<li>Similar for com.plumanalytics, Grammarly, and ltx71!</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:
*com.plumanalytics*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;28256&quot; start=&quot;0&quot;&gt;
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
@ -329,7 +329,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
</code></pre><ul>
<li>Deleting these seems to work, for example the 105,000 ltx71 records from 2018:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=&lt;delete&gt;&lt;query&gt;userAgent:*ltx71*&lt;/query&gt;&lt;query&gt;type:0&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=&lt;delete&gt;&lt;query&gt;userAgent:*ltx71*&lt;/query&gt;&lt;query&gt;type:0&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&amp;facet.field=ip&amp;facet.mincount=1&amp;type:0&amp;q=userAgent:*ltx71*' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
</code></pre><ul>
@ -341,7 +341,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
</ul>
</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&amp;facet.field=dateYearMonth&amp;facet.mincount=1&amp;facet.offset=0&amp;facet.limit=
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&amp;facet.field=dateYearMonth&amp;facet.mincount=1&amp;facet.offset=0&amp;facet.limit=
12&amp;q=userAgent:*Unpaywall*' | xmllint --format - | less
...
&lt;lst name=&quot;facet_counts&quot;&gt;
@ -394,7 +394,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
</ul>
</li>
</ul>
<pre><code>$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
<pre tabindex="0"><code>$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
</code></pre><ul>
<li>Open a <a href="https://github.com/atmire/COUNTER-Robots/pull/28">pull request</a> against COUNTER-Robots to remove unnecessary escaping of dashes</li>
@ -423,7 +423,7 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
</li>
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr&rsquo;s regex search can&rsquo;t use those</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;Scrapoo/1&quot;
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;Scrapoo/1&quot;
$ http &quot;http://localhost:8081/solr/statistics/update?commit=true&quot;
$ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*&quot; | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
@ -433,7 +433,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
<li>I realized that it&rsquo;s easier to search Solr from curl via POST using this syntax:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:*Scrapoo*&amp;rows=0&quot;)
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:*Scrapoo*&amp;rows=0&quot;)
</code></pre><ul>
<li>If the parameters include something like &ldquo;[0-9]&rdquo; then curl interprets it as a range and will make ten requests
<ul>
@ -441,7 +441,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
</ul>
</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2'
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2'
</code></pre><ul>
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I&rsquo;m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
</ul>
@ -450,7 +450,7 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-11-14-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-11-14-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -513,7 +513,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>Most of the curl hits were from CIAT in mid-2019, where they were using <a href="https://guzzle3.readthedocs.io/http-client/client.html">GuzzleHttp</a> from PHP, which uses something like this for its user agent:</li>
</ul>
<pre><code>Guzzle/&lt;Guzzle_Version&gt; curl/&lt;curl_version&gt; PHP/&lt;PHP_VERSION&gt;
<pre tabindex="0"><code>Guzzle/&lt;Guzzle_Version&gt; curl/&lt;curl_version&gt; PHP/&lt;PHP_VERSION&gt;
</code></pre><ul>
<li>Run system updates on DSpace Test and reboot the server</li>
</ul>
@ -564,7 +564,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>Buck is one I&rsquo;ve never heard of before, its user agent is:</li>
</ul>
<pre><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
<pre tabindex="0"><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
</code></pre><ul>
<li>All in all that&rsquo;s about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li>
</ul>

View File

@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the
# dpkg -C
# reboot
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -142,14 +142,14 @@ Make sure all packages are up to date and the package manager is up to date, the
</ul>
</li>
</ul>
<pre><code># apt update &amp;&amp; apt full-upgrade
<pre tabindex="0"><code># apt update &amp;&amp; apt full-upgrade
# apt-get autoremove &amp;&amp; apt-get autoclean
# dpkg -C
# reboot
</code></pre><ul>
<li>Take some backups:</li>
</ul>
<pre><code># dpkg -l &gt; 2019-12-01-linode18-dpkg.txt
<pre tabindex="0"><code># dpkg -l &gt; 2019-12-01-linode18-dpkg.txt
# tar czf 2019-12-01-linode18-etc.tar.gz /etc
</code></pre><ul>
<li>Then check all third-party repositories in /etc/apt to see if everything using &ldquo;xenial&rdquo; has packages available for &ldquo;bionic&rdquo; and then update the sources:</li>
@ -157,18 +157,18 @@ Make sure all packages are up to date and the package manager is up to date, the
<li>Pause the Uptime Robot monitoring for CGSpace</li>
<li>Make sure the update manager is installed and do the upgrade:</li>
</ul>
<pre><code># apt install update-manager-core
<pre tabindex="0"><code># apt install update-manager-core
# do-release-upgrade
</code></pre><ul>
<li>After the upgrade finishes, remove Java 11, force the installation of bionic nginx, and reboot the server:</li>
</ul>
<pre><code># apt purge openjdk-11-jre-headless
<pre tabindex="0"><code># apt purge openjdk-11-jre-headless
# apt install 'nginx=1.16.1-1~bionic'
# reboot
</code></pre><ul>
<li>After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it&rsquo;s working:</li>
</ul>
<pre><code># rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
<pre tabindex="0"><code># rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
# rm -rf /opt/ilri/dspace-statistics-api/venv
# /opt/certbot-auto
</code></pre><ul>
@ -195,7 +195,7 @@ Make sure all packages are up to date and the package manager is up to date, the
</ul>
</li>
</ul>
<pre><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030' &gt; /tmp/cgspace-104030.xml
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030' &gt; /tmp/cgspace-104030.xml
$ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/104030' &gt; /tmp/dspacetest-104030.xml
</code></pre><ul>
<li>The DSpace Test ones actually now capture the DOI, where the CGSpace doesn&rsquo;t&hellip;</li>
@ -209,7 +209,7 @@ $ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPref
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
COPY 48
</code></pre><h2 id="2019-12-05">2019-12-05</h2>
<ul>
@ -288,13 +288,13 @@ COPY 48
<li>I looked into creating RTF documents from HTML in Node.js and there is a library called <a href="https://www.npmjs.com/package/html-to-rtf">html-to-rtf</a> that works well, but doesn&rsquo;t support images</li>
<li>Export a list of all investors (<code>dc.description.sponsorship</code>) for Peter to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.sponsor&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.sponsor&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
COPY 643
</code></pre><h2 id="2019-12-18">2019-12-18</h2>
<ul>
<li>Apply the investor corrections and deletions from Peter on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
</code></pre><ul>
<li>Peter asked about the &ldquo;Open Government Licence 3.0&rdquo; that is used by <a href="">some items</a>
@ -304,7 +304,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dsp
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
<pre tabindex="0"><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
text_value
-----------------------------
Open Government License 3.0
@ -321,7 +321,7 @@ UPDATE 2
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c MegaIndex.ru
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c MegaIndex.ru
27320
</code></pre><ul>
<li>I see they <em>did</em> check <code>robots.txt</code> and their requests are only going to XMLUI item pages&hellip; so I guess I just leave them alone</li>
@ -338,12 +338,12 @@ UPDATE 2
<ul>
<li>I ran the <code>dspace cleanup</code> process on CGSpace (linode18) and had an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(179441) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is to delete that bitstream manually:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
UPDATE 1
</code></pre><ul>
<li>Adjust <a href="/cgspace-notes/cgspace-cgcorev2-migration/">CG Core v2 migrataion notes</a> to use <code>cg.review-status</code> instead of <code>cg.peer-reviewed</code>

View File

@ -56,7 +56,7 @@ I tweeted the CGSpace repository link
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -166,17 +166,17 @@ I tweeted the CGSpace repository link
<ul>
<li>Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
</code></pre><ul>
<li>As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:</li>
</ul>
<pre><code>$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
<pre tabindex="0"><code>$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779
</code></pre><ul>
<li>According to <a href="https://www.datafix.com.au/BASHing/2018-09-13.html">this trick</a> the troublesome character is on line 5227:</li>
</ul>
<pre><code>$ awk 'END {print NR&quot;: &quot;$0}' /tmp/2020-01-08-authors-windows.csv
<pre tabindex="0"><code>$ awk 'END {print NR&quot;: &quot;$0}' /tmp/2020-01-08-authors-windows.csv
5227: &quot;Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22 &quot;
@ -190,7 +190,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
</code></pre><ul>
<li><del>According to the blog post linked above the troublesome character is probably the &ldquo;High Octect Preset&rdquo; (81)</del>, which vim identifies (using <code>ga</code> on the character) as:</li>
</ul>
<pre><code>&lt;e&gt; 101, Hex 65, Octal 145 &lt; ́&gt; 769, Hex 0301, Octal 1401
<pre tabindex="0"><code>&lt;e&gt; 101, Hex 65, Octal 145 &lt; ́&gt; 769, Hex 0301, Octal 1401
</code></pre><ul>
<li>If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it&rsquo;s stored incorrectly in the database&hellip;</li>
<li>Other encodings like <code>windows-1251</code> and <code>windows-1257</code> also fail on different characters like &ldquo;ž&rdquo; and &ldquo;é&rdquo; that <em>are</em> legitimate UTF-8 characters</li>
@ -207,7 +207,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
</ul>
</li>
</ul>
<pre><code>Exception: Read timed out
<pre tabindex="0"><code>Exception: Read timed out
java.net.SocketTimeoutException: Read timed out
</code></pre><ul>
<li>I am not sure how I will fix that shard&hellip;</li>
@ -225,7 +225,7 @@ java.net.SocketTimeoutException: Read timed out
</ul>
</li>
</ul>
<pre><code>In [7]: unicodedata.is_normalized('NFC', 'é')
<pre tabindex="0"><code>In [7]: unicodedata.is_normalized('NFC', 'é')
Out[7]: False
In [8]: unicodedata.is_normalized('NFC', 'é')
@ -235,7 +235,7 @@ Out[8]: True
<li>I added support for Unicode normalization to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool in <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0">v0.4.0</a></li>
<li>Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ilri&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ilri&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.bioversity&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325
@ -243,12 +243,12 @@ COPY 1325
<li>She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC</li>
<li>I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my <code>fix-metadata.py</code> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
</code></pre><h2 id="2020-01-16">2020-01-16</h2>
<ul>
<li>Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ciat&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ciat&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35
</code></pre><ul>
<li>Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
@ -301,7 +301,7 @@ COPY 35
<ul>
<li>I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:</li>
</ul>
<pre><code>Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
<pre tabindex="0"><code>Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
</code></pre><ul>
<li>They started <a href="https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/">limiting public access to the database in December, 2019 due to GDPR and CCPA</a>
<ul>
@ -315,11 +315,11 @@ COPY 35
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
</code></pre><ul>
<li>Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a>:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
COPY 67314
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
@ -331,7 +331,7 @@ $ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, text_value as &quot;correct&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, text_value as &quot;correct&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6170
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
@ -339,11 +339,11 @@ $ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dsp
</code></pre><ul>
<li>I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:</li>
</ul>
<pre><code>$ sleep 4h &amp;&amp; time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ sleep 4h &amp;&amp; time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Then I generated a new list for Peter:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6162
</code></pre><ul>
<li>Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author &ldquo;Hung, Nguyen&rdquo;
@ -352,7 +352,7 @@ COPY 6162
</ul>
</li>
</ul>
<pre><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u &gt; hung-nguyen-ares-handles.txt
<pre tabindex="0"><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u &gt; hung-nguyen-ares-handles.txt
$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u &gt; hung-nguyen-atmire-handles.txt
$ wc -l hung-nguyen-a*handles.txt
46 hung-nguyen-ares-handles.txt
@ -374,7 +374,7 @@ $ wc -l hung-nguyen-a*handles.txt
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2020:0[12345678]&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2020:0[12345678]&quot; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The top two hosts according to the amount of data transferred are:
<ul>
@ -388,12 +388,12 @@ $ wc -l hung-nguyen-a*handles.txt
<li>They are apparently using this Drupal module to generate the thumbnails: <code>sites/all/modules/contrib/pdf_to_imagefield</code></li>
<li>I see some excellent suggestions in this <a href="https://www.imagemagick.org/discourse-server/viewtopic.php?t=21589">ImageMagick thread from 2012</a> that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as <a href="https://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/">this blog post</a>:</li>
</ul>
<pre><code>$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
<pre tabindex="0"><code>$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
</code></pre><ul>
<li>Here I&rsquo;m also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using <code>-flatten</code> like DSpace already does</li>
<li>I did some tests with a modified version of above that uses uses <code>-flatten</code> and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):</li>
</ul>
<pre><code>$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
<pre tabindex="0"><code>$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
$ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg
$ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg
$ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
@ -404,7 +404,7 @@ $ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
<li>The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org&rsquo;s 400KiB PNG!</li>
<li>Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
<pre tabindex="0"><code>$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre><h2 id="2020-01-26">2020-01-26</h2>
@ -422,11 +422,11 @@ $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db
</ul>
</li>
</ul>
<pre><code>$ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
<pre tabindex="0"><code>$ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
</code></pre><ul>
<li>One thing worth mentioning was this syntax for extracting bits from JSON in bash using <code>jq</code>:</li>
</ul>
<pre><code>$ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
<pre tabindex="0"><code>$ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName==&quot;ORIGINAL&quot;) | .retrieveLink'
&quot;/bitstreams/172559/retrieve&quot;
</code></pre><h2 id="2020-01-27">2020-01-27</h2>
@ -438,7 +438,7 @@ $ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName==&quot;ORIGINAL&quot;)
</ul>
</li>
</ul>
<pre><code>2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
<pre tabindex="0"><code>2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
</code></pre><ul>
<li>Now this appears to be a Solr limit of some kind (&ldquo;too many boolean clauses&rdquo;)
@ -453,7 +453,7 @@ org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError:
<ul>
<li>Generate a list of CIP subjects for Abenet:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.cip&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.cip&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
COPY 77
</code></pre><ul>
<li>Start looking over the IITA records from earlier this month (<a href="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a>)
@ -483,7 +483,7 @@ COPY 77
<ul>
<li>Normalize about 4,500 DOI, YouTube, and SlideShare links on CGSpace that are missing HTTPS or using old format:</li>
</ul>
<pre><code>UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
<pre tabindex="0"><code>UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://dx.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
@ -492,24 +492,24 @@ UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.sli
</code></pre><ul>
<li>I exported a list of all of our ISSNs with item IDs so that I could fix them in OpenRefine and submit them with multi-value separators to DSpace metadata import:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT resource_id as &quot;id&quot;, text_value as &quot;dc.identifier.issn&quot; FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT resource_id as &quot;id&quot;, text_value as &quot;dc.identifier.issn&quot; FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
COPY 23339
</code></pre><ul>
<li>Then, after spending two hours correcting 1,000 ISSNs I realized that I need to normalize the <code>text_lang</code> fields in the database first or else these will all look like changes due to the &ldquo;en_US&rdquo; and NULL, etc (for both ISSN and ISBN):</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
UPDATE 30454
</code></pre><ul>
<li>Then I realized that my initial PostgreSQL query wasn&rsquo;t so genius because if a field already has multiple values it will appear on separate lines with the same ID, so when <code>dspace metadata-import</code> sees it, the change will be removed and added, or added and removed, depending on the order it is seen!</li>
<li>A better course of action is to select the distinct ones and then correct them using <code>fix-metadata-values.py</code>&hellip;</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.identifier.issn[en_US]&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.identifier.issn[en_US]&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
COPY 2900
</code></pre><ul>
<li>I re-applied all my corrections, filtering out things like multi-value separators and values that are actually ISBNs so I can fix them later</li>
<li>Then I applied 181 fixes for ISSNs using <code>fix-metadata-values.py</code> on DSpace Test and CGSpace (after testing locally):</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
</code></pre><h2 id="2020-01-30">2020-01-30</h2>
<ul>
<li>About to start working on the DSpace 6 port and I&rsquo;m looking at commits that are in the not-yet-tagged DSpace 6.4:

View File

@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -138,11 +138,11 @@ The code finally builds and runs with a fresh install
<ul>
<li>Now we don&rsquo;t specify the build environment because site modification are in <code>local.cfg</code>, so we just build like this:</li>
</ul>
<pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
</code></pre><ul>
<li>And it seems that we need to enabled <code>pg_crypto</code> now (used for UUIDs):</li>
</ul>
<pre><code>$ psql -h localhost -U postgres dspace63
<pre tabindex="0"><code>$ psql -h localhost -U postgres dspace63
dspace63=# CREATE EXTENSION pgcrypto;
CREATE EXTENSION pgcrypto;
</code></pre><ul>
@ -153,11 +153,11 @@ CREATE EXTENSION pgcrypto;
</ul>
</li>
</ul>
<pre><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
<pre tabindex="0"><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
</code></pre><ul>
<li>Then I ran <code>dspace database migrate</code> and got an error:</li>
</ul>
<pre><code>$ ~/dspace63/bin/dspace database migrate
<pre tabindex="0"><code>$ ~/dspace63/bin/dspace database migrate
Database URL: jdbc:postgresql://localhost:5432/dspace63?ApplicationName=dspaceCli
Migrating database to latest version... (Check dspace logs for details)
@ -225,7 +225,7 @@ Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatav
<li>A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the <code>resource_id</code> column</li>
<li>Our database had the same issue, where the <code>eperson_metadata</code> view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:</li>
</ul>
<pre><code>dspace63=# DROP VIEW eperson_metadata;
<pre tabindex="0"><code>dspace63=# DROP VIEW eperson_metadata;
DROP VIEW
</code></pre><ul>
<li>After that the migration was successful and DSpace starts up successfully and begins indexing
@ -252,7 +252,7 @@ DROP VIEW
</li>
<li>There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:</li>
</ul>
<pre><code>2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception:
<pre tabindex="0"><code>2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception:
org.dspace.discovery.SearchServiceException: Invalid UUID string: 1
2020-02-03 13:20:20,475 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
@ -260,11 +260,11 @@ org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
<li>If I look in Solr&rsquo;s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now&hellip;</li>
<li>I dropped all the documents in the search core:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
<pre tabindex="0"><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&amp;commit=true'
</code></pre><ul>
<li>Still didn&rsquo;t work, so I&rsquo;m going to try a clean database import and migration:</li>
</ul>
<pre><code>$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
<pre tabindex="0"><code>$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
@ -301,7 +301,7 @@ $ ~/dspace63/bin/dspace database migrate
</ul>
</li>
</ul>
<pre><code>$ git checkout -b 6_x-dev64 6_x-dev
<pre tabindex="0"><code>$ git checkout -b 6_x-dev64 6_x-dev
$ git rebase -i upstream/dspace-6_x
</code></pre><ul>
<li>I finally understand why our themes show all the &ldquo;Browse by&rdquo; buttons on community and collection pages in DSpace 6.x
@ -321,7 +321,7 @@ $ git rebase -i upstream/dspace-6_x
<li>UptimeRobot told me that AReS Explorer crashed last night, so I logged into it, ran all updates, and rebooted it</li>
<li>Testing Discovery indexing speed on my local DSpace 6.3:</li>
</ul>
<pre><code>$ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3771.78s user 93.63s system 41% cpu 2:34:19.53 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3360.28s user 82.63s system 38% cpu 2:30:22.07 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 4678.72s user 138.87s system 42% cpu 3:08:35.72 total
@ -329,7 +329,7 @@ schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3334.19s user 86.54s s
</code></pre><ul>
<li>DSpace 5.8 was taking about 1 hour (or less on this laptop), so this is 2-3 times longer!</li>
</ul>
<pre><code>$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 299.53s user 69.67s system 20% cpu 30:34.47 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 270.31s user 69.88s system 19% cpu 29:01.38 total
</code></pre><ul>
@ -360,7 +360,7 @@ schedtool -D -e ~/dspace/bin/dspace index-discovery -b 270.31s user 69.88s syst
<li>I sent a mail to the dspace-tech mailing list asking about slow Discovery indexing speed in DSpace 6</li>
<li>I destroyed my PostgreSQL 9.6 containers and re-created them using PostgreSQL 10 to see if there are any speedups with DSpace 6.x:</li>
</ul>
<pre><code>$ podman pull postgres:10-alpine
<pre tabindex="0"><code>$ podman pull postgres:10-alpine
$ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:10-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
@ -379,29 +379,29 @@ dspace63=# \q
</code></pre><ul>
<li>I purged ~33,000 hits from the &ldquo;Jersey/2.6&rdquo; bot in CGSpace&rsquo;s statistics using my <code>check-spider-hits.sh</code> script:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
<pre tabindex="0"><code>$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s &quot;statistics-${year}&quot; -u http://localhost:8081/solr; done
</code></pre><ul>
<li>I noticed another user agen in the logs that we should add to the list:</li>
</ul>
<pre><code>ReactorNetty/0.9.2.RELEASE
<pre tabindex="0"><code>ReactorNetty/0.9.2.RELEASE
</code></pre><ul>
<li>I made <a href="https://github.com/atmire/COUNTER-Robots/issues/31">an issue on the COUNTER-Robots repository</a></li>
<li>I found a <a href="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to work for exporting our 2019 stats from the large statistics core!</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
$ ls -lh /tmp/statistics-2019-01.json
-rw-rw-r-- 1 aorth aorth 3.7G Feb 6 09:26 /tmp/statistics-2019-01.json
</code></pre><ul>
<li>Then I tested importing this by creating a new core in my development environment:</li>
</ul>
<pre><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace/solr/statistics&amp;dataDir=/home/aorth/dspace/solr/statistics-2019/data'
<pre tabindex="0"><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace/solr/statistics&amp;dataDir=/home/aorth/dspace/solr/statistics-2019/data'
$ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
</code></pre><ul>
<li>This imports the records into the core, but DSpace can&rsquo;t see them, and when I restart Tomcat the core is not seen by Solr&hellip;</li>
<li>I got the core to load by adding it to <code>dspace/solr/solr.xml</code> manually, ie:</li>
</ul>
<pre><code> &lt;cores adminPath=&quot;/admin/cores&quot;&gt;
<pre tabindex="0"><code> &lt;cores adminPath=&quot;/admin/cores&quot;&gt;
...
&lt;core name=&quot;statistics&quot; instanceDir=&quot;statistics&quot; /&gt;
&lt;core name=&quot;statistics-2019&quot; instanceDir=&quot;statistics&quot;&gt;
@ -415,11 +415,11 @@ $ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Download
<li>Just for fun I tried to load these stats into a Solr 7.7.2 instance using the DSpace 7 solr config:</li>
<li>First, create a Solr statistics core using the DSpace 7 config:</li>
</ul>
<pre><code>$ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
<pre tabindex="0"><code>$ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
</code></pre><ul>
<li>Then try to import the stats, skipping a shitload of fields that are apparently added to our Solr statistics by Atmire modules:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
</code></pre><ul>
<li>OK that imported! I wonder if it works&hellip; maybe I&rsquo;ll try another day</li>
</ul>
@ -433,7 +433,7 @@ $ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Download
</ul>
</li>
</ul>
<pre><code>$ cd ~/src/git/perf-map-agent
<pre tabindex="0"><code>$ cd ~/src/git/perf-map-agent
$ cmake .
$ make
$ ./bin/create-links-in ~/.local/bin
@ -467,7 +467,7 @@ $ perf-java-flames 11359
<ul>
<li>This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:</li>
</ul>
<pre><code># CGSpace 5.8
<pre tabindex="0"><code># CGSpace 5.8
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 385.72s user 131.16s system 19% cpu 43:21.18 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 382.95s user 127.31s system 20% cpu 42:10.07 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 368.56s user 143.97s system 20% cpu 42:22.66 total
@ -483,7 +483,7 @@ schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s
</code></pre><ul>
<li>I generated better flame graphs for the DSpace indexing process by using <code>perf-record-stack</code> and filtering out the java process:</li>
</ul>
<pre><code>$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
<pre tabindex="0"><code>$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export PERF_RECORD_SECONDS=60
$ export JAVA_OPTS=&quot;-XX:+PreserveFramePointer&quot;
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &amp;
@ -525,14 +525,14 @@ $ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' |
<ul>
<li>Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-02-11-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-02-11-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-02-11-combined-orcids.txt -o /tmp/2020-02-11-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using <code>fix-metadata-values.py</code>:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
</code></pre><ul>
<li>On a hunch I decided to try to add these ORCID iDs to existing items that might not have them yet
<ul>
@ -540,7 +540,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</ul>
</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Staver, Charles&quot;,charles staver: 0000-0002-4532-6077
&quot;Staver, C.&quot;,charles staver: 0000-0002-4532-6077
&quot;Fungo, R.&quot;,Robert Fungo: 0000-0002-4264-6905
@ -556,7 +556,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</code></pre><ul>
<li>Running the <code>add-orcid-identifiers-csv.py</code> script I added 144 ORCID iDs to items on CGSpace!</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Minor updates to all Python utility scripts in the CGSpace git repository</li>
<li>Update the spider agent patterns in CGSpace <code>5_x-prod</code> branch from the latest <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> project
@ -575,7 +575,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>Peter asked me to update John McIntire&rsquo;s name format on CGSpace so I ran the following PostgreSQL query:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
UPDATE 26
</code></pre><h2 id="2020-02-17">2020-02-17</h2>
<ul>
@ -607,12 +607,12 @@ UPDATE 26
<ul>
<li>I see a new spider in the nginx logs on CGSpace:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
<pre tabindex="0"><code>Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
</code></pre><ul>
<li>I think this should be covered by the <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> patterns for the statistics at least&hellip;</li>
<li>I see some IP (186.32.217.255) in Costa Rica making requests like a bot with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
</code></pre><ul>
<li>Another IP address (31.6.77.23) in the UK making a few hundred requests without a user agent</li>
<li>I will add the IP addresses to the nginx badbots list</li>
@ -622,7 +622,7 @@ UPDATE 26
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=dns:/squeeze3.bronco.co.uk./&amp;rows=0&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=dns:/squeeze3.bronco.co.uk./&amp;rows=0&quot;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;4&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;dns:/squeeze3.bronco.co.uk./&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;86044&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
@ -641,7 +641,7 @@ UPDATE 26
</li>
<li>I will purge them from each core one by one, ie:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:squeeze3.bronco.co.uk.&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:squeeze3.bronco.co.uk.&lt;/query&gt;&lt;/delete&gt;&quot;
$ curl -s &quot;http://localhost:8081/solr/statistics-2014/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:squeeze3.bronco.co.uk.&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Deploy latest Tomcat and PostgreSQL JDBC driver changes on CGSpace (linode18)</li>
@ -654,12 +654,12 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2014/update?softCommit=tru
</li>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(183996) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code># su - postgres
<pre tabindex="0"><code># su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
UPDATE 1
</code></pre><ul>
@ -671,7 +671,7 @@ UPDATE 1
</ul>
</li>
</ul>
<pre><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
</code></pre><ul>
<li>For some reason the Atmire Content and Usage Analysis (CUA) module&rsquo;s Usage Statistics is drawing blank graphs
<ul>
@ -679,7 +679,7 @@ UPDATE 1
</ul>
</li>
</ul>
<pre><code>2020-02-23 11:28:13,696 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
<pre tabindex="0"><code>2020-02-23 11:28:13,696 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoClassDefFoundError: Could not
initialize class org.jfree.chart.JFreeChart
</code></pre><ul>
@ -694,11 +694,11 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
</li>
<li>I copied the <code>jfreechart-1.0.5.jar</code> file to the Tomcat lib folder and then there was a different error when I loaded Atmire CUA:</li>
</ul>
<pre><code>2020-02-23 16:25:10,841 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request! org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
<pre tabindex="0"><code>2020-02-23 16:25:10,841 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request! org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
</code></pre><ul>
<li>Some search results suggested commenting out the following line in <code>/etc/java-8-openjdk/accessibility.properties</code>:</li>
</ul>
<pre><code>assistive_technologies=org.GNOME.Accessibility.AtkWrapper
<pre tabindex="0"><code>assistive_technologies=org.GNOME.Accessibility.AtkWrapper
</code></pre><ul>
<li>And removing the extra jfreechart library and restarting Tomcat I was able to load the usage statistics graph on DSpace Test&hellip;
<ul>
@ -708,7 +708,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
</ul>
</li>
</ul>
<pre><code># grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
<pre tabindex="0"><code># grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
dspace.log.2020-01-12:4
dspace.log.2020-01-13:66
dspace.log.2020-01-14:4
@ -724,7 +724,7 @@ dspace.log.2020-01-21:4
<li>I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics&hellip;</li>
<li>On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia&rsquo;s AReS explorer, but it should only be using REST and therefore no Solr statistics&hellip;?</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/select&quot; -d &quot;q=ip:34.218.226.147&amp;rows=0&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/select&quot; -d &quot;q=ip:34.218.226.147&amp;rows=0&quot;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;811&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:34.218.226.147&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;5536097&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
@ -732,7 +732,7 @@ dspace.log.2020-01-21:4
</code></pre><ul>
<li>And there are apparently two million from last month (2020-01):</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=ip:34.218.226.147&amp;fq=dateYearMonth:2020-01&amp;rows=0&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=ip:34.218.226.147&amp;fq=dateYearMonth:2020-01&amp;rows=0&quot;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;response&gt;
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;248&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:34.218.226.147&lt;/str&gt;&lt;str name=&quot;fq&quot;&gt;dateYearMonth:2020-01&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;2173455&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
@ -740,7 +740,7 @@ dspace.log.2020-01-21:4
</code></pre><ul>
<li>But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on <code>/rest</code> and none of which are to XMLUI:</li>
</ul>
<pre><code># zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
<pre tabindex="0"><code># zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
84322
# zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c '/rest'
84322
@ -758,7 +758,7 @@ dspace.log.2020-01-21:4
</li>
<li>Anyways, I faceted by IP in 2020-01 and see:</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&amp;fq=dateYearMonth:2020-01&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip'
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&amp;fq=dateYearMonth:2020-01&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip'
...
&quot;172.104.229.92&quot;,2686876,
&quot;34.218.226.147&quot;,2173455,
@ -769,19 +769,19 @@ dspace.log.2020-01-21:4
<li>Surprise surprise, the top two IPs are from AReS servers&hellip; wtf.</li>
<li>The next three are from Online in France and they are all using this weird user agent and making tens of thousands of requests to Discovery:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>And all the same three are already inflating the statistics for 2020-02&hellip; hmmm.</li>
<li>I need to see why AReS harvesting is inflating the stats, as it should only be making REST requests&hellip;</li>
<li>Shiiiiit, I see 84,000 requests from the AReS IP today alone:</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true'
...
&quot;response&quot;:{&quot;numFound&quot;:84594,&quot;start&quot;:0,&quot;docs&quot;:[]
</code></pre><ul>
<li>Fuck! And of course the ILRI websites doing their daily REST harvesting are causing issues too, from today alone:</li>
</ul>
<pre><code> &quot;2a01:7e00::f03c:91ff:fe9a:3a37&quot;,35512,
<pre tabindex="0"><code> &quot;2a01:7e00::f03c:91ff:fe9a:3a37&quot;,35512,
&quot;2a01:7e00::f03c:91ff:fe18:7396&quot;,26155,
</code></pre><ul>
<li>I need to try to make some requests for these URLs and observe if they make a statistics hit:
@ -793,7 +793,7 @@ dspace.log.2020-01-21:4
<li>Those are the requests AReS and ILRI servers are making&hellip; nearly 150,000 per day!</li>
<li>Well that settles it!</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&amp;fq=ip:78.128.99.24&amp;rows=10&amp;wt=json&amp;indent=true' | grep numFound
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&amp;fq=ip:78.128.99.24&amp;rows=10&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:12,&quot;start&quot;:0,&quot;docs&quot;:[
$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=50&amp;offset=82450'
$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
@ -817,12 +817,12 @@ $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+s
<li>I tried to add the IPs to our nginx IP bot mapping but it doesn&rsquo;t seem to work&hellip; WTF, why is everything broken?!</li>
<li>Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:</li>
</ul>
<pre><code>$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true' | grep numFound
<pre tabindex="0"><code>$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&amp;rows=0&amp;wt=json&amp;indent=true' | grep numFound
&quot;response&quot;:{&quot;numFound&quot;:42395486,&quot;start&quot;:0,&quot;docs&quot;:[]
</code></pre><ul>
<li>I modified my <code>check-spider-hits.sh</code> script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
Purging 22809216 hits from 34.218.226.147 in statistics
Purging 19586270 hits from 172.104.229.92 in statistics
Purging 111137 hits from 2a01:7e00::f03c:91ff:fe9a:3a37 in statistics
@ -856,11 +856,11 @@ Total number of bot hits purged: 5535399
</ul>
</li>
</ul>
<pre><code>add_header X-debug-message &quot;ua is $ua&quot; always;
<pre tabindex="0"><code>add_header X-debug-message &quot;ua is $ua&quot; always;
</code></pre><ul>
<li>Then in the HTTP response you see:</li>
</ul>
<pre><code>X-debug-message: ua is bot
<pre tabindex="0"><code>X-debug-message: ua is bot
</code></pre><ul>
<li>So the IP to bot mapping is working, phew.</li>
<li>More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them!
@ -880,7 +880,7 @@ Total number of bot hits purged: 5535399
<li>These IPs are all active in the REST API logs over the last few months and they account for <em>thirty-four million</em> more hits in the statistics!</li>
<li>I purged them from CGSpace:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 15 hits from 104.196.152.243 in statistics
Purging 61064 hits from 35.237.175.180 in statistics
Purging 1378 hits from 70.32.90.172 in statistics
@ -910,7 +910,7 @@ Total number of bot hits purged: 1752548
</li>
<li>The client at 3.225.28.105 is using the following user agent:</li>
</ul>
<pre><code>Apache-HttpClient/4.3.4 (java 1.5)
<pre tabindex="0"><code>Apache-HttpClient/4.3.4 (java 1.5)
</code></pre><ul>
<li>But I don&rsquo;t see any hits for it in the statistics core for some reason</li>
<li>Looking more into the 2015 statistics I see some questionable IPs:
@ -925,7 +925,7 @@ Total number of bot hits purged: 1752548
</li>
<li>For the IPs I purged them using <code>check-spider-ip-hits.sh</code>:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 11478 hits from 95.110.154.135 in statistics
Purging 1208 hits from 34.209.213.122 in statistics
Purging 10 hits from 54.184.39.242 in statistics
@ -966,7 +966,7 @@ Total number of bot hits purged: 2228
</code></pre><ul>
<li>Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn&rsquo;t have a proper user agent and the only way to identify them was via DNS:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:*crawl.baidu.com.&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;dns:*crawl.baidu.com.&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Jesus, the more I keep looking, the more I see ridiculous stuff&hellip;</li>
<li>In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network&hellip;
@ -982,7 +982,7 @@ Total number of bot hits purged: 2228
<li>Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130</li>
<li>I purged a bunch more from all cores:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 109965 hits from 45.5.186.2 in statistics
Purging 78648 hits from 79.173.222.114 in statistics
Purging 49032 hits from 149.200.141.57 in statistics
@ -1024,7 +1024,7 @@ Total number of bot hits purged: 14110
<li>Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like &ldquo;Microsoft Office Word 2014&rdquo;</li>
<li>Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:</li>
</ul>
<pre><code># zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\&quot; '{print $6}' | sort | uniq -c | sort -h
<pre tabindex="0"><code># zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\&quot; '{print $6}' | sort | uniq -c | sort -h
1 Microsoft Office Word 2014
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
@ -1038,7 +1038,7 @@ Total number of bot hits purged: 14110
</code></pre><ul>
<li>I see lots of requests coming from the following user agents:</li>
</ul>
<pre><code>&quot;Apache-HttpClient/4.5.7 (Java/11.0.3)&quot;
<pre tabindex="0"><code>&quot;Apache-HttpClient/4.5.7 (Java/11.0.3)&quot;
&quot;Apache-HttpClient/4.5.7 (Java/11.0.2)&quot;
&quot;LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)&quot;
&quot;EventMachine HttpClient&quot;
@ -1054,7 +1054,7 @@ Total number of bot hits purged: 14110
</li>
<li>More weird user agents in 2019:</li>
</ul>
<pre><code>ecolink (+https://search.ecointernet.org/)
<pre tabindex="0"><code>ecolink (+https://search.ecointernet.org/)
ecoweb (+https://search.ecointernet.org/)
EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/
@ -1062,12 +1062,12 @@ EcoInternet http://ecointernet.org/
<ul>
<li>And what&rsquo;s the 950,000 hits from Online.net IPs with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I&rsquo;m purging them all</li>
<li>I looked deeper in the Solr statistics and found a bunch more weird user agents:</li>
</ul>
<pre><code>LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
<pre tabindex="0"><code>LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
EventMachine HttpClient
ecolink (+https://search.ecointernet.org/)
ecoweb (+https://search.ecointernet.org/)
@ -1098,13 +1098,13 @@ HTTPie/1.0.2
</ul>
</li>
</ul>
<pre><code>Link.?Check
<pre tabindex="0"><code>Link.?Check
Http.?Client
ecointernet
</code></pre><ul>
<li>That removes another 500,000 or so:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
<pre tabindex="0"><code>$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
Purging 253 hits from Jersey\/[0-9] in statistics
Purging 7302 hits from Link.?Check in statistics
Purging 85574 hits from Http.?Client in statistics
@ -1171,12 +1171,12 @@ Total number of bot hits purged: 159
</li>
<li>I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s &gt;&gt; log/cron-stats-util.log.$(date --iso-8601)
</code></pre><ul>
<li>Interestingly I saw this in the Solr log:</li>
</ul>
<pre><code>2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
<pre tabindex="0"><code>2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
2020-02-26 08:55:47,511 INFO org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&amp;name=statistics-2019&amp;action=CREATE&amp;instanceDir=statistics&amp;wt=javabin&amp;version=2} status=0 QTime=590
</code></pre><ul>
<li>The process has been going for several hours now and I suspect it will fail eventually
@ -1186,7 +1186,7 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s &gt;&gt; log/cron-stats-ut
</li>
<li>Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:</li>
</ul>
<pre><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace63/solr/statistics&amp;dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
<pre tabindex="0"><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&amp;name=statistics-2019&amp;instanceDir=/home/aorth/dspace63/solr/statistics&amp;dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
</code></pre><ul>
<li>After that the <code>statistics-2019</code> core was immediately available in the Solr UI, but after restarting Tomcat it was gone
<ul>
@ -1195,11 +1195,11 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s &gt;&gt; log/cron-stats-ut
</li>
<li>First export a small slice of 2019 stats from the main CGSpace <code>statistics</code> core, skipping Atmire schema additions:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
</code></pre><ul>
<li>Then import into my local <code>statistics</code> core:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
$ ~/dspace63/bin/dspace stats-util -s
Moving: 21993 into core statistics-2019
</code></pre><ul>
@ -1226,7 +1226,7 @@ Moving: 21993 into core statistics-2019
</ul>
</li>
</ul>
<pre><code>&lt;meta content=&quot;Thu hoạch v&amp;agrave; bảo quản c&amp;agrave; ph&amp;ecirc; ch&amp;egrave; đ&amp;uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)&quot; name=&quot;citation_title&quot;&gt;
<pre tabindex="0"><code>&lt;meta content=&quot;Thu hoạch v&amp;agrave; bảo quản c&amp;agrave; ph&amp;ecirc; ch&amp;egrave; đ&amp;uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)&quot; name=&quot;citation_title&quot;&gt;
&lt;meta name=&quot;citation_title&quot; content=&quot;Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)&quot; /&gt;
</code></pre><ul>
<li><a href="https://jira.lyrasis.org/browse/DS-4397">DS-4397 controlled vocabulary loading speedup</a></li>
@ -1250,7 +1250,7 @@ Moving: 21993 into core statistics-2019
</li>
<li>I added some debugging to the Solr core loading in DSpace 6.4-SNAPSHOT (<code>SolrLoggerServiceImpl.java</code>) and I see this when DSpace starts up now:</li>
</ul>
<pre><code>2020-02-27 12:26:35,695 INFO org.dspace.statistics.SolrLoggerServiceImpl @ Alan Ping of Solr Core [statistics-2019] Failed with [org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException]. New Core Will be Created
<pre tabindex="0"><code>2020-02-27 12:26:35,695 INFO org.dspace.statistics.SolrLoggerServiceImpl @ Alan Ping of Solr Core [statistics-2019] Failed with [org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException]. New Core Will be Created
</code></pre><ul>
<li>When I check Solr I see the <code>statistics-2019</code> core loaded (from <code>stats-util -s</code> yesterday, not manually created)</li>
</ul>

View File

@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -141,7 +141,7 @@ You need to download this into the DSpace 6.x source and compile it
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
</code></pre><h2 id="2020-03-03">2020-03-03</h2>
<ul>
@ -160,7 +160,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
</code></pre><ul>
<li>But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it&hellip;</li>
<li>Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs</li>
@ -177,7 +177,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
<li>I want to try to consolidate our yearly Solr statistics cores back into one <code>statistics</code> core using the solr-import-export-json tool</li>
<li>I will try it on DSpace test, doing one year at a time:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;time:2010*&lt;/query&gt;&lt;/delete&gt;&quot;
$ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
@ -196,7 +196,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
</code></pre><ul>
<li>Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
<ul>
@ -204,7 +204,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code># apt install postgresql-10 postgresql-contrib-10
<pre tabindex="0"><code># apt install postgresql-10 postgresql-contrib-10
# systemctl stop tomcat7
# pg_ctlcluster 9.6 main stop
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
@ -232,11 +232,11 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
<pre tabindex="0"><code>Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
</code></pre><ul>
<li>It seems to only be a problem in the last week:</li>
</ul>
<pre><code># zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
<pre tabindex="0"><code># zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
/var/log/nginx/rest.log.1:0
/var/log/nginx/rest.log.2:0
/var/log/nginx/rest.log.3:0
@ -250,22 +250,22 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
<li>In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean</li>
<li>I will purge them from Solr statistics:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)&quot;&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)&quot;&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
<li>Another user agent that seems to be a bot is:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx&rsquo;s logs I see it belongs to three IPs on Online.net in France:</li>
</ul>
<pre><code># zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
<pre tabindex="0"><code># zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
63090 163.172.68.99
183428 163.172.70.248
147608 163.172.71.24
</code></pre><ul>
<li>It is making 10,000 to 40,000 requests to XMLUI per day&hellip;</li>
</ul>
<pre><code># zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
<pre tabindex="0"><code># zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
/var/log/nginx/access.log.30.gz:18687
/var/log/nginx/access.log.31.gz:28936
/var/log/nginx/access.log.32.gz:36402
@ -284,7 +284,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</code></pre><ul>
<li>I will purge those hits too!</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)&quot;&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)&quot;&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
<li>Shit, and something happened and a few thousand hits from user agents with &ldquo;Bot&rdquo; in their user agent got through
<ul>
@ -292,7 +292,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -f /tmp/bots -d -p
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: Citoid
Purging 11 hits from Citoid in statistics
@ -337,7 +337,7 @@ Purging 62 hits from [Ss]pider in statistics
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
</code></pre><ul>
<li>Then I exported the metadata from DSpace Test and imported it into OpenRefine
<ul>
@ -346,7 +346,7 @@ Purging 62 hits from [Ss]pider in statistics
</li>
<li>I exported a new list of affiliations from the database, added line numbers with <code>csvcut</code>, and then validated them in OpenRefine using <code>reconcile-csv</code>:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
dspace=# \q
$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' &gt; /tmp/affiliations.csv
$ lein run /tmp/affiliations.csv name id
@ -417,14 +417,14 @@ $ lein run /tmp/affiliations.csv name id
<li>Update Tomcat to version 7.0.103 in the Ansible infrastrcutrue playbooks and deploy on DSpace Test (linode26)</li>
<li>Maria sent me a few new ORCID identifiers from Bioversity so I combined them with our existing ones, filtered the unique ones, and then resolved their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-03-26-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-03-26-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-03-26-combined-orcids.txt -o /tmp/2020-03-26-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;King, Brian&quot;,&quot;Brian King: 0000-0002-7056-9214&quot;
&quot;Ortiz-Crespo, Berta&quot;,&quot;Berta Ortiz-Crespo: 0000-0002-6664-0815&quot;
&quot;Ekesa, Beatrice&quot;,&quot;Beatrice Ekesa: 0000-0002-2630-258X&quot;
@ -434,7 +434,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</code></pre><ul>
<li>Running the <code>add-orcid-identifiers-csv.py</code> script I added 32 ORCID iDs to items on CGSpace!</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
<li>Udana from IWMI asked about some items that are missing Altmetric donuts on CGSpace
<ul>
@ -449,7 +449,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<ul>
<li>Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors' existing publications in the database using this CSV with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Snook, L.K.&quot;,&quot;Laura Snook: 0000-0002-9168-1301&quot;
&quot;Snook, L.&quot;,&quot;Laura Snook: 0000-0002-9168-1301&quot;
&quot;Zheng, S.J.&quot;,&quot;Sijun Zheng: 0000-0003-1550-3738&quot;

View File

@ -48,7 +48,7 @@ The third item now has a donut with score 1 since I tweeted it last week
On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -171,23 +171,23 @@ On the same note, the one item Abenet pointed out last week now has a donut with
</ul>
</li>
</ul>
<pre><code>$ psql -h localhost -U postgres dspace -c &quot;DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE '%Ballantyne%';&quot;
<pre tabindex="0"><code>$ psql -h localhost -U postgres dspace -c &quot;DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE '%Ballantyne%';&quot;
DELETE 97
$ ./add-orcid-identifiers-csv.py -i 2020-04-07-peter-orcids.csv -db dspace -u dspace -p 'fuuu' -d
</code></pre><ul>
<li>I used this CSV with the script (all records with his name have the name standardized like this):</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Ballantyne, Peter G.&quot;,&quot;Peter G. Ballantyne: 0000-0001-9346-2893&quot;
</code></pre><ul>
<li>Then I tried another way, to identify all duplicate ORCID identifiers for a given resource ID and group them so I can see if count is greater than 1:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT(resource_id, text_value) as distinct_orcid, COUNT(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 240 GROUP BY distinct_orcid ORDER BY count DESC) TO /tmp/2020-04-07-duplicate-orcids.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT(resource_id, text_value) as distinct_orcid, COUNT(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 240 GROUP BY distinct_orcid ORDER BY count DESC) TO /tmp/2020-04-07-duplicate-orcids.csv WITH CSV HEADER;
COPY 15209
</code></pre><ul>
<li>Of those, about nine authors had duplicate ORCID identifiers over about thirty records, so I created a CSV with all their name variations and ORCID identifiers:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Ballantyne, Peter G.&quot;,&quot;Peter G. Ballantyne: 0000-0001-9346-2893&quot;
&quot;Ramirez-Villegas, Julian&quot;,&quot;Julian Ramirez-Villegas: 0000-0002-8044-583X&quot;
&quot;Villegas-Ramirez, J&quot;,&quot;Julian Ramirez-Villegas: 0000-0002-8044-583X&quot;
@ -207,12 +207,12 @@ COPY 15209
</code></pre><ul>
<li>Then I deleted <em>all</em> their existing ORCID identifier records:</li>
</ul>
<pre><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO '%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO '%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%';
DELETE 994
</code></pre><ul>
<li>And then I added them again using the <code>add-orcid-identifiers</code> records:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p 'fuuu' -d
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p 'fuuu' -d
</code></pre><ul>
<li>I ran the fixes on DSpace Test and CGSpace as well</li>
<li>I started testing the <a href="https://github.com/ilri/DSpace/pull/445">pull request</a> sent by Atmire yesterday
@ -230,7 +230,7 @@ DELETE 994
</ul>
</li>
</ul>
<pre><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');
<pre tabindex="0"><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');
dspace63=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>Then DSpace 6.3 started up OK and I was able to see some statistics in the Content and Usage Analysis (CUA) module, but not on community, collection, or item pages
@ -239,11 +239,11 @@ dspace63=# CREATE EXTENSION pgcrypto;
</ul>
</li>
</ul>
<pre><code>2020-04-12 16:34:33,363 ERROR com.atmire.dspace.app.xmlui.aspect.statistics.editorparts.DataTableTransformer @ java.lang.IllegalArgumentException: Invalid UUID string: 1
<pre tabindex="0"><code>2020-04-12 16:34:33,363 ERROR com.atmire.dspace.app.xmlui.aspect.statistics.editorparts.DataTableTransformer @ java.lang.IllegalArgumentException: Invalid UUID string: 1
</code></pre><ul>
<li>And I remembered I actually need to run the DSpace 6.4 Solr UUID migrations:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
</code></pre><ul>
<li>Run system updates on DSpace Test (linode26) and reboot it</li>
@ -258,7 +258,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
<li>I realized that <code>solr-upgrade-statistics-6x</code> only processes 100,000 records by default so I think we actually need to finish running it for all legacy Solr records before asking Atmire why CUA statlets and detailed statistics aren&rsquo;t working</li>
<li>For now I am just doing 250,000 records at a time on my local environment:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx2000m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx2000m -Dfile.encoding=UTF-8&quot;
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x -n 250000
</code></pre><ul>
<li>Despite running the migration for all of my local 1.5 million Solr records, I still see a few hundred thousand like <code>-1</code> and <code>0-unmigrated</code>
@ -269,14 +269,14 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x -n 250000
</ul>
</li>
</ul>
<pre><code>/** DSpace site type */
<pre tabindex="0"><code>/** DSpace site type */
public static final int SITE = 5;
</code></pre><ul>
<li>Even after deleting those documents and re-running <code>solr-upgrade-statistics-6x</code> I still get the UUID errors when using CUA and the statlets</li>
<li>I have sent some feedback and questions to Atmire (including about the  issue with glypicons in the header trail)</li>
<li>In other news, my local Artifactory container stopped working for some reason so I re-created it and it seems some things have changed upstream (port 8082 for web UI?):</li>
</ul>
<pre><code>$ podman rm artifactory
<pre tabindex="0"><code>$ podman rm artifactory
$ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
$ podman start artifactory
@ -284,7 +284,7 @@ $ podman start artifactory
<ul>
<li>A few days ago Peter asked me to update an author&rsquo;s name on CGSpace and in the controlled vocabularies:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value='Knight-Jones, Theodore J.D.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='Knight-Jones, T.J.D.';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='Knight-Jones, Theodore J.D.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='Knight-Jones, T.J.D.';
</code></pre><ul>
<li>I updated his existing records on CGSpace, changed the controlled lists, added his ORCID identifier to the controlled list, and tagged his thirty-nine items with the ORCID iD</li>
<li>The new DSpace 6 stuff that Atmire sent modifies the Mirage 2&rsquo;s <code>pom.xml</code> to copy the each theme&rsquo;s resulting <code>node_modules</code> to each theme after building and installing with <code>ant update</code> because they moved some packages from bower to npm and now reference them in <code>page-structure.xsl</code>
@ -315,7 +315,7 @@ $ podman start artifactory
<ul>
<li>Looking into a high rate of outgoing bandwidth from yesterday on CGSpace (linode18):</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Apr/2020:0[6789]&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Apr/2020:0[6789]&quot; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>One host in Russia (91.241.19.70) download 23GiB over those few hours in the morning
<ul>
@ -323,18 +323,18 @@ $ podman start artifactory
</ul>
</li>
</ul>
<pre><code># grep -c 91.241.19.70 /var/log/nginx/access.log.1
<pre tabindex="0"><code># grep -c 91.241.19.70 /var/log/nginx/access.log.1
8900
# grep 91.241.19.70 /var/log/nginx/access.log.1 | grep -c '10568/35187'
8900
</code></pre><ul>
<li>I thought the host might have been Yandex misbehaving, but its user agent is:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_3; nl-nl) AppleWebKit/527 (KHTML, like Gecko) Version/3.1.1 Safari/525.20
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_3; nl-nl) AppleWebKit/527 (KHTML, like Gecko) Version/3.1.1 Safari/525.20
</code></pre><ul>
<li>I will purge that IP from the Solr statistics using my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -d -f /tmp/ip -p
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -d -f /tmp/ip -p
(DEBUG) Using spider IPs file: /tmp/ip
(DEBUG) Checking for hits from spider IP: 91.241.19.70
Purging 8909 hits from 91.241.19.70 in statistics
@ -343,11 +343,11 @@ Total number of bot hits purged: 8909
</code></pre><ul>
<li>While investigating that I noticed ORCID identifiers missing from a few authors names, so I added them with my <code>add-orcid-identifiers.py</code> script:</li>
</ul>
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
</code></pre><ul>
<li>The contents of <code>2020-04-20-add-orcids.csv</code> was:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
&quot;Schut, Marc&quot;,&quot;Marc Schut: 0000-0002-3361-4581&quot;
&quot;Schut, M.&quot;,&quot;Marc Schut: 0000-0002-3361-4581&quot;
&quot;Kamau, G.&quot;,&quot;Geoffrey Kamau: 0000-0002-6995-4801&quot;
@ -387,17 +387,17 @@ Total number of bot hits purged: 8909
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(184980) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
UPDATE 1
</code></pre><ul>
<li>I spent some time working on the XMLUI themes in DSpace 6
@ -412,7 +412,7 @@ UPDATE 1
</ul>
</li>
</ul>
<pre><code>.breadcrumb &gt; li + li:before {
<pre tabindex="0"><code>.breadcrumb &gt; li + li:before {
content: &quot;/\00a0&quot;;
}
</code></pre><h2 id="2020-04-27">2020-04-27</h2>
@ -421,7 +421,7 @@ UPDATE 1
<li>My changes to DSpace XMLUI Mirage 2 build process mean that we don&rsquo;t need Ruby gems at all anymore! We can completely build without them!</li>
<li>Trying to test the <code>com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI</code> script but there is an error:</li>
</ul>
<pre><code>Exception: org.apache.solr.search.SyntaxError: Cannot parse 'cua_version:${cua.version.number}': Encountered &quot; &quot;}&quot; &quot;} &quot;&quot; at line 1, column 32.
<pre tabindex="0"><code>Exception: org.apache.solr.search.SyntaxError: Cannot parse 'cua_version:${cua.version.number}': Encountered &quot; &quot;}&quot; &quot;} &quot;&quot; at line 1, column 32.
Was expecting one of:
&quot;TO&quot; ...
&lt;RANGE_QUOTED&gt; ...
@ -429,7 +429,7 @@ Was expecting one of:
</code></pre><ul>
<li>Seems something is wrong with the variable interpolation, and I see two configurations in the <code>atmire-cua.cfg</code> file:</li>
</ul>
<pre><code>atmire-cua.cua.version.number=${cua.version.number}
<pre tabindex="0"><code>atmire-cua.cua.version.number=${cua.version.number}
atmire-cua.version.number=${cua.version.number}
</code></pre><ul>
<li>I sent a message to Atmire to check</li>
@ -473,7 +473,7 @@ atmire-cua.version.number=${cua.version.number}
</ul>
</li>
</ul>
<pre><code>Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn't be processed
<pre tabindex="0"><code>Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)
@ -508,7 +508,7 @@ Caused by: java.lang.NullPointerException
</ul>
</li>
</ul>
<pre><code>$ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d' ' | sort | uniq -c | sort -n
<pre tabindex="0"><code>$ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d' ' | sort | uniq -c | sort -n
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL findByUnique Error -
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL find Error -
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
@ -524,20 +524,20 @@ Caused by: java.lang.NullPointerException
<ul>
<li>Database connections do seem high:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
6 dspaceCli
88 dspaceWeb
</code></pre><ul>
<li>Most of those are idle in transaction:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c &quot;idle in transaction&quot;
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c &quot;idle in transaction&quot;
67
</code></pre><ul>
<li>I don&rsquo;t see anything in the PostgreSQL or Tomcat logs suggesting anything is wrong&hellip; I think the solution to clear these idle connections is probably to just restart Tomcat</li>
<li>I looked at the Solr stats for this month and see lots of suspicious IPs:</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&amp;fq=dateYearMonth:2020-04&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&amp;fq=dateYearMonth:2020-04&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip
&quot;88.99.115.53&quot;,23621, # Hetzner, using XMLUI and REST API with no user agent
&quot;104.154.216.0&quot;,11865,# Google cloud, scraping XMLUI with no user agent
@ -555,13 +555,13 @@ Caused by: java.lang.NullPointerException
<li>I need to start blocking requests without a user agent&hellip;</li>
<li>I purged these user agents using my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<pre><code>$ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done
<pre tabindex="0"><code>$ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done
$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
</code></pre><ul>
<li>Then I added a few of them to the bot mapping in the nginx config because it appears they are regular harvesters since 2018</li>
<li>Looking through the Solr stats faceted by the <code>userAgent</code> field I see some interesting ones:</li>
</ul>
<pre><code>$ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=userAgent'
<pre tabindex="0"><code>$ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&amp;rows=0&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=userAgent'
...
&quot;Delphi 2009&quot;,50725,
&quot;OgScrper/1.0.0&quot;,12421,
@ -580,13 +580,13 @@ $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
<li>I don&rsquo;t know why, but my <code>check-spider-hits.sh</code> script doesn&rsquo;t seem to be handling the user agents with spaces properly so I will delete those manually after</li>
<li>First delete the ones without spaces, creating a temp file in <code>/tmp/agents</code> containing the patterns:</li>
</ul>
<pre><code>$ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done
<pre tabindex="0"><code>$ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done
$ ./check-spider-hits.sh -f /tmp/agents -s statistics -p
</code></pre><ul>
<li>That&rsquo;s about 300,000 hits purged&hellip;</li>
<li>Then remove the ones with spaces manually, checking the query syntax first, then deleting in yearly cores and the statistics core:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:/Delphi 2009/&amp;rows=0&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:/Delphi 2009/&amp;rows=0&quot;
...
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;52&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;userAgent:/Delphi 2009/&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;38760&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
$ for year in {2010..2019}; do curl -s &quot;http://localhost:8081/solr/statistics-$year/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;userAgent:&quot;Delphi 2009&quot;&lt;/query&gt;&lt;/delete&gt;'; done
@ -606,7 +606,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quo
</ul>
</li>
</ul>
<pre><code># mv /etc/letsencrypt /etc/letsencrypt.bak
<pre tabindex="0"><code># mv /etc/letsencrypt /etc/letsencrypt.bak
# /opt/certbot-auto certonly --standalone --email fu@m.com -d dspacetest.cgiar.org --standalone --pre-hook &quot;/bin/systemctl stop nginx&quot; --post-hook &quot;/bin/systemctl start nginx&quot;
# /opt/certbot-auto revoke --cert-path /etc/letsencrypt.bak/live/dspacetest.cgiar.org/cert.pem
# rm -rf /etc/letsencrypt.bak
@ -618,7 +618,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quo
<ul>
<li>But I don&rsquo;t see a lot of connections in PostgreSQL itself:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
6 dspaceCli
14 dspaceWeb
@ -636,7 +636,7 @@ $ psql -c 'select * from pg_stat_activity' | wc -l
<ul>
<li>The PostgreSQL log shows a lot of errors about deadlocks and queries waiting on other processes&hellip;</li>
</ul>
<pre><code>ERROR: deadlock detected
<pre tabindex="0"><code>ERROR: deadlock detected
</code></pre><!-- raw HTML omitted -->

View File

@ -34,7 +34,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -166,7 +166,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;07/May/2020:(01|03|04)&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;07/May/2020:(01|03|04)&quot; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The two main IPs making requests around then are 188.134.31.88 and 212.34.8.188
<ul>
@ -176,7 +176,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
</ul>
</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
Purging 171641 hits from 212.34.8.188 in statistics
Purging 20691 hits from 188.134.31.88 in statistics
@ -209,7 +209,7 @@ Total number of bot hits purged: 192332
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-11-add-orcids.csv
<pre tabindex="0"><code>$ cat 2020-05-11-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Lutakome, P.&quot;,&quot;Pius Lutakome: 0000-0002-0804-2649&quot;
&quot;Lutakome, Pius&quot;,&quot;Pius Lutakome: 0000-0002-0804-2649&quot;
@ -263,7 +263,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspa
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-19-add-orcids.csv
<pre tabindex="0"><code>$ cat 2020-05-19-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Bahta, Sirak T.&quot;,&quot;Sirak Bahta: 0000-0002-5728-2489&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
@ -298,7 +298,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspa
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-25-add-orcids.csv
<pre tabindex="0"><code>$ cat 2020-05-25-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Díaz, Manuel F.&quot;,&quot;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&quot;
&quot;Díaz, Manuel Francisco&quot;,&quot;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&quot;
@ -327,7 +327,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspa
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/*.log.1 | grep -E &quot;29/May/2020:(02|03|04|05)&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log.1 | grep -E &quot;29/May/2020:(02|03|04|05)&quot; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The top is 172.104.229.92, which is the AReS harvester (still not using a user agent, but it&rsquo;s tagged as a bot in the nginx mapping)</li>
<li>Second is 188.134.31.88, which is a Russian host that we also saw in the last few weeks, using a browser user agent and hitting the XMLUI (but it is tagged as a bot in nginx as well)</li>
@ -358,7 +358,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspa
</ul>
</li>
</ul>
<pre><code>$ sudo su - postgres
<pre tabindex="0"><code>$ sudo su - postgres
$ dropdb dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql dspacetest -c 'alter user dspacetest superuser;'
@ -372,14 +372,14 @@ $ exit
</code></pre><ul>
<li>Now switch to the DSpace 6.x branch and start a build:</li>
</ul>
<pre><code>$ chrt -i 0 ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false package
<pre tabindex="0"><code>$ chrt -i 0 ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false package
...
[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:6.3: Failed to collect dependencies at com.atmire:atmire-listings-and-reports-api:jar:6.x-2.10.8-0-SNAPSHOT: Failed to read artifact descriptor for com.atmire:atmire-listings-and-reports-api:jar:6.x-2.10.8-0-SNAPSHOT: Could not transfer artifact com.atmire:atmire-listings-and-reports-api:pom:6.x-2.10.8-0-SNAPSHOT from/to atmire.com-snapshots (https://atmire.com/artifactory/atmire.com-snapshots): Not authorized , ReasonPhrase:Unauthorized. -&gt; [Help 1]
</code></pre><ul>
<li>Great! I will have to send Atmire a note about this&hellip; but for now I can sync over my local <code>~/.m2</code> directory and the build completes</li>
<li>After the Maven build completed successfully I installed the updated code with Ant (make sure to delete the old spring directory):</li>
</ul>
<pre><code>$ cd dspace/target/dspace-installer
<pre tabindex="0"><code>$ cd dspace/target/dspace-installer
$ rm -rf /blah/dspacetest/config/spring
$ ant update
</code></pre><ul>
@ -391,7 +391,7 @@ $ ant update
<li>I had a mistake in my Solr internal URL parameter so DSpace couldn&rsquo;t find it, but once I fixed that DSpace starts up OK!</li>
<li>Once the initial Discovery reindexing was completed (after three hours or so!) I started the Solr statistics UUID migration:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
$ dspace solr-upgrade-statistics-6x -i statistics -n 250000
$ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
$ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
@ -400,7 +400,7 @@ $ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
<li>It&rsquo;s taking about 35 minutes for 1,000,000 records&hellip;</li>
<li>Some issues towards the end of this core:</li>
</ul>
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
@ -425,17 +425,17 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f '(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)'
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f '(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)'
$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Now the UUID conversion script says there is nothing left to convert, so I can try to run the Atmire CUA conversion utility:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 1
</code></pre><ul>
<li>The processing is very slow and there are lots of errors like this:</li>
</ul>
<pre><code>Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn't be processed
<pre tabindex="0"><code>Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)

View File

@ -36,7 +36,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to
In other news, I checked the statistics API on DSpace 6 and it&rsquo;s working
I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -132,7 +132,7 @@ I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Tes
<li>In other news, I checked the statistics API on DSpace 6 and it&rsquo;s working</li>
<li>I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:</li>
</ul>
<pre><code>$ dspace oai import -c
<pre tabindex="0"><code>$ dspace oai import -c
OAI 2.0 manager action started
Loading @mire database changes for module MQM
Changes have been processed
@ -161,7 +161,7 @@ java.lang.NullPointerException
</ul>
</li>
</ul>
<pre><code>$ curl http://localhost:8080/solr/oai/update -H &quot;Content-type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl http://localhost:8080/solr/oai/update -H &quot;Content-type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;'
$ curl http://localhost:8080/solr/oai/update -H &quot;Content-type: text/xml&quot; --data-binary '&lt;commit /&gt;'
$ ~/dspace63/bin/dspace oai import
OAI 2.0 manager action started
@ -213,7 +213,7 @@ java.lang.NullPointerException
</ul>
</li>
</ul>
<pre><code>$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 125m37.423s
user 11m20.312s
@ -250,7 +250,7 @@ sys 3m19.965s
</ul>
</li>
</ul>
<pre><code>$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 101m41.195s
user 10m9.569s
@ -264,7 +264,7 @@ sys 3m13.929s
<li>Peter said he was annoyed with a CSV export from CGSpace because of the different <code>text_lang</code> attributes and asked if we can fix it</li>
<li>The last time I normalized these was in 2019-06, and currently it looks like this:</li>
</ul>
<pre><code>dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-------------+---------
en_US | 2158377
@ -279,7 +279,7 @@ sys 3m13.929s
<li>In theory we can have different languages for metadata fields but in practice we don&rsquo;t do that, so we might as well normalize everything to &ldquo;en_US&rdquo; (and perhaps I should make a curation task to do this)</li>
<li>For now I will do it manually on CGSpace and DSpace Test:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2;
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2;
UPDATE 2414738
</code></pre><ul>
<li>Note: DSpace Test doesn&rsquo;t have the <code>resource_type_id</code> column because it&rsquo;s running DSpace 6 and <a href="https://wiki.lyrasis.org/display/DSPACE/DSpace+Service+based+api">the schema changed to use an object model there</a>
@ -288,7 +288,7 @@ UPDATE 2414738
</ul>
</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
</code></pre><ul>
<li>Peter asked if it was possible to find all ILRI items that have &ldquo;zoonoses&rdquo; or &ldquo;zoonotic&rdquo; in their titles and check if they have the ILRI subject &ldquo;ZOONOTIC DISEASES&rdquo; (and add it if not)
<ul>
@ -319,7 +319,7 @@ UPDATE 2414738
</ul>
</li>
</ul>
<pre><code>$ dspace metadata-export -i 10568/1 -f /tmp/2020-06-08-ILRI.csv
<pre tabindex="0"><code>$ dspace metadata-export -i 10568/1 -f /tmp/2020-06-08-ILRI.csv
$ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-ILRI.csv &gt; /tmp/ilri.csv
</code></pre><ul>
<li>Moayad asked why he&rsquo;s getting HTTP 500 errors on CGSpace
@ -329,12 +329,12 @@ $ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-I
</ul>
</li>
</ul>
<pre><code># journalctl --since=today -u tomcat7 | grep -c 'Internal Server Error'
<pre tabindex="0"><code># journalctl --since=today -u tomcat7 | grep -c 'Internal Server Error'
482
</code></pre><ul>
<li>They are all related to the REST API, like:</li>
</ul>
<pre><code>Jun 07 02:00:27 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
<pre tabindex="0"><code>Jun 07 02:00:27 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
Jun 07 02:00:27 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
Jun 07 02:00:27 linode18 tomcat7[6286]: at org.dspace.rest.Resource.processException(Resource.java:151)
Jun 07 02:00:27 linode18 tomcat7[6286]: at org.dspace.rest.ItemsResource.getItems(ItemsResource.java:195)
@ -346,7 +346,7 @@ Jun 07 02:00:27 linode18 tomcat7[6286]: at com.sun.jersey.spi.container.
</code></pre><ul>
<li>And:</li>
</ul>
<pre><code>Jun 08 09:28:29 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
<pre tabindex="0"><code>Jun 08 09:28:29 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
Jun 08 09:28:29 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
Jun 08 09:28:29 linode18 tomcat7[6286]: at org.dspace.rest.Resource.processFinally(Resource.java:169)
Jun 08 09:28:29 linode18 tomcat7[6286]: at org.dspace.rest.HandleResource.getObject(HandleResource.java:81)
@ -356,7 +356,7 @@ Jun 08 09:28:29 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
</code></pre><ul>
<li>And:</li>
</ul>
<pre><code>Jun 06 08:19:54 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
<pre tabindex="0"><code>Jun 06 08:19:54 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
Jun 06 08:19:54 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
Jun 06 08:19:54 linode18 tomcat7[6286]: at org.dspace.rest.Resource.processException(Resource.java:151)
Jun 06 08:19:54 linode18 tomcat7[6286]: at org.dspace.rest.CollectionsResource.getCollectionItems(CollectionsResource.java:289)
@ -366,12 +366,12 @@ Jun 06 08:19:54 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
</code></pre><ul>
<li>Looking back, I see ~800 of these errors since I changed the database configuration last week:</li>
</ul>
<pre><code># journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
<pre tabindex="0"><code># journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
795
</code></pre><ul>
<li>And only ~280 in the entire month before that&hellip;</li>
</ul>
<pre><code># journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
<pre tabindex="0"><code># journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
286
</code></pre><ul>
<li>So it seems to be related to the database, perhaps that there are less connections in the pool?
@ -390,11 +390,11 @@ Jun 06 08:19:54 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
</code></pre><ul>
<li>Looking at the nginx access logs I see that, other than something that seems like Google Feedburner, all hosts using this user agent are all in Sweden!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' | grep -v '/feed' | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' | grep -v '/feed' | awk '{print $1}' | sort | uniq -c | sort -n
1624 192.36.136.246
1627 192.36.241.95
1629 192.165.45.204
@ -419,7 +419,7 @@ Jun 06 08:19:54 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
<li>The earliest I see any of these hosts is 2020-06-05 (three days ago)</li>
<li>I will purge them from the Solr statistics and add them to abusive IPs ipset in the Ansible deployment scripts</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
Purging 1423 hits from 192.36.136.246 in statistics
Purging 1387 hits from 192.36.241.95 in statistics
Purging 1398 hits from 192.165.45.204 in statistics
@ -480,7 +480,7 @@ Total number of bot hits purged: 29025
</ul>
</li>
</ul>
<pre><code>172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] &quot;GET /rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=50&amp;offset=0 HTTP/1.1&quot; 403 260 &quot;-&quot; &quot;-&quot;
<pre tabindex="0"><code>172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] &quot;GET /rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=50&amp;offset=0 HTTP/1.1&quot; 403 260 &quot;-&quot; &quot;-&quot;
</code></pre><ul>
<li>I created an nginx map based on the host&rsquo;s IP address that sets a temporary user agent (ua) and then changed the conditional in the REST API location block so that it checks this mapped ua instead of the default one
<ul>
@ -497,11 +497,11 @@ Total number of bot hits purged: 29025
</ul>
</li>
</ul>
<pre><code>$ curl -s 'https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections' 'https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections' | grep -oE '10568/[0-9]+' | sort | uniq &gt; /tmp/cip-collections.txt
<pre tabindex="0"><code>$ curl -s 'https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections' 'https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections' | grep -oE '10568/[0-9]+' | sort | uniq &gt; /tmp/cip-collections.txt
</code></pre><ul>
<li>Then I formatted it into a SQL query and exported a CSV:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN ('10568/100533', '10568/100653', '10568/101955', '10568/106580', '10568/108469', '10568/51671', '10568/53085', '10568/53086', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/53093', '10568/53094', '10568/64874', '10568/69069', '10568/70150', '10568/88229', '10568/89346', '10568/89347', '10568/99301', '10568/99302', '10568/99303', '10568/99304', '10568/99428'))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN ('10568/100533', '10568/100653', '10568/101955', '10568/106580', '10568/108469', '10568/51671', '10568/53085', '10568/53086', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/53093', '10568/53094', '10568/64874', '10568/69069', '10568/70150', '10568/88229', '10568/89346', '10568/89347', '10568/99301', '10568/99302', '10568/99303', '10568/99304', '10568/99428'))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
COPY 3917
</code></pre><h2 id="2020-06-15">2020-06-15</h2>
<ul>
@ -632,7 +632,7 @@ COPY 3917
</li>
<li>I also notice that there is a <a href="https://www.crossref.org/services/funder-registry/">CrossRef funders registry</a> with 23,000+ funders that you can <a href="https://gitlab.com/crossref/open_funder_registry">download as RDF</a> or <a href="https://www.crossref.org/education/funder-registry/accessing-the-funder-registry/">access via an API</a></li>
</ul>
<pre><code>$ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&amp;mailto=a.orth@cgiar.org'
<pre tabindex="0"><code>$ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&amp;mailto=a.orth@cgiar.org'
</code></pre><ul>
<li>Searching for &ldquo;Bill and Melinda Gates&rdquo; we can see the <code>name</code> literal and a list of <code>alt-names</code> literals
<ul>
@ -645,7 +645,7 @@ COPY 3917
<li>I made a pull request on CG Core v2 to recommend using persistent identifiers for DOIs and ORCID iDs (<a href="https://github.com/AgriculturalSemantics/cg-core/pull/26">#26</a>)</li>
<li>I exported sponsors/funders from CGSpace and wrote a script to query the CrossRef API for matches:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29) TO /tmp/2020-06-29-sponsors.csv;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29) TO /tmp/2020-06-29-sponsors.csv;
COPY 682
</code></pre><ul>
<li>The script is <code>crossref-funders-lookup.py</code> and it is based on <code>agrovoc-lookup.py</code>
@ -656,7 +656,7 @@ COPY 682
</li>
<li>I tested the script on our funders:</li>
</ul>
<pre><code>$ ./crossref-funders-lookup.py -i /tmp/2020-06-29-sponsors.csv -om /tmp/sponsors-matched.txt -or /tmp/sponsors-rejected.txt -d -e blah@blah.com
<pre tabindex="0"><code>$ ./crossref-funders-lookup.py -i /tmp/2020-06-29-sponsors.csv -om /tmp/sponsors-matched.txt -or /tmp/sponsors-rejected.txt -d -e blah@blah.com
$ wc -l /tmp/2020-06-29-sponsors.csv
682 /tmp/2020-06-29-sponsors.csv
$ wc -l /tmp/sponsors-*
@ -684,7 +684,7 @@ $ wc -l /tmp/sponsors-*
</li>
<li>Gabriela from CIP sent me a list of erroneously added CIP subjects to remove from CGSpace:</li>
</ul>
<pre><code>$ cat /tmp/2020-06-30-remove-cip-subjects.csv
<pre tabindex="0"><code>$ cat /tmp/2020-06-30-remove-cip-subjects.csv
cg.subject.cip
INTEGRATED PEST MANAGEMENT
ORANGE FLESH SWEET POTATOES
@ -701,7 +701,7 @@ $ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dsp
</code></pre><ul>
<li>She also wants to change their <code>SWEET POTATOES</code> term to <code>SWEETPOTATOES</code>, both in the CIP subject list and existing items so I updated those too:</li>
</ul>
<pre><code>$ cat /tmp/2020-06-30-fix-cip-subjects.csv
<pre tabindex="0"><code>$ cat /tmp/2020-06-30-fix-cip-subjects.csv
cg.subject.cip,correct
SWEET POTATOES,SWEETPOTATOES
$ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -t correct -m 127 -d
@ -710,7 +710,7 @@ $ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u
<li>I ran the fixes and deletes on CGSpace, but not on DSpace Test yet because those scripts need updating for DSpace 6 UUIDs</li>
<li>I spent about two hours manually checking our sponsors that were rejected from CrossRef and found about fifty-five corrections that I ran on CGSpace:</li>
</ul>
<pre><code>$ cat 2020-06-29-fix-sponsors.csv
<pre tabindex="0"><code>$ cat 2020-06-29-fix-sponsors.csv
dc.description.sponsorship,correct
&quot;Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil&quot;,&quot;Conselho Nacional de Desenvolvimento Científico e Tecnológico&quot;
&quot;Claussen Simon Stiftung&quot;,&quot;Claussen-Simon-Stiftung&quot;
@ -772,7 +772,7 @@ $ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dsp
</code></pre><ul>
<li>Then I started a full re-index at batch CPU priority:</li>
</ul>
<pre><code>$ time chrt --batch 0 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt --batch 0 dspace index-discovery -b
real 99m16.230s
user 11m23.245s
@ -784,7 +784,7 @@ sys 2m56.635s
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Xmx512m -Dfile.encoding=UTF-8&quot;
$ dspace metadata-export -i 10568/1 -f /tmp/ilri.cs
$ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]' /tmp/ilri.csv &gt; /tmp/ilri-covid19.csv
</code></pre><ul>

View File

@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter&rsquo;s request
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -139,7 +139,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
<li>Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning</li>
<li>First looking at the traffic in the morning:</li>
</ul>
<pre><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E &quot;01/Jul/2020:(00|01|02|03|04)&quot; | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E &quot;01/Jul/2020:(00|01|02|03|04)&quot; | goaccess --log-format=COMBINED -
...
9659 33.56% 1 0.08% 340.94 MiB 64.39.99.13
3317 11.53% 1 0.08% 871.71 MiB 199.47.87.140
@ -148,23 +148,23 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
</code></pre><ul>
<li>64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
</code></pre><ul>
<li>I will purge hits from that IP from Solr</li>
<li>The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:/Turnitin.*/&amp;rows=0&quot; | grep -oE 'numFound=&quot;[0-9]+&quot;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:/Turnitin.*/&amp;rows=0&quot; | grep -oE 'numFound=&quot;[0-9]+&quot;'
numFound=&quot;41694&quot;
</code></pre><ul>
<li>They used to be &ldquo;TurnitinBot&rdquo;&hellip; hhmmmm, seems they use both: <a href="https://turnitin.com/robot/crawlerinfo.html">https://turnitin.com/robot/crawlerinfo.html</a></li>
<li>I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting <code>robots.txt</code> and only requesting item pages, so that&rsquo;s impressive! I don&rsquo;t need to add them to the &ldquo;bad bot&rdquo; rate limit list in nginx</li>
<li>While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
</code></pre><ul>
<li>The IPs all belong to HostRoyale:</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
81
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
185.152.250.1
@ -269,7 +269,7 @@ numFound=&quot;41694&quot;
<li>I purged 20,000 hits from IPs and 45,000 hits from user agents</li>
<li>I will revert the default &ldquo;example&rdquo; agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven&rsquo;t merged yet:</li>
</ul>
<pre><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
<pre tabindex="0"><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
Citoid
ecointernet
GigablastOpenSource
@ -285,7 +285,7 @@ Typhoeus
</code></pre><ul>
<li>Just a note that I <em>still</em> can&rsquo;t deploy the <code>6_x-dev-atmire-modules</code> branch as it fails at ant update:</li>
</ul>
<pre><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
<pre tabindex="0"><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
</code></pre><ul>
<li>I had told Atmire about this several weeks ago&hellip; but I reminded them again in the ticket
<ul>
@ -308,7 +308,7 @@ Typhoeus
</ul>
</li>
</ul>
<pre><code>$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&amp;fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&amp;rows=0&amp;wt=json&amp;indent=true'
<pre tabindex="0"><code>$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&amp;fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&amp;rows=0&amp;wt=json&amp;indent=true'
{
&quot;responseHeader&quot;:{
&quot;status&quot;:0,
@ -324,12 +324,12 @@ Typhoeus
</code></pre><ul>
<li>But not in solr-import-export-json&hellip; hmmm&hellip; seems we need to URL encode <em>only</em> the date range itself, but not the brackets:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
$ zstd /tmp/statistics-2019-1.json
</code></pre><ul>
<li>Then import it on my local dev environment:</li>
</ul>
<pre><code>$ zstd -d statistics-2019-1.json.zst
<pre tabindex="0"><code>$ zstd -d statistics-2019-1.json.zst
$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-1.json -k uid
</code></pre><h2 id="2020-07-05">2020-07-05</h2>
<ul>
@ -358,11 +358,11 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
</li>
<li>I noticed that we have 20,000 distinct values for <code>dc.subject</code>, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
</code></pre><ul>
<li>DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:</li>
</ul>
<pre><code>dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
<pre tabindex="0"><code>dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
</code></pre><ul>
<li>Note the use of the POSIX character class :)</li>
<li>I suggest that we generate a list of the top 5,000 values that don&rsquo;t match AGROVOC so that Sisay can correct them
@ -371,14 +371,14 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
COPY 19640
dspace=# \q
$ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 &gt; 2020-07-05-cgspace-subjects.txt
</code></pre><ul>
<li>Then start looking them up using <code>agrovoc-lookup.py</code>:</li>
</ul>
<pre><code>$ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
<pre tabindex="0"><code>$ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
</code></pre><h2 id="2020-07-06">2020-07-06</h2>
<ul>
<li>I made some optimizations to the suite of Python utility scripts in our DSpace directory as well as the <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> script
@ -399,12 +399,12 @@ $ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 &gt; 2020-07-05-c
<ul>
<li>Peter asked me to send him a list of sponsors on CGSpace</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY &quot;dc.description.sponsorship&quot; ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY &quot;dc.description.sponsorship&quot; ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
COPY 707
</code></pre><ul>
<li>I ran it quickly through my <code>csv-metadata-quality</code> tool and found two issues that I will correct with <code>fix-metadata-values.py</code> on CGSpace immediately:</li>
</ul>
<pre><code>$ cat 2020-07-07-fix-sponsors.csv
<pre tabindex="0"><code>$ cat 2020-07-07-fix-sponsors.csv
dc.description.sponsorship,correct
&quot;Ministe`re des Affaires Etrange`res et Européennes, France&quot;,&quot;Ministère des Affaires Étrangères et Européennes, France&quot;
&quot;Global Food Security Programme, United Kingdom&quot;,&quot;Global Food Security Programme, United Kingdom&quot;
@ -432,7 +432,7 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
<ul>
<li>Generate a CSV of all the AGROVOC subjects that didn&rsquo;t match from the top 6500 I exported earlier this week:</li>
</ul>
<pre><code>$ csvgrep -c 'number of matches' -r &quot;^0$&quot; 2020-07-05-cgspace-subjects.csv | csvcut -c 1 &gt; 2020-07-05-cgspace-invalid-subjects.csv
<pre tabindex="0"><code>$ csvgrep -c 'number of matches' -r &quot;^0$&quot; 2020-07-05-cgspace-subjects.csv | csvcut -c 1 &gt; 2020-07-05-cgspace-invalid-subjects.csv
</code></pre><ul>
<li>Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors' names because of &ldquo;funny character&rdquo; issues with reports generated from CGSpace
<ul>
@ -442,7 +442,7 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
</ul>
</li>
</ul>
<pre><code>$ csvgrep -c 2 -r &quot;^.+$&quot; ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r &quot;^.*[À-ú].*$&quot; | csvgrep -c 2 -r &quot;^.*[À-ú].*$&quot; -i | csvcut -c 1,2
<pre tabindex="0"><code>$ csvgrep -c 2 -r &quot;^.+$&quot; ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r &quot;^.*[À-ú].*$&quot; | csvgrep -c 2 -r &quot;^.*[À-ú].*$&quot; -i | csvcut -c 1,2
dc.contributor.author,correction
&quot;López, G.&quot;,&quot;Lopez, G.&quot;
&quot;Gómez, R.&quot;,&quot;Gomez, R.&quot;
@ -475,11 +475,11 @@ dc.contributor.author,correction
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
</code></pre><ul>
<li>Then I stripped the CSV header and quotes to make it a plain text file and ran <code>ror-lookup.py</code>:</li>
</ul>
<pre><code>$ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
<pre tabindex="0"><code>$ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
$ wc -l /tmp/2020-07-08-affiliations.txt
5866 /tmp/2020-07-08-affiliations.txt
$ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l
@ -500,7 +500,7 @@ $ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
</li>
<li>I updated <code>ror-lookup.py</code> to check aliases and acronyms as well and now the results are better for CGSpace&rsquo;s affiliation list:</li>
</ul>
<pre><code>$ wc -l /tmp/2020-07-08-affiliations.txt
<pre tabindex="0"><code>$ wc -l /tmp/2020-07-08-affiliations.txt
5866 /tmp/2020-07-08-affiliations.txt
$ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l
1516
@ -510,16 +510,16 @@ $ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
<li>So now our matching improves to 1515 out of 5866 (25.8%)</li>
<li>Gabriela from CIP said that I should run the author corrections minus those that remove accent characters so I will run it on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
</code></pre><ul>
<li>Apply 110 fixes and 90 deletions to sponsorships that Peter sent me a few days ago:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
$ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
</code></pre><ul>
<li>Start a full Discovery re-index on CGSpace:</li>
</ul>
<pre><code>$ time chrt -b 0 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 dspace index-discovery -b
real 94m21.413s
user 9m40.364s
@ -527,7 +527,7 @@ sys 2m37.246s
</code></pre><ul>
<li>I modified <code>crossref-funders-lookup.py</code> to be case insensitive and now CGSpace&rsquo;s sponsors match 173 out of 534 (32.4%):</li>
</ul>
<pre><code>$ ./crossref-funders-lookup.py -i 2020-07-09-cgspace-sponsors.txt -o 2020-07-09-cgspace-sponsors-crossref.csv -d -e a.orth@cgiar.org
<pre tabindex="0"><code>$ ./crossref-funders-lookup.py -i 2020-07-09-cgspace-sponsors.txt -o 2020-07-09-cgspace-sponsors-crossref.csv -d -e a.orth@cgiar.org
$ wc -l 2020-07-09-cgspace-sponsors.txt
534 2020-07-09-cgspace-sponsors.txt
$ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
@ -552,7 +552,7 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
</ul>
</li>
</ul>
<pre><code># grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code># grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2815
</code></pre><ul>
<li>So I need to add this alternative user-agent to the Tomcat Crawler Session Manager valve to force it to re-use a common bot session</li>
@ -563,11 +563,11 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-02-07T13:53:20Z +http://ifm.uni-mannheim.de)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-02-07T13:53:20Z +http://ifm.uni-mannheim.de)
</code></pre><ul>
<li>Generate a list of sponsors to update our controlled vocabulary:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY &quot;dc.description.sponsorship&quot; ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY &quot;dc.description.sponsorship&quot; ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
COPY 125
dspace=# \q
$ csvcut -c 1 --tabs /tmp/2020-07-12-sponsors.csv &gt; dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
@ -590,12 +590,12 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-descripti
<ul>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(189618) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
UPDATE 1
</code></pre><ul>
<li>Udana from WLE asked me about some items that didn&rsquo;t show Altmetric donuts
@ -616,12 +616,12 @@ UPDATE 1
<li>All four IWMI items that I tweeted yesterday have Altmetric donuts with a score of 1 now&hellip;</li>
<li>Export CGSpace countries to check them against ISO 3166-1 and ISO 3166-3 (historic countries):</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-07-15-countries.csv;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-07-15-countries.csv;
COPY 194
</code></pre><ul>
<li>I wrote a script <code>iso3166-lookup.py</code> to check them:</li>
</ul>
<pre><code>$ ./iso3166-1-lookup.py -i /tmp/2020-07-15-countries.csv -o /tmp/2020-07-15-countries-resolved.csv
<pre tabindex="0"><code>$ ./iso3166-1-lookup.py -i /tmp/2020-07-15-countries.csv -o /tmp/2020-07-15-countries-resolved.csv
$ csvgrep -c matched -m false /tmp/2020-07-15-countries-resolved.csv
country,match type,matched
CAPE VERDE,,false
@ -642,16 +642,16 @@ IRAN,,false
</code></pre><ul>
<li>Check the database for DOIs that are not in the preferred &ldquo;<a href="https://doi.org/%22">https://doi.org/&quot;</a> format:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT text_value as &quot;cg.identifier.doi&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT text_value as &quot;cg.identifier.doi&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
COPY 186
</code></pre><ul>
<li>Then I imported them into OpenRefine and replaced them in a new &ldquo;correct&rdquo; column using this GREL transform:</li>
</ul>
<pre><code>value.replace(&quot;dx.doi.org&quot;, &quot;doi.org&quot;).replace(&quot;http://&quot;, &quot;https://&quot;).replace(&quot;https://dx,doi,org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://doi.dx.org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi:&quot;, &quot;https://doi.org&quot;).replace(&quot;DOI: &quot;, &quot;https://doi.org/&quot;).replace(&quot;doi: &quot;, &quot;https://doi.org/&quot;).replace(&quot;http://dx.doi.org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx. doi.org. &quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi:&quot;, &quot;https://doi.org/&quot;).replace(&quot;hdl.handle.net&quot;, &quot;doi.org&quot;)
<pre tabindex="0"><code>value.replace(&quot;dx.doi.org&quot;, &quot;doi.org&quot;).replace(&quot;http://&quot;, &quot;https://&quot;).replace(&quot;https://dx,doi,org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://doi.dx.org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi:&quot;, &quot;https://doi.org&quot;).replace(&quot;DOI: &quot;, &quot;https://doi.org/&quot;).replace(&quot;doi: &quot;, &quot;https://doi.org/&quot;).replace(&quot;http://dx.doi.org&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx. doi.org. &quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi&quot;, &quot;https://doi.org&quot;).replace(&quot;https://dx.doi:&quot;, &quot;https://doi.org/&quot;).replace(&quot;hdl.handle.net&quot;, &quot;doi.org&quot;)
</code></pre><ul>
<li>Then I fixed the DOIs on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
</code></pre><ul>
<li>I filed <a href="https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/10">an issue on Debian&rsquo;s iso-codes</a> project to ask why &ldquo;Swaziland&rdquo; does not appear in the ISO 3166-3 list of historical country names despite it being changed to &ldquo;Eswatini&rdquo; in 2018.</li>
<li>Atmire responded about the Solr issue
@ -666,7 +666,7 @@ COPY 186
<ul>
<li>Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:</li>
</ul>
<pre><code>217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] &quot;GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0&quot; 302 138 &quot;-&quot; &quot;ILRI Livestock Website Publications importer BOT&quot;
<pre tabindex="0"><code>217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] &quot;GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0&quot; 302 138 &quot;-&quot; &quot;ILRI Livestock Website Publications importer BOT&quot;
</code></pre><ul>
<li>I still see 12,000 records in Solr from this user agent, though.
<ul>
@ -683,7 +683,7 @@ COPY 186
<li>I re-ran the <code>check-spider-hits.sh</code> script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total</li>
<li>I looked at the <a href="https://clarisa.cgiar.org/">CLARISA</a> institutions list again, since I hadn&rsquo;t looked at it in over six months:</li>
</ul>
<pre><code>$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
<pre tabindex="0"><code>$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/&quot;//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' &gt; /tmp/clarisa-institutions.csv
</code></pre><ul>
<li>The API still needs a key unless you query from Swagger web interface
<ul>
@ -700,7 +700,7 @@ COPY 186
</ul>
</li>
</ul>
<pre><code>$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv
<pre tabindex="0"><code>$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv
Removing excessive whitespace (name): Comitato Internazionale per lo Sviluppo dei Popoli / International Committee for the Development of Peoples
Removing excessive whitespace (name): Deutsche Landwirtschaftsgesellschaft / German agriculture society
Removing excessive whitespace (name): Institute of Arid Regions of Medenine
@ -732,7 +732,7 @@ Removing unnecessary Unicode (U+200B): Agencia de Servicios a la Comercializaci
</li>
<li>I started processing the 2019 stats in a batch of 1 million on DSpace Test:</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
...
*** Statistics Records with Legacy Id ***
@ -749,7 +749,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
</code></pre><ul>
<li>The statistics-2019 finished processing after about 9 hours so I started the 2018 ones:</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
*** Statistics Records with Legacy Id ***
@ -765,7 +765,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
</code></pre><ul>
<li>Moayad finally made OpenRXV use a unique user agent:</li>
</ul>
<pre><code>OpenRXV harvesting bot; https://github.com/ilri/OpenRXV
<pre tabindex="0"><code>OpenRXV harvesting bot; https://github.com/ilri/OpenRXV
</code></pre><ul>
<li>I see nearly 200,000 hits in Solr from the IP address, though, so I need to make sure those are old ones from before today
<ul>
@ -793,12 +793,12 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
</ul>
</li>
</ul>
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
</code></pre><ul>
<li>There were four records so I deleted them:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:10&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:10&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
<li>Meeting with Moayad and Peter and Abenet to discuss the latest AReS changes</li>
</ul>
@ -826,7 +826,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
<pre tabindex="0"><code>Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
</code></pre><ul>
<li>Also, in the same month with the same <em>exact</em> user agent, I see 300,000 from 192.157.89.x
<ul>
@ -842,7 +842,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>In statistics-2018 I see more weird IPs
<ul>
@ -860,7 +860,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
</code></pre><ul>
<li>Then there is 213.139.53.62 in 2018, which is on Orange Telecom Jordan, so it&rsquo;s definitely CodeObia / ICARDA and I will purge them</li>
<li>Jesus, and then there are 100,000 from the ILRI harvestor on Linode on 2a01:7e00::f03c:91ff:fe0a:d645</li>
@ -869,7 +869,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
<li>Jesus fuck there is 104.198.9.108 on Google Cloud that was making 30,000 requests with no user agent</li>
<li>I will purge the hits from all the following IPs:</li>
</ul>
<pre><code>192.157.89.4
<pre tabindex="0"><code>192.157.89.4
192.157.89.5
192.157.89.6
192.157.89.7
@ -898,7 +898,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</li>
<li>I noticed a few other user agents that should be purged too:</li>
</ul>
<pre><code>^Java\/\d{1,2}.\d
<pre tabindex="0"><code>^Java\/\d{1,2}.\d
FlipboardProxy\/\d
API scraper
RebelMouse\/\d
@ -932,7 +932,7 @@ mailto\:team@impactstory\.org
</li>
<li>Export some of the CGSpace Solr stats minus the Atmire CUA schema additions for Salem to play with:</li>
</ul>
<pre><code>$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
<pre tabindex="0"><code>$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
</code></pre><ul>
<li>
<p>Run system updates on DSpace Test (linode26) and reboot it</p>
@ -1036,11 +1036,11 @@ mailto\:team@impactstory\.org
<p>I started processing Solr stats with the Atmire tool now:</p>
</li>
</ul>
<pre><code>$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -c statistics -f -t 12
<pre tabindex="0"><code>$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -c statistics -f -t 12
</code></pre><ul>
<li>This one failed after a few hours:</li>
</ul>
<pre><code>Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
<pre tabindex="0"><code>Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
@ -1063,7 +1063,7 @@ If run the update again with the resume option (-r) they will be reattempted
<li>I started the same script for the statistics-2019 core (12 million records&hellip;)</li>
<li>Update an ILRI author&rsquo;s name on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
Fixed 13 occurences of: Muloi, D.
Fixed 4 occurences of: Muloi, D.M.
</code></pre><h2 id="2020-07-28">2020-07-28</h2>
@ -1110,7 +1110,7 @@ Fixed 4 occurences of: Muloi, D.M.
</ul>
</li>
</ul>
<pre><code># grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
<pre tabindex="0"><code># grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
249
# grep -c -E '&quot;name&quot;:' /usr/share/iso-codes/json/iso_3166-1.json
249

View File

@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -150,7 +150,7 @@ It is class based so I can easily add support for other vocabularies, and the te
</li>
<li>I purged all unmigrated stats in a few cores and then restarted processing:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
</code></pre><ul>
@ -192,14 +192,14 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
</ul>
</li>
</ul>
<pre><code>$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
<pre tabindex="0"><code>$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
&quot;numberItems&quot; : 63,
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
61
</code></pre><ul>
<li>Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:</li>
</ul>
<pre><code>$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
<pre tabindex="0"><code>$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
&quot;numberItems&quot; : 61,
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
59
@ -210,7 +210,7 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
<pre tabindex="0"><code>dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
id | collection_id | item_id
--------+---------------+---------
133698 | 966 | 107687
@ -220,12 +220,12 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
</code></pre><ul>
<li>So for each id you can delete one duplicate mapping:</li>
</ul>
<pre><code>dspace=# DELETE FROM collection2item WHERE id='134686';
<pre tabindex="0"><code>dspace=# DELETE FROM collection2item WHERE id='134686';
dspace=# DELETE FROM collection2item WHERE id='128819';
</code></pre><ul>
<li>Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter&rsquo;s preferred display names</li>
</ul>
<pre><code>$ cat 2020-08-04-PB-new-countries.csv
<pre tabindex="0"><code>$ cat 2020-08-04-PB-new-countries.csv
cg.coverage.country,correct
CAPE VERDE,CABO VERDE
COCOS ISLANDS,COCOS (KEELING) ISLANDS
@ -267,7 +267,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</li>
<li>I checked the nginx logs around 5PM yesterday to see who was accessing the server:</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I see the Macaroni Bros are using their new user agent for harvesting: <code>RTB website BOT</code>
<ul>
@ -276,7 +276,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</ul>
</li>
</ul>
<pre><code>$ cat dspace.log.2020-08-04 | grep -E &quot;(63.32.242.35|64.62.202.71)&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep -E &quot;(63.32.242.35|64.62.202.71)&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
5693
</code></pre><ul>
<li>DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don&rsquo;t misuse the resources
@ -291,18 +291,18 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
</li>
<li>A few more IPs causing lots of Tomcat sessions yesterday:</li>
</ul>
<pre><code>$ cat dspace.log.2020-08-04 | grep &quot;38.128.66.10&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &quot;38.128.66.10&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
1585
$ cat dspace.log.2020-08-04 | grep &quot;64.62.202.71&quot; | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
5691
</code></pre><ul>
<li>38.128.66.10 isn&rsquo;t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
</code></pre><ul>
<li>64.62.202.71 is using a user agent I&rsquo;ve never seen before:</li>
</ul>
<pre><code>Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
<pre tabindex="0"><code>Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
</code></pre><ul>
<li>So now our &ldquo;bot&rdquo; regex can&rsquo;t even match that&hellip;
<ul>
@ -310,7 +310,7 @@ $ cat dspace.log.2020-08-04 | grep &quot;64.62.202.71&quot; | grep -E 'session_i
</ul>
</li>
</ul>
<pre><code>RTB website BOT
<pre tabindex="0"><code>RTB website BOT
Altmetribot
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
@ -318,7 +318,7 @@ Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
</code></pre><ul>
<li>And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):</li>
</ul>
<pre><code>$ cat dspace.log.2020-08-04 | grep &quot;199.47.87.145&quot; | grep -E 'sessi
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep &quot;199.47.87.145&quot; | grep -E 'sessi
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2777
</code></pre><ul>
@ -377,7 +377,7 @@ on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<ul>
<li>The Atmire stats processing for the statistics-2018 Solr core keeps stopping with this error:</li>
</ul>
<pre><code>Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
<pre tabindex="0"><code>Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.storeOnServer(SourceFile:317)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:177)
@ -398,13 +398,13 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/[0-9]+/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/[0-9]+/&lt;/query&gt;&lt;/delete&gt;'
</code></pre><h2 id="2020-08-09">2020-08-09</h2>
<ul>
<li>The Atmire script did something to the server and created 132GB of log files so the root partition ran out of space&hellip;</li>
<li>I removed the log file and tried to re-run the process but it seems to be looping over 11,000 records and failing, creating millions of lines in the logs again:</li>
</ul>
<pre><code># grep -oE &quot;Record uid: ([a-f0-9\\-]*){1} couldn't be processed&quot; /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 &gt; /tmp/not-processed-errors.txt
<pre tabindex="0"><code># grep -oE &quot;Record uid: ([a-f0-9\\-]*){1} couldn't be processed&quot; /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 &gt; /tmp/not-processed-errors.txt
# wc -l /tmp/not-processed-errors.txt
2202973 /tmp/not-processed-errors.txt
# sort /tmp/not-processed-errors.txt | uniq -c | tail -n 10
@ -421,7 +421,7 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
</code></pre><ul>
<li>I looked at some of those records and saw strange objects in their <code>containerCommunity</code>, <code>containerCollection</code>, etc&hellip;</li>
</ul>
<pre><code>{
<pre tabindex="0"><code>{
&quot;responseHeader&quot;: {
&quot;status&quot;: 0,
&quot;QTime&quot;: 0,
@ -470,7 +470,7 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
</code></pre><ul>
<li>I deleted those 11,724 records with the strange &ldquo;set&rdquo; object in the collections and communities, as well as 360,000 records with <code>id: -1</code></li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:\-1&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
<li>I was going to compare the CUA stats for 2018 and 2019 on CGSpace and DSpace Test, but after Linode rebooted CGSpace (linode18) for maintenance yesterday the solr cores didn&rsquo;t all come back up OK
@ -485,7 +485,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code>$ cat 2020-08-09-add-ILRI-orcids.csv
<pre tabindex="0"><code>$ cat 2020-08-09-add-ILRI-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Grace, Delia&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
&quot;Delia Grace&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
@ -501,7 +501,7 @@ dc.contributor.author,cg.creator.id
</code></pre><ul>
<li>That got me curious, so I generated a list of all the unique ORCID identifiers we have in the database:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
COPY 2095
dspace=# \q
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq &gt; /tmp/2020-08-09-orcid-identifiers-uniq.csv
@ -517,7 +517,7 @@ $ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
@ -534,7 +534,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
<pre tabindex="0"><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
count
-------
50812
@ -573,7 +573,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<ul>
<li>Last night I started the processing of the statistics-2016 core with the Atmire stats util and I see some errors like this:</li>
</ul>
<pre><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
<pre tabindex="0"><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: f6b288d7-d60d-4df9-b311-1696b88552a0, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
@ -598,7 +598,7 @@ Caused by: java.lang.NullPointerException
</li>
<li>I purged the unmigrated docs and continued processing:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
</code></pre><ul>
@ -608,7 +608,7 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
</ul>
</li>
</ul>
<pre><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' &gt; /tmp/0.xml
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' &gt; /tmp/0.xml
$ for num in {100..1300..100}; do http &quot;https://cgspace.cgiar.org/oai/request?verb=ListSets&amp;resumptionToken=////$num&quot; &gt; /tmp/$num.xml; sleep 2; done
$ for num in {0..1300..100}; do cat /tmp/$num.xml &gt;&gt; /tmp/cgspace-oai-sets.xml; done
</code></pre><ul>
@ -620,7 +620,7 @@ $ for num in {0..1300..100}; do cat /tmp/$num.xml &gt;&gt; /tmp/cgspace-oai-sets
<li>The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs&hellip;</li>
<li>I looked at a few of the UIDs that it was having problems with and they were unmigrated ones&hellip; so I purged them in 2015 and all the rest of the statistics cores</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
</code></pre><h2 id="2020-08-19">2020-08-19</h2>
@ -715,13 +715,13 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code>$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=0' User-Agent:'curl' &gt; /tmp/wle-trade-off-page1.xml
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=0' User-Agent:'curl' &gt; /tmp/wle-trade-off-page1.xml
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=100' User-Agent:'curl' &gt; /tmp/wle-trade-off-page2.xml
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&amp;query=trade+off&amp;rpp=100&amp;start=200' User-Agent:'curl' &gt; /tmp/wle-trade-off-page3.xml
</code></pre><ul>
<li>Ugh, and to extract the <code>&lt;id&gt;</code> from each <code>&lt;entry&gt;</code> we have to use an XPath query, but use a <a href="http://blog.powered-up-games.com/wordpress/archives/70">hack to ignore the default namespace by setting each element&rsquo;s local name</a>:</li>
</ul>
<pre><code>$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page1.xml &gt;&gt; /tmp/ids.txt
<pre tabindex="0"><code>$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page1.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page2.xml &gt;&gt; /tmp/ids.txt
$ xmllint --xpath '//*[local-name()=&quot;entry&quot;]/*[local-name()=&quot;id&quot;]/text()' /tmp/wle-trade-off-page3.xml &gt;&gt; /tmp/ids.txt
$ sort -u /tmp/ids.txt &gt; /tmp/ids-sorted.txt
@ -764,7 +764,7 @@ $ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt &gt; /tmp/handles.txt
<ul>
<li>I ran the CountryCodeTagger on CGSpace and it was very fast:</li>
</ul>
<pre><code>$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-08-27-countrycodetagger.log
<pre tabindex="0"><code>$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-08-27-countrycodetagger.log
real 2m7.643s
user 1m48.740s
sys 0m14.518s

View File

@ -48,7 +48,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -153,7 +153,7 @@ I filed an issue on OpenRXV to make some minor edits to the admin UI: https://gi
<ul>
<li>I ran the country code tagger on CGSpace:</li>
</ul>
<pre><code>$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-09-02-countrycodetagger.log
<pre tabindex="0"><code>$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-09-02-countrycodetagger.log
...
real 2m10.516s
user 1m43.953s
@ -169,11 +169,11 @@ $ grep -c added /tmp/2020-09-02-countrycodetagger.log
</ul>
</li>
</ul>
<pre><code>2020-09-02 12:03:10,666 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
<pre tabindex="0"><code>2020-09-02 12:03:10,666 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
</code></pre><ul>
<li>I tried to query LDAP directly using the application credentials with ldapsearch and it works:</li>
</ul>
<pre><code>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;applicationaccount@cgiarad.org&quot; -W &quot;(sAMAccountName=me)&quot;
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;applicationaccount@cgiarad.org&quot; -W &quot;(sAMAccountName=me)&quot;
</code></pre><ul>
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC6x/Authentication+Plugins#AuthenticationPlugins-LDAPAuthentication">DSpace 6 docs</a> we need to escape commas in our LDAP parameters due to the new configuration system
<ul>
@ -191,7 +191,7 @@ $ grep -c added /tmp/2020-09-02-countrycodetagger.log
</ul>
</li>
</ul>
<pre><code>$ cat 2020-09-03-fix-review-status.csv
<pre tabindex="0"><code>$ cat 2020-09-03-fix-review-status.csv
dc.description.version,correct
Externally Peer Reviewed,Peer Review
Peer Reviewed,Peer Review
@ -225,7 +225,7 @@ $ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dsp
</ul>
</li>
</ul>
<pre><code>Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58
<pre tabindex="0"><code>Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58
Error while updating
java.lang.NullPointerException
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
@ -259,7 +259,7 @@ java.lang.NullPointerException
</li>
<li>I will update our nearly 6,000 metadata values for CIFOR in the database accordingly:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
</code></pre><ul>
@ -285,7 +285,7 @@ dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^http
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/bitstream/handle/10568/82745/Characteristics-Silage.JPG
<pre tabindex="0"><code>https://cgspace.cgiar.org/bitstream/handle/10568/82745/Characteristics-Silage.JPG
</code></pre><ul>
<li>So they end up getting rate limited due to the XMLUI rate limits
<ul>
@ -308,7 +308,7 @@ dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^http
</ul>
</li>
</ul>
<pre><code>$ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object
<pre tabindex="0"><code>$ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object
</code></pre><h2 id="2020-09-10">2020-09-10</h2>
<ul>
<li>I checked the country code tagger on CGSpace and DSpace Test and it ran fine from the systemd timer last night&hellip; w00t</li>
@ -318,7 +318,7 @@ dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^http
</ul>
</li>
</ul>
<pre><code>$ cat 2020-09-10-fix-cgspace-regions.csv
<pre tabindex="0"><code>$ cat 2020-09-10-fix-cgspace-regions.csv
cg.coverage.region,correct
EAST AFRICA,EASTERN AFRICA
WEST AFRICA,WESTERN AFRICA
@ -417,15 +417,15 @@ Would fix 3 occurences of: SOUTHWEST ASIA
</ul>
</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
<pre tabindex="0"><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
</code></pre><ul>
<li>Then I created a SAF bundle with SAFBuilder:</li>
</ul>
<pre><code>$ ./safbuilder.sh -c ~/Downloads/cip-annual-reports/cip-reports.csv
<pre tabindex="0"><code>$ ./safbuilder.sh -c ~/Downloads/cip-annual-reports/cip-reports.csv
</code></pre><ul>
<li>And imported them into my local test instance of CGSpace:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace import -a -e y.arrr@cgiar.org -m /tmp/2020-09-15-cip-annual-reports.map -s ~/Downloads/cip-annual-reports/SimpleArchiveFormat
<pre tabindex="0"><code>$ ~/dspace/bin/dspace import -a -e y.arrr@cgiar.org -m /tmp/2020-09-15-cip-annual-reports.map -s ~/Downloads/cip-annual-reports/SimpleArchiveFormat
</code></pre><ul>
<li>Then I uploaded them to CGSpace</li>
</ul>
@ -475,7 +475,7 @@ Would fix 3 occurences of: SOUTHWEST ASIA
</ul>
</li>
</ul>
<pre><code>$ cat 2020-09-17-add-bioversity-orcids.csv
<pre tabindex="0"><code>$ cat 2020-09-17-add-bioversity-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Etten, Jacob van&quot;,&quot;Jacob van Etten: 0000-0001-7554-2558&quot;
&quot;van Etten, Jacob&quot;,&quot;Jacob van Etten: 0000-0001-7554-2558&quot;
@ -496,7 +496,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dsp
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/open-search/discover?query=type:&quot;Journal Article&quot; AND status:&quot;Open Access&quot; AND crpsubject:&quot;Water, Land and Ecosystems&quot; AND &quot;tradeoffs&quot;&amp;rpp=100
<pre tabindex="0"><code>https://cgspace.cgiar.org/open-search/discover?query=type:&quot;Journal Article&quot; AND status:&quot;Open Access&quot; AND crpsubject:&quot;Water, Land and Ecosystems&quot; AND &quot;tradeoffs&quot;&amp;rpp=100
</code></pre><ul>
<li>I noticed that my <code>move-collections.sh</code> script didn&rsquo;t work on DSpace 6 because of the change from IDs to UUIDs, so I modified it to quote the collection <code>resource_id</code> parameters in the PostgreSQL query</li>
</ul>
@ -522,7 +522,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dsp
</ul>
</li>
</ul>
<pre><code>dspacestatistics=# SELECT SUM(views) FROM items;
<pre tabindex="0"><code>dspacestatistics=# SELECT SUM(views) FROM items;
sum
----------
15714024
@ -536,7 +536,7 @@ dspacestatistics=# SELECT SUM(downloads) FROM items;
</code></pre><ul>
<li>I deleted &ldquo;Report&rdquo; from twelve items that had it in their peer review field:</li>
</ul>
<pre><code>dspace=# BEGIN;
<pre tabindex="0"><code>dspace=# BEGIN;
BEGIN
dspace=# DELETE FROM metadatavalue WHERE text_value='Report' AND resource_type_id=2 AND metadata_field_id=68;
DELETE 12
@ -572,7 +572,7 @@ dspace=# COMMIT;
</ul>
</li>
</ul>
<pre><code>...
<pre tabindex="0"><code>...
item_ids = ['0079470a-87a1-4373-beb1-b16e3f0c4d81', '007a9df1-0871-4612-8b28-5335982198cb']
item_ids_str = ' OR '.join(item_ids).replace('-', '\-')
...
@ -598,7 +598,7 @@ solr_query_params = {
<ul>
<li>I did some more work on the dspace-statistics-api and finalized the support for sending a POST to <code>/items</code>:</li>
</ul>
<pre><code>$ curl -s -d @request.json https://dspacetest.cgiar.org/rest/statistics/items | json_pp
<pre tabindex="0"><code>$ curl -s -d @request.json https://dspacetest.cgiar.org/rest/statistics/items | json_pp
{
&quot;currentPage&quot; : 0,
&quot;limit&quot; : 10,

View File

@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -144,7 +144,7 @@ During the FlywayDB migration I got an error:
</ul>
</li>
</ul>
<pre><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
<pre tabindex="0"><code>2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
Detail: Key (short_description)=(EPUB) already exists. Call getNextException to see other errors in the batch.
2020-10-06 21:36:04,138 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint &quot;bitstreamformatregistry_short_description_key&quot;
@ -212,7 +212,7 @@ org.hibernate.exception.ConstraintViolationException: could not execute batch
</ul>
</li>
</ul>
<pre><code>$ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
<pre tabindex="0"><code>$ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
Loading @mire database changes for module MQM
Changes have been processed
-----------------------------------------------------------
@ -259,7 +259,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
</code></pre><ul>
<li>Also, I tested Listings and Reports and there are still no hits for &ldquo;Orth, Alan&rdquo; as a contributor, despite there being dozens of items in the repository and the Solr query generated by Listings and Reports actually returning hits:</li>
</ul>
<pre><code>2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=search.resourcetype:2&amp;fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&amp;fq=dateIssued.year:[2013+TO+2021]&amp;rows=500&amp;wt=javabin&amp;version=2} hits=18 status=0 QTime=10
<pre tabindex="0"><code>2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=search.resourcetype:2&amp;fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&amp;fq=dateIssued.year:[2013+TO+2021]&amp;rows=500&amp;wt=javabin&amp;version=2} hits=18 status=0 QTime=10
</code></pre><ul>
<li>Solr returns <code>hits=18</code> for the L&amp;R query, but there are no result shown in the L&amp;R UI</li>
<li>I sent all this feedback to Atmire&hellip;</li>
@ -278,16 +278,16 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
</ul>
</li>
</ul>
<pre><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
<pre tabindex="0"><code>$ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
$ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
</code></pre><ul>
<li>Then we post an item in JSON format to <code>/rest/collections/{uuid}/items</code>:</li>
</ul>
<pre><code>$ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE &lt; item-object.json
<pre tabindex="0"><code>$ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE &lt; item-object.json
</code></pre><ul>
<li>Format of JSON is:</li>
</ul>
<pre><code>{ &quot;metadata&quot;: [
<pre tabindex="0"><code>{ &quot;metadata&quot;: [
{
&quot;key&quot;: &quot;dc.title&quot;,
&quot;value&quot;: &quot;Testing REST API post&quot;,
@ -362,7 +362,7 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
</ul>
</li>
</ul>
<pre><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
<pre tabindex="0"><code>$ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
$ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
$ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 &lt; item-object.json
</code></pre><ul>
@ -408,7 +408,7 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
</ul>
</li>
</ul>
<pre><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
@ -438,7 +438,7 @@ $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-438
<li>I added <code>[Ss]pider</code> to the Tomcat Crawler Session Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID</li>
<li>I added a few of the patterns from above to our local agents list and ran the <code>check-spider-hits.sh</code> on CGSpace:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
Purging 228916 hits from RTB website BOT in statistics
Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics
Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics
@ -472,7 +472,7 @@ Total number of bot hits purged: 3684
</li>
<li>I can update the country metadata in PostgreSQL like this:</li>
</ul>
<pre><code>dspace=&gt; BEGIN;
<pre tabindex="0"><code>dspace=&gt; BEGIN;
dspace=&gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE resource_type_id=2 AND metadata_field_id=228;
UPDATE 51756
dspace=&gt; COMMIT;
@ -483,7 +483,7 @@ dspace=&gt; COMMIT;
</ul>
</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.country&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.country&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
COPY 195
</code></pre><ul>
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
@ -493,7 +493,7 @@ COPY 195
</li>
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
</ul>
<pre><code>:'&lt;,'&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
<pre tabindex="0"><code>:'&lt;,'&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
</code></pre><ul>
<li>It uses a <a href="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka &ldquo;lookaround&rdquo; in PCRE?) to match words that are <em>not</em> &ldquo;pair&rdquo;, &ldquo;displayed&rdquo;, etc because we don&rsquo;t want to edit the XML tags themselves&hellip;
<ul>
@ -509,18 +509,18 @@ COPY 195
</ul>
</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.region&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.region&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
COPY 34
</code></pre><ul>
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
<li>After testing the replacements locally I ran them on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
</code></pre><ul>
<li>Then I started a full re-indexing:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 88m21.678s
user 7m59.182s
@ -579,7 +579,7 @@ sys 2m22.713s
<li>I posted a message on Yammer to inform all our users about the changes to countries, regions, and AGROVOC subjects</li>
<li>I modified all AGROVOC subjects to be lower case in PostgreSQL and then exported a list of the top 1500 to update the controlled vocabulary in our submission form:</li>
</ul>
<pre><code>dspace=&gt; BEGIN;
<pre tabindex="0"><code>dspace=&gt; BEGIN;
dspace=&gt; UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57;
UPDATE 335063
dspace=&gt; COMMIT;
@ -588,7 +588,7 @@ COPY 1500
</code></pre><ul>
<li>Use my <code>agrovoc-lookup.py</code> script to validate subject terms against the AGROVOC REST API, extract matches with <code>csvgrep</code>, and then update and format the controlled vocabulary:</li>
</ul>
<pre><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 &gt; /tmp/subjects.txt
<pre tabindex="0"><code>$ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 &gt; /tmp/subjects.txt
$ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
$ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' &gt; dspace/config/controlled-vocabularies/dc-subject.xml
# apply formatting in XML file
@ -596,7 +596,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
</code></pre><ul>
<li>Then I started a full re-indexing on CGSpace:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 88m21.678s
user 7m59.182s
@ -614,7 +614,7 @@ sys 2m22.713s
<li>They are using the user agent &ldquo;CCAFS Website Publications importer BOT&rdquo; so they are getting rate limited by nginx</li>
<li>Ideally they would use the REST <code>find-by-metadata-field</code> endpoint, but it is <em>really</em> slow for large result sets (like twenty minutes!):</li>
</ul>
<pre><code>$ curl -f -H &quot;CCAFS Website Publications importer BOT&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100&quot; -d '{&quot;key&quot;:&quot;cg.contributor.crp&quot;, &quot;value&quot;:&quot;Climate Change, Agriculture and Food Security&quot;,&quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -f -H &quot;CCAFS Website Publications importer BOT&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100&quot; -d '{&quot;key&quot;:&quot;cg.contributor.crp&quot;, &quot;value&quot;:&quot;Climate Change, Agriculture and Food Security&quot;,&quot;language&quot;: &quot;en_US&quot;}'
</code></pre><ul>
<li>For now I will whitelist their user agent so that they can continue scraping /browse</li>
<li>I figured out that the mappings for AReS are stored in Elasticsearch
@ -624,7 +624,7 @@ sys 2m22.713s
</ul>
</li>
</ul>
<pre><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_delete_by_query&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_delete_by_query&quot; -H 'Content-Type: application/json' -d'
{
&quot;query&quot;: {
&quot;match&quot;: {
@ -635,7 +635,7 @@ sys 2m22.713s
</code></pre><ul>
<li>I added a new find/replace:</li>
</ul>
<pre><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
{
&quot;find&quot;: &quot;ALAN1&quot;,
&quot;replace&quot;: &quot;ALAN2&quot;,
@ -645,11 +645,11 @@ sys 2m22.713s
<li>I see it in Kibana, and I can search it in Elasticsearch, but I don&rsquo;t see it in OpenRXV&rsquo;s mapping values dashboard</li>
<li>Now I deleted everything in the <code>openrxv-values</code> index:</li>
</ul>
<pre><code>$ curl -XDELETE http://localhost:9200/openrxv-values
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
</code></pre><ul>
<li>Then I tried posting it again:</li>
</ul>
<pre><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
<pre tabindex="0"><code>$ curl -XPOST &quot;localhost:9200/openrxv-values/_doc?pretty&quot; -H 'Content-Type: application/json' -d'
{
&quot;find&quot;: &quot;ALAN1&quot;,
&quot;replace&quot;: &quot;ALAN2&quot;,
@ -682,12 +682,12 @@ sys 2m22.713s
<ul>
<li>Last night I learned how to POST mappings to Elasticsearch for AReS:</li>
</ul>
<pre><code>$ curl -XDELETE http://localhost:9200/openrxv-values
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @./mapping.json
</code></pre><ul>
<li>The JSON file looks like this, with one instruction on each line:</li>
</ul>
<pre><code>{&quot;index&quot;:{}}
<pre tabindex="0"><code>{&quot;index&quot;:{}}
{ &quot;find&quot;: &quot;CRP on Dryland Systems - DS&quot;, &quot;replace&quot;: &quot;Dryland Systems&quot; }
{&quot;index&quot;:{}}
{ &quot;find&quot;: &quot;FISH&quot;, &quot;replace&quot;: &quot;Fish&quot; }
@ -737,7 +737,7 @@ f<span style="color:#f92672">.</span>close()
<li>It filters all upper and lower case strings as well as any replacements that end in an acronym like &ldquo;- ILRI&rdquo;, reducing the number of mappings from around 4,000 to about 900</li>
<li>I deleted the existing <code>openrxv-values</code> Elasticsearch core and then POSTed it:</li>
</ul>
<pre><code>$ ./convert-mapping.py &gt; /tmp/elastic-mappings.txt
<pre tabindex="0"><code>$ ./convert-mapping.py &gt; /tmp/elastic-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elastic-mappings.txt
</code></pre><ul>
@ -762,17 +762,17 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
</li>
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
</ul>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(192921) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
UPDATE 1
</code></pre><ul>
<li>After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:</li>
</ul>
<pre><code>$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
Purging 2474 hits from ShortLinkTranslate in statistics
Purging 2568 hits from RI\/1\.0 in statistics
@ -794,7 +794,7 @@ Total number of bot hits purged: 8174
</ul>
</li>
</ul>
<pre><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
<pre tabindex="0"><code>$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:&quot;RTB website BOT&quot;
$ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
</code></pre><ul>
<li>And I saw three hits in Solr with <code>isBot: true</code>!!!
@ -817,7 +817,7 @@ $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
@ -833,7 +833,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
<ul>
<li>Bosede was getting this error on CGSpace yesterday:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
</code></pre><ul>
<li>Collection 1072 appears to be <a href="https://cgspace.cgiar.org/handle/10568/69542">IITA Miscellaneous</a>
<ul>
@ -848,7 +848,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
</ul>
</li>
</ul>
<pre><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&amp;size=10000&amp;q=*:*' &gt; /tmp/affiliations.json
<pre tabindex="0"><code>$ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&amp;size=10000&amp;q=*:*' &gt; /tmp/affiliations.json
</code></pre><ul>
<li>Then I decided to try a different approach and I adjusted my <code>convert-mapping.py</code> script to re-consider some replacement patterns with acronyms from the original AReS <code>mapping.json</code> file to hopefully address some MEL to CGSpace mappings
<ul>
@ -893,7 +893,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
<ul>
<li>I re-installed DSpace Test with a fresh snapshot of CGSpace&rsquo;s to test the DSpace 6 upgrade (the last time was in 2020-05, and we&rsquo;ve fixed a lot of issues since then):</li>
</ul>
<pre><code>$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
<pre tabindex="0"><code>$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
$ git checkout origin/6_x-dev-atmire-modules
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
$ sudo su - postgres
@ -911,7 +911,7 @@ $ sudo systemctl start tomcat7
</code></pre><ul>
<li>Then I started processing the Solr stats one core and 1 million records at a time:</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
@ -920,7 +920,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
</code></pre><ul>
<li>After the fifth or so run I got this error:</li>
</ul>
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
@ -945,7 +945,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Then I restarted the <code>solr-upgrade-statistics-6x</code> process, which apparently had no records left to process</li>
<li>I started processing the statistics-2019 core&hellip;
@ -958,7 +958,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
<ul>
<li>The statistics processing on the statistics-2018 core errored after 1.8 million records:</li>
</ul>
<pre><code>Exception: Java heap space
<pre tabindex="0"><code>Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>I had the same problem when I processed the statistics-2018 core in 2020-07 and 2020-08
@ -967,7 +967,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>I restarted the process and it crashed again a few minutes later
<ul>
@ -976,7 +976,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Then I started processing the statistics-2017 core&hellip;
<ul>
@ -984,7 +984,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code>$ curl -s &quot;http://localhost:8083/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Also I purged 2.7 million unmigrated records from the statistics-2019 core</li>
<li>I filed an issue with Atmire about the duplicate values in the <code>owningComm</code> and <code>containerCommunity</code> fields in Solr: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
@ -1002,7 +1002,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
</code></pre><ul>
<li>Peter asked me to add the new preferred AGROVOC subject &ldquo;covid-19&rdquo; to all items we had previously added &ldquo;coronavirus disease&rdquo;, and to make sure all items with ILRI subject &ldquo;ZOONOTIC DISEASES&rdquo; have the AGROVOC subject &ldquo;zoonoses&rdquo;
<ul>
@ -1010,7 +1010,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>$ dspace metadata-export -f /tmp/cgspace.csv
<pre tabindex="0"><code>$ dspace metadata-export -f /tmp/cgspace.csv
$ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-subjects.csv
</code></pre><ul>
<li>I sanity checked the CSV in <code>csv-metadata-quality</code> after exporting from OpenRefine, then applied the changes to 453 items on CGSpace</li>
@ -1040,7 +1040,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
</ul>
</li>
</ul>
<pre><code>$ ./create-mappings.py &gt; /tmp/elasticsearch-mappings.txt
<pre tabindex="0"><code>$ ./create-mappings.py &gt; /tmp/elasticsearch-mappings.txt
$ ./convert-mapping.py &gt;&gt; /tmp/elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/elasticsearch-mappings.txt
@ -1048,12 +1048,12 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
<li>After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up</li>
<li>I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:</li>
</ul>
<pre><code>$ docker-compose up --build -d angular_nginx
<pre tabindex="0"><code>$ docker-compose up --build -d angular_nginx
</code></pre><h2 id="2020-10-28">2020-10-28</h2>
<ul>
<li>Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:</li>
</ul>
<pre><code>$ docker-compose up --build -d --force-recreate angular_nginx
<pre tabindex="0"><code>$ docker-compose up --build -d --force-recreate angular_nginx
</code></pre><ul>
<li>Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like &ldquo;Burkina faso&rdquo; is due to the country formatter (see: <code>backend/src/harvester/consumers/fetch.consumer.ts</code>)
<ul>
@ -1079,7 +1079,7 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
</ul>
</li>
</ul>
<pre><code>$ cat 2020-10-28-update-regions.csv
<pre tabindex="0"><code>$ cat 2020-10-28-update-regions.csv
cg.coverage.region,correct
East Africa,Eastern Africa
West Africa,Western Africa
@ -1092,7 +1092,7 @@ $ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 92m14.294s
user 7m59.840s
@ -1115,7 +1115,7 @@ sys 2m22.327s
</li>
<li>Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
COPY 6357
dspace=&gt; \COPY (SELECT DISTINCT text_value as &quot;dc.description.sponsorship&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
COPY 730
@ -1134,7 +1134,7 @@ COPY 5598
</ul>
</li>
</ul>
<pre><code>$ grep -c '&quot;find&quot;' /tmp/elasticsearch-mappings*
<pre tabindex="0"><code>$ grep -c '&quot;find&quot;' /tmp/elasticsearch-mappings*
/tmp/elasticsearch-mappings2.txt:350
/tmp/elasticsearch-mappings.txt:1228
$ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | wc -l
@ -1148,7 +1148,7 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | u
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/new-elasticsearch-mappings.txt
</code></pre><ul>
@ -1159,14 +1159,14 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-T
</li>
<li>Lower case some straggling AGROVOC subjects on CGSpace:</li>
</ul>
<pre><code>dspace=# BEGIN;
<pre tabindex="0"><code>dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 123
dspace=# COMMIT;
</code></pre><ul>
<li>Move some top-level communities to the CGIAR System community for Peter:</li>
</ul>
<pre><code>$ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
<pre tabindex="0"><code>$ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
$ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
</code></pre><h2 id="2020-10-30">2020-10-30</h2>
<ul>
@ -1187,7 +1187,7 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
</ul>
</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -1198,7 +1198,7 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
</code></pre><ul>
<li>Then I did a test to apply the corrections and deletions on my local DSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
$ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
$ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
$ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
@ -1214,12 +1214,12 @@ $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace
</li>
<li>Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
$ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
</code></pre><ul>
<li>I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><!-- raw HTML omitted -->

View File

@ -32,7 +32,7 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -150,12 +150,12 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
$ ./delete-metadata-values.py -i 2020-11-05-delete-29-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre><ul>
<li>Then I started a Discovery re-index on CGSpace:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 92m24.993s
user 8m11.858s
@ -190,7 +190,7 @@ sys 2m26.931s
<li>The statistics-2014 core finished processing after five hours, so I started processing the statistics-2013 core on DSpace Test</li>
<li>Since I was going to restart CGSpace and update the Discovery indexes anyways I decided to check for any straggling upper case AGROVOC entries and lower case them:</li>
</ul>
<pre><code>dspace=# BEGIN;
<pre tabindex="0"><code>dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 164
dspace=# COMMIT;
@ -211,7 +211,7 @@ dspace=# COMMIT;
</ul>
</li>
</ul>
<pre><code>2020-11-10 08:43:59,634 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
<pre tabindex="0"><code>2020-11-10 08:43:59,634 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:43:59,687 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
2020-11-10 08:43:59,707 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:44:00,004 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
@ -227,7 +227,7 @@ dspace=# COMMIT;
</ul>
</li>
</ul>
<pre><code>2020-11-10 08:51:03,007 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2018
<pre tabindex="0"><code>2020-11-10 08:51:03,007 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2018
2020-11-10 08:51:03,008 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:51:03,137 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
2020-11-10 08:51:03,153 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
@ -281,11 +281,11 @@ dspace=# COMMIT;
</li>
<li>First we get the total number of communities with stats (using calcdistinct):</li>
</ul>
<pre><code>facet=true&amp;facet.field=owningComm&amp;facet.mincount=1&amp;facet.limit=1&amp;facet.offset=0&amp;stats=true&amp;stats.field=owningComm&amp;stats.calcdistinct=true&amp;shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
<pre tabindex="0"><code>facet=true&amp;facet.field=owningComm&amp;facet.mincount=1&amp;facet.limit=1&amp;facet.offset=0&amp;stats=true&amp;stats.field=owningComm&amp;stats.calcdistinct=true&amp;shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
</code></pre><ul>
<li>Then get stats themselves, iterating 100 items at a time with limit and offset:</li>
</ul>
<pre><code>facet=true&amp;facet.field=owningComm&amp;facet.mincount=1&amp;facet.limit=100&amp;facet.offset=0&amp;shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
<pre tabindex="0"><code>facet=true&amp;facet.field=owningComm&amp;facet.mincount=1&amp;facet.limit=100&amp;facet.offset=0&amp;shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
</code></pre><ul>
<li>I was surprised to see 10,000,000 docs with <code>isBot:true</code> when I was testing on DSpace Test&hellip;
<ul>
@ -309,7 +309,7 @@ dspace=# COMMIT;
</ul>
</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
$ git checkout origin/6_x-dev-atmire-modules
$ npm install -g yarn
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2,\!dspace-jspui clean package
@ -329,7 +329,7 @@ $ sudo systemctl start tomcat7
</ul>
</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# pg_ctlcluster 9.6 main stop
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
# tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6
@ -345,7 +345,7 @@ $ sudo systemctl start tomcat7
<li>I disabled the dspace-statistsics-api for now because it won&rsquo;t work until I migrate all the Solr statistics anyways</li>
<li>Start a full Discovery re-indexing:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 211m30.726s
user 134m40.124s
@ -353,13 +353,13 @@ sys 2m17.979s
</code></pre><ul>
<li>Towards the end of the indexing there were a few dozen of these messages:</li>
</ul>
<pre><code>2020-11-15 13:23:21,685 INFO com.atmire.dspace.discovery.service.AtmireSolrService @ Removed Item: null from Index
<pre tabindex="0"><code>2020-11-15 13:23:21,685 INFO com.atmire.dspace.discovery.service.AtmireSolrService @ Removed Item: null from Index
</code></pre><ul>
<li>I updated all the Ansible infrastructure and DSpace branches to be the DSpace 6 ones</li>
<li>I will wait until the Discovery indexing is finished to start doing the Solr statistics migration</li>
<li>I tested the email functionality and it seems to need more configuration:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: blah@cgiar.org
@ -372,12 +372,12 @@ Error sending email:
<li>I copied the <code>mail.extraproperties = mail.smtp.starttls.enable=true</code> setting from the old DSpace 5 <code>dspace.cfg</code> and now the emails are working</li>
<li>After the Discovery indexing finished I started processing the Solr stats one core and 2.5 million records at a time:</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
</code></pre><ul>
<li>After about 6,000,000 records I got the same error that I&rsquo;ve gotten every time I test this migration process:</li>
</ul>
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
@ -407,7 +407,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
<ul>
<li>There are almost 1,500 locks:</li>
</ul>
<pre><code>$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code>$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1494
</code></pre><ul>
<li>I sent a mail to the dspace-tech mailing list to ask for help&hellip;
@ -417,7 +417,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</li>
<li>While processing the statistics-2018 Solr core I got the <em>same</em> memory error that I have gotten every time I processed this core in testing:</li>
</ul>
<pre><code>Exception: Java heap space
<pre tabindex="0"><code>Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
@ -454,7 +454,7 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
@ -486,7 +486,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
<ul>
<li>There are over 2,000 locks:</li>
</ul>
<pre><code>$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code>$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2071
</code></pre><h2 id="2020-11-18">2020-11-18</h2>
<ul>
@ -534,7 +534,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</li>
<li>Peter got a strange message this evening when trying to update metadata:</li>
</ul>
<pre><code>2020-11-18 16:57:33,309 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1]
<pre tabindex="0"><code>2020-11-18 16:57:33,309 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1]
2020-11-18 16:57:33,316 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [13]; actual row count: 0; expected: 1]
2020-11-18 16:57:33,385 INFO org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl @ HHH000010: On release of batch it still contained JDBC statements
</code></pre><ul>
@ -603,25 +603,25 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
COPY 87411
</code></pre><ul>
<li>Saving some notes I wrote down about faceting by community and collection in Solr, for potential use in the future in the DSpace Statistics API</li>
<li>Facet by owningComm to see total number of distinct communities (136):</li>
</ul>
<pre><code> facet=true&amp;facet.mincount=1&amp;facet.field=owningComm&amp;facet.limit=1&amp;facet.offset=0&amp;stats=true&amp;stats.field=id&amp;stats.calcdistinct=true
<pre tabindex="0"><code> facet=true&amp;facet.mincount=1&amp;facet.field=owningComm&amp;facet.limit=1&amp;facet.offset=0&amp;stats=true&amp;stats.field=id&amp;stats.calcdistinct=true
</code></pre><ul>
<li>Facet by owningComm and get the first 5 distinct:</li>
</ul>
<pre><code> facet=true&amp;facet.mincount=1&amp;facet.field=owningComm&amp;facet.limit=5&amp;facet.offset=0&amp;facet.pivot=id,countryCode
<pre tabindex="0"><code> facet=true&amp;facet.mincount=1&amp;facet.field=owningComm&amp;facet.limit=5&amp;facet.offset=0&amp;facet.pivot=id,countryCode
</code></pre><ul>
<li>Facet by owningComm and countryCode using facet.pivot and maybe I can just skip the normal facet params?</li>
</ul>
<pre><code>facet=true&amp;f.owningComm.facet.limit=5&amp;f.owningComm.facet.offset=5&amp;facet.pivot=owningComm,countryCode
<pre tabindex="0"><code>facet=true&amp;f.owningComm.facet.limit=5&amp;f.owningComm.facet.offset=5&amp;facet.pivot=owningComm,countryCode
</code></pre><ul>
<li>Facet by owningComm and countryCode using facet.pivot and limiting to top five countries&hellip; fuck it&rsquo;s possible!</li>
</ul>
<pre><code>facet=true&amp;f.owningComm.facet.limit=5&amp;f.owningComm.facet.offset=5&amp;f.countryCode.facet.limit=5&amp;facet.pivot=owningComm,countryCode
<pre tabindex="0"><code>facet=true&amp;f.owningComm.facet.limit=5&amp;f.owningComm.facet.offset=5&amp;f.countryCode.facet.limit=5&amp;facet.pivot=owningComm,countryCode
</code></pre><h2 id="2020-11-23">2020-11-23</h2>
<ul>
<li>I created the sub-communities and collections for IWMI&rsquo;s Strategic Priorities and Research Groups on CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/110259">https://cgspace.cgiar.org/handle/10568/110259</a></li>
@ -688,18 +688,18 @@ COPY 87411
</ul>
</li>
</ul>
<pre><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
<pre tabindex="0"><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
</code></pre><ul>
<li>IWMI sent me a few new ORCID identifiers so I combined them with our existing ones as well as another ILRI one that Tezira asked me to update, filtered the unique ones, and then resolved their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-11-30-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-11-30-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-11-30-combined-orcids.txt -o /tmp/2020-11-30-combined-orcids-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>I used my <code>fix-metadata-values.py</code> script to update the old occurences of Hung&rsquo;s ORCID and some others that I see have changed:</li>
</ul>
<pre><code>$ cat 2020-11-30-fix-hung-orcid.csv
<pre tabindex="0"><code>$ cat 2020-11-30-fix-hung-orcid.csv
cg.creator.id,correct
&quot;Hung Nguyen-Viet: 0000-0001-9877-0596&quot;,&quot;Hung Nguyen-Viet: 0000-0003-1549-2733&quot;
&quot;Adriana Tofiño: 0000-0001-7115-7169&quot;,&quot;Adriana Tofiño Rivera: 0000-0001-7115-7169&quot;

View File

@ -36,7 +36,7 @@ I started processing those (about 411,000 records):
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -132,7 +132,7 @@ I started processing those (about 411,000 records):
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
</code></pre><ul>
<li>AReS went down when the <code>renew-letsencrypt</code> service stopped the <code>angular_nginx</code> container in the pre-update hook and failed to bring it back up
<ul>
@ -151,7 +151,7 @@ I started processing those (about 411,000 records):
</li>
<li>Start testing export/import of yearly Solr statistics data into the main statistics core on DSpace Test, for example:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
@ -179,13 +179,13 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<ul>
<li>First the 2010 core:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Judging by the DSpace logs all these cores had a problem starting up in the last month:</li>
</ul>
<pre><code class="language-console" data-lang="console"># grep -rsI &quot;Unable to create core&quot; [dspace]/log/dspace.log.2020-* | grep -o -E &quot;statistics-[0-9]+&quot; | sort | uniq -c
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -rsI &quot;Unable to create core&quot; [dspace]/log/dspace.log.2020-* | grep -o -E &quot;statistics-[0-9]+&quot; | sort | uniq -c
24 statistics-2010
24 statistics-2015
18 statistics-2016
@ -193,7 +193,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
</code></pre><ul>
<li>The message is always this:</li>
</ul>
<pre><code>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
<pre tabindex="0"><code>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
</code></pre><ul>
<li>I will migrate all these cores and see if it makes a difference, then probably end up migrating all of them
<ul>
@ -223,7 +223,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<ul>
<li>There are apparently 1,700 locks right now:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1739
</code></pre><h2 id="2020-12-08">2020-12-08</h2>
<ul>
@ -233,7 +233,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code>Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
<pre tabindex="0"><code>Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0, an error occured in the com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
@ -270,7 +270,7 @@ Caused by: java.lang.UnsupportedOperationException
<ul>
<li>I was running the AtomicStatisticsUpdateCLI to remove duplicates on DSpace Test but it failed near the end of the statistics core (after 20 hours or so) with a memory error:</li>
</ul>
<pre><code>Successfully finished updating Solr Storage Reports | Wed Dec 09 15:25:11 CET 2020
<pre tabindex="0"><code>Successfully finished updating Solr Storage Reports | Wed Dec 09 15:25:11 CET 2020
Run 1 —  67% — 10,000/14,935 docs — 6m 6s — 6m 6s
Exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
@ -279,7 +279,7 @@ java.lang.OutOfMemoryError: GC overhead limit exceeded
<li>I increased the JVM heap to 2048m and tried again, but it failed with a memory error again&hellip;</li>
<li>I increased the JVM heap to 4096m and tried again, but it failed with another error:</li>
</ul>
<pre><code>Successfully finished updating Solr Storage Reports | Wed Dec 09 15:53:40 CET 2020
<pre tabindex="0"><code>Successfully finished updating Solr Storage Reports | Wed Dec 09 15:53:40 CET 2020
Exception: parsing error
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: parsing error
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:530)
@ -341,7 +341,7 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
<ul>
<li>I can see it in the <code>openrxv-items-final</code> index:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
{
&quot;_shards&quot; : {
&quot;failed&quot; : 0,
@ -355,14 +355,14 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
<li>I filed a bug on OpenRXV: <a href="https://github.com/ilri/OpenRXV/issues/64">https://github.com/ilri/OpenRXV/issues/64</a></li>
<li>For now I will try to delete the index and start a re-harvest in the Admin UI:</li>
</ul>
<pre><code>$ curl -XDELETE http://localhost:9200/openrxv-items-final
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-items-final
{&quot;acknowledged&quot;:true}%
</code></pre><ul>
<li>Moayad said he&rsquo;s working on the harvesting so I stopped it for now to re-deploy his latest changes</li>
<li>I updated Tomcat to version 7.0.107 on CGSpace (linode18), ran all updates, and restarted the server</li>
<li>I deleted both items indexes and restarted the harvesting:</li>
</ul>
<pre><code>$ curl -XDELETE http://localhost:9200/openrxv-items-final
<pre tabindex="0"><code>$ curl -XDELETE http://localhost:9200/openrxv-items-final
$ curl -XDELETE http://localhost:9200/openrxv-items-temp
</code></pre><ul>
<li>Peter asked me for a list of all submitters and approvers that were active recently on CGSpace
@ -371,7 +371,7 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
</code></pre><h2 id="2020-12-14">2020-12-14</h2>
<ul>
<li>The re-harvesting finished last night on AReS but there are no records in the <code>openrxv-items-final</code> index
@ -380,7 +380,7 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
{
&quot;count&quot; : 99992,
&quot;_shards&quot; : {
@ -397,14 +397,14 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{&quot;acknowledged&quot;:true,&quot;shards_acknowledged&quot;:true,&quot;index&quot;:&quot;openrxv-items-final&quot;}
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
<li>Now I see that the <code>openrxv-items-final</code> index has items, but there are still none in AReS Explorer UI!</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
{
&quot;count&quot; : 99992,
&quot;_shards&quot; : {
@ -417,7 +417,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H
</code></pre><ul>
<li>The api logs show this from last night after the harvesting:</li>
</ul>
<pre><code class="language-console" data-lang="console">[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
<pre tabindex="0"><code class="language-console" data-lang="console">[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called
@ -432,7 +432,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H
<li>I cloned the <code>openrxv-items-final</code> index to <code>openrxv-items</code> index and now I see items in the explorer UI</li>
<li>The PDF report was broken and I looked in the API logs and saw this:</li>
</ul>
<pre><code class="language-console" data-lang="console">(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
<pre tabindex="0"><code class="language-console" data-lang="console">(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
</code></pre><ul>
@ -457,7 +457,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H
</ul>
</li>
</ul>
<pre><code>$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&amp;limit=100&amp;offset=0' | json_pp &gt; /tmp/policy1.json
<pre tabindex="0"><code>$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&amp;limit=100&amp;offset=0' | json_pp &gt; /tmp/policy1.json
$ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&amp;limit=100&amp;offset=100' | json_pp &gt; /tmp/policy2.json
$ query-json '.items | length' /tmp/policy1.json
100
@ -487,7 +487,7 @@ $ query-json '.items | length' /tmp/policy2.json
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><h2 id="2020-12-15">2020-12-15</h2>
@ -499,12 +499,12 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H
</li>
<li>I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter&rsquo;s CSVs) and then applied them using the <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
</code></pre><ul>
<li>Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# BEGIN;
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 406
@ -513,7 +513,7 @@ COMMIT
</code></pre><ul>
<li>I also updated the Font Awesome icon classes for version 5 syntax:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# BEGIN;
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
UPDATE 74
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
@ -522,7 +522,7 @@ dspace=# COMMIT;
</code></pre><ul>
<li>Then I started a full Discovery re-index:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 265m11.224s
@ -544,7 +544,7 @@ sys 2m41.097s
<ul>
<li>After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the <code>openrxv-items-temp</code> index was empty and that the backup index I made yesterday was still there:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
{
&quot;acknowledged&quot; : true
}
@ -576,7 +576,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&amp;pretty'
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100046,
&quot;_shards&quot; : {
@ -611,7 +611,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
</li>
<li>Generate a list of submitters and approvers active in the last months using the Provenance field on CGSpace:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -h localhost -U postgres dspace -c &quot;SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'&quot; &gt; /tmp/provenance.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -h localhost -U postgres dspace -c &quot;SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'&quot; &gt; /tmp/provenance.txt
$ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E &quot;( on |checksum)&quot; | sed -e 's/by //' -e 's/ (/,/' -e 's/)//' | sort | uniq &gt; /tmp/recent-submitters-approvers.csv
</code></pre><ul>
<li>Peter wanted it to send some mail to the users&hellip;</li>
@ -620,7 +620,7 @@ $ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E &quot;( on |checksum)&quo
<ul>
<li>I see some errors from CUA in our Tomcat logs:</li>
</ul>
<pre><code class="language-console" data-lang="console">Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
<pre tabindex="0"><code class="language-console" data-lang="console">Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
Error while updating
java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
@ -636,7 +636,7 @@ java.lang.UnsupportedOperationException: Multiple update components target the s
</li>
<li>I was trying to export the ILRI community on CGSpace so I could update one of the ILRI author&rsquo;s names, but it throws an error&hellip;</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
@ -657,7 +657,7 @@ java.lang.NullPointerException
</code></pre><ul>
<li>I did it via CSV with <code>fix-metadata-values.py</code> instead:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat 2020-12-17-update-ILRI-author.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2020-12-17-update-ILRI-author.csv
dc.contributor.author,correct
&quot;Padmakumar, V.P.&quot;,&quot;Varijakshapanicker, Padmakumar&quot;
$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
@ -668,7 +668,7 @@ $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u ds
</ul>
</li>
</ul>
<pre><code>$ csvcut -c 'dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]' ~/Downloads/10568-80099.csv | csvgrep -c 'cg.identifier.status[en_US]' -m 'Limited Access' | csvgrep -c 'dc.date.issued' -m 2020 -c 'dc.date.issued[]' -m 2020 -c 'dc.date.issued[en_US]' -m 2020 &gt; /tmp/limited-2020.csv
<pre tabindex="0"><code>$ csvcut -c 'dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]' ~/Downloads/10568-80099.csv | csvgrep -c 'cg.identifier.status[en_US]' -m 'Limited Access' | csvgrep -c 'dc.date.issued' -m 2020 -c 'dc.date.issued[]' -m 2020 -c 'dc.date.issued[en_US]' -m 2020 &gt; /tmp/limited-2020.csv
</code></pre><h2 id="2020-12-18">2020-12-18</h2>
<ul>
<li>I added support for indexing community views and downloads to <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>
@ -689,7 +689,7 @@ $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u ds
<ul>
<li>The DeduplicateValuesProcessor has been running on DSpace Test since two days ago and it almost completed its second twelve-hour run, but crashed near the end:</li>
</ul>
<pre><code class="language-console" data-lang="console">...
<pre tabindex="0"><code class="language-console" data-lang="console">...
Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
@ -744,7 +744,7 @@ java.lang.OutOfMemoryError: Java heap space
<li>The AReS harvest finished this morning and I moved the Elasticsearch index manually</li>
<li>First, check the number of records in the temp index to make sure it seems complete and not with double data:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100135,
&quot;_shards&quot; : {
@ -757,13 +757,13 @@ java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>Then delete the old backup and clone the current items index as a backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
</code></pre><ul>
<li>Then delete the current items index and clone it from temp:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
@ -806,11 +806,11 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H
</ul>
</li>
</ul>
<pre><code>statistics-2012: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
<pre tabindex="0"><code>statistics-2012: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre><ul>
<li>I exported the 2012 stats from the year core and imported them to the main statistics core with solr-import-export-json:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
@ -824,7 +824,7 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100135,
&quot;_shards&quot; : {
@ -842,7 +842,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items/_settings?pretty&quot; -H 'Cont
<ul>
<li>The indexing on AReS finished so I cloned the <code>openrxv-items-temp</code> index to <code>openrxv-items</code> and deleted the backup index:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'

View File

@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -160,12 +160,12 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
</code></pre><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100278,
&quot;_shards&quot; : {
@ -214,7 +214,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
</code></pre><ul>
<li>Help Udana export IWMI records from AReS
<ul>
@ -261,12 +261,12 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID=&quot;TX35636856957739531161091194485578658698&quot;)
<pre tabindex="0"><code class="language-console" data-lang="console">2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID=&quot;TX35636856957739531161091194485578658698&quot;)
</code></pre><ul>
<li>I filed <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=907">a bug on Atmire&rsquo;s issue tracker</a></li>
<li>Peter asked me to move the CGIAR Gender Platform community to the top level of CGSpace, but I get an error when I use the community-filiator command:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
Loading @mire database changes for module MQM
Changes have been processed
Exception: null
@ -301,7 +301,7 @@ java.lang.UnsupportedOperationException
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
... after ten hours
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
@ -331,7 +331,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
0
</code></pre><ul>
<li>So now I should really add it to the DSpace spider agent list so it doesn&rsquo;t create Solr hits
@ -341,7 +341,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</li>
<li>I purged the existing hits using my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
</code></pre><h2 id="2021-01-11">2021-01-11</h2>
<ul>
<li>The AReS indexing finished this morning and I moved the <code>openrxv-items-temp</code> core to <code>openrxv-items</code> (see above)
@ -351,7 +351,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</li>
<li>I deployed the community-filiator fix on CGSpace and moved the Gender Platform community to the top level of CGSpace:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
</code></pre><h2 id="2021-01-12">2021-01-12</h2>
<ul>
<li>IWMI is really pressuring us to have a periodic CSV export of their community
@ -393,12 +393,12 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
</code></pre><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100540,
&quot;_shards&quot; : {
@ -445,7 +445,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-18'
</ul>
</li>
</ul>
<pre><code>localhost/dspace63= &gt; BEGIN;
<pre tabindex="0"><code>localhost/dspace63= &gt; BEGIN;
localhost/dspace63= &gt; DELETE FROM metadatavalue WHERE metadata_field_id IN (115, 116, 117, 118);
DELETE 27
localhost/dspace63= &gt; COMMIT;
@ -462,7 +462,7 @@ localhost/dspace63= &gt; COMMIT;
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker exec -it api /bin/bash
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker exec -it api /bin/bash
# apt update &amp;&amp; apt install unoconv
</code></pre><ul>
<li>Help Peter get a list of titles and DOIs for CGSpace items that Altmetric does not have an attention score for
@ -512,12 +512,12 @@ localhost/dspace63= &gt; COMMIT;
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
</code></pre><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100699,
&quot;_shards&quot; : {
@ -579,7 +579,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-25'
</ul>
</li>
</ul>
<pre><code>Jan 26, 2021 10:47:23 AM org.apache.coyote.http11.AbstractHttp11Processor process
<pre tabindex="0"><code>Jan 26, 2021 10:47:23 AM org.apache.coyote.http11.AbstractHttp11Processor process
INFO: Error parsing HTTP request header
Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
java.lang.IllegalArgumentException: Invalid character found in the request target [/discover/search/csv?query=*&amp;scope=~&amp;filters=author:(Alan\%20Orth)]. The valid characters are defined in RFC 7230 and RFC 3986
@ -601,12 +601,12 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
<li>I <a href="https://jira.lyrasis.org/browse/DS-4566">filed a bug</a> on DSpace&rsquo;s issue tracker (though I accidentally hit Enter and submitted it before I finished, and there is no edit function)</li>
<li>Looking into Linode report that the load outbound traffic rate was high this morning:</li>
</ul>
<pre><code class="language-console" data-lang="console"># grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The culprit seems to be the ILRI publications importer, so that&rsquo;s OK</li>
<li>But I also see an IP in Jordan hitting the REST API 1,100 times today:</li>
</ul>
<pre><code>80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] &quot;GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0&quot; 302 138 &quot;http://wp.local/&quot; &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36&quot;
<pre tabindex="0"><code>80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] &quot;GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0&quot; 302 138 &quot;http://wp.local/&quot; &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36&quot;
</code></pre><ul>
<li>Seems to be someone from CodeObia working on WordPress
<ul>
@ -615,7 +615,7 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
</li>
<li>I purged all ~3,000 statistics hits that have the &ldquo;<a href="http://wp.local/%22">http://wp.local/&quot;</a> referrer:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;referrer:http\:\/\/wp\.local\/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;referrer:http\:\/\/wp\.local\/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Tag version 0.4.3 of the csv-metadata-quality tool on GitHub: <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3">https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3</a>
<ul>
@ -661,7 +661,7 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
</code></pre><ul>
<li>Sent out emails about CG Core v2 to Macaroni Bros, Fabio, Hector at CCAFS, Dani and Tariku</li>

View File

@ -60,7 +60,7 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
}
}
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -157,7 +157,7 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {
@ -170,18 +170,18 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
</code></pre><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
</code></pre><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre><ul>
<li>Then delete the temp and backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{&quot;acknowledged&quot;:true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
</code></pre><ul>
@ -196,7 +196,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
</li>
<li>I tried to export the ILRI community from CGSpace but I got an error:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
@ -234,16 +234,16 @@ java.lang.NullPointerException
<li>Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart&rsquo;s iD</li>
<li>I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-02-02-combined-orcids.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-02-02-combined-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
</code></pre><ul>
<li>I sorted the names and added the XML formatting in vim, then ran it through tidy:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
<pre tabindex="0"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>Then I added all the changed names plus Stefan&rsquo;s incorrect ones to a CSV and processed them with <code>fix-metadata-values.py</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat 2021-02-02-fix-orcid-ids.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-02-02-fix-orcid-ids.csv
cg.creator.id,correct
Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
@ -263,7 +263,7 @@ $ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u
<ul>
<li>Tag forty-three items from Bioversity&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat /tmp/2021-02-02-add-orcid-ids.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/2021-02-02-add-orcid-ids.csv
dc.contributor.author,cg.creator.id
&quot;Nchanji, E.&quot;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&quot;Nchanji, Eileen&quot;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
@ -300,7 +300,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db d
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 dspace index-discovery -b
$ dspace oai import -c
</code></pre><ul>
<li>Attend Accenture meeting for repository managers
@ -333,7 +333,7 @@ $ dspace oai import -c
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
</code></pre><ul>
<li>The corrected versions have a lot of encoding issues so I asked Peter to give me the correct ones so I can search/replace them:
<ul>
@ -358,7 +358,7 @@ $ dspace oai import -c
<li>I ended up using <a href="https://github.com/LuminosoInsight/python-ftfy">python-ftfy</a> to fix those very easily, then replaced them in the CSV</li>
<li>Then I trimmed whitespace at the beginning, end, and around the &ldquo;;&rdquo;, and applied the 1,600 fixes using <code>fix-metadata-values.py</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
</code></pre><ul>
<li>Help Peter debug an issue with one of Alan Duncan&rsquo;s new FEAST Data reports on CGSpace
<ul>
@ -372,7 +372,7 @@ $ dspace oai import -c
<li>Run system updates on CGSpace (linode18), deploy latest 6_x-prod branch, and reboot the server</li>
<li>After the server came back up I started a full Discovery re-indexing:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 247m30.850s
user 160m36.657s
@ -385,13 +385,13 @@ sys 2m26.050s
</li>
<li>Delete the old Elasticsearch temp index to prepare for starting an AReS re-harvest:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
</code></pre><h2 id="2021-02-08">2021-02-08</h2>
<ul>
<li>Finish rotating the AReS indexes after the harvesting last night:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100983,
&quot;_shards&quot; : {
@ -429,7 +429,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
30354
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l
18555
@ -452,15 +452,15 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h |
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
</code></pre><ul>
<li>I imported the CSV into OpenRefine and converted the date text values to date types so I could facet by dates before 2010:</li>
</ul>
<pre><code class="language-console" data-lang="console">if(diff(value,&quot;01/01/2010&quot;.toDate(),&quot;days&quot;)&lt;0, true, false)
<pre tabindex="0"><code class="language-console" data-lang="console">if(diff(value,&quot;01/01/2010&quot;.toDate(),&quot;days&quot;)&lt;0, true, false)
</code></pre><ul>
<li>Then I filtered by publisher to make sure they were only ours:</li>
</ul>
<pre><code class="language-console" data-lang="console">or(
<pre tabindex="0"><code class="language-console" data-lang="console">or(
value.contains(&quot;International Livestock Research Institute&quot;),
value.contains(&quot;ILRI&quot;),
value.contains(&quot;International Livestock Centre for Africa&quot;),
@ -488,7 +488,7 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h |
<li>Run system updates, deploy latest <code>6_x-prod</code> branch, and reboot CGSpace (linode18)</li>
<li>Normalize <code>text_lang</code> of DSpace item metadata on CGSpace:</li>
</ul>
<pre><code>dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2567413
@ -504,7 +504,7 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
<ul>
<li>Clear the OpenRXV temp items index:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</code></pre><ul>
<li>Then start a full harvesting of CGSpace in the AReS Explorer admin dashboard</li>
<li>Peter asked me about a few other recently submitted FEAST items that are restricted
@ -521,12 +521,12 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
</code></pre><h2 id="2021-02-15">2021-02-15</h2>
<ul>
<li>Check the results of the AReS Harvesting from last night:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 101126,
&quot;_shards&quot; : {
@ -539,12 +539,12 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
</code></pre><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15
</code></pre><ul>
<li>Delete the current items index and clone the temp one:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
@ -563,18 +563,18 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
</li>
<li>They are definitely bots posing as users, as I see they have created six thousand DSpace sessions today:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
4007
$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231' | sort | uniq | wc -l
2128
</code></pre><ul>
<li>Ah, actually 45.146.165.203 is making requests like this:</li>
</ul>
<pre><code class="language-console" data-lang="console">&quot;http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">&quot;http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO&quot;
</code></pre><ul>
<li>I purged the hits from these two using my <code>check-spider-ip-hits.sh</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 4005 hits from 45.146.165.203 in statistics
Purging 3493 hits from 130.255.161.231 in statistics
@ -582,7 +582,7 @@ Total number of bot hits purged: 7498
</code></pre><ul>
<li>Ugh, I looked in Solr for the top IPs in 2021-01 and found a few more of these Russian IPs so I purged them too:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 27163 hits from 45.146.164.176 in statistics
Purging 19556 hits from 45.146.165.105 in statistics
Purging 15927 hits from 45.146.165.83 in statistics
@ -596,7 +596,7 @@ Total number of bot hits purged: 70731
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 3 hits from 130.255.161.231 in statistics
Purging 16773 hits from 64.39.99.15 in statistics
Purging 6976 hits from 64.39.99.13 in statistics
@ -627,7 +627,7 @@ Total number of bot hits purged: 23789
<li>Abenet asked me to add Tom Randolph&rsquo;s ORCID identifier to CGSpace</li>
<li>I also tagged all his 247 existing items on CGSpace:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat 2021-02-17-add-tom-orcid.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-02-17-add-tom-orcid.csv
dc.contributor.author,cg.creator.id
&quot;Randolph, Thomas F.&quot;,&quot;Thomas Fitz Randolph: 0000-0003-1849-9877&quot;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p 'fuuu'
@ -640,7 +640,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace
<li>Start the CG Core v2 migration on CGSpace (linode18)</li>
<li>After deploying the latest <code>6_x-prod</code> branch and running <code>migrate-fields.sh</code> I started a full Discovery reindex:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 311m12.617s
user 217m3.102s
@ -648,7 +648,7 @@ sys 2m37.363s
</code></pre><ul>
<li>Then update OAI:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace oai import -c
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace oai import -c
$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
</code></pre><ul>
<li>Ben Hack was asking if there is a REST API query that will give him all ILRI outputs for their new Sharepoint intranet
@ -668,14 +668,14 @@ $ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
</code></pre><ul>
<li>The process took an hour or so!</li>
<li>I added colorized output to the csv-metadata-quality tool and tagged <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.4">version 0.4.4 on GitHub</a></li>
<li>I updated the fields in AReS Explorer and then removed the old temp index so I can start a fresh re-harvest of CGSpace:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
</code></pre><h2 id="2021-02-22">2021-02-22</h2>
<ul>
@ -687,7 +687,7 @@ $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
UPDATE 104
</code></pre><ul>
<li>As for splitting the other values, I think I can export the <code>dspace_object_id</code> and <code>text_value</code> and then upload it as a CSV rather than writing a Python script to create the new metadata values</li>
@ -696,7 +696,7 @@ UPDATE 104
<ul>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 101380,
&quot;_shards&quot; : {
@ -709,18 +709,18 @@ UPDATE 104
</code></pre><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
</code></pre><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre><ul>
<li>Then delete the temp and backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{&quot;acknowledged&quot;:true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
</code></pre><h2 id="2021-02-23">2021-02-23</h2>
@ -732,21 +732,21 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
</li>
<li>Remove semicolons from series names without numbers:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# BEGIN;
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
UPDATE 104
dspace=# COMMIT;
</code></pre><ul>
<li>Set all <code>text_lang</code> values on CGSpace to <code>en_US</code> to make the series replacements easier (this didn&rsquo;t work, read below):</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# BEGIN;
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 911
cgspace=# COMMIT;
</code></pre><ul>
<li>Then export all series with their IDs to CSV:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# \COPY (SELECT dspace_object_id, text_value as &quot;dcterms.isPartOf[en_US]&quot; FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# \COPY (SELECT dspace_object_id, text_value as &quot;dcterms.isPartOf[en_US]&quot; FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
</code></pre><ul>
<li>In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check
<ul>
@ -761,22 +761,22 @@ cgspace=# COMMIT;
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
UPDATE 1
</code></pre><ul>
<li>This also seems to work, using the id for just that one item:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
UPDATE 37
</code></pre><ul>
<li>This seems to work better for some reason:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
<pre tabindex="0"><code class="language-console" data-lang="console">dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 18659
</code></pre><ul>
<li>I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace metadata-import -f /tmp/0.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-import -f /tmp/0.csv
</code></pre><ul>
<li>It took FOREVER to import each file&hellip; like several hours <em>each</em>. MY GOD DSpace 6 is slow.</li>
<li>Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros
@ -785,7 +785,7 @@ UPDATE 18659
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] &quot;GET /rest/communities?limit=1000 HTTP/1.1&quot; 200 188779 &quot;https://cgspace.cgiar.org/rest /communities?limit=1000&quot; &quot;RTB website BOT&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] &quot;GET /rest/communities?limit=1000 HTTP/1.1&quot; 200 188779 &quot;https://cgspace.cgiar.org/rest /communities?limit=1000&quot; &quot;RTB website BOT&quot;
104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] &quot;GET /rest/communities//communities HTTP/1.1&quot; 404 714 &quot;https://cgspace.cgiar.org/rest/communities//communities&quot; &quot;RTB website BOT&quot;
</code></pre><ul>
<li>The first request is OK, but the second one is malformed for sure</li>
@ -794,12 +794,12 @@ UPDATE 18659
<ul>
<li>Export a list of journals for Peter to look through:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.journal&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.journal&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 3345
</code></pre><ul>
<li>Start a fresh harvesting on AReS because Udana mapped some items today and wants to include them in his report:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
</code></pre><ul>
<li>Also, I want to include the new series name/number cleanups so it&rsquo;s not a total waste of time</li>
@ -808,7 +808,7 @@ COPY 3345
<ul>
<li>Hmm the AReS harvest last night seems to have finished successfully, but the number of items is less than I was expecting:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 99546,
&quot;_shards&quot; : {
@ -843,7 +843,7 @@ COPY 3345
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,&quot;&quot;)
<pre tabindex="0"><code class="language-console" data-lang="console">value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,&quot;&quot;)
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&quot;$1&quot;)
</code></pre><ul>
<li>This <code>value.partition</code> was new to me&hellip; and it took me a bit of time to figure out whether I needed to escape the parentheses in the issue number or not (no) and how to reference a capture group with <code>value.replace</code></li>
@ -857,7 +857,7 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&quot;$1&quot;)
<li>Niroshini from IWMI is still having issues adding WLE subjects to items during the metadata review step in the workflow</li>
<li>It seems the BatchEditConsumer log spam is gone since I applied <a href="https://github.com/ilri/DSpace/pull/462">Atmire&rsquo;s patch</a></li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
dspace.log.2021-02-10:5067
dspace.log.2021-02-11:2647
dspace.log.2021-02-12:4231

View File

@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -163,14 +163,14 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
<ul>
<li>I looked at the number of connections in PostgreSQL and it&rsquo;s definitely high again:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1020
</code></pre><ul>
<li>I reported it to Atmire to take a look, on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=851">same issue</a> we had been tracking this before</li>
<li>Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell</li>
<li>I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my <code>add-orcid-identifier.py</code> script:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat 2021-03-04-add-zoe-campbell-orcid.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-03-04-add-zoe-campbell-orcid.csv
dc.contributor.author,cg.creator.identifier
&quot;Campbell, Zoë&quot;,&quot;Zoe Campbell: 0000-0002-4759-9976&quot;
&quot;Campbell, Zoe A.&quot;,&quot;Zoe Campbell: 0000-0002-4759-9976&quot;
@ -183,7 +183,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT dspace_object_id AS id, text_value as &quot;cg.journal&quot; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT dspace_object_id AS id, text_value as &quot;cg.journal&quot; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087
</code></pre><ul>
<li>I used OpenRefine to remove all journal values that didn&rsquo;t have one of these values: ; ( )
@ -193,7 +193,7 @@ COPY 32087
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">value.partition(';')[0].trim() # to get journal names
<pre tabindex="0"><code class="language-console" data-lang="console">value.partition(';')[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,&quot;$1&quot;) # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&quot;$1&quot;) # to get journal issues
</code></pre><ul>
@ -233,7 +233,7 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&quot;$1&quot;) #
<ul>
<li>I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml down
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
@ -249,12 +249,12 @@ $ docker-compose -f docker/docker-compose.yml up -d
<li>I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit</li>
<li>Delete the <code>openrxv-items-temp</code> index to test a fresh harvesting:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</code></pre><h2 id="2021-03-05-1">2021-03-05</h2>
<ul>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 101761,
&quot;_shards&quot; : {
@ -267,18 +267,18 @@ $ docker-compose -f docker/docker-compose.yml up -d
</code></pre><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
</code></pre><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre><ul>
<li>Then delete the temp and backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{&quot;acknowledged&quot;:true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
</code></pre><ul>
@ -298,7 +298,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {
@ -308,7 +308,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
</code></pre><ul>
<li>But on AReS production <code>openrxv-items</code> has somehow become a concrete index:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
&quot;openrxv-items&quot;: {
&quot;aliases&quot;: {}
@ -322,7 +322,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
</code></pre><ul>
<li>I fixed the issue on production by cloning the <code>openrxv-items</code> index to <code>openrxv-items-final</code>, deleting <code>openrxv-items</code>, and then re-creating it as an alias:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
@ -331,7 +331,7 @@ $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application
</code></pre><ul>
<li>Delete backups and remove read-only mode on <code>openrxv-items</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
<li>Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
@ -340,11 +340,11 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Typ
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I see the usual IPs for CCAFS and ILRI importer bots, but also <code>143.233.242.132</code> which appears to be for GARDIAN:</li>
</ul>
<pre><code class="language-console" data-lang="console"># zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
<pre tabindex="0"><code class="language-console" data-lang="console"># zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
6237
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
6418
@ -375,7 +375,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Typ
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
</code></pre><ul>
<li>On 2021-03-03 the PostgreSQL transactions started rising:</li>
@ -409,7 +409,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Typ
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS
</code></pre><ul>
@ -434,7 +434,7 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
</code></pre><h2 id="2021-03-10">2021-03-10</h2>
<ul>
<li>Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses <code>cg.isijournal</code> and MELSpace uses <code>mel.impact-factor</code>
@ -444,7 +444,7 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
</li>
<li>Peter said he doesn&rsquo;t see &ldquo;Source Code&rdquo; or &ldquo;Software&rdquo; in the <a href="https://cgspace.cgiar.org/handle/10568/1/search-filter?field=type">output type facet on the ILRI community</a>, but I see it on the home page, so I will try to do a full Discovery re-index:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 318m20.485s
user 215m15.196s
@ -467,7 +467,7 @@ sys 2m51.529s
<ul>
<li>Switch to linux-kvm kernel on linode20 and linode18:</li>
</ul>
<pre><code class="language-console" data-lang="console"># apt update &amp;&amp; apt full-upgrade
<pre tabindex="0"><code class="language-console" data-lang="console"># apt update &amp;&amp; apt full-upgrade
# apt install linux-kvm
# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
# apt autoremove &amp;&amp; apt autoclean
@ -478,13 +478,13 @@ sys 2m51.529s
<li>Last week Peter added OpenRXV to CGSpace: <a href="https://hdl.handle.net/10568/112982">https://hdl.handle.net/10568/112982</a></li>
<li>Back up the current <code>openrxv-items-final</code> index on AReS to start a new harvest:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
<li>After the harvesting finished it seems the indexes got messed up again, as <code>openrxv-items</code> is an alias of <code>openrxv-items-temp</code> instead of <code>openrxv-items-final</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
@ -535,7 +535,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
</li>
<li>Back up the current <code>openrxv-items-final</code> index to start a fresh AReS Harvest:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
@ -545,7 +545,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
<ul>
<li>The harvesting on AReS yesterday completed, but somehow I have twice the number of items:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
{
&quot;count&quot; : 206204,
&quot;_shards&quot; : {
@ -558,7 +558,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
</code></pre><ul>
<li>Hmmm and even my backup index has a strange number of items:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&amp;pretty'
{
&quot;count&quot; : 844,
&quot;_shards&quot; : {
@ -571,7 +571,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
</code></pre><ul>
<li>I deleted all indexes and re-created the openrxv-items alias:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
&quot;openrxv-items-temp&quot;: {
@ -591,7 +591,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</li>
<li>The AReS harvest finally finished, with 1047 pages of items, but the <code>openrxv-items-final</code> index is empty and the <code>openrxv-items-temp</code> index has a 103,000 items:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 103162,
&quot;_shards&quot; : {
@ -604,12 +604,12 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</code></pre><ul>
<li>I tried to clone the temp index to the final, but got an error:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{&quot;error&quot;:{&quot;root_cause&quot;:[{&quot;type&quot;:&quot;resource_already_exists_exception&quot;,&quot;reason&quot;:&quot;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&quot;,&quot;index_uuid&quot;:&quot;LmxH-rQsTRmTyWex2d8jxw&quot;,&quot;index&quot;:&quot;openrxv-items-final&quot;}],&quot;type&quot;:&quot;resource_already_exists_exception&quot;,&quot;reason&quot;:&quot;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&quot;,&quot;index_uuid&quot;:&quot;LmxH-rQsTRmTyWex2d8jxw&quot;,&quot;index&quot;:&quot;openrxv-items-final&quot;},&quot;status&quot;:400}%
</code></pre><ul>
<li>I looked in the Docker logs for Elasticsearch and saw a few memory errors:</li>
</ul>
<pre><code class="language-console" data-lang="console">java.lang.OutOfMemoryError: Java heap space
<pre tabindex="0"><code class="language-console" data-lang="console">java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>According to <code>/usr/share/elasticsearch/config/jvm.options</code> in the Elasticsearch container the default JVM heap is 1g
<ul>
@ -622,7 +622,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"> &quot;openrxv-items-final&quot;: {
<pre tabindex="0"><code class="language-console" data-lang="console"> &quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
},
&quot;openrxv-items-temp&quot;: {
@ -634,7 +634,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<ul>
<li>For reference you can also get the Elasticsearch JVM stats from the API:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
</code></pre><ul>
<li>I re-deployed AReS with 1.5GB of heap using the <code>ES_JAVA_OPTS</code> environment variable
<ul>
@ -644,7 +644,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<li>Then I fixed the aliases to make sure <code>openrxv-items</code> was an alias of <code>openrxv-items-final</code>, similar to how I did a few weeks ago</li>
<li>I re-created the temp index:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
</code></pre><h2 id="2021-03-24">2021-03-24</h2>
<ul>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934">ticket about the Duplicate Checker</a>
@ -659,18 +659,18 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># du -s /home/dspacetest.cgiar.org/solr/statistics
<pre tabindex="0"><code class="language-console" data-lang="console"># du -s /home/dspacetest.cgiar.org/solr/statistics
57861236 /home/dspacetest.cgiar.org/solr/statistics
</code></pre><ul>
<li>I applied their changes to <code>config/spring/api/atmire-cua-update.xml</code> and started the duplicate processor:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
</code></pre><ul>
<li>The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)</li>
<li>Hah, I still got a memory error after only a few minutes:</li>
</ul>
<pre><code class="language-console" data-lang="console">...
<pre tabindex="0"><code class="language-console" data-lang="console">...
Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s
Exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
@ -678,7 +678,7 @@ java.lang.OutOfMemoryError: GC overhead limit exceeded
<li>I guess we really do have to use <code>-r 100</code></li>
<li>Now the thing runs for a few minutes and &ldquo;finishes&rdquo;:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
Loading @mire database changes for module MQM
Changes have been processed
@ -796,7 +796,7 @@ Run 1 took 5m 53s
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&amp;wt=javabin&amp;version=2} hits=143 status=0 QTime=0
<pre tabindex="0"><code class="language-console" data-lang="console">2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&amp;wt=javabin&amp;version=2} hits=143 status=0 QTime=0
</code></pre><ul>
<li>But the item mapper only displays ten items, with no pagination
<ul>
@ -845,7 +845,7 @@ r <span style="color:#f92672">=</span> requests<span style="color:#f92672">.</sp
</code></pre></div><ul>
<li>I exported a list of all our ISSNs from CGSpace:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
COPY 3081
</code></pre><ul>
<li>I wrote a script to check the ISSNs against Crossref&rsquo;s API: <code>crossref-issn-lookup.py</code>

View File

@ -44,7 +44,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -153,16 +153,16 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;cgspace-account&quot; -W &quot;(sAMAccountName=otheraccounttoquery)&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;cgspace-account&quot; -W &quot;(sAMAccountName=otheraccounttoquery)&quot;
</code></pre><h2 id="2021-04-04">2021-04-04</h2>
<ul>
<li>Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</code></pre><ul>
<li>Then set the <code>openrxv-items-final</code> index to read-only so we can make a backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
{&quot;acknowledged&quot;:true}%
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
{&quot;acknowledged&quot;:true,&quot;shards_acknowledged&quot;:true,&quot;index&quot;:&quot;openrxv-items-final-backup&quot;}%
@ -181,7 +181,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
</code></pre><ul>
<li>For now I only fixed obvious errors like &ldquo;1234-5678.&rdquo; and &ldquo;e-ISSN: 1234-5678&rdquo; etc, but there are still lots of invalid ones which need more manual work:
<ul>
@ -196,7 +196,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
<ul>
<li>The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
{
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
@ -218,7 +218,7 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
sed '1d' | \
csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, &quot;&quot;)||COALESCE(c, &quot;&quot;)||COALESCE(d, &quot;&quot;) AS issued FROM stdin' | \
@ -257,13 +257,13 @@ $ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.
</code></pre></div><ul>
<li>Then I submitted the file three times (changing the page parameter):</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp &gt; /tmp/page1.json
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp &gt; /tmp/page1.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp &gt; /tmp/page2.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp &gt; /tmp/page3.json
</code></pre><ul>
<li>Then I extracted the views and downloads in the most ridiculous way:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
30364
$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
9100
@ -290,16 +290,16 @@ $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs |
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12413
</code></pre><ul>
<li>The system journal shows thousands of these messages in the system journal, this is the first one:</li>
</ul>
<pre><code class="language-console" data-lang="console">Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
<pre tabindex="0"><code class="language-console" data-lang="console">Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
</code></pre><ul>
<li>Around that time in the dspace log I see nothing unusual, but maybe these?</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
<pre tabindex="0"><code class="language-console" data-lang="console">2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
</code></pre><ul>
<li>(BTW what is the deal with the &ldquo;200/127&rdquo;? I should send a comment to Atmire)
<ul>
@ -308,7 +308,7 @@ $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs |
</li>
<li>I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3640
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2968
@ -318,7 +318,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<li>After ten minutes or so it went back down&hellip;</li>
<li>And now it&rsquo;s back up in the thousands&hellip; I am seeing a lot of stuff in dspace log like this:</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
<pre tabindex="0"><code class="language-console" data-lang="console">2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
@ -354,17 +354,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<li>I had a meeting with Peter and Abenet about CGSpace TODOs</li>
<li>CGSpace went down again and the PostgreSQL locks are through the roof:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12154
</code></pre><ul>
<li>I don&rsquo;t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:</li>
</ul>
<pre><code class="language-console" data-lang="console"># grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
3547
</code></pre><ul>
<li>I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions:</li>
</ul>
<pre><code class="language-console" data-lang="console"># for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E &quot;2021-0(3|4)-[0-9]{2} (13|14|15|16|17):&quot; &quot;$file&quot; | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
<pre tabindex="0"><code class="language-console" data-lang="console"># for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E &quot;2021-0(3|4)-[0-9]{2} (13|14|15|16|17):&quot; &quot;$file&quot; | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
...
3572
4085
@ -390,7 +390,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</code></pre><ul>
<li>What about total number of sessions per day?</li>
</ul>
<pre><code class="language-console" data-lang="console"># for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo &quot;$file:&quot;; grep -a -o -E 'session_id=[A-Z0-9]{32}' &quot;$file&quot; | sort | uniq | wc -l; done
<pre tabindex="0"><code class="language-console" data-lang="console"># for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo &quot;$file:&quot;; grep -a -o -E 'session_id=[A-Z0-9]{32}' &quot;$file&quot; | sort | uniq | wc -l; done
...
/home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
11784
@ -421,7 +421,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>The locks in PostgreSQL shot up again&hellip;</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3447
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3527
@ -440,7 +440,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<ul>
<li>While looking at the nginx logs I see that MEL is trying to log into CGSpace&rsquo;s REST API and delete items:</li>
</ul>
<pre><code class="language-console" data-lang="console">34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] &quot;POST /rest/login HTTP/1.1&quot; 401 727 &quot;-&quot; &quot;MEL&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] &quot;POST /rest/login HTTP/1.1&quot; 401 727 &quot;-&quot; &quot;MEL&quot;
34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] &quot;DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1&quot; 401 704 &quot;-&quot; &quot;-&quot;
</code></pre><ul>
<li>I see a few of these per day going back several months
@ -450,7 +450,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>Also annoying, I see tons of what look like penetration testing requests from Qualys:</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user &quot;'&gt;&lt;qss a=X158062356Y1_2Z&gt;
<pre tabindex="0"><code class="language-console" data-lang="console">2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user &quot;'&gt;&lt;qss a=X158062356Y1_2Z&gt;
2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=&quot;'&gt;&lt;qss a=X158062356Y1_2Z&gt;
2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email=&quot;'&gt;&lt;qss a=X158062356Y1_2Z&gt;, realm=null, result=2
2021-04-04 06:35:18,145 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
@ -464,19 +464,19 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>10PM and the server is down again, with locks through the roof:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12198
</code></pre><ul>
<li>I see that there are tons of PostgreSQL connections getting abandoned today, compared to very few in the past few weeks:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
<pre tabindex="0"><code class="language-console" data-lang="console">$ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
1838
$ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
3
</code></pre><ul>
<li>I even restarted the server and connections were low for a few minutes until they shot back up:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
8651
@ -488,12 +488,12 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<li>I had to go to bed and I bet it will crash and be down for hours until I wake up&hellip;</li>
<li>What the hell is this user agent?</li>
</ul>
<pre><code>54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] &quot;GET /handle/10568/16499 HTTP/1.1&quot; 499 0 &quot;-&quot; &quot;GetUrl/1.0 wdestiny@umich.edu (Linux)&quot;
<pre tabindex="0"><code>54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] &quot;GET /handle/10568/16499 HTTP/1.1&quot; 499 0 &quot;-&quot; &quot;GetUrl/1.0 wdestiny@umich.edu (Linux)&quot;
</code></pre><h2 id="2021-04-07">2021-04-07</h2>
<ul>
<li>CGSpace was still down from last night of course, with tons of database locks:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12168
</code></pre><ul>
<li>I restarted the server again and the locks came back</li>
@ -504,7 +504,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-04-01 12:45:11,414 WARN org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon; Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
<pre tabindex="0"><code class="language-console" data-lang="console">2021-04-01 12:45:11,414 WARN org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon; Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
</code></pre><ul>
<li>The issue is not the named user above, but a member of the group&hellip;</li>
<li>And the group does have users with invalid email addresses (probably accounts created automatically after authenticating with LDAP):</li>
@ -513,7 +513,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<ul>
<li>I extracted all the group IDs from recent logs that had users with invalid email addresses:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6
1769137c-36d4-42b2-8fec-60585e110db7
203c8614-8a97-4ac8-9686-d9d62cb52acc
@ -565,12 +565,12 @@ fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12070
</code></pre><ul>
<li>I restarted PostgreSQL and Tomcat and the locks go straight back up!</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
986
@ -608,7 +608,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
@ -616,18 +616,18 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
</code></pre><ul>
<li>Then I updated all Docker containers and rebooted the server (linode20) so that the correct indexes would be created again:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre><ul>
<li>Then I realized I have to clone the backup index directly to <code>openrxv-items-final</code>, and re-create the <code>openrxv-items</code> alias:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -X PUT &quot;localhost:9200/openrxv-items-backup/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
</code></pre><ul>
<li>Now I see both <code>openrxv-items-final</code> and <code>openrxv-items</code> have the current number of items:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
{
&quot;count&quot; : 103373,
&quot;_shards&quot; : {
@ -672,24 +672,24 @@ $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
<ul>
<li>13,000 requests in the last two months from a user with user agent <code>SomeRandomText</code>, for example:</li>
</ul>
<pre><code class="language-console" data-lang="console">84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] &quot;GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1&quot; 404 10890 &quot;-&quot; &quot;SomeRandomText&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] &quot;GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1&quot; 404 10890 &quot;-&quot; &quot;SomeRandomText&quot;
</code></pre><ul>
<li>I purged them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 13159 hits from SomeRandomText in statistics
Total number of bot hits purged: 13159
</code></pre><ul>
<li>I noticed there were 78 items submitted in the hour before CGSpace crashed:</li>
</ul>
<pre><code class="language-console" data-lang="console"># grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item
78
</code></pre><ul>
<li>Of those 78, 77 of them were from Udana</li>
<li>Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:</li>
</ul>
<pre><code class="language-console" data-lang="console"># for num in {01..13}; do grep -a -E &quot;2021-04-$num 0&quot; /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
<pre tabindex="0"><code class="language-console" data-lang="console"># for num in {01..13}; do grep -a -E &quot;2021-04-$num 0&quot; /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
add_item; done
32
0
@ -723,7 +723,7 @@ Total number of bot hits purged: 13159
</li>
<li>Create a test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
</code></pre><ul>
<li>I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
<ul>
@ -735,12 +735,12 @@ Total number of bot hits purged: 13159
<ul>
<li>Update all containers on AReS (linode20):</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre><ul>
<li>Then run all system updates and reboot the server</li>
<li>I learned a new command for Elasticsearch:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl http://localhost:9200/_cat/indices
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl http://localhost:9200/_cat/indices
yellow open openrxv-values ChyhGwMDQpevJtlNWO1vcw 1 1 1579 0 537.6kb 537.6kb
yellow open openrxv-items-temp PhV5ieuxQsyftByvCxzSIw 1 1 103585 104372 482.7mb 482.7mb
yellow open openrxv-shared J_8cxIz6QL6XTRZct7UBBQ 1 1 127 0 115.7kb 115.7kb
@ -754,7 +754,7 @@ yellow open users M0t2LaZhSm2NrF5xb64dnw 1 1 2 0 1
</code></pre><ul>
<li>Somehow the <code>openrxv-items-final</code> index only has a few items and the majority are in <code>openrxv-items-temp</code>, via the <code>openrxv-items</code> alias (which is in the temp index):</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
{
&quot;count&quot; : 103585,
&quot;_shards&quot; : {
@ -767,7 +767,7 @@ yellow open users M0t2LaZhSm2NrF5xb64dnw 1 1 2 0 1
</code></pre><ul>
<li>I found a cool tool to help with exporting and restoring Elasticsearch indexes:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
...
Sun, 18 Apr 2021 06:27:07 GMT | Total Writes: 103585
@ -776,20 +776,20 @@ Sun, 18 Apr 2021 06:27:07 GMT | dump complete
<li>It took only two or three minutes to export everything&hellip;</li>
<li>I did a test to restore the index:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-test --limit 1000 --type=data
</code></pre><ul>
<li>So that&rsquo;s pretty cool!</li>
<li>I deleted the <code>openrxv-items-final</code> index and <code>openrxv-items-temp</code> indexes and then restored the mappings to <code>openrxv-items-final</code>, added the <code>openrxv-items</code> alias, and started restoring the data to <code>openrxv-items</code> with elasticdump:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
</code></pre><ul>
<li>AReS seems to be working fine аfter that, so I created the <code>openrxv-items-temp</code> index and then started a fresh harvest on AReS Explorer:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp&quot;
</code></pre><ul>
<li>Run system updates on CGSpace (linode18) and run the latest Ansible infrastructure playbook to update the DSpace Statistics API, PostgreSQL JDBC driver, etc, and then reboot the system</li>
<li>I wasted a bit of time trying to get TSLint and then ESLint running for OpenRXV on GitHub Actions</li>
@ -798,13 +798,13 @@ $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localh
<ul>
<li>The AReS harvesting last night seems to have completed successfully, but the number of results is strange:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp kNUlupUyS_i7vlBGiuVxwg 1 1 103741 105553 483.6mb 483.6mb
yellow open openrxv-items-final HFc3uytTRq2GPpn13vkbmg 1 1 970 0 2.3mb 2.3mb
</code></pre><ul>
<li>The indices endpoint doesn&rsquo;t include the <code>openrxv-items</code> alias, but it is currently in the <code>openrxv-items-temp</code> index so the number of items is the same:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
{
&quot;count&quot; : 103741,
&quot;_shards&quot; : {
@ -821,7 +821,7 @@ yellow open openrxv-items-final HFc3uytTRq2GPpn13vkbmg 1 1 970 0
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace test-email
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace test-email
...
Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 550 5.2.1 Mailbox cannot be accessed [PR0P264CA0280.FRAP264.PROD.OUTLOOK.COM]
@ -850,7 +850,7 @@ Error sending email:
</ul>
</li>
</ul>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ cp atmire-cua-update.xml-20210124-132112.old /home/dspacetest.cgiar.org/config/spring/api/atmire-cua-update.xml
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12 -g
</code></pre><ul>
@ -869,7 +869,7 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
@ -883,13 +883,13 @@ $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localh
<ul>
<li>The AReS harvest last night seems to have finished successfully and the number of items looks good:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
</code></pre><ul>
<li>And the aliases seem correct for once:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {
@ -904,7 +904,7 @@ yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb
<li>That&rsquo;s 250 new items in the index since the last harvest!</li>
<li>Re-create my local Artifactory container because I&rsquo;m getting errors starting it and it has been a few months since it was updated:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ podman rm artifactory
<pre tabindex="0"><code class="language-console" data-lang="console">$ podman rm artifactory
$ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
$ podman start artifactory
@ -925,11 +925,11 @@ $ podman start artifactory
</li>
<li>I tried to delete all the Atmire SQL migrations:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace7b5= &gt; DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace7b5= &gt; DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
</code></pre><ul>
<li>But I got an error when running <code>dspace database migrate</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ~/dspace7b5/bin/dspace database migrate
<pre tabindex="0"><code class="language-console" data-lang="console">$ ~/dspace7b5/bin/dspace database migrate
Database URL: jdbc:postgresql://localhost:5432/dspace7b5
Migrating database to latest version... (Check dspace logs for details)
@ -961,11 +961,11 @@ Detected applied migration not resolved locally: 6.0.2017.09.25
</code></pre><ul>
<li>I deleted those migrations:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace7b5= &gt; DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace7b5= &gt; DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
</code></pre><ul>
<li>Then when I ran the migration again it failed for a new reason, related to the configurable workflow:</li>
</ul>
<pre><code class="language-console" data-lang="console">Database URL: jdbc:postgresql://localhost:5432/dspace7b5
<pre tabindex="0"><code class="language-console" data-lang="console">Database URL: jdbc:postgresql://localhost:5432/dspace7b5
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
@ -993,12 +993,12 @@ Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflo
</code></pre><ul>
<li>The <a href="https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace">DSpace 7 upgrade docs</a> say I need to apply these previously optional migrations:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ~/dspace7b5/bin/dspace database migrate ignored
<pre tabindex="0"><code class="language-console" data-lang="console">$ ~/dspace7b5/bin/dspace database migrate ignored
</code></pre><ul>
<li>Now I see all migrations have completed and DSpace actually starts up fine!</li>
<li>I will try to do a full re-index to see how long it takes:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time ~/dspace7b5/bin/dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time ~/dspace7b5/bin/dspace index-discovery -b
...
~/dspace7b5/bin/dspace index-discovery -b 25156.71s user 64.22s system 97% cpu 7:11:09.94 total
</code></pre><ul>
@ -1012,7 +1012,7 @@ Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflo
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' &gt; /tmp/dois.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' &gt; /tmp/dois.txt
$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.csv -db dspace63 -u dspace -p 'fuuu' -d
</code></pre><ul>
<li>He will Tweet them&hellip;</li>

View File

@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one&hellip; as that&rsquo;s an actual user&hellip;
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -147,7 +147,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1&quot; 400 5 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1&quot; 400 5 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata-21%2B21*01 HTTP/1.1&quot; 200 458201 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata'||lower('')||' HTTP/1.1&quot; 400 5 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata'%2Brtrim('')%2B' HTTP/1.1&quot; 200 458209 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
@ -155,7 +155,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
<li>I will report the IP on abuseipdb.com and purge their hits from Solr</li>
<li>The second IP is in Colombia and is making thousands of requests for what looks like some test site:</li>
</ul>
<pre><code class="language-console" data-lang="console">181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] &quot;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&quot; 200 123613 &quot;http://cassavalighthousetest.org/&quot; &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] &quot;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&quot; 200 123613 &quot;http://cassavalighthousetest.org/&quot; &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&quot;
181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] &quot;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&quot; 200 123613 &quot;http://cassavalighthousetest.org/&quot; &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&quot;
</code></pre><ul>
<li>But this site does not exist (yet?)
@ -165,11 +165,11 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
</li>
<li>The third IP is in Russia apparently, and the user agent has the <code>pl-PL</code> locale with thousands of requests like this:</li>
</ul>
<pre><code class="language-console" data-lang="console">45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] &quot;GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&amp;isAllowed=y HTTP/1.1&quot; 200 918998 &quot;http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf&quot; &quot;Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] &quot;GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&amp;isAllowed=y HTTP/1.1&quot; 200 918998 &quot;http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf&quot; &quot;Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15&quot;
</code></pre><ul>
<li>I will purge these all with my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 21648 hits from 193.169.254.178 in statistics
Purging 20323 hits from 181.62.166.177 in statistics
Purging 19376 hits from 45.146.166.180 in statistics
@ -179,7 +179,7 @@ Total number of bot hits purged: 61347
<ul>
<li>Check the AReS Harvester indexes:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
@ -195,13 +195,13 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
</code></pre><ul>
<li>I think they look OK (<code>openrxv-items</code> is an alias of <code>openrxv-items-final</code>), but I took a backup just in case:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
</code></pre><ul>
<li>Then I started an indexing in the AReS Explorer admin dashboard</li>
<li>The indexing finished, but it looks like the aliases are messed up again:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
</code></pre><h2 id="2021-05-05">2021-05-05</h2>
@ -229,7 +229,7 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time ~/dspace64/bin/dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time ~/dspace64/bin/dspace index-discovery -b
~/dspace64/bin/dspace index-discovery -b 4053.24s user 53.17s system 38% cpu 2:58:53.83 total
</code></pre><ul>
<li>Nope! Still slow, and still no mapped item&hellip;
@ -244,7 +244,7 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
</li>
<li>The indexes on AReS Explorer are messed up after last week&rsquo;s harvesting:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
@ -262,21 +262,21 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
<li><code>openrxv-items</code> should be an alias of <code>openrxv-items-final</code>&hellip;</li>
<li>I made a backup of the temp index and then started indexing on the AReS Explorer admin dashboard:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><h2 id="2021-05-10">2021-05-10</h2>
<ul>
<li>Amazing, the harvesting on AReS finished but it messed up all the indexes and now there are no items in any index!</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp 8thRX0WVRUeAzmd2hkG6TA 1 1 0 0 283b 283b
yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
yellow open openrxv-items-final BtvV9kwVQ3yBYCZvJS1QyQ 1 1 0 0 283b 283b
</code></pre><ul>
<li>I fixed the indexes manually by re-creating them and cloning from the backup:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp-backup/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
@ -284,11 +284,11 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp-backup'
</code></pre><ul>
<li>Also I ran all updated on the server and updated all Docker images, then rebooted the server (linode20):</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre><ul>
<li>I backed up the AReS Elasticsearch data using elasticdump, then started a new harvest:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
</code></pre><ul>
<li>Discuss CGSpace statistics with the CIP team
@ -329,7 +329,7 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
</li>
<li>I checked the CLARISA list against ROR&rsquo;s April, 2020 release (&ldquo;Version 9&rdquo;, on figshare, though it is version 8 in the dump):</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
1770
</code></pre><ul>
@ -341,7 +341,7 @@ $ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
<ul>
<li>Fix a few thousand IWMI URLs that are using HTTP instead of HTTPS on CGSpace:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
UPDATE 1132
localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://publications.iwmi.org','https://publications.iwmi.org', 'g') WHERE text_value LIKE 'http://publications.iwmi.org%' AND metadata_field_id=219;
UPDATE 1803
@ -367,7 +367,7 @@ UPDATE 1803
<ul>
<li>I have to fix the Elasticsearch indexes on AReS after last week&rsquo;s harvesting because, as always, the <code>openrxv-items</code> index should be an alias of <code>openrxv-items-final</code> instead of <code>openrxv-items-temp</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
},
@ -380,13 +380,13 @@ UPDATE 1803
</code></pre><ul>
<li>I took a backup of the <code>openrxv-items</code> index with elasticdump so I can re-create them manually before starting a new harvest tomorrow:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
</code></pre><h2 id="2021-05-16">2021-05-16</h2>
<ul>
<li>I deleted and re-created the Elasticsearch indexes on AReS:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XPUT 'http://localhost:9200/openrxv-items-final'
$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
@ -394,7 +394,7 @@ $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application
</code></pre><ul>
<li>Then I re-imported the backup that I created with elasticdump yesterday:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre><ul>
<li>Then I started a new harvest on AReS</li>
@ -403,7 +403,7 @@ $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localh
<ul>
<li>The AReS harvest finished and the Elasticsearch indexes seem OK so I shouldn&rsquo;t have to fix them next time&hellip;</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 0 0 283b 283b
yellow open openrxv-items-final TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
@ -423,7 +423,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;cgspace-ldap@cgiarad.org&quot; -W &quot;(sAMAccountName=aorth)&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;cgspace-ldap@cgiarad.org&quot; -W &quot;(sAMAccountName=aorth)&quot;
Enter LDAP Password:
ldap_bind: Invalid credentials (49)
additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839
@ -446,11 +446,11 @@ ldap_bind: Invalid credentials (49)
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ xmllint --xpath '//value-pairs[@value-pairs-name=&quot;ccafsprojectpii&quot;]/pair/stored-value/node()' dspace/config/input-forms.xml
<pre tabindex="0"><code class="language-console" data-lang="console">$ xmllint --xpath '//value-pairs[@value-pairs-name=&quot;ccafsprojectpii&quot;]/pair/stored-value/node()' dspace/config/input-forms.xml
</code></pre><ul>
<li>I formatted the input file with tidy, especially because one of the new project tags has an ampersand character&hellip; grrr:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml
<pre tabindex="0"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml
line 3658 column 26 - Warning: unescaped &amp; or unknown entity &quot;&amp;WA_EU-IFAD&quot;
line 3659 column 23 - Warning: unescaped &amp; or unknown entity &quot;&amp;WA_EU-IFAD&quot;
</code></pre><ul>
@ -461,16 +461,16 @@ line 3659 column 23 - Warning: unescaped &amp; or unknown entity &quot;&amp;WA_E
<li>Paola from the Alliance emailed me some new ORCID identifiers to add to CGSpace</li>
<li>I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-05-18-combined.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-05-18-combined.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
</code></pre><ul>
<li>I sorted the names and added the XML formatting in vim, then ran it through tidy:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
<pre tabindex="0"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
</code></pre><ul>
<li>Tag fifty-five items from the Alliance&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat 2021-05-18-add-orcids.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-05-18-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&quot;Urioste Daza, Sergio&quot;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
&quot;Urioste, Sergio&quot;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
@ -504,7 +504,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 47405
</code></pre><ul>
<li>That&rsquo;s interesting because we lowercased them all a few months ago, so these must all be new&hellip; wow
@ -518,7 +518,7 @@ UPDATE 47405
<ul>
<li>Export the top 5,000 AGROVOC terms to validate them:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
COPY 5000
$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d &gt; /tmp/2021-05-20-agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
@ -545,7 +545,7 @@ $ csvgrep -c &quot;number of matches&quot; -r '^0$' /tmp/2021-05-20-agrovoc-resu
<ul>
<li>Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat 2021-05-24-add-orcids.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-05-24-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&quot;Patel, Ekta&quot;,&quot;Ekta Patel: 0000-0001-9400-6988&quot;
&quot;Dessie, Tadelle&quot;,&quot;Tadelle Dessie: 0000-0002-1630-0417&quot;
@ -562,7 +562,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u
</code></pre><ul>
<li>A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
</code></pre><ul>
<li>The indexes look OK so I started a harvesting on AReS</li>
@ -571,13 +571,13 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
<ul>
<li>The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb
</code></pre><ul>
<li>Update all docker images on the AReS server (linode20):</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml down
$ docker-compose -f docker/docker-compose.yml build
</code></pre><ul>
@ -585,7 +585,7 @@ $ docker-compose -f docker/docker-compose.yml build
<li>Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317&hellip; so it was actually correct before!</li>
<li>For reference, this is how I re-created everything:</li>
</ul>
<pre><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
<pre tabindex="0"><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
@ -605,7 +605,7 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</li>
<li>Looking in the DSpace log for this morning I see a big hole in the logs at that time (UTC+2 server time):</li>
</ul>
<pre><code>2021-05-26 02:17:52,808 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: '10568/70659: item has country codes, skipping'
<pre tabindex="0"><code>2021-05-26 02:17:52,808 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: '10568/70659: item has country codes, skipping'
2021-05-26 02:17:52,853 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/66761 with status: 2. Result: '10568/66761: item has country codes, skipping'
2021-05-26 03:00:05,772 INFO org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.spidersfile:null
2021-05-26 03:00:05,773 INFO org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.server:http://localhost:8081/solr/statistics
@ -613,7 +613,7 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
<li>There are no logs between 02:17 and 03:00&hellip; hmmm.</li>
<li>I see a similar gap in the Solr log, though it starts at 02:15:</li>
</ul>
<pre><code>2021-05-26 02:15:07,968 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={f.location.coll.facet.sort=count&amp;facet.field=location.comm&amp;facet.field=location.coll&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=search.resourcetype:2&amp;fq=NOT(discoverable:false)&amp;rows=0&amp;version=2&amp;q=*:*&amp;f.location.coll.facet.limit=-1&amp;facet.mincount=1&amp;facet=true&amp;f.location.comm.facet.sort=count&amp;wt=javabin&amp;facet.offset=0&amp;f.location.comm.facet.limit=-1} hits=90792 status=0 QTime=6
<pre tabindex="0"><code>2021-05-26 02:15:07,968 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={f.location.coll.facet.sort=count&amp;facet.field=location.comm&amp;facet.field=location.coll&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=search.resourcetype:2&amp;fq=NOT(discoverable:false)&amp;rows=0&amp;version=2&amp;q=*:*&amp;f.location.coll.facet.limit=-1&amp;facet.mincount=1&amp;facet=true&amp;f.location.comm.facet.sort=count&amp;wt=javabin&amp;facet.offset=0&amp;f.location.comm.facet.limit=-1} hits=90792 status=0 QTime=6
2021-05-26 02:15:09,446 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/update params={wt=javabin&amp;version=2} status=0 QTime=1
2021-05-26 02:28:03,602 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2021-05-26 02:28:03,630 INFO org.apache.solr.core.SolrCore @ SolrDeletionPolicy.onCommit: commits: num=2
@ -626,19 +626,19 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</code></pre><ul>
<li>Ah, it seems to have been a <a href="https://status.linode.com/incidents/byqmt6nss9l0">Linode network issue in the Frankfurt region</a>:</li>
</ul>
<pre><code>May 26, 2021
<pre tabindex="0"><code>May 26, 2021
Connectivity Issue - Frankfurt
Resolved - We havent observed any additional connectivity issues in our Frankfurt data center, and will now consider this incident resolved. If you continue to experience problems, please open a Support ticket for assistance.
May 26, 02:57 UTC
</code></pre><ul>
<li>While looking in the logs I noticed an error about SMTP:</li>
</ul>
<pre><code>2021-05-26 02:00:18,015 ERROR org.dspace.eperson.SubscribeCLITool @ Failed to send subscription to eperson_id=934cb92f-2e77-4881-89e2-6f13ad4b1378
<pre tabindex="0"><code>2021-05-26 02:00:18,015 ERROR org.dspace.eperson.SubscribeCLITool @ Failed to send subscription to eperson_id=934cb92f-2e77-4881-89e2-6f13ad4b1378
2021-05-26 02:00:18,015 ERROR org.dspace.eperson.SubscribeCLITool @ javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))
</code></pre><ul>
<li>And indeed the email seems to be broken:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace test-email
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace test-email
About to send test email:
- To: fuuuuuu

View File

@ -36,7 +36,7 @@ I simply started it and AReS was running again:
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -132,7 +132,7 @@ I simply started it and AReS was running again:
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
</code></pre><ul>
<li>Margarita from CCAFS emailed me to say that workflow alerts haven&rsquo;t been working lately
<ul>
@ -152,7 +152,7 @@ I simply started it and AReS was running again:
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&amp;scope=10568/16814&amp;order=DESC&amp;rpp=100&amp;sort_by=2&amp;start=1
<pre tabindex="0"><code>https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&amp;scope=10568/16814&amp;order=DESC&amp;rpp=100&amp;sort_by=2&amp;start=1
</code></pre><ul>
<li>That will sort by date issued (see: <code>webui.itemlist.sort-option.2</code> in dspace.cfg), give 100 results per page, and start on item 1</li>
<li>Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week</li>
@ -162,7 +162,7 @@ I simply started it and AReS was running again:
<ul>
<li>The Elasticsearch indexes are messed up so I dumped and re-created them correctly:</li>
</ul>
<pre><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
<pre tabindex="0"><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
@ -208,7 +208,7 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
<pre tabindex="0"><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
</code></pre><ul>
<li>The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it&rsquo;s much faster
<ul>
@ -231,7 +231,7 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
90459
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
90380
@ -255,11 +255,11 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
</code></pre><ul>
<li>Then I used <code>csvcut</code> to extract just the columns I needed and do the replacement into a new CSV:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
</code></pre><ul>
<li>Then I uploaded the resulting CSV to CGSpace, updating 161 items</li>
<li>Start a harvest on AReS</li>
@ -278,7 +278,7 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | wc -l
90937
$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort -u | wc -l
85709
@ -289,7 +289,7 @@ $ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort | uniq -c | sort -h
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort | uniq -c | sort -h
</code></pre><ul>
<li>Unfortunately I found no pattern:
<ul>
@ -312,7 +312,7 @@ $ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq length
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq length
5
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq '.[].handle'
&quot;10673/4&quot;
@ -355,7 +355,7 @@ $ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | wc -l
90327
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
90317
@ -368,7 +368,7 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 1339 hits from RI\/1\.0 in statistics
Purging 447 hits from crusty in statistics
Purging 3736 hits from newspaper in statistics
@ -397,7 +397,7 @@ Total number of bot hits purged: 5522
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
<pre tabindex="0"><code class="language-console" data-lang="console"># journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
978
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10100
@ -412,16 +412,16 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
63
</code></pre><ul>
<li>Looking in the DSpace log, the first &ldquo;pool empty&rdquo; message I saw this morning was at 4AM:</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
<pre tabindex="0"><code class="language-console" data-lang="console">2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<li>Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
</code></pre><ul>
<li>We can purge them, as this is not user traffic: <a href="https://about.flipboard.com/browserproxy/">https://about.flipboard.com/browserproxy/</a>
<ul>
@ -448,7 +448,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | wc -l
104797
$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
99186
@ -456,7 +456,7 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
<li>This number is probably unique for that particular harvest, but I don&rsquo;t think it represents the true number of items&hellip;</li>
<li>The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;DSpace Test&quot;' 2021-06-23-openrxv-items-final-local.json | grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;DSpace Test&quot;' 2021-06-23-openrxv-items-final-local.json | grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' | sort | uniq | wc -l
90990
</code></pre><ul>
<li>So the harvest on the live site is missing items, then why didn&rsquo;t the add missing items plugin find them?!
@ -469,7 +469,7 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &quot;GET /sitemap HTTP/1.1&quot; 503 190 &quot;-&quot; &quot;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &quot;GET /sitemap HTTP/1.1&quot; 503 190 &quot;-&quot; &quot;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&quot;
</code></pre><ul>
<li>I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins&hellip; now it&rsquo;s checking 180,000+ handles to see if they are collections or items&hellip;
<ul>
@ -478,7 +478,7 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</li>
<li>According to the api logs we will be adding 5,697 items:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
5697
</code></pre><ul>
<li>Spent a few hours with Moayad troubleshooting and improving OpenRXV
@ -496,7 +496,7 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ redis-cli
<pre tabindex="0"><code class="language-console" data-lang="console">$ redis-cli
127.0.0.1:6379&gt; SCAN 0 COUNT 5
1) &quot;49152&quot;
2) 1) &quot;bull:plugins:476595&quot;
@ -507,14 +507,14 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</code></pre><ul>
<li>We can apparently get the names of the jobs in each hash using <code>hget</code>:</li>
</ul>
<pre><code class="language-console" data-lang="console">127.0.0.1:6379&gt; TYPE bull:plugins:401827
<pre tabindex="0"><code class="language-console" data-lang="console">127.0.0.1:6379&gt; TYPE bull:plugins:401827
hash
127.0.0.1:6379&gt; HGET bull:plugins:401827 name
&quot;dspace_add_missing_items&quot;
</code></pre><ul>
<li>I whipped up a one liner to get the keys for all plugin jobs, convert to redis <code>HGET</code> commands to extract the value of the name field, and then sort them by their counts:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ redis-cli KEYS &quot;bull:plugins:*&quot; \
<pre tabindex="0"><code class="language-console" data-lang="console">$ redis-cli KEYS &quot;bull:plugins:*&quot; \
| sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
| ncat -w 3 localhost 6379 \
| grep -v -E '^\$' | sort | uniq -c | sort -h
@ -544,7 +544,7 @@ hash
<ul>
<li>Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ for file in dspace.log.2021-06-[12]*; do echo &quot;$file&quot;; grep -oE 'session_id=[A-Z0-9]{32}' &quot;$file&quot; | sort | uniq | wc -l; done
<pre tabindex="0"><code class="language-console" data-lang="console">$ for file in dspace.log.2021-06-[12]*; do echo &quot;$file&quot;; grep -oE 'session_id=[A-Z0-9]{32}' &quot;$file&quot; | sort | uniq | wc -l; done
dspace.log.2021-06-10
19072
dspace.log.2021-06-11
@ -584,7 +584,7 @@ dspace.log.2021-06-27
</code></pre><ul>
<li>I see 15,000 unique IPs in the XMLUI logs alone on that day:</li>
</ul>
<pre><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
15835
</code></pre><ul>
<li>Annoyingly I found 37,000 more hits from Bing using <code>dns:*msnbot* AND dns:*.msn.com.</code> as a Solr filter
@ -628,7 +628,7 @@ dspace.log.2021-06-27
</li>
<li>The DSpace log shows:</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
<pre tabindex="0"><code class="language-console" data-lang="console">2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>The first one of these I see is from last night at 2021-06-29 at 10:47 PM</li>
<li>I restarted Tomcat 7 and CGSpace came back up&hellip;</li>
@ -641,12 +641,12 @@ dspace.log.2021-06-27
</li>
<li>Export a list of all CGSpace&rsquo;s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value AS &quot;dcterms.subject&quot;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &quot;dcterms.subject&quot; ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value AS &quot;dcterms.subject&quot;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &quot;dcterms.subject&quot; ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
COPY 20780
</code></pre><ul>
<li>Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
COPY 1710
</code></pre><ul>
<li>Fix an issue in the Ansible infrastructure playbooks for the DSpace role
@ -657,12 +657,12 @@ COPY 1710
</li>
<li>I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):</li>
</ul>
<pre><code class="language-console" data-lang="console">Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
<pre tabindex="0"><code class="language-console" data-lang="console">Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
</code></pre><ul>
<li>What&rsquo;s even crazier is that it is twice that on CGSpace (linode18)!</li>
<li>Apparently OpenJDK defaults to using <code>/dev/random</code> (see <code>/etc/java-8-openjdk/security/java.security</code>):</li>
</ul>
<pre><code class="language-console" data-lang="console">securerandom.source=file:/dev/urandom
<pre tabindex="0"><code class="language-console" data-lang="console">securerandom.source=file:/dev/urandom
</code></pre><ul>
<li><code>/dev/random</code> blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
<ul>

View File

@ -30,7 +30,7 @@ Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVO
localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -120,13 +120,13 @@ COPY 20994
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre><h2 id="2021-07-04">2021-07-04</h2>
<ul>
<li>Update all Docker containers on the AReS server (linode20) and rebuild OpenRXV:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cd OpenRXV
<pre tabindex="0"><code class="language-console" data-lang="console">$ cd OpenRXV
$ docker-compose -f docker/docker-compose.yml down
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml build
@ -172,7 +172,7 @@ $ docker-compose -f docker/docker-compose.yml build
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
Purging 95 hits from Drupal in statistics
Purging 38 hits from DTS Agent in statistics
Purging 601 hits from Microsoft Office Existence Discovery in statistics
@ -189,7 +189,7 @@ Total number of bot hits purged: 15030
<li>Meet with the CGIARAGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC</li>
<li>I extracted another list of all subjects to check against AGROVOC:</li>
</ul>
<pre><code class="language-console" data-lang="console">\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d &gt; /tmp/2021-07-06-all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
</code></pre><ul>
@ -205,7 +205,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># for num in {10..26}; do echo &quot;2021-06-$num&quot;; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep &quot;$num/Jun/2021&quot; | awk '{print $1}' | sort | uniq | wc -l; done
<pre tabindex="0"><code class="language-console" data-lang="console"># for num in {10..26}; do echo &quot;2021-06-$num&quot;; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep &quot;$num/Jun/2021&quot; | awk '{print $1}' | sort | uniq | wc -l; done
2021-06-10
10693
2021-06-11
@ -243,7 +243,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
</code></pre><ul>
<li>Similarly, the number of connections to the REST API was around the average for the recent weeks before:</li>
</ul>
<pre><code class="language-console" data-lang="console"># for num in {10..26}; do echo &quot;2021-06-$num&quot;; zcat /var/log/nginx/rest.*.gz | grep &quot;$num/Jun/2021&quot; | awk '{print $1}' | sort | uniq | wc -l; done
<pre tabindex="0"><code class="language-console" data-lang="console"># for num in {10..26}; do echo &quot;2021-06-$num&quot;; zcat /var/log/nginx/rest.*.gz | grep &quot;$num/Jun/2021&quot; | awk '{print $1}' | sort | uniq | wc -l; done
2021-06-10
1183
2021-06-11
@ -281,7 +281,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
</code></pre><ul>
<li>According to goaccess, the traffic spike started at 2AM (remember that the first &ldquo;Pool empty&rdquo; error in dspace.log was at 4:01AM):</li>
</ul>
<pre><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>Moayad sent a fix for the add missing items plugins issue (<a href="https://github.com/ilri/OpenRXV/pull/107">#107</a>)
<ul>
@ -311,7 +311,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2302
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2564
@ -320,7 +320,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
</code></pre><ul>
<li>The locks are held by XMLUI, not REST API or OAI:</li>
</ul>
<pre><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
<pre tabindex="0"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
57 dspaceApi
2671 dspaceWeb
</code></pre><ul>
@ -338,7 +338,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code class="language-console" data-lang="console"># grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
32 91.243.191.124
33 91.243.191.129
33 91.243.191.200
@ -392,7 +392,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./asn -n 45.80.217.235
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./asn -n 45.80.217.235
╭──────────────────────────────╮
│ ASN lookup for 45.80.217.235 │
@ -410,7 +410,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
</code></pre><ul>
<li>Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example:</li>
</ul>
<pre><code class="language-csv" data-lang="csv">IP, Organization, Website, Network
<pre tabindex="0"><code class="language-csv" data-lang="csv">IP, Organization, Website, Network
45.148.126.246, TrafficTransitSolution LLC, traffictransitsolution.us, 45.148.126.0/24 (Net-traffictransitsolution-15)
45.138.102.253, TrafficTransitSolution LLC, traffictransitsolution.us, 45.138.102.0/24 (Net-traffictransitsolution-11)
45.140.205.104, Bulgakov Alexey Yurievich, finegroupservers.com, 45.140.204.0/23 (CHINA_NETWORK)
@ -496,17 +496,17 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; /var/log/nginx/access.log | awk '{print $1}' | sort | uniq &gt; /tmp/ips-sorted.txt
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; /var/log/nginx/access.log | awk '{print $1}' | sort | uniq &gt; /tmp/ips-sorted.txt
# wc -l /tmp/ips-sorted.txt
10776 /tmp/ips-sorted.txt
</code></pre><ul>
<li>Then resolve them all:</li>
</ul>
<pre><code class="language-console:" data-lang="console:">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
<pre tabindex="0"><code class="language-console:" data-lang="console:">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
</code></pre><ul>
<li>Then get the top 10 organizations and top ten ASNs:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
213 AMAZON-AES
218 ASN-QUADRANET-GLOBAL
246 Silverstar Invest Limited
@ -531,7 +531,7 @@ $ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
</code></pre><ul>
<li>I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I&rsquo;m concerned about Global Layer because it&rsquo;s a huge ASN that seems to have legit hosts too&hellip;?</li>
</ul>
<pre><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
<pre tabindex="0"><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
@ -543,7 +543,7 @@ $ wc -l /tmp/abusive-networks.txt
</code></pre><ul>
<li>Combining with my existing rules and filtering uniques:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
2298
</code></pre><ul>
<li><a href="https://scamalytics.com/ip/isp/2021-06">According to Scamalytics all these are high risk ISPs</a> (as recently as 2021-06) so I will just keep blocking them</li>
@ -558,7 +558,7 @@ $ wc -l /tmp/abusive-networks.txt
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E &quot; (200|499) &quot; | awk '{print $1}' | sort | uniq &gt; /tmp/all-ips.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E &quot; (200|499) &quot; | awk '{print $1}' | sort | uniq &gt; /tmp/all-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/all-ips-to-block.txt
$ wc -l /tmp/all-ips-to-block.txt
@ -571,7 +571,7 @@ $ wc -l /tmp/all-ips-to-block.txt
</li>
<li>I decided to extract the networks from the GeoIP database with <code>resolve-addresses-geoip2.py</code> so I can block them more efficiently than using the 5,000 IPs in an ipset:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/all-networks-to-block.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/all-networks-to-block.txt
$ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
2354
</code></pre><ul>
@ -582,7 +582,7 @@ $ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq
</li>
<li>Then I got a list of all the 5,095 IPs from above and used <code>check-spider-ip-hits.sh</code> to purge them from Solr:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
...
Total number of bot hits purged: 197116
</code></pre><ul>
@ -592,13 +592,13 @@ Total number of bot hits purged: 197116
<ul>
<li>Looking again at the IPs making connections to CGSpace over the last few days from these seven ASNs, it&rsquo;s much higher than I noticed yesterday:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5643
</code></pre><ul>
<li>I purged 27,000 more hits from the Solr stats using this new list of IPs with my <code>check-spider-ip-hits.sh</code> script</li>
<li>Surprise surprise, I checked the nginx logs from 2021-06-23 when we last had issues with thousands of XMLUI sessions and PostgreSQL connections and I see IPs from the same ASNs!</li>
</ul>
<pre><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/all-ips-june-23.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/all-ips-june-23.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
265 GOOGLE,15169
@ -619,12 +619,12 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
</code></pre><ul>
<li>Again it was over 5,000 IPs:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5228
</code></pre><ul>
<li>Interestingly, it seems these are five thousand <em>different</em> IP addresses than the attack from last weekend, as there are over 10,000 unique ones if I combine them!</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
10458
</code></pre><ul>
<li>I purged all the (26,000) hits from these new IP addresses from Solr as well</li>
@ -636,7 +636,7 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
</li>
<li>Adding QuadraNet brings the total networks seen during these two attacks to 262, and the number of unique IPs to 10900:</li>
</ul>
<pre><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/ddos-ips.txt
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/ddos-ips.txt
# wc -l /tmp/ddos-ips.txt
54002 /tmp/ddos-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
@ -649,7 +649,7 @@ $ wc -l /tmp/ddos-networks-to-block.txt
</code></pre><ul>
<li>The new total number of networks to block, including the network prefixes for these ASNs downloaded from asn.ipinfo.app, is 4,007:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
<pre tabindex="0"><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
https://asn.ipinfo.app/api/text/nginx/AS46844 \
https://asn.ipinfo.app/api/text/nginx/AS206485 \
https://asn.ipinfo.app/api/text/nginx/AS62282 \

View File

@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -122,14 +122,14 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<ul>
<li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li>
</ul>
<pre><code class="language-console" data-lang="console"># apt update &amp;&amp; apt dist-upgrade
<pre tabindex="0"><code class="language-console" data-lang="console"># apt update &amp;&amp; apt dist-upgrade
# apt autoremove &amp;&amp; apt autoclean
# check for any packages with residual configs we can purge
# dpkg -l | grep -E '^rc' | awk '{print $2}'
@ -144,13 +144,13 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
<li>&hellip; but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li>
<li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li>
</ul>
<pre><code class="language-console" data-lang="console"># apt install -f
<pre tabindex="0"><code class="language-console" data-lang="console"># apt install -f
# apt dist-upgrade
# reboot
</code></pre><ul>
<li>After rebooting I purged all packages with residual configs and cleaned up again:</li>
</ul>
<pre><code class="language-console" data-lang="console"># dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
<pre tabindex="0"><code class="language-console" data-lang="console"># dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# apt autoremove &amp;&amp; apt autoclean
</code></pre><ul>
<li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li>
@ -190,7 +190,7 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
</code></pre><ul>
@ -200,7 +200,7 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv
@ -220,7 +220,7 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre><ul>
<li>That IP is on Amazon, and from looking at the DSpace logs I don&rsquo;t see them logging in at all, only scraping&hellip; so I will purge hits from that IP</li>
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
@ -232,13 +232,13 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&quot;
</code></pre><ul>
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</code></pre><ul>
<li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
<ul>
@ -247,14 +247,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</li>
<li>I see a new bot using this user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
<pre tabindex="0"><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
</code></pre><ul>
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
<li>There are probably more but that&rsquo;s most of them over 1,000 hits last month, so I will purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
@ -272,7 +272,7 @@ Total number of bot hits purged: 90485
</code></pre><ul>
<li>Then I purged a few thousand more by user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics
@ -289,7 +289,7 @@ Total number of hits from bots: 4492
</li>
<li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</code></pre><ul>
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
<ul>
@ -303,19 +303,19 @@ Total number of hits from bots: 4492
<ul>
<li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
</code></pre><ul>
<li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv &gt; /tmp/2021-08-09-journals-all.csv
</code></pre><ul>
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
</ul>
<pre><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,&quot;same&quot;,&quot;different&quot;)
<pre tabindex="0"><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,&quot;same&quot;,&quot;different&quot;)
</code></pre><ul>
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
<ul>
@ -332,7 +332,7 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
</li>
<li>I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
<pre tabindex="0"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
39004:0.08
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
40932:0.53
@ -359,7 +359,7 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d &gt; /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
@ -367,7 +367,7 @@ $ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
<li>Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine</li>
<li>I exported a list of all the journal titles we have in the <code>cg.journal</code> field:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &quot;cg.journal&quot; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &quot;cg.journal&quot; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
</code></pre><ul>
<li>I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don&rsquo;t match, so I&rsquo;d have to go check many of them manually before selecting a match or fixing them&hellip;
@ -421,7 +421,7 @@ COPY 3245
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
</code></pre><ul>
@ -446,17 +446,17 @@ $ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
</li>
<li>Lower case all AGROVOC metadata, as I had noticed a few in sentence case:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 484
</code></pre><ul>
<li>Also update some DOIs using the <code>dx.doi.org</code> format, just to keep things uniform:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE 469
</code></pre><ul>
<li>Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 322m16.917s
user 226m43.121s
@ -464,7 +464,7 @@ sys 3m17.469s
</code></pre><ul>
<li>I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
-H 'Content-Type: application/json' \
-d '{
&quot;size&quot;: 10,
@ -525,17 +525,17 @@ $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-08-25-combined-orcids.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-08-25-combined-orcids.txt
$ wc -l /tmp/2021-08-25-combined-orcids.txt
1331
</code></pre><ul>
<li>After I combined them and removed duplicates, I resolved all the names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
</code></pre><ul>
<li>Tag existing items from the Alliance&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code> (181 new metadata fields added):</li>
</ul>
<pre><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&quot;Chege, Christine G. Kiria&quot;,&quot;Christine G.Kiria Chege: 0000-0001-8360-0279&quot;
&quot;Chege, Christine Kiria&quot;,&quot;Christine G.Kiria Chege: 0000-0001-8360-0279&quot;

View File

@ -26,7 +26,7 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-09/" />
<meta property="article:published_time" content="2021-09-01T09:14:07+03:00" />
<meta property="article:modified_time" content="2021-09-04T21:16:03+03:00" />
<meta property="article:modified_time" content="2021-09-06T12:31:11+03:00" />
@ -48,7 +48,7 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -58,9 +58,9 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
"@type": "BlogPosting",
"headline": "September, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-09/",
"wordCount": "176",
"wordCount": "637",
"datePublished": "2021-09-01T09:14:07+03:00",
"dateModified": "2021-09-04T21:16:03+03:00",
"dateModified": "2021-09-06T12:31:11+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -154,7 +154,7 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
<ul>
<li>Update Docker images on AReS server (linode20) and rebuild OpenRXV:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
</code></pre><ul>
<li>Then run system updates and reboot the server
@ -163,6 +163,61 @@ $ docker-compose build
</ul>
</li>
</ul>
<h2 id="2021-09-07">2021-09-07</h2>
<ul>
<li>Checking last month&rsquo;s Solr statistics to see if there are any new bots that I need to purge and add to the list
<ul>
<li>78.203.225.68 made 50,000 requests on one day in August, and it is using this user agent: <code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36</code></li>
<li>It&rsquo;s a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser</li>
<li>130.255.162.154 is in Sweden and made 46,000 requests in August and it is using this user agent: <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0</code></li>
<li>35.174.144.154 is on Amazon and made 28,000 requests with this user agent: <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36</code></li>
<li>192.121.135.6 is in Sweden and made 9,000 requests with this user agent: <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0</code></li>
<li>185.38.40.66 is in Germany and made 6,000 requests with this user agent: <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/1.10.2.4</code></li>
<li>3.225.28.105 is in Amazon and made 3,000 requests with this user agent: <code>Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36</code></li>
<li>I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li>
<li>I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com.</li>
<li>I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again</li>
<li>While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others</li>
<li>They must be related, because I see them all using the exact same user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li>
<li>So this startdedicated.com DNS is some Bing bot also&hellip;</li>
</ul>
</li>
<li>I extracted all the IPs and purged them using my <code>check-spider-ip-hits.sh</code> script
<ul>
<li>In total I purged 225,000 hits&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2021-09-12">2021-09-12</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2021-09-13">2021-09-13</h2>
<ul>
<li>Mishell Portilla asked me about thumbnails on CGSpace being small
<ul>
<li>For example, <a href="https://cgspace.cgiar.org/handle/10568/114576">10568/114576</a> has a lot of white space on the left side</li>
<li>I created a new thumbnail with vipsthumbnail:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
</code></pre><ul>
<li>Looking at the PDF&rsquo;s metadata I see:
<ul>
<li>Producer: iLovePDF</li>
<li>Creator: Adobe InDesign 15.0 (Windows)</li>
<li>Format: PDF-1.7</li>
</ul>
</li>
<li>Eventually I should do more tests on this and perhaps file a bug with DSpace&hellip;</li>
<li>Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool
<ul>
<li>I told them I can give them access to DSpace Test and that we should have a meeting soon</li>
<li>We need to figure out what controlled vocabularies they should use</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -17,7 +17,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="404 Page not found"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -127,7 +127,7 @@
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
@ -152,7 +152,7 @@
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
@ -315,7 +315,7 @@ COPY 20994
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {

View File

@ -41,7 +41,7 @@
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
@ -57,7 +57,7 @@
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -164,7 +164,7 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
{
&amp;quot;count&amp;quot; : 100875,
&amp;quot;_shards&amp;quot; : {
@ -471,7 +471,7 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# apt update &amp;amp;&amp;amp; apt full-upgrade
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# apt update &amp;amp;&amp;amp; apt full-upgrade
# apt-get autoremove &amp;amp;&amp;amp; apt-get autoclean
# dpkg -C
# reboot
@ -492,7 +492,7 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
1277694
@ -500,7 +500,7 @@ COPY 20994
&lt;li&gt;So 4.6 million from XMLUI and another 1.2 million from API requests&lt;/li&gt;
&lt;li&gt;Let&amp;rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot; | grep -c -E &amp;quot;/rest/bitstreams&amp;quot;
106781
@ -527,7 +527,7 @@ COPY 20994
&lt;li&gt;Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning&lt;/li&gt;
&lt;li&gt;Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -628,7 +628,7 @@ COPY 20994
&lt;/li&gt;
&lt;li&gt;The item seems to be in a pre-submitted state, so I tried to delete it from there:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;But after this I tried to delete the item from the XMLUI and it is &lt;em&gt;still&lt;/em&gt; present&amp;hellip;&lt;/li&gt;
@ -654,13 +654,13 @@ DELETE 1
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;In the last two weeks there have been 47,000 downloads of this &lt;em&gt;same exact PDF&lt;/em&gt; by these three IP addresses&lt;/li&gt;
&lt;li&gt;Apply country and region corrections and deletions on DSpace Test and CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
@ -701,7 +701,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!&lt;/li&gt;
&lt;li&gt;The top IPs before, during, and after this latest alert tonight were:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -717,7 +717,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase&lt;/li&gt;
&lt;li&gt;There were just over 3 million accesses in the nginx logs last month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
3018243
real 0m19.873s
@ -737,7 +737,7 @@ sys 0m1.979s
&lt;li&gt;Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning&lt;/li&gt;
&lt;li&gt;I don&amp;rsquo;t see anything interesting in the web server logs around that time though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -825,7 +825,7 @@ sys 0m1.979s
&lt;ul&gt;
&lt;li&gt;DSpace Test had crashed at some point yesterday morning and I see the following in &lt;code&gt;dmesg&lt;/code&gt;:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
@ -848,11 +848,11 @@ sys 0m1.979s
&lt;ul&gt;
&lt;li&gt;I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;During the &lt;code&gt;mvn package&lt;/code&gt; stage on the 5.8 branch I kept getting issues with java running out of memory:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;There is insufficient memory for the Java Runtime Environment to continue.
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;There is insufficient memory for the Java Runtime Environment to continue.
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -872,12 +872,12 @@ sys 0m1.979s
&lt;li&gt;I added the new CCAFS Phase II Project Tag &lt;code&gt;PII-FP1_PACCA2&lt;/code&gt; and merged it into the &lt;code&gt;5_x-prod&lt;/code&gt; branch (&lt;a href=&#34;https://github.com/ilri/DSpace/pull/379&#34;&gt;#379&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I proofed and tested the ILRI author corrections that Peter sent back to me this week:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in &lt;a href=&#34;https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/&#34;&gt;March, 2018&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Time to index ~70,000 items on CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
@ -958,19 +958,19 @@ sys 2m7.289s
&lt;li&gt;In dspace.log around that time I see many errors like &amp;ldquo;Client closed the connection before file download was complete&amp;rdquo;&lt;/li&gt;
&lt;li&gt;And just before that I see this:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Ah hah! So the pool was actually empty!&lt;/li&gt;
&lt;li&gt;I need to increase that, let&amp;rsquo;s try to bump it up from 50 to 75&lt;/li&gt;
&lt;li&gt;After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&amp;rsquo;t know what the hell Uptime Robot saw&lt;/li&gt;
&lt;li&gt;I notice this error quite a few times in dspace.log:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &amp;quot; &amp;quot;]&amp;quot; &amp;quot;] &amp;quot;&amp;quot; at line 1, column 32.
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;And there are many of these errors every day for the past month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -1048,12 +1048,12 @@ dspace.log.2018-01-02:34
&lt;ul&gt;
&lt;li&gt;Today there have been no hits by CORE and no alerts from Linode (coincidence?)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
0
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Generate list of authors on CGSpace for Peter to go through and correct:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1068,7 +1068,7 @@ COPY 54701
&lt;ul&gt;
&lt;li&gt;Peter emailed to point out that many items in the &lt;a href=&#34;https://cgspace.cgiar.org/handle/10568/2703&#34;&gt;ILRI archive collection&lt;/a&gt; have multiple handles:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;There appears to be a pattern but I&amp;rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine&lt;/li&gt;
&lt;li&gt;Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections&lt;/li&gt;

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -195,7 +195,7 @@
</ul>
</li>
</ul>
<pre><code># apt update &amp;&amp; apt full-upgrade
<pre tabindex="0"><code># apt update &amp;&amp; apt full-upgrade
# apt-get autoremove &amp;&amp; apt-get autoclean
# dpkg -C
# reboot
@ -225,7 +225,7 @@
</ul>
</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
1277694
@ -233,7 +233,7 @@
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781
@ -278,7 +278,7 @@
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -101,7 +101,7 @@
</li>
<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li>
</ul>
<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
</code></pre><ul>
<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li>
@ -136,13 +136,13 @@ DELETE 1
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
@ -201,7 +201,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -217,7 +217,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
3018243
real 0m19.873s
@ -246,7 +246,7 @@ sys 0m1.979s
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -379,7 +379,7 @@ sys 0m1.979s
<ul>
<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
</ul>
<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
<pre tabindex="0"><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -94,11 +94,11 @@
<ul>
<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
</ul>
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
</code></pre><ul>
<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
</ul>
<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
<pre tabindex="0"><code>There is insufficient memory for the Java Runtime Environment to continue.
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2018-07/'>Read more →</a>
</article>
@ -127,12 +127,12 @@
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
@ -258,19 +258,19 @@ sys 2m7.289s
<li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
<li>And just before that I see this:</li>
</ul>
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
<pre tabindex="0"><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre><ul>
<li>Ah hah! So the pool was actually empty!</li>
<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -366,12 +366,12 @@ dspace.log.2018-01-02:34
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-11/'>Read more →</a>
@ -395,7 +395,7 @@ COPY 54701
<ul>
<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
</ul>
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGIAR Library Migration"/>
<meta name="twitter:description" content="Notes on the migration of the CGIAR Library to CGSpace"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -132,7 +132,7 @@
<li><input checked="" disabled="" type="checkbox"> Temporarily disable nightly <code>index-discovery</code> cron job because the import process will be taking place during some of this time and I don&rsquo;t want them to be competing to update the Solr index</li>
<li><input checked="" disabled="" type="checkbox"> Copy HTTPS certificate key pair from CGIAR Library server&rsquo;s Tomcat keystore:</li>
</ul>
<pre><code>$ keytool -list -keystore tomcat.keystore
<pre tabindex="0"><code>$ keytool -list -keystore tomcat.keystore
$ keytool -importkeystore -srckeystore tomcat.keystore -destkeystore library.cgiar.org.p12 -deststoretype PKCS12 -srcalias tomcat
$ openssl pkcs12 -in library.cgiar.org.p12 -nokeys -out library.cgiar.org.crt.pem
$ openssl pkcs12 -in library.cgiar.org.p12 -nodes -nocerts -out library.cgiar.org.key.pem
@ -140,7 +140,7 @@ $ wget https://certs.godaddy.com/repository/gdroot-g2.crt https://certs.godaddy.
$ cat library.cgiar.org.crt.pem gdig2.crt.pem &gt; library.cgiar.org-chained.pem
</code></pre><h2 id="migration-process">Migration Process</h2>
<p><strong>Export all top-level communities and collections from DSpace Test:</strong></p>
<pre><code>$ export PATH=$PATH:/home/dspacetest.cgiar.org/bin
<pre tabindex="0"><code>$ export PATH=$PATH:/home/dspacetest.cgiar.org/bin
$ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2515 10947-2515/10947-2515.zip
$ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2516 10947-2516/10947-2516.zip
$ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2517 10947-2517/10947-2517.zip
@ -158,12 +158,12 @@ $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/1 10947-1/10947-1.zip
<li><input checked="" disabled="" type="checkbox"> Copy all exports from DSpace Test</li>
<li><input checked="" disabled="" type="checkbox"> Add ingestion overrides to <code>dspace.cfg</code> before import:</li>
</ul>
<pre><code>mets.dspaceAIP.ingest.crosswalk.METSRIGHTS = NIL
<pre tabindex="0"><code>mets.dspaceAIP.ingest.crosswalk.METSRIGHTS = NIL
mets.dspaceAIP.ingest.crosswalk.DSPACE-ROLES = NIL
</code></pre><ul>
<li><input checked="" disabled="" type="checkbox"> Import communities and collections, paying attention to options to skip missing parents and ignore handles:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1&quot;
$ export PATH=$PATH:/home/cgspace.cgiar.org/bin
$ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2515/10947-2515.zip
$ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2516/10947-2516.zip
@ -189,7 +189,7 @@ $ for item in 10947-1/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aorth@
</ul>
</li>
</ul>
<pre><code>$ dspace packager -r -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83536 10568-93760/COLLECTION@10947-4651.zip
<pre tabindex="0"><code>$ dspace packager -r -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83536 10568-93760/COLLECTION@10947-4651.zip
$ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done
</code></pre><ul>
<li><input checked="" disabled="" type="checkbox"> Create <em>CGIAR System Management Office</em> sub-community: <code>10568/83537</code>
@ -199,17 +199,17 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
</ul>
</li>
</ul>
<pre><code>$ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
<pre tabindex="0"><code>$ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
</code></pre><p><strong>Get the handles for the last few items from CGIAR Library that were created since we did the migration to DSpace Test in May:</strong></p>
<pre><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z');
<pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; '2017-05-01T00:00:00Z');
</code></pre><ul>
<li>Export them from the CGIAR Library:</li>
</ul>
<pre><code># for handle in 10947/4658 10947/4659 10947/4660 10947/4661 10947/4665 10947/4664 10947/4666 10947/4669; do /usr/local/dspace/bin/dspace packager -d -a -t AIP -e m.marus@cgiar.org -i $handle ${handle}.zip; done
<pre tabindex="0"><code># for handle in 10947/4658 10947/4659 10947/4660 10947/4661 10947/4665 10947/4664 10947/4666 10947/4669; do /usr/local/dspace/bin/dspace packager -d -a -t AIP -e m.marus@cgiar.org -i $handle ${handle}.zip; done
</code></pre><ul>
<li>Import on CGSpace:</li>
</ul>
<pre><code>$ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done
<pre tabindex="0"><code>$ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done
</code></pre><h2 id="post-migration">Post Migration</h2>
<ul>
<li><input checked="" disabled="" type="checkbox"> Shut down Tomcat and run <code>update-sequences.sql</code> as the system&rsquo;s <code>postgres</code> user</li>
@ -218,7 +218,7 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
<li><input checked="" disabled="" type="checkbox"> Enable nightly <code>index-discovery</code> cron job</li>
<li><input checked="" disabled="" type="checkbox"> Adjust CGSpace&rsquo;s <code>handle-server/config.dct</code> to add the new prefix alongside our existing 10568, ie:</li>
</ul>
<pre><code>&quot;server_admins&quot; = (
<pre tabindex="0"><code>&quot;server_admins&quot; = (
&quot;300:0.NA/10568&quot;
&quot;300:0.NA/10947&quot;
)
@ -244,22 +244,22 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
<li><input checked="" disabled="" type="checkbox"> Run system updates and reboot server</li>
<li><input checked="" disabled="" type="checkbox"> Switch to Let&rsquo;s Encrypt HTTPS certificates (after DNS is updated and server isn&rsquo;t busy):</li>
</ul>
<pre><code>$ sudo systemctl stop nginx
<pre tabindex="0"><code>$ sudo systemctl stop nginx
$ /opt/certbot-auto certonly --standalone -d library.cgiar.org
$ sudo systemctl start nginx
</code></pre><h2 id="troubleshooting">Troubleshooting</h2>
<h3 id="foreign-key-error-in-dspace-cleanup">Foreign Key Error in <code>dspace cleanup</code></h3>
<p>The cleanup script is sometimes used during import processes to clean the database and assetstore after failed AIP imports. If you see the following error with <code>dspace cleanup -v</code>:</p>
<pre><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
<pre tabindex="0"><code>Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(119841) is still referenced from table &quot;bundle&quot;.
</code></pre><p>The solution is to set the <code>primary_bitstream_id</code> to NULL in PostgreSQL:</p>
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (119841);
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (119841);
</code></pre><h3 id="psqlexception-during-aip-ingest">PSQLException During AIP Ingest</h3>
<p>After a few rounds of ingesting—possibly with failures—you might end up with inconsistent IDs in the database. In this case, during AIP ingest of a single collection in submit mode (-s):</p>
<pre><code>org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot;
<pre tabindex="0"><code>org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &quot;handle_pkey&quot;
Detail: Key (handle_id)=(86227) already exists.
</code></pre><p>The normal solution is to run the <code>update-sequences.sql</code> script (with Tomcat shut down) but it doesn&rsquo;t seem to work in this case. Finding the maximum <code>handle_id</code> and manually updating the sequence seems to work:</p>
<pre><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
<pre tabindex="0"><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
dspace=# select setval('handle_seq',86873);
</code></pre>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace CG Core v2 Migration"/>
<meta name="twitter:description" content="Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -440,7 +440,7 @@
</ul>
<hr>
<p>¹ Not committed yet because I don&rsquo;t want to have to make minor adjustments in multiple commits. Re-apply the gauntlet of fixes with the sed script:</p>
<pre><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
<pre tabindex="0"><code>$ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname &quot;*.xsl&quot; -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
</code></pre>

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace DSpace 6 Upgrade"/>
<meta name="twitter:description" content="Documenting the DSpace 6 upgrade."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -129,14 +129,14 @@
</ul>
<h3 id="re-import-oai-with-clean-index">Re-import OAI with clean index</h3>
<p>After the upgrade is complete, re-index all items into OAI with a clean index:</p>
<pre><code class="language-console" data-lang="console">$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
$ dspace oai -c import
</code></pre><p>The process ran out of memory several times so I had to keep trying again with more JVM heap memory.</p>
<h3 id="processing-solr-statistics-with-solr-upgrade-statistics-6x">Processing Solr Statistics With solr-upgrade-statistics-6x</h3>
<p>After the main upgrade process was finished and DSpace was running I started processing the Solr statistics with <code>solr-upgrade-statistics-6x</code> to migrate all IDs to UUIDs.</p>
<h2 id="statistics">statistics</h2>
<p>First process the current year&rsquo;s statistics core:</p>
<pre><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
...
=================================================================
@ -159,10 +159,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
<li>698,000: <code>*:* NOT id:/.{36}/</code></li>
<li>Majority are <code>type: 5</code> (aka SITE, according to <code>Constants.java</code>) so we can purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2019">statistics-2019</h2>
<p>Processing the statistics-2019 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -184,10 +184,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
<li>4,184,896: <code>*:* NOT id:/.{36}/</code></li>
<li>4,172,929 are <code>type: 5</code> (aka SITE) so we can purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2019/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2019/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2018">statistics-2018</h2>
<p>Processing the statistics-2018 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -203,7 +203,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
5,561,166 TOTAL
=================================================================
</code></pre><p>After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:</p>
<pre><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
</code></pre><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
<ul>
@ -212,10 +212,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>923,158: <code>*:* NOT id:/.{36}/</code></li>
<li>823,293: are <code>type: 5</code> so we can purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2017">statistics-2017</h2>
<p>Processing the statistics-2017 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -237,10 +237,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>1,702,177: <code>*:* NOT id:/.{36}/</code></li>
<li>1,660,524 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2016">statistics-2016</h2>
<p>Processing the statistics-2016 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -261,10 +261,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>1,477,155: <code>*:* NOT id:/.{36}/</code></li>
<li>1,469,706 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2015">statistics-2015</h2>
<p>Processing the statistics-2015 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -286,10 +286,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>262,439: <code>*:* NOT id:/.{36}/</code></li>
<li>247,400 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2014">statistics-2014</h2>
<p>Processing the statistics-2014 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -312,10 +312,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>222,078: <code>*:* NOT id:/.{36}/</code></li>
<li>188,791 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2014/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2014/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2013">statistics-2013</h2>
<p>Processing the statistics-2013 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -338,10 +338,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>32,320: <code>*:* NOT id:/.{36}/</code></li>
<li>15,691 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2013/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2013/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2012">statistics-2012</h2>
<p>Processing the statistics-2012 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -360,10 +360,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>33,161: <code>*:* NOT id:/.{36}/</code></li>
<li>33,161 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2011">statistics-2011</h2>
<p>Processing the statistics-2011 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -382,10 +382,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>17,551: <code>*:* NOT id:/.{36}/</code></li>
<li>12,116 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2011/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2011/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2010">statistics-2010</h2>
<p>Processing the statistics-2010 core:</p>
<pre><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
...
=================================================================
*** Statistics Records with Legacy Id ***
@ -404,52 +404,52 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<li>1,012: <code>*:* NOT id:/.{36}/</code></li>
<li>654 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h3 id="processing-solr-statistics-with-atomicstatisticsupdatecli">Processing Solr statistics with AtomicStatisticsUpdateCLI</h3>
<p>On 2020-11-18 I finished processing the Solr statistics with solr-upgrade-statistics-6x and I started processing them with AtomicStatisticsUpdateCLI.</p>
<h2 id="statistics-1">statistics</h2>
<p>First the current year&rsquo;s statistics core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
</code></pre><p>It took ~38 hours to finish processing this core.</p>
<h2 id="statistics-2019-1">statistics-2019</h2>
<p>The statistics-2019 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2019
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2019
</code></pre><p>It took ~32 hours to finish processing this core.</p>
<h2 id="statistics-2018-1">statistics-2018</h2>
<p>The statistics-2018 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2018
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2018
</code></pre><p>It took ~28 hours to finish processing this core.</p>
<h2 id="statistics-2017-1">statistics-2017</h2>
<p>The statistics-2017 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2017
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2017
</code></pre><p>It took ~24 hours to finish processing this core.</p>
<h2 id="statistics-2016-1">statistics-2016</h2>
<p>The statistics-2016 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
</code></pre><p>It took ~20 hours to finish processing this core.</p>
<h2 id="statistics-2015-1">statistics-2015</h2>
<p>The statistics-2015 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
</code></pre><p>It took ~21 hours to finish processing this core.</p>
<h2 id="statistics-2014-1">statistics-2014</h2>
<p>The statistics-2014 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2014
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2014
</code></pre><p>It took ~12 hours to finish processing this core.</p>
<h2 id="statistics-2013-1">statistics-2013</h2>
<p>The statistics-2013 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2013
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2013
</code></pre><p>It took ~3 hours to finish processing this core.</p>
<h2 id="statistics-2012-1">statistics-2012</h2>
<p>The statistics-2012 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2012
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2012
</code></pre><p>It took ~2 hours to finish processing this core.</p>
<h2 id="statistics-2011-1">statistics-2011</h2>
<p>The statistics-2011 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2011
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2011
</code></pre><p>It took 1 hour to finish processing this core.</p>
<h2 id="statistics-2010-1">statistics-2010</h2>
<p>The statistics-2010 core, in 12-hour batches:</p>
<pre><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2010
<pre tabindex="0"><code>$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2010
</code></pre><p>It took five minutes to finish processing this core.</p>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -142,7 +142,7 @@
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
@ -167,7 +167,7 @@
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
@ -330,7 +330,7 @@ COPY 20994
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {

View File

@ -41,7 +41,7 @@
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
@ -57,7 +57,7 @@
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -164,7 +164,7 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
{
&amp;quot;count&amp;quot; : 100875,
&amp;quot;_shards&amp;quot; : {
@ -471,7 +471,7 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# apt update &amp;amp;&amp;amp; apt full-upgrade
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# apt update &amp;amp;&amp;amp; apt full-upgrade
# apt-get autoremove &amp;amp;&amp;amp; apt-get autoclean
# dpkg -C
# reboot
@ -492,7 +492,7 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
1277694
@ -500,7 +500,7 @@ COPY 20994
&lt;li&gt;So 4.6 million from XMLUI and another 1.2 million from API requests&lt;/li&gt;
&lt;li&gt;Let&amp;rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot; | grep -c -E &amp;quot;/rest/bitstreams&amp;quot;
106781
@ -527,7 +527,7 @@ COPY 20994
&lt;li&gt;Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning&lt;/li&gt;
&lt;li&gt;Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -628,7 +628,7 @@ COPY 20994
&lt;/li&gt;
&lt;li&gt;The item seems to be in a pre-submitted state, so I tried to delete it from there:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;But after this I tried to delete the item from the XMLUI and it is &lt;em&gt;still&lt;/em&gt; present&amp;hellip;&lt;/li&gt;
@ -654,13 +654,13 @@ DELETE 1
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;In the last two weeks there have been 47,000 downloads of this &lt;em&gt;same exact PDF&lt;/em&gt; by these three IP addresses&lt;/li&gt;
&lt;li&gt;Apply country and region corrections and deletions on DSpace Test and CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
@ -701,7 +701,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!&lt;/li&gt;
&lt;li&gt;The top IPs before, during, and after this latest alert tonight were:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -717,7 +717,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase&lt;/li&gt;
&lt;li&gt;There were just over 3 million accesses in the nginx logs last month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
3018243
real 0m19.873s
@ -737,7 +737,7 @@ sys 0m1.979s
&lt;li&gt;Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning&lt;/li&gt;
&lt;li&gt;I don&amp;rsquo;t see anything interesting in the web server logs around that time though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -825,7 +825,7 @@ sys 0m1.979s
&lt;ul&gt;
&lt;li&gt;DSpace Test had crashed at some point yesterday morning and I see the following in &lt;code&gt;dmesg&lt;/code&gt;:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
@ -848,11 +848,11 @@ sys 0m1.979s
&lt;ul&gt;
&lt;li&gt;I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;During the &lt;code&gt;mvn package&lt;/code&gt; stage on the 5.8 branch I kept getting issues with java running out of memory:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;There is insufficient memory for the Java Runtime Environment to continue.
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;There is insufficient memory for the Java Runtime Environment to continue.
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -872,12 +872,12 @@ sys 0m1.979s
&lt;li&gt;I added the new CCAFS Phase II Project Tag &lt;code&gt;PII-FP1_PACCA2&lt;/code&gt; and merged it into the &lt;code&gt;5_x-prod&lt;/code&gt; branch (&lt;a href=&#34;https://github.com/ilri/DSpace/pull/379&#34;&gt;#379&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I proofed and tested the ILRI author corrections that Peter sent back to me this week:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in &lt;a href=&#34;https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/&#34;&gt;March, 2018&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Time to index ~70,000 items on CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
@ -958,19 +958,19 @@ sys 2m7.289s
&lt;li&gt;In dspace.log around that time I see many errors like &amp;ldquo;Client closed the connection before file download was complete&amp;rdquo;&lt;/li&gt;
&lt;li&gt;And just before that I see this:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Ah hah! So the pool was actually empty!&lt;/li&gt;
&lt;li&gt;I need to increase that, let&amp;rsquo;s try to bump it up from 50 to 75&lt;/li&gt;
&lt;li&gt;After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&amp;rsquo;t know what the hell Uptime Robot saw&lt;/li&gt;
&lt;li&gt;I notice this error quite a few times in dspace.log:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &amp;quot; &amp;quot;]&amp;quot; &amp;quot;] &amp;quot;&amp;quot; at line 1, column 32.
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;And there are many of these errors every day for the past month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -1048,12 +1048,12 @@ dspace.log.2018-01-02:34
&lt;ul&gt;
&lt;li&gt;Today there have been no hits by CORE and no alerts from Linode (coincidence?)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
0
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Generate list of authors on CGSpace for Peter to go through and correct:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1068,7 +1068,7 @@ COPY 54701
&lt;ul&gt;
&lt;li&gt;Peter emailed to point out that many items in the &lt;a href=&#34;https://cgspace.cgiar.org/handle/10568/2703&#34;&gt;ILRI archive collection&lt;/a&gt; have multiple handles:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;There appears to be a pattern but I&amp;rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine&lt;/li&gt;
&lt;li&gt;Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections&lt;/li&gt;
@ -1182,7 +1182,7 @@ COPY 54701
&lt;li&gt;Remove redundant/duplicate text in the DSpace submission license&lt;/li&gt;
&lt;li&gt;Testing the CMYK patch on a collection with 650 items:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &amp;quot;ImageMagick PDF Thumbnail&amp;quot; -v &amp;gt;&amp;amp; /tmp/filter-media-cmyk.txt
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &amp;quot;ImageMagick PDF Thumbnail&amp;quot; -v &amp;gt;&amp;amp; /tmp/filter-media-cmyk.txt
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1208,7 +1208,7 @@ COPY 54701
&lt;li&gt;Discovered that the ImageMagic &lt;code&gt;filter-media&lt;/code&gt; plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK&lt;/li&gt;
&lt;li&gt;Interestingly, it seems DSpace 4.x&amp;rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&amp;rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see &lt;a href=&#34;https://cgspace.cgiar.org/handle/10568/51999&#34;&gt;10568/51999&lt;/a&gt;):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ identify ~/Desktop/alc_contrastes_desafios.jpg
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1223,7 +1223,7 @@ COPY 54701
&lt;ul&gt;
&lt;li&gt;An item was mapped twice erroneously again, so I had to remove one of the mappings manually:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspace=# select * from collection2item where item_id = &#39;80278&#39;;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -1263,7 +1263,7 @@ DELETE 1
&lt;li&gt;CGSpace was down for five hours in the morning while I was sleeping&lt;/li&gt;
&lt;li&gt;While looking in the logs for errors, I see tons of warnings about Atmire MQM:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&amp;quot;dc.title&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&amp;quot;THUMBNAIL&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&amp;quot;-1&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
@ -1305,7 +1305,7 @@ DELETE 1
&lt;/li&gt;
&lt;li&gt;I exported a random item&amp;rsquo;s metadata as CSV, deleted &lt;em&gt;all columns&lt;/em&gt; except id and collection, and made a new coloum called &lt;code&gt;ORCID:dc.contributor.author&lt;/code&gt; with the following random ORCIDs from the ORCID registry:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1322,7 +1322,7 @@ DELETE 1
&lt;li&gt;We had been using &lt;code&gt;DC=ILRI&lt;/code&gt; to determine whether a user was ILRI or not&lt;/li&gt;
&lt;li&gt;It looks like we might be able to use OUs now, instead of DCs:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &amp;quot;dc=cgiarad,dc=org&amp;quot; -D &amp;quot;admigration1@cgiarad.org&amp;quot; -W &amp;quot;(sAMAccountName=admigration1)&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &amp;quot;dc=cgiarad,dc=org&amp;quot; -D &amp;quot;admigration1@cgiarad.org&amp;quot; -W &amp;quot;(sAMAccountName=admigration1)&amp;quot;
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1341,7 +1341,7 @@ DELETE 1
&lt;li&gt;Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of &lt;code&gt;fonts&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Start working on DSpace 5.15.5 port:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ git checkout -b 55new 5_x-prod
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
&lt;/code&gt;&lt;/pre&gt;</description>
@ -1358,7 +1358,7 @@ $ git rebase -i dspace-5.5
&lt;li&gt;Add &lt;code&gt;dc.description.sponsorship&lt;/code&gt; to Discovery sidebar facets and make investors clickable in item view (&lt;a href=&#34;https://github.com/ilri/DSpace/issues/232&#34;&gt;#232&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I think this query should find and replace all authors that have &amp;ldquo;,&amp;rdquo; at the end of their names:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
text_value
@ -1398,7 +1398,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;li&gt;I have blocked access to the API now&lt;/li&gt;
&lt;li&gt;There are 3,000 IPs accessing the REST API in a 24-hour period!&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1476,7 +1476,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;ul&gt;
&lt;li&gt;Replace &lt;code&gt;lzop&lt;/code&gt; with &lt;code&gt;xz&lt;/code&gt; in log compression cron jobs on DSpace Test—it uses less space:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# cd /home/dspacetest.cgiar.org/log
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
@ -1496,7 +1496,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;li&gt;Looks like DSpace exhausted its PostgreSQL connection pool&lt;/li&gt;
&lt;li&gt;Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
&lt;/code&gt;&lt;/pre&gt;</description>
</item>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -210,7 +210,7 @@
</ul>
</li>
</ul>
<pre><code># apt update &amp;&amp; apt full-upgrade
<pre tabindex="0"><code># apt update &amp;&amp; apt full-upgrade
# apt-get autoremove &amp;&amp; apt-get autoclean
# dpkg -C
# reboot
@ -240,7 +240,7 @@
</ul>
</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
1277694
@ -248,7 +248,7 @@
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781
@ -293,7 +293,7 @@
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -116,7 +116,7 @@
</li>
<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li>
</ul>
<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
</code></pre><ul>
<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li>
@ -151,13 +151,13 @@ DELETE 1
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
@ -216,7 +216,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -232,7 +232,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
3018243
real 0m19.873s
@ -261,7 +261,7 @@ sys 0m1.979s
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -394,7 +394,7 @@ sys 0m1.979s
<ul>
<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
</ul>
<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
<pre tabindex="0"><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -109,11 +109,11 @@
<ul>
<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
</ul>
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
</code></pre><ul>
<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
</ul>
<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
<pre tabindex="0"><code>There is insufficient memory for the Java Runtime Environment to continue.
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2018-07/'>Read more →</a>
</article>
@ -142,12 +142,12 @@
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
@ -273,19 +273,19 @@ sys 2m7.289s
<li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
<li>And just before that I see this:</li>
</ul>
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
<pre tabindex="0"><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre><ul>
<li>Ah hah! So the pool was actually empty!</li>
<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -381,12 +381,12 @@ dspace.log.2018-01-02:34
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-11/'>Read more →</a>
@ -410,7 +410,7 @@ COPY 54701
<ul>
<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
</ul>
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -262,7 +262,7 @@
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-04/'>Read more →</a>
</article>
@ -297,7 +297,7 @@
<li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
</ul>
<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
<pre tabindex="0"><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-03/'>Read more →</a>
@ -321,7 +321,7 @@
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre><code>dspace=# select * from collection2item where item_id = '80278';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -110,7 +110,7 @@
<li>CGSpace was down for five hours in the morning while I was sleeping</li>
<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
</ul>
<pre><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
@ -170,7 +170,7 @@
</li>
<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
</ul>
<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
<pre tabindex="0"><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2016-10/'>Read more →</a>
</article>
@ -196,7 +196,7 @@
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
<pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2016-09/'>Read more →</a>
</article>
@ -224,7 +224,7 @@
<li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li>
<li>Start working on DSpace 5.15.5 port:</li>
</ul>
<pre><code>$ git checkout -b 55new 5_x-prod
<pre tabindex="0"><code>$ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
</code></pre>
@ -250,7 +250,7 @@ $ git rebase -i dspace-5.5
<li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
</ul>
<pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
text_value
@ -308,7 +308,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<li>I have blocked access to the API now</li>
<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
</ul>
<pre><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
3168
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2016-05/'>Read more →</a>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -160,7 +160,7 @@
<ul>
<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li>
</ul>
<pre><code># cd /home/dspacetest.cgiar.org/log
<pre tabindex="0"><code># cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
@ -189,7 +189,7 @@
<li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
</ul>
<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
78
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2015-11/'>Read more →</a>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -142,7 +142,7 @@
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
@ -167,7 +167,7 @@
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
@ -330,7 +330,7 @@ COPY 20994
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {

View File

@ -41,7 +41,7 @@
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
@ -57,7 +57,7 @@
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -164,7 +164,7 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
{
&amp;quot;count&amp;quot; : 100875,
&amp;quot;_shards&amp;quot; : {
@ -471,7 +471,7 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# apt update &amp;amp;&amp;amp; apt full-upgrade
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# apt update &amp;amp;&amp;amp; apt full-upgrade
# apt-get autoremove &amp;amp;&amp;amp; apt-get autoclean
# dpkg -C
# reboot
@ -492,7 +492,7 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
1277694
@ -500,7 +500,7 @@ COPY 20994
&lt;li&gt;So 4.6 million from XMLUI and another 1.2 million from API requests&lt;/li&gt;
&lt;li&gt;Let&amp;rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot; | grep -c -E &amp;quot;/rest/bitstreams&amp;quot;
106781
@ -527,7 +527,7 @@ COPY 20994
&lt;li&gt;Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning&lt;/li&gt;
&lt;li&gt;Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -628,7 +628,7 @@ COPY 20994
&lt;/li&gt;
&lt;li&gt;The item seems to be in a pre-submitted state, so I tried to delete it from there:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;But after this I tried to delete the item from the XMLUI and it is &lt;em&gt;still&lt;/em&gt; present&amp;hellip;&lt;/li&gt;
@ -654,13 +654,13 @@ DELETE 1
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;In the last two weeks there have been 47,000 downloads of this &lt;em&gt;same exact PDF&lt;/em&gt; by these three IP addresses&lt;/li&gt;
&lt;li&gt;Apply country and region corrections and deletions on DSpace Test and CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
@ -701,7 +701,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!&lt;/li&gt;
&lt;li&gt;The top IPs before, during, and after this latest alert tonight were:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -717,7 +717,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase&lt;/li&gt;
&lt;li&gt;There were just over 3 million accesses in the nginx logs last month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
3018243
real 0m19.873s
@ -737,7 +737,7 @@ sys 0m1.979s
&lt;li&gt;Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning&lt;/li&gt;
&lt;li&gt;I don&amp;rsquo;t see anything interesting in the web server logs around that time though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -825,7 +825,7 @@ sys 0m1.979s
&lt;ul&gt;
&lt;li&gt;DSpace Test had crashed at some point yesterday morning and I see the following in &lt;code&gt;dmesg&lt;/code&gt;:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
@ -848,11 +848,11 @@ sys 0m1.979s
&lt;ul&gt;
&lt;li&gt;I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;During the &lt;code&gt;mvn package&lt;/code&gt; stage on the 5.8 branch I kept getting issues with java running out of memory:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;There is insufficient memory for the Java Runtime Environment to continue.
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;There is insufficient memory for the Java Runtime Environment to continue.
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -872,12 +872,12 @@ sys 0m1.979s
&lt;li&gt;I added the new CCAFS Phase II Project Tag &lt;code&gt;PII-FP1_PACCA2&lt;/code&gt; and merged it into the &lt;code&gt;5_x-prod&lt;/code&gt; branch (&lt;a href=&#34;https://github.com/ilri/DSpace/pull/379&#34;&gt;#379&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I proofed and tested the ILRI author corrections that Peter sent back to me this week:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in &lt;a href=&#34;https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/&#34;&gt;March, 2018&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Time to index ~70,000 items on CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
@ -958,19 +958,19 @@ sys 2m7.289s
&lt;li&gt;In dspace.log around that time I see many errors like &amp;ldquo;Client closed the connection before file download was complete&amp;rdquo;&lt;/li&gt;
&lt;li&gt;And just before that I see this:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Ah hah! So the pool was actually empty!&lt;/li&gt;
&lt;li&gt;I need to increase that, let&amp;rsquo;s try to bump it up from 50 to 75&lt;/li&gt;
&lt;li&gt;After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&amp;rsquo;t know what the hell Uptime Robot saw&lt;/li&gt;
&lt;li&gt;I notice this error quite a few times in dspace.log:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &amp;quot; &amp;quot;]&amp;quot; &amp;quot;] &amp;quot;&amp;quot; at line 1, column 32.
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;And there are many of these errors every day for the past month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -1048,12 +1048,12 @@ dspace.log.2018-01-02:34
&lt;ul&gt;
&lt;li&gt;Today there have been no hits by CORE and no alerts from Linode (coincidence?)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
0
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Generate list of authors on CGSpace for Peter to go through and correct:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1068,7 +1068,7 @@ COPY 54701
&lt;ul&gt;
&lt;li&gt;Peter emailed to point out that many items in the &lt;a href=&#34;https://cgspace.cgiar.org/handle/10568/2703&#34;&gt;ILRI archive collection&lt;/a&gt; have multiple handles:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;There appears to be a pattern but I&amp;rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine&lt;/li&gt;
&lt;li&gt;Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections&lt;/li&gt;
@ -1182,7 +1182,7 @@ COPY 54701
&lt;li&gt;Remove redundant/duplicate text in the DSpace submission license&lt;/li&gt;
&lt;li&gt;Testing the CMYK patch on a collection with 650 items:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &amp;quot;ImageMagick PDF Thumbnail&amp;quot; -v &amp;gt;&amp;amp; /tmp/filter-media-cmyk.txt
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &amp;quot;ImageMagick PDF Thumbnail&amp;quot; -v &amp;gt;&amp;amp; /tmp/filter-media-cmyk.txt
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1208,7 +1208,7 @@ COPY 54701
&lt;li&gt;Discovered that the ImageMagic &lt;code&gt;filter-media&lt;/code&gt; plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK&lt;/li&gt;
&lt;li&gt;Interestingly, it seems DSpace 4.x&amp;rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&amp;rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see &lt;a href=&#34;https://cgspace.cgiar.org/handle/10568/51999&#34;&gt;10568/51999&lt;/a&gt;):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ identify ~/Desktop/alc_contrastes_desafios.jpg
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1223,7 +1223,7 @@ COPY 54701
&lt;ul&gt;
&lt;li&gt;An item was mapped twice erroneously again, so I had to remove one of the mappings manually:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspace=# select * from collection2item where item_id = &#39;80278&#39;;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -1263,7 +1263,7 @@ DELETE 1
&lt;li&gt;CGSpace was down for five hours in the morning while I was sleeping&lt;/li&gt;
&lt;li&gt;While looking in the logs for errors, I see tons of warnings about Atmire MQM:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&amp;quot;dc.title&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&amp;quot;THUMBNAIL&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&amp;quot;-1&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
@ -1305,7 +1305,7 @@ DELETE 1
&lt;/li&gt;
&lt;li&gt;I exported a random item&amp;rsquo;s metadata as CSV, deleted &lt;em&gt;all columns&lt;/em&gt; except id and collection, and made a new coloum called &lt;code&gt;ORCID:dc.contributor.author&lt;/code&gt; with the following random ORCIDs from the ORCID registry:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1322,7 +1322,7 @@ DELETE 1
&lt;li&gt;We had been using &lt;code&gt;DC=ILRI&lt;/code&gt; to determine whether a user was ILRI or not&lt;/li&gt;
&lt;li&gt;It looks like we might be able to use OUs now, instead of DCs:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &amp;quot;dc=cgiarad,dc=org&amp;quot; -D &amp;quot;admigration1@cgiarad.org&amp;quot; -W &amp;quot;(sAMAccountName=admigration1)&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &amp;quot;dc=cgiarad,dc=org&amp;quot; -D &amp;quot;admigration1@cgiarad.org&amp;quot; -W &amp;quot;(sAMAccountName=admigration1)&amp;quot;
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1341,7 +1341,7 @@ DELETE 1
&lt;li&gt;Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of &lt;code&gt;fonts&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Start working on DSpace 5.15.5 port:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ git checkout -b 55new 5_x-prod
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
&lt;/code&gt;&lt;/pre&gt;</description>
@ -1358,7 +1358,7 @@ $ git rebase -i dspace-5.5
&lt;li&gt;Add &lt;code&gt;dc.description.sponsorship&lt;/code&gt; to Discovery sidebar facets and make investors clickable in item view (&lt;a href=&#34;https://github.com/ilri/DSpace/issues/232&#34;&gt;#232&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I think this query should find and replace all authors that have &amp;ldquo;,&amp;rdquo; at the end of their names:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
text_value
@ -1398,7 +1398,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;li&gt;I have blocked access to the API now&lt;/li&gt;
&lt;li&gt;There are 3,000 IPs accessing the REST API in a 24-hour period!&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1476,7 +1476,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;ul&gt;
&lt;li&gt;Replace &lt;code&gt;lzop&lt;/code&gt; with &lt;code&gt;xz&lt;/code&gt; in log compression cron jobs on DSpace Test—it uses less space:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# cd /home/dspacetest.cgiar.org/log
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
@ -1496,7 +1496,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;li&gt;Looks like DSpace exhausted its PostgreSQL connection pool&lt;/li&gt;
&lt;li&gt;Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
&lt;/code&gt;&lt;/pre&gt;</description>
</item>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -210,7 +210,7 @@
</ul>
</li>
</ul>
<pre><code># apt update &amp;&amp; apt full-upgrade
<pre tabindex="0"><code># apt update &amp;&amp; apt full-upgrade
# apt-get autoremove &amp;&amp; apt-get autoclean
# dpkg -C
# reboot
@ -240,7 +240,7 @@
</ul>
</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
1277694
@ -248,7 +248,7 @@
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781
@ -293,7 +293,7 @@
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -116,7 +116,7 @@
</li>
<li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li>
</ul>
<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
<pre tabindex="0"><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
</code></pre><ul>
<li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li>
@ -151,13 +151,13 @@ DELETE 1
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
@ -216,7 +216,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -232,7 +232,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
3018243
real 0m19.873s
@ -261,7 +261,7 @@ sys 0m1.979s
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -394,7 +394,7 @@ sys 0m1.979s
<ul>
<li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
</ul>
<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
<pre tabindex="0"><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
[Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
[Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -109,11 +109,11 @@
<ul>
<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
</ul>
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
</code></pre><ul>
<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
</ul>
<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
<pre tabindex="0"><code>There is insufficient memory for the Java Runtime Environment to continue.
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2018-07/'>Read more →</a>
</article>
@ -142,12 +142,12 @@
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
@ -273,19 +273,19 @@ sys 2m7.289s
<li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
<li>And just before that I see this:</li>
</ul>
<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
<pre tabindex="0"><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
</code></pre><ul>
<li>Ah hah! So the pool was actually empty!</li>
<li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
<li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -381,12 +381,12 @@ dspace.log.2018-01-02:34
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-11/'>Read more →</a>
@ -410,7 +410,7 @@ COPY 54701
<ul>
<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
</ul>
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-09-04T21:16:03+03:00" />
<meta property="og:updated_time" content="2021-09-06T12:31:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -262,7 +262,7 @@
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-04/'>Read more →</a>
</article>
@ -297,7 +297,7 @@
<li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
</ul>
<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
<pre tabindex="0"><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-03/'>Read more →</a>
@ -321,7 +321,7 @@
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre><code>dspace=# select * from collection2item where item_id = '80278';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278

Some files were not shown because too many files have changed in this diff Show More