Add notes for 2021-09-13

This commit is contained in:
2021-09-13 16:21:16 +03:00
parent 8b487a4a77
commit c05c7213c2
109 changed files with 2627 additions and 2530 deletions

View File

@ -72,7 +72,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.87.0" />
<meta name="generator" content="Hugo 0.88.1" />
@ -163,7 +163,7 @@ sys 0m1.979s
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -179,7 +179,7 @@ sys 0m1.979s
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
3018243
real 0m19.873s
@ -198,7 +198,7 @@ sys 0m1.979s
<ul>
<li>Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Feb/2019:0(1|2|3|4|5)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Feb/2019:0(1|2|3|4|5)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
284 18.195.78.144
329 207.46.13.32
417 35.237.175.180
@ -219,7 +219,7 @@ sys 0m1.979s
<li>This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!</li>
<li>Here are the top IPs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
325 85.25.237.71
340 45.5.184.72
431 5.143.231.8
@ -234,11 +234,11 @@ sys 0m1.979s
<li><code>45.5.184.2</code> is CIAT, <code>70.32.83.92</code> and <code>205.186.128.185</code> are Macaroni Bros harvesters for CCAFS I think</li>
<li><code>195.201.104.240</code> is a new IP address in Germany with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
</code></pre><ul>
<li>This user was making 2060 requests per minute this morning&hellip; seems like I should try to block this type of behavior heuristically, regardless of user agent!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019&quot; | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019&quot; | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
19 03/Feb/2019:07:42
20 03/Feb/2019:07:12
21 03/Feb/2019:07:27
@ -262,7 +262,7 @@ sys 0m1.979s
</code></pre><ul>
<li>At least they re-used their Tomcat session!</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li>This user was making requests to <code>/browse</code>, which is not currently under the existing rate limiting of dynamic pages in our nginx config
@ -280,14 +280,14 @@ sys 0m1.979s
<ul>
<li>Generate a list of CTA subjects from CGSpace for Peter:</li>
</ul>
<pre><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
<pre tabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
COPY 321
</code></pre><ul>
<li>Skype with Michael Victor about CKM and CGSpace</li>
<li>Discuss the new IITA research theme field with Abenet and decide that we should use <code>cg.identifier.iitatheme</code></li>
<li>This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
589 2a01:4f8:140:3192::2
762 66.249.66.219
889 35.237.175.180
@ -307,7 +307,7 @@ COPY 321
<li>Peter sent me corrections and deletions for the CTA subjects and as usual, there were encoding errors with some accentsÁ in his file</li>
<li>In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use <code>toString()</code>:</li>
</ul>
<pre><code>or(
<pre tabindex="0"><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
@ -318,17 +318,17 @@ COPY 321
</code></pre><ul>
<li>Testing the corrections for sixty-five items and sixteen deletions using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> and <a href="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a> scripts:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
</code></pre><ul>
<li>I applied them on DSpace Test and CGSpace and started a full Discovery re-index:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Peter had marked several terms with <code>||</code> to indicate multiple values in his corrections so I will have to go back and do those manually:</li>
</ul>
<pre><code>EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
<pre tabindex="0"><code>EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
ENVIRONMENTAL PROTECTION AND NATURAL RESOURCES MANAGEMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
FISHERIES AND AQUACULTURE,FISHERIES||AQUACULTURE
MARKETING AND TRADE,MARKETING||TRADE
@ -340,21 +340,21 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<ul>
<li>I dumped the CTA community so I can try to fix the subjects with multiple subjects that Peter indicated in his corrections:</li>
</ul>
<pre><code>$ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
<pre tabindex="0"><code>$ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
</code></pre><ul>
<li>Then I used <code>csvcut</code> to get only the CTA subject columns:</li>
</ul>
<pre><code>$ csvcut -c &quot;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&quot; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
<pre tabindex="0"><code>$ csvcut -c &quot;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&quot; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
</code></pre><ul>
<li>After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values</li>
<li>Then I imported it back into CGSpace:</li>
</ul>
<pre><code>$ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
<pre tabindex="0"><code>$ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
</code></pre><ul>
<li>Another day, another alert about high load on CGSpace (linode18) from Linode</li>
<li>This time the load average was 370% and the top ten IPs before, during, and after that time were:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
689 35.237.175.180
1236 5.9.6.51
1305 34.218.226.147
@ -368,7 +368,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Looking closer at the top users, I see <code>45.5.186.2</code> is in Brazil and was making over 100 requests per minute to the REST API:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
118 06/Feb/2019:05:46
119 06/Feb/2019:05:37
119 06/Feb/2019:05:47
@ -382,7 +382,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
10411 200
1 301
7 302
@ -392,7 +392,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
328 220.247.212.35
372 66.249.66.221
380 207.46.13.2
@ -419,7 +419,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Linode sent an alert last night that the load on CGSpace (linode18) was over 300%</li>
<li>Here are the top IPs in the web server and API logs before, during, and after that time, respectively:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.209
6 2a01:4f8:210:51ef::2
6 40.77.167.75
@ -444,7 +444,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Then again this morning another alert:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.223
8 104.198.9.108
13 110.54.160.222
@ -471,7 +471,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don&rsquo;t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)</li>
<li>Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
</code></pre><ul>
<li>Collection 1056 appears to be <a href="https://cgspace.cgiar.org/handle/10568/68741">IITA Posters and Presentations</a> and I see that its workflow step 1 (Accept/Reject) is empty:</li>
</ul>
@ -482,7 +482,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Bizuwork asked about the &ldquo;DSpace Submission Approved and Archived&rdquo; emails that stopped working last month</li>
<li>I tried the <code>test-email</code> command on DSpace and it indeed is not working:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: aorth@mjanja.ch
@ -503,7 +503,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>I re-configured CGSpace to use the email/password for cgspace-support, but I get this error when I try the <code>test-email</code> script:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
</code></pre><ul>
<li>I tried to log into Outlook 365 with the credentials but I think the ones I have must be wrong, so I will ask ICT to reset the password</li>
@ -513,7 +513,7 @@ Please see the DSpace documentation for assistance.
<li>Linode sent alerts about CPU load yesterday morning, yesterday night, and this morning! All over 300% CPU load!</li>
<li>This is just for this morning:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
289 35.237.175.180
290 66.249.66.221
296 18.195.78.144
@ -539,7 +539,7 @@ Please see the DSpace documentation for assistance.
<li>I know 66.249.66.219 is Google, 5.9.6.51 is MegaIndex, and 5.143.231.38 is SputnikBot</li>
<li>Ooh, but 151.80.203.180 is some malicious bot making requests for <code>/etc/passwd</code> like this:</li>
</ul>
<pre><code>/bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;amp;isAllowed=../etc/passwd
<pre tabindex="0"><code>/bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;amp;isAllowed=../etc/passwd
</code></pre><ul>
<li>151.80.203.180 is on OVH so I sent a message to their abuse email&hellip;</li>
</ul>
@ -547,7 +547,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
232 18.195.78.144
238 35.237.175.180
281 66.249.66.221
@ -572,14 +572,14 @@ Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>Another interesting thing might be the total number of requests for web and API services during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
16333
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
15964
</code></pre><ul>
<li>Also, the number of unique IPs served during that time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
1622
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
95
@ -610,7 +610,7 @@ Please see the DSpace documentation for assistance.
</ul>
</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: cannot test email because mail.server.disabled is set to true
</code></pre><ul>
<li>I&rsquo;m not sure why I didn&rsquo;t know about this configuration option before, and always maintained multiple configurations for development and production
@ -620,7 +620,7 @@ Please see the DSpace documentation for assistance.
</li>
<li>I updated my local Sonatype nexus Docker image and had an issue with the volume for some reason so I decided to just start from scratch:</li>
</ul>
<pre><code># docker rm nexus
<pre tabindex="0"><code># docker rm nexus
# docker pull sonatype/nexus3
# mkdir -p /home/aorth/.local/lib/containers/volumes/nexus_data
# chown 200:200 /home/aorth/.local/lib/containers/volumes/nexus_data
@ -628,7 +628,7 @@ Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>For some reason my <code>mvn package</code> for DSpace is not working now&hellip; I might go back to <a href="https://mjanja.ch/2018/02/cache-maven-artifacts-with-artifactory/">using Artifactory for caching</a> instead:</li>
</ul>
<pre><code># docker pull docker.bintray.io/jfrog/artifactory-oss:latest
<pre tabindex="0"><code># docker pull docker.bintray.io/jfrog/artifactory-oss:latest
# mkdir -p /home/aorth/.local/lib/containers/volumes/artifactory5_data
# chown 1030 /home/aorth/.local/lib/containers/volumes/artifactory5_data
# docker run --name artifactory --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
@ -643,13 +643,13 @@ Please see the DSpace documentation for assistance.
<li>On a similar note, I wonder if we could use the performance-focused <a href="https://libvips.github.io/libvips/">libvps</a> and the third-party <a href="https://github.com/codecitizen/jlibvips/">jlibvips Java library</a> in DSpace</li>
<li>Testing the <code>vipsthumbnail</code> command line tool with <a href="https://cgspace.cgiar.org/handle/10568/51999">this CGSpace item that uses CMYK</a>:</li>
</ul>
<pre><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
<pre tabindex="0"><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
</code></pre><ul>
<li>(DSpace 5 appears to use JPEG 92 quality so I do the same)</li>
<li>Thinking about making &ldquo;top items&rdquo; endpoints in my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
<li>I could use the following SQL queries very easily to get the top items by views or downloads:</li>
</ul>
<pre><code>dspacestatistics=# SELECT * FROM items WHERE views &gt; 0 ORDER BY views DESC LIMIT 10;
<pre tabindex="0"><code>dspacestatistics=# SELECT * FROM items WHERE views &gt; 0 ORDER BY views DESC LIMIT 10;
dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads DESC LIMIT 10;
</code></pre><ul>
<li>I&rsquo;d have to think about what to make the REST API endpoints, perhaps: <code>/statistics/top/items?limit=10</code></li>
@ -660,7 +660,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
</ul>
</li>
</ul>
<pre><code>$ identify -verbose alc_contrastes_desafios.pdf.jpg
<pre tabindex="0"><code>$ identify -verbose alc_contrastes_desafios.pdf.jpg
...
Colorspace: sRGB
</code></pre><ul>
@ -671,35 +671,35 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
<li>ILRI ICT reset the password for the CGSpace mail account, but I still can&rsquo;t get it to send mail from DSpace&rsquo;s <code>test-email</code> utility</li>
<li>I even added extra mail properties to <code>dspace.cfg</code> as suggested by someone on the dspace-tech mailing list:</li>
</ul>
<pre><code>mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
<pre tabindex="0"><code>mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
</code></pre><ul>
<li>But the result is still:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
</code></pre><ul>
<li>I tried to log into the Outlook 365 web mail and it doesn&rsquo;t work so I&rsquo;ve emailed ILRI ICT again</li>
<li>After reading the <a href="https://javaee.github.io/javamail/FAQ#commonmistakes">common mistakes in the JavaMail FAQ</a> I reconfigured the extra properties in DSpace&rsquo;s mail configuration to be simply:</li>
</ul>
<pre><code>mail.extraproperties = mail.smtp.starttls.enable=true
<pre tabindex="0"><code>mail.extraproperties = mail.smtp.starttls.enable=true
</code></pre><ul>
<li>&hellip; and then I was able to send a mail using my personal account where I know the credentials work</li>
<li>The CGSpace account still gets this error message:</li>
</ul>
<pre><code>Error sending email:
<pre tabindex="0"><code>Error sending email:
- Error: javax.mail.AuthenticationFailedException
</code></pre><ul>
<li>I updated the <a href="https://github.com/ilri/DSpace/pull/410">DSpace SMTP settings in <code>dspace.cfg</code></a> as well as the <a href="https://github.com/ilri/rmg-ansible-public/commit/ab5fe4d10e16413cd04ffb1bc3179dc970d6d47c">variables in the DSpace role of the Ansible infrastructure scripts</a></li>
<li>Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:</li>
</ul>
<pre><code>$ dspace user --delete --email blah@cta.int
<pre tabindex="0"><code>$ dspace user --delete --email blah@cta.int
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
</code></pre><ul>
<li>On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable <code>webui.user.assumelogin = true</code></li>
<li>I will enable this on CGSpace (<a href="https://github.com/ilri/DSpace/pull/411">#411</a>)</li>
<li>Test re-creating my local PostgreSQL and Artifactory containers with podman instead of Docker (using the volumes from my old Docker containers though):</li>
</ul>
<pre><code># podman pull postgres:9.6-alpine
<pre tabindex="0"><code># podman pull postgres:9.6-alpine
# podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
# podman pull docker.bintray.io/jfrog/artifactory-oss
# podman run --name artifactory -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
@ -707,7 +707,7 @@ $ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int
<li>Totally works&hellip; awesome!</li>
<li>Then I tried with rootless containers by creating the subuid and subgid mappings for aorth:</li>
</ul>
<pre><code>$ sudo touch /etc/subuid /etc/subgid
<pre tabindex="0"><code>$ sudo touch /etc/subuid /etc/subgid
$ usermod --add-subuids 10000-75535 aorth
$ usermod --add-subgids 10000-75535 aorth
$ sudo sysctl kernel.unprivileged_userns_clone=1
@ -717,7 +717,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<li>Which totally works, but Podman&rsquo;s rootless support doesn&rsquo;t work with port mappings yet&hellip;</li>
<li>Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# apt remove tomcat7 tomcat7-admin
# useradd -m -r -s /bin/bash dspace
# mv /usr/share/tomcat7/.m2 /home/dspace
@ -728,14 +728,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
</code></pre><ul>
<li>After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:</li>
</ul>
<pre><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
<pre tabindex="0"><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
</code></pre><ul>
<li>The issue last month was address space, which is now set as <code>LimitAS=infinity</code> in <code>tomcat7.service</code>&hellip;</li>
<li>I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server</li>
<li>Still the error persists after reboot</li>
<li>I will try to stop Tomcat and then remove the locks manually:</li>
</ul>
<pre><code># find /home/cgspace.cgiar.org/solr/ -iname &quot;write.lock&quot; -delete
<pre tabindex="0"><code># find /home/cgspace.cgiar.org/solr/ -iname &quot;write.lock&quot; -delete
</code></pre><ul>
<li>After restarting Tomcat the usage statistics are back</li>
<li>Interestingly, many of the locks were from last month, last year, and even 2015! I&rsquo;m pretty sure that&rsquo;s not supposed to be how locks work&hellip;</li>
@ -747,19 +747,19 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<ul>
<li>Tomcat was killed around 3AM by the kernel&rsquo;s OOM killer according to <code>dmesg</code>:</li>
</ul>
<pre><code>[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
<pre tabindex="0"><code>[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
[Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
[Fri Feb 15 03:10:43 2019] oom_reaper: reaped process 12027 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
</code></pre><ul>
<li>The <code>tomcat7</code> service shows:</li>
</ul>
<pre><code>Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
<pre tabindex="0"><code>Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
</code></pre><ul>
<li>I suspect it was related to the media-filter cron job that runs at 3AM but I don&rsquo;t see anything particular in the log files</li>
<li>I want to try to normalize the <code>text_lang</code> values to make working with metadata easier</li>
<li>We currently have a bunch of weird values that DSpace uses like <code>NULL</code>, <code>en_US</code>, and <code>en</code> and others that have been entered manually by editors:</li>
</ul>
<pre><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
| 1069539
@ -778,7 +778,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<li>Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!</li>
<li>I&rsquo;m going to normalized these to <code>NULL</code> at least on DSpace Test for now:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
UPDATE 1045410
</code></pre><ul>
<li>I started proofing IITA&rsquo;s 2019-01 records that Sisay uploaded this week
@ -790,7 +790,7 @@ UPDATE 1045410
<li>ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works</li>
<li>Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman&rsquo;s volumes:</li>
</ul>
<pre><code>$ podman pull postgres:9.6-alpine
<pre tabindex="0"><code>$ podman pull postgres:9.6-alpine
$ podman volume create dspacedb_data
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
@ -803,7 +803,7 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
<li>And it&rsquo;s all running without root!</li>
<li>Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:</li>
</ul>
<pre><code>$ podman volume create artifactory_data
<pre tabindex="0"><code>$ podman volume create artifactory_data
artifactory_data
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
$ buildah unshare
@ -817,13 +817,13 @@ $ podman start artifactory
<ul>
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre><code>$ dspace cleanup -v
<pre tabindex="0"><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(162844) is still referenced from table &quot;bundle&quot;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
UPDATE 1
</code></pre><ul>
<li>I merged the Atmire Metadata Quality Module (MQM) changes to the <code>5_x-prod</code> branch and deployed it on CGSpace (<a href="https://github.com/ilri/DSpace/pull/407">#407</a>)</li>
@ -834,7 +834,7 @@ UPDATE 1
<li>Jesus fucking Christ, Linode sent an alert that CGSpace (linode18) was using 421% CPU for a few hours this afternoon (server time):</li>
<li>There seems to have been a lot of activity in XMLUI:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1236 18.212.208.240
1276 54.164.83.99
1277 3.83.14.11
@ -864,7 +864,7 @@ UPDATE 1
<li>94.71.244.172 is in Greece and uses the user agent &ldquo;Indy Library&rdquo;</li>
<li>At least they are re-using their Tomcat session:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
</code></pre><ul>
<li>
<p>The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent &ldquo;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&rdquo;:</p>
@ -886,7 +886,7 @@ UPDATE 1
<p>For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:</p>
</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
1173 52.91.249.23
1176 107.22.118.106
1178 3.88.173.152
@ -920,7 +920,7 @@ UPDATE 1
</code></pre><ul>
<li>In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
10 18/Feb/2019:17:20
10 18/Feb/2019:17:22
10 18/Feb/2019:17:31
@ -935,7 +935,7 @@ UPDATE 1
<li>As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics</li>
<li>There were 92,000 requests from these IPs alone today!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
92756
</code></pre><ul>
<li>I will add this user agent to the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/nginx/default.conf.j2">&ldquo;badbots&rdquo; rate limiting in our nginx configuration</a></li>
@ -943,7 +943,7 @@ UPDATE 1
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -956,7 +956,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>Unfortunately, I don&rsquo;t see any strange activity in the web server API or XMLUI logs at that time in particular</li>
<li>So far today the top ten IPs in the XMLUI logs are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
11541 18.212.208.240
11560 3.81.136.184
11562 3.88.237.84
@ -978,7 +978,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>The top requests in the API logs today are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
42 66.249.66.221
44 156.156.81.215
55 3.85.54.129
@ -999,17 +999,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate</li>
<li>I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from <a href="https://hdl.handle.net/10568/96140">10568/96140</a> almost 200 times:</li>
</ul>
<pre><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
185
</code></pre><ul>
<li>Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:</li>
</ul>
<pre><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
346
</code></pre><ul>
<li>In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
1 139.162.146.60
1 157.55.39.159
1 196.188.127.94
@ -1032,7 +1032,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</code></pre><ul>
<li>That is so weird, they are all using this Android user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
<pre tabindex="0"><code>Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
</code></pre><ul>
<li>I wrote a quick and dirty Python script called <code>resolve-addresses.py</code> to resolve IP addresses to their owning organization&rsquo;s name, ASN, and country using the <a href="https://ipapi.co">IPAPI.co API</a></li>
</ul>
@ -1042,7 +1042,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I told him that they should probably try to use the REST API&rsquo;s <code>find-by-metadata-field</code> endpoint</li>
<li>The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:</li>
</ul>
<pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;&quot;}'
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;&quot;}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: null}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;en_US&quot;}'
</code></pre><ul>
@ -1063,23 +1063,23 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: applica
<li>It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to</li>
<li>I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files:</li>
</ul>
<pre><code>$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
<pre tabindex="0"><code>$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
</code></pre><ul>
<li>Then I generated a list of all the unique matched terms:</li>
</ul>
<pre><code>$ cat /tmp/matched-subjects-* | sort | uniq &gt; /tmp/2019-02-21-matched-subjects.txt
<pre tabindex="0"><code>$ cat /tmp/matched-subjects-* | sort | uniq &gt; /tmp/2019-02-21-matched-subjects.txt
</code></pre><ul>
<li>And then a list of all the unique <em>unmatched</em> terms using some utility I&rsquo;ve never heard of before called <code>comm</code> or with <code>diff</code>:</li>
</ul>
<pre><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
<pre tabindex="0"><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format=&quot;&quot; --unchanged-line-format=&quot;&quot; /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
</code></pre><ul>
<li>Generate a list of countries and regions from CGSpace for Sisay to look through:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
COPY 202
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
COPY 33
@ -1124,7 +1124,7 @@ COPY 33
<p>I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:</p>
</li>
</ul>
<pre><code>import json
<pre tabindex="0"><code>import json
import re
import urllib
import urllib2
@ -1148,7 +1148,7 @@ return &quot;unmatched&quot;
<li>I&rsquo;m not sure how to deal with terms like &ldquo;CORN&rdquo; that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be &ldquo;MAIZE&rdquo;</li>
<li>For example, <a href="http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&amp;lang=en">a query</a> for <code>CORN*</code> returns:</li>
</ul>
<pre><code> &quot;results&quot;: [
<pre tabindex="0"><code> &quot;results&quot;: [
{
&quot;altLabel&quot;: &quot;corn (maize)&quot;,
&quot;lang&quot;: &quot;en&quot;,
@ -1176,7 +1176,7 @@ return &quot;unmatched&quot;
<li>There seems to be something going on with Solr on CGSpace (linode18) because statistics on communities and collections are blank for January and February this year</li>
<li>I see some errors started recently in Solr (yesterday):</li>
</ul>
<pre><code>$ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
<pre tabindex="0"><code>$ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
/home/cgspace.cgiar.org/log/solr.log.2019-02-11.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-12.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-13.xz:0
@ -1195,7 +1195,7 @@ return &quot;unmatched&quot;
<li>But I don&rsquo;t see anything interesting in yesterday&rsquo;s Solr log&hellip;</li>
<li>I see this in the Tomcat 7 logs yesterday:</li>
</ul>
<pre><code>Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
<pre tabindex="0"><code>Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
Feb 25 21:09:29 linode18 tomcat7[1015]: java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger$9.visit(SourceFile:1241)
Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.SolrLogger.visitEachStatisticShard(SourceFile:268)
@ -1207,7 +1207,7 @@ Feb 25 21:09:29 linode18 tomcat7[1015]: at org.dspace.statistics.Statist
<li>In the Solr admin GUI I see we have the following error: &ldquo;statistics-2011: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher&rdquo;</li>
<li>I restarted Tomcat and upon startup I see lots of errors in the systemd journal, like:</li>
</ul>
<pre><code>Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
<pre tabindex="0"><code>Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
Feb 25 21:37:49 linode18 tomcat7[28363]: java.io.StreamCorruptedException: invalid type code: 00
Feb 25 21:37:49 linode18 tomcat7[28363]: at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1601)
Feb 25 21:37:49 linode18 tomcat7[28363]: at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
@ -1220,7 +1220,7 @@ Feb 25 21:37:49 linode18 tomcat7[28363]: at sun.reflect.NativeMethodAcce
<li>Also, now the Solr admin UI says &ldquo;statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher&rdquo;</li>
<li>In the Solr log I see:</li>
</ul>
<pre><code>2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
<pre tabindex="0"><code>2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:873)
at org.apache.solr.core.SolrCore.&lt;init&gt;(SolrCore.java:646)
@ -1243,7 +1243,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
</code></pre><ul>
<li>I tried to shutdown Tomcat and remove the locks:</li>
</ul>
<pre><code># systemctl stop tomcat7
<pre tabindex="0"><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr -iname &quot;*.lock&quot; -delete
# systemctl start tomcat7
</code></pre><ul>
@ -1254,7 +1254,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the <code>LimitAS</code> setting does work, and the <code>infinity</code> setting in systemd does get translated to &ldquo;unlimited&rdquo; on the service</li>
<li>I thought it might be open file limit, but it seems we&rsquo;re nowhere near the current limit of 16384:</li>
</ul>
<pre><code># lsof -u dspace | wc -l
<pre tabindex="0"><code># lsof -u dspace | wc -l
3016
</code></pre><ul>
<li>For what it&rsquo;s worth I see the same errors about <code>solr_update_time_stamp</code> on DSpace Test (linode19)</li>
@ -1270,7 +1270,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>I sent a mail to the dspace-tech mailing list about the &ldquo;solr_update_time_stamp&rdquo; error</li>
<li>A CCAFS user sent a message saying they got this error when submitting to CGSpace:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
<pre tabindex="0"><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
</code></pre><ul>
<li>According to the <a href="https://cgspace.cgiar.org/rest/collections/1021">REST API</a> collection 1021 appears to be <a href="https://cgspace.cgiar.org/handle/10568/66581">CCAFS Tools, Maps, Datasets and Models</a></li>
<li>I looked at the <code>WORKFLOW_STEP_1</code> (Accept/Reject) and the group is of course empty</li>
@ -1287,7 +1287,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>He asked me to upload the files for him via the command line, but the file he referenced (<code>Thumbnails_feb_2019.zip</code>) doesn&rsquo;t exist</li>
<li>I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file&rsquo;s name:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
<pre tabindex="0"><code>$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
</code></pre><ul>
<li>Why don&rsquo;t they just derive the directory from the path to the zip file?</li>
<li>Working on Udana&rsquo;s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
@ -1303,12 +1303,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<ul>
<li>I helped Sisay upload the nineteen CTA records from last week via the command line because they required mappings (which is not possible to do via the batch upload web interface)</li>
</ul>
<pre><code>$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
<pre tabindex="0"><code>$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
</code></pre><ul>
<li>Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out <em>sigh</em></li>
<li>Now I&rsquo;m getting this message when trying to use DSpace&rsquo;s <code>test-email</code> script:</li>
</ul>
<pre><code>$ dspace test-email
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: stfu@google.com