Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -12,7 +12,7 @@
Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -28,7 +28,7 @@ The top IPs before, during, and after this latest alert tonight were:
The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
3018243
real 0m19.873s
@ -49,7 +49,7 @@ sys 0m1.979s
Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -65,14 +65,14 @@ The top IPs before, during, and after this latest alert tonight were:
The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
3018243
real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -163,7 +163,7 @@ sys 0m1.979s
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -179,7 +179,7 @@ sys 0m1.979s
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s
@ -198,7 +198,7 @@ sys 0m1.979s
<ul>
<li>Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Feb/2019:0(1|2|3|4|5)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Feb/2019:0(1|2|3|4|5)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
284 18.195.78.144
329 207.46.13.32
417 35.237.175.180
@ -219,7 +219,7 @@ sys 0m1.979s
<li>This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!</li>
<li>Here are the top IPs before, during, and after that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
325 85.25.237.71
340 45.5.184.72
431 5.143.231.8
@ -238,7 +238,7 @@ sys 0m1.979s
</code></pre><ul>
<li>This user was making 2060 requests per minute this morning&hellip; seems like I should try to block this type of behavior heuristically, regardless of user agent!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Feb/2019&quot; | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;03/Feb/2019&#34; | grep 195.201.104.240 | grep -o -E &#39;03/Feb/2019:0[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 20
19 03/Feb/2019:07:42
20 03/Feb/2019:07:12
21 03/Feb/2019:07:27
@ -262,7 +262,7 @@ sys 0m1.979s
</code></pre><ul>
<li>At least they re-used their Tomcat session!</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240&#39; dspace.log.2019-02-03 | sort | uniq | wc -l
1
</code></pre><ul>
<li>This user was making requests to <code>/browse</code>, which is not currently under the existing rate limiting of dynamic pages in our nginx config
@ -287,7 +287,7 @@ COPY 321
<li>Discuss the new IITA research theme field with Abenet and decide that we should use <code>cg.identifier.iitatheme</code></li>
<li>This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;04/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;04/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
589 2a01:4f8:140:3192::2
762 66.249.66.219
889 35.237.175.180
@ -318,12 +318,12 @@ COPY 321
</code></pre><ul>
<li>Testing the corrections for sixty-five items and sixteen deletions using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> and <a href="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a> scripts:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p &#39;fuu&#39; -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p &#39;fuu&#39; -d
</code></pre><ul>
<li>I applied them on DSpace Test and CGSpace and started a full Discovery re-index:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Peter had marked several terms with <code>||</code> to indicate multiple values in his corrections so I will have to go back and do those manually:</li>
@ -344,7 +344,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Then I used <code>csvcut</code> to get only the CTA subject columns:</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c &quot;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&quot; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
<pre tabindex="0"><code>$ csvcut -c &#34;id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]&#34; /tmp/cta.csv &gt; /tmp/cta-subjects.csv
</code></pre><ul>
<li>After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values</li>
<li>Then I imported it back into CGSpace:</li>
@ -354,7 +354,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Another day, another alert about high load on CGSpace (linode18) from Linode</li>
<li>This time the load average was 370% and the top ten IPs before, during, and after that time were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
689 35.237.175.180
1236 5.9.6.51
1305 34.218.226.147
@ -368,7 +368,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Looking closer at the top users, I see <code>45.5.186.2</code> is in Brazil and was making over 100 requests per minute to the REST API:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E &#39;06/Feb/2019:0[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 10
118 06/Feb/2019:05:46
119 06/Feb/2019:05:37
119 06/Feb/2019:05:47
@ -382,7 +382,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#39;06/Feb/2019&#39; | grep 45.5.186.2 | awk &#39;{print $9}&#39; | sort | uniq -c
10411 200
1 301
7 302
@ -392,7 +392,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
328 220.247.212.35
372 66.249.66.221
380 207.46.13.2
@ -403,7 +403,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
1236 5.9.6.51
1554 66.249.66.219
4942 85.25.237.71
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;06/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
10 66.249.66.221
26 66.249.66.219
69 5.143.231.8
@ -419,7 +419,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
<li>Linode sent an alert last night that the load on CGSpace (linode18) was over 300%</li>
<li>Here are the top IPs in the web server and API logs before, during, and after that time, respectively:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;06/Feb/2019:(17|18|19|20|23)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.209
6 2a01:4f8:210:51ef::2
6 40.77.167.75
@ -430,7 +430,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
20 95.108.181.88
27 66.249.66.219
2381 45.5.186.2
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Feb/2019:(17|18|19|20|23)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;06/Feb/2019:(17|18|19|20|23)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
455 45.5.186.2
506 40.77.167.75
559 54.70.40.11
@ -444,7 +444,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</code></pre><ul>
<li>Then again this morning another alert:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;07/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
5 66.249.66.223
8 104.198.9.108
13 110.54.160.222
@ -455,7 +455,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
4529 45.5.186.2
4661 205.186.128.185
4661 70.32.83.92
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;07/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;07/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
145 157.55.39.237
154 66.249.66.221
214 34.218.226.147
@ -513,7 +513,7 @@ Please see the DSpace documentation for assistance.
<li>Linode sent alerts about CPU load yesterday morning, yesterday night, and this morning! All over 300% CPU load!</li>
<li>This is just for this morning:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;09/Feb/2019:(07|08|09|10|11)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
289 35.237.175.180
290 66.249.66.221
296 18.195.78.144
@ -524,7 +524,7 @@ Please see the DSpace documentation for assistance.
742 5.143.231.38
1046 5.9.6.51
1331 66.249.66.219
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;09/Feb/2019:(07|08|09|10|11)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;09/Feb/2019:(07|08|09|10|11)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
4 66.249.83.30
5 49.149.10.16
8 207.46.13.64
@ -547,7 +547,7 @@ Please see the DSpace documentation for assistance.
<ul>
<li>Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
232 18.195.78.144
238 35.237.175.180
281 66.249.66.221
@ -558,7 +558,7 @@ Please see the DSpace documentation for assistance.
444 2a01:4f8:140:3192::2
1171 5.9.6.51
1196 66.249.66.219
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
6 112.203.241.69
7 157.55.39.149
9 40.77.167.178
@ -572,16 +572,16 @@ Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>Another interesting thing might be the total number of requests for web and API services during that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE &#34;10/Feb/2019:0(5|6|7|8|9)&#34;
16333
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE &quot;10/Feb/2019:0(5|6|7|8|9)&quot;
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE &#34;10/Feb/2019:0(5|6|7|8|9)&#34;
15964
</code></pre><ul>
<li>Also, the number of unique IPs served during that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
1622
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;10/Feb/2019:0(5|6|7|8|9)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
95
</code></pre><ul>
<li>It&rsquo;s very clear to me now that the API requests are the heaviest!</li>
@ -643,7 +643,7 @@ Please see the DSpace documentation for assistance.
<li>On a similar note, I wonder if we could use the performance-focused <a href="https://libvips.github.io/libvips/">libvps</a> and the third-party <a href="https://github.com/codecitizen/jlibvips/">jlibvips Java library</a> in DSpace</li>
<li>Testing the <code>vipsthumbnail</code> command line tool with <a href="https://cgspace.cgiar.org/handle/10568/51999">this CGSpace item that uses CMYK</a>:</li>
</ul>
<pre tabindex="0"><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
<pre tabindex="0"><code>$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o &#39;%s.jpg[Q=92,optimize_coding,strip]&#39;
</code></pre><ul>
<li>(DSpace 5 appears to use JPEG 92 quality so I do the same)</li>
<li>Thinking about making &ldquo;top items&rdquo; endpoints in my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
@ -693,7 +693,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
<li>Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:</li>
</ul>
<pre tabindex="0"><code>$ dspace user --delete --email blah@cta.int
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password &#39;blah&#39;
</code></pre><ul>
<li>On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable <code>webui.user.assumelogin = true</code></li>
<li>I will enable this on CGSpace (<a href="https://github.com/ilri/DSpace/pull/411">#411</a>)</li>
@ -728,14 +728,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
</code></pre><ul>
<li>After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:</li>
</ul>
<pre tabindex="0"><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
<pre tabindex="0"><code>2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2018&#39;: Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
</code></pre><ul>
<li>The issue last month was address space, which is now set as <code>LimitAS=infinity</code> in <code>tomcat7.service</code>&hellip;</li>
<li>I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server</li>
<li>Still the error persists after reboot</li>
<li>I will try to stop Tomcat and then remove the locks manually:</li>
</ul>
<pre tabindex="0"><code># find /home/cgspace.cgiar.org/solr/ -iname &quot;write.lock&quot; -delete
<pre tabindex="0"><code># find /home/cgspace.cgiar.org/solr/ -iname &#34;write.lock&#34; -delete
</code></pre><ul>
<li>After restarting Tomcat the usage statistics are back</li>
<li>Interestingly, many of the locks were from last month, last year, and even 2015! I&rsquo;m pretty sure that&rsquo;s not supposed to be how locks work&hellip;</li>
@ -795,10 +795,10 @@ $ podman volume create dspacedb_data
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest superuser;&#39;
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost dspace_2019-02-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspacetest -c &#39;alter user dspacetest nosuperuser;&#39;
</code></pre><ul>
<li>And it&rsquo;s all running without root!</li>
<li>Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:</li>
@ -818,12 +818,12 @@ $ podman start artifactory
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre tabindex="0"><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(162844) is still referenced from table &quot;bundle&quot;.
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(162844) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is, as always:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
<pre tabindex="0"><code>$ psql dspace -c &#39;update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);&#39;
UPDATE 1
</code></pre><ul>
<li>I merged the Atmire Metadata Quality Module (MQM) changes to the <code>5_x-prod</code> branch and deployed it on CGSpace (<a href="https://github.com/ilri/DSpace/pull/407">#407</a>)</li>
@ -834,7 +834,7 @@ UPDATE 1
<li>Jesus fucking Christ, Linode sent an alert that CGSpace (linode18) was using 421% CPU for a few hours this afternoon (server time):</li>
<li>There seems to have been a lot of activity in XMLUI:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1236 18.212.208.240
1276 54.164.83.99
1277 3.83.14.11
@ -845,7 +845,7 @@ UPDATE 1
1327 52.54.252.47
1477 5.9.6.51
1861 94.71.244.172
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
8 42.112.238.64
9 121.52.152.3
9 157.55.39.50
@ -856,15 +856,15 @@ UPDATE 1
28 66.249.66.219
43 34.209.213.122
178 50.116.102.77
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
2727
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq | wc -l
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq | wc -l
186
</code></pre><ul>
<li>94.71.244.172 is in Greece and uses the user agent &ldquo;Indy Library&rdquo;</li>
<li>At least they are re-using their Tomcat session:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
<pre tabindex="0"><code>$ grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172&#39; dspace.log.2019-02-18 | sort | uniq | wc -l
</code></pre><ul>
<li>
<p>The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent &ldquo;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&rdquo;:</p>
@ -886,7 +886,7 @@ UPDATE 1
<p>For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:</p>
</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;18/Feb/2019:1(2|3|4|5|6)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;18/Feb/2019:1(2|3|4|5|6)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 30
1173 52.91.249.23
1176 107.22.118.106
1178 3.88.173.152
@ -920,7 +920,7 @@ UPDATE 1
</code></pre><ul>
<li>In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E &#39;18/Feb/2019:1[0-9]:[0-9][0-9]&#39; | uniq -c | sort -n | tail -n 10
10 18/Feb/2019:17:20
10 18/Feb/2019:17:22
10 18/Feb/2019:17:31
@ -935,7 +935,7 @@ UPDATE 1
<li>As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics</li>
<li>There were 92,000 requests from these IPs alone today!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c &#39;Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0&#39;
92756
</code></pre><ul>
<li>I will add this user agent to the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/nginx/default.conf.j2">&ldquo;badbots&rdquo; rate limiting in our nginx configuration</a></li>
@ -943,7 +943,7 @@ UPDATE 1
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt | grep -oE &#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39; | sort | uniq &gt; /tmp/2019-02-18-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
@ -956,7 +956,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>Unfortunately, I don&rsquo;t see any strange activity in the web server API or XMLUI logs at that time in particular</li>
<li>So far today the top ten IPs in the XMLUI logs are:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &#34;19/Feb/2019:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
11541 18.212.208.240
11560 3.81.136.184
11562 3.88.237.84
@ -978,7 +978,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</li>
<li>The top requests in the API logs today are:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &#34;19/Feb/2019:&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
42 66.249.66.221
44 156.156.81.215
55 3.85.54.129
@ -999,17 +999,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate</li>
<li>I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from <a href="https://hdl.handle.net/10568/96140">10568/96140</a> almost 200 times:</li>
</ul>
<pre tabindex="0"><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.30.105 /var/log/nginx/access.log | grep -c &#39;acgg_progress_report.pdf&#39;
185
</code></pre><ul>
<li>Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:</li>
</ul>
<pre tabindex="0"><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
<pre tabindex="0"><code># grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c &#39;acgg_progress_report.pdf&#39;
346
</code></pre><ul>
<li>In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v &#39;upstream response is buffered&#39; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n
1 139.162.146.60
1 157.55.39.159
1 196.188.127.94
@ -1042,9 +1042,9 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>I told him that they should probably try to use the REST API&rsquo;s <code>find-by-metadata-field</code> endpoint</li>
<li>The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;&quot;}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: null}'
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;en_US&quot;}'
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: &#34;&#34;}&#39;
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: null}&#39;
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.creator.id&#34;,&#34;value&#34;: &#34;Alan S. Orth: 0000-0002-1735-7458&#34;, &#34;language&#34;: &#34;en_US&#34;}&#39;
</code></pre><ul>
<li>This returns six items for me, which is the <a href="https://cgspace.cgiar.org/discover?filtertype_1=orcid&amp;filter_relational_operator_1=contains&amp;filter_1=Alan+S.+Orth%3A+0000-0002-1735-7458&amp;submit_apply_filter=&amp;query=">same I see in a Discovery search</a></li>
<li>Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
@ -1075,7 +1075,7 @@ $ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subje
</ul>
<pre tabindex="0"><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format=&quot;&quot; --unchanged-line-format=&quot;&quot; /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format=&#34;&#34; --unchanged-line-format=&#34;&#34; /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
</code></pre><ul>
<li>Generate a list of countries and regions from CGSpace for Sisay to look through:</li>
</ul>
@ -1129,15 +1129,15 @@ import re
import urllib
import urllib2
pattern = re.compile('^S[A-Z ]+$')
pattern = re.compile(&#39;^S[A-Z ]+$&#39;)
if pattern.match(value):
url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&amp;lang=en'
url = &#39;http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=&#39; + urllib.quote_plus(value) + &#39;&amp;lang=en&#39;
get = urllib2.urlopen(url)
data = json.load(get)
if len(data['results']) == 1:
return &quot;matched&quot;
if len(data[&#39;results&#39;]) == 1:
return &#34;matched&#34;
return &quot;unmatched&quot;
return &#34;unmatched&#34;
</code></pre><ul>
<li>You have to make sure to URL encode the value with <code>quote_plus()</code> and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable</li>
<li>There is a <a href="https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-2-url-queries-and-parsing-json">good resource discussing OpenRefine, Jython, and web scraping</a></li>
@ -1148,16 +1148,16 @@ return &quot;unmatched&quot;
<li>I&rsquo;m not sure how to deal with terms like &ldquo;CORN&rdquo; that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be &ldquo;MAIZE&rdquo;</li>
<li>For example, <a href="http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&amp;lang=en">a query</a> for <code>CORN*</code> returns:</li>
</ul>
<pre tabindex="0"><code> &quot;results&quot;: [
<pre tabindex="0"><code> &#34;results&#34;: [
{
&quot;altLabel&quot;: &quot;corn (maize)&quot;,
&quot;lang&quot;: &quot;en&quot;,
&quot;prefLabel&quot;: &quot;maize&quot;,
&quot;type&quot;: [
&quot;skos:Concept&quot;
&#34;altLabel&#34;: &#34;corn (maize)&#34;,
&#34;lang&#34;: &#34;en&#34;,
&#34;prefLabel&#34;: &#34;maize&#34;,
&#34;type&#34;: [
&#34;skos:Concept&#34;
],
&quot;uri&quot;: &quot;http://aims.fao.org/aos/agrovoc/c_12332&quot;,
&quot;vocab&quot;: &quot;agrovoc&quot;
&#34;uri&#34;: &#34;http://aims.fao.org/aos/agrovoc/c_12332&#34;,
&#34;vocab&#34;: &#34;agrovoc&#34;
},
</code></pre><ul>
<li>There are dozens of other entries like &ldquo;corn (soft wheat)&rdquo;, &ldquo;corn (zea)&rdquo;, &ldquo;corn bran&rdquo;, &ldquo;Cornales&rdquo;, etc that could potentially match and to determine if they are related programatically is difficult</li>
@ -1239,12 +1239,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
... 33 more
2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2015': Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore &#39;statistics-2015&#39;: Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
</code></pre><ul>
<li>I tried to shutdown Tomcat and remove the locks:</li>
</ul>
<pre tabindex="0"><code># systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr -iname &quot;*.lock&quot; -delete
# find /home/cgspace.cgiar.org/solr -iname &#34;*.lock&#34; -delete
# systemctl start tomcat7
</code></pre><ul>
<li>&hellip; but the problem still occurs</li>