cgspace-notes/content/posts/2019-02.md

---
title: "February, 2019"
date: 2019-02-01T21:37:30+02:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2019-02-01

- Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
- The top IPs before, during, and after this latest alert tonight were:

```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    245 207.46.13.5
    332 54.70.40.11
    385 5.143.231.38
    405 207.46.13.173
    405 207.46.13.75
   1117 66.249.66.219
   1121 35.237.175.180
   1546 5.9.6.51
   2474 45.5.186.2
   5490 85.25.237.71
```

- `85.25.237.71` is the "Linguee Bot" that I first saw last month
- The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
- There were just over 3 million accesses in the nginx logs last month:

```
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
3018243

real    0m19.873s
user    0m22.203s
sys     0m1.979s
```

<!--more-->

- Normally I'd say this was very high, but [about this time last year]({{< relref "2018-02.md" >}}) I remember thinking the same thing when we had 3.1 million...
- I will have to keep an eye on this to see if there is some error in Solr...
- Atmire sent their [pull request to re-enable the Metadata Quality Module (MQM) on our `5_x-dev` branch](https://github.com/ilri/DSpace/pull/407) today
  - I will test it next week and send them feedback

## 2019-02-02

- Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:

```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    284 18.195.78.144
    329 207.46.13.32
    417 35.237.175.180
    448 34.218.226.147
    694 2a01:4f8:13b:1296::2
    718 2a01:4f8:140:3192::2
    786 137.108.70.14
   1002 5.9.6.51
   6077 85.25.237.71
   8726 45.5.184.2
```

- `45.5.184.2` is CIAT and `85.25.237.71` is the new Linguee bot that I first noticed a few days ago
- I will increase the Linode alert threshold from 275 to 300% because this is becoming too much!
- I tested the Atmire Metadata Quality Module (MQM)'s duplicate checked on the some [WLE items](https://dspacetest.cgiar.org/handle/10568/81268) that I helped Udana with a few months ago on DSpace Test (linode19) and indeed it found many duplicates!

## 2019-02-03

- This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!
- Here are the top IPs before, during, and after that time:

```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    325 85.25.237.71
    340 45.5.184.72
    431 5.143.231.8
    756 5.9.6.51
   1048 34.218.226.147
   1203 66.249.66.219
   1496 195.201.104.240
   4658 205.186.128.185
   4658 70.32.83.92
   4852 45.5.184.2
```

- `45.5.184.2` is CIAT, `70.32.83.92` and `205.186.128.185` are Macaroni Bros harvesters for CCAFS I think
- `195.201.104.240` is a new IP address in Germany with the following user agent:

```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
```

- This user was making 20–60 requests per minute this morning... seems like I should try to block this type of behavior heuristically, regardless of user agent!

```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
     19 03/Feb/2019:07:42
     20 03/Feb/2019:07:12
     21 03/Feb/2019:07:27
     21 03/Feb/2019:07:28
     25 03/Feb/2019:07:23
     25 03/Feb/2019:07:29
     26 03/Feb/2019:07:33
     28 03/Feb/2019:07:38
     30 03/Feb/2019:07:31
     33 03/Feb/2019:07:35
     33 03/Feb/2019:07:37
     38 03/Feb/2019:07:40
     43 03/Feb/2019:07:24
     43 03/Feb/2019:07:32
     46 03/Feb/2019:07:36
     47 03/Feb/2019:07:34
     47 03/Feb/2019:07:39
     47 03/Feb/2019:07:41
     51 03/Feb/2019:07:26
     59 03/Feb/2019:07:25
```

- At least they re-used their Tomcat session!

```
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
1
```

- This user was making requests to `/browse`, which is not currently under the existing rate limiting of dynamic pages in our nginx config
  - I [extended the existing `dynamicpages` (12/m) rate limit to `/browse` and `/discover`](https://github.com/ilri/rmg-ansible-public/commit/36dfb072d6724fb5cdc81ef79cab08ed9ce427ad) with an allowance for bursting of up to five requests for "real" users
- Run all system updates on linode20 and reboot it
  - This will be the new AReS repository explorer server soon

## 2019-02-04

- Generate a list of CTA subjects from CGSpace for Peter:

```
dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
COPY 321
```

- Skype with Michael Victor about CKM and CGSpace
- Discuss the new IITA research theme field with Abenet and decide that we should use `cg.identifier.iitatheme`
- This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:

```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    589 2a01:4f8:140:3192::2
    762 66.249.66.219
    889 35.237.175.180
   1332 34.218.226.147
   1393 5.9.6.51
   1940 50.116.102.77
   3578 85.25.237.71
   4311 45.5.184.2
   4658 205.186.128.185
   4658 70.32.83.92
```

- At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there's nothing we can do to improve REST API performance!
- Perhaps I just need to keep increasing the Linode alert threshold (currently 300%) for this host?

## 2019-02-05

- Peter sent me corrections and deletions for the CTA subjects and as usual, there were encoding errors with some accentsÁ in his file
- In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use `toString()`:

```
or(
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b4.*/)),
  isNotNull(value.match(/.*\u007e.*/))
).toString()
```

- Testing the corrections for sixty-five items and sixteen deletions using my [fix-metadata-values.py](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) and [delete-metadata-values.py](https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be) scripts:

```
$ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
$ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
```

- I applied them on DSpace Test and CGSpace and started a full Discovery re-index:

```
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
```

- Peter had marked several terms with `||` to indicate multiple values in his corrections so I will have to go back and do those manually:

```
EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
ENVIRONMENTAL PROTECTION AND NATURAL RESOURCES MANAGEMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
FISHERIES AND AQUACULTURE,FISHERIES||AQUACULTURE
MARKETING AND TRADE,MARKETING||TRADE
MARKETING ET COMMERCE,MARKETING||COMMERCE
NATURAL RESOURCES AND ENVIRONMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
PÊCHES ET AQUACULTURE,PÊCHES||AQUACULTURE
PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
```

## 2019-02-06

- I dumped the CTA community so I can try to fix the subjects with multiple subjects that Peter indicated in his corrections:

```
$ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
```

- Then I used `csvcut` to get only the CTA subject columns:

```
$ csvcut -c "id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]" /tmp/cta.csv > /tmp/cta-subjects.csv
```

- After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values
- Then I imported it back into CGSpace:

```
$ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
```

- Another day, another alert about high load on CGSpace (linode18) from Linode
- This time the load average was 370% and the top ten IPs before, during, and after that time were:

```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    689 35.237.175.180
   1236 5.9.6.51
   1305 34.218.226.147
   1580 66.249.66.219
   1939 50.116.102.77
   2313 108.212.105.35
   4666 205.186.128.185
   4666 70.32.83.92
   4950 85.25.237.71
   5158 45.5.186.2
```

- Looking closer at the top users, I see `45.5.186.2` is in Brazil and was making over 100 requests per minute to the REST API:

```
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
    118 06/Feb/2019:05:46
    119 06/Feb/2019:05:37
    119 06/Feb/2019:05:47
    120 06/Feb/2019:05:43
    120 06/Feb/2019:05:44
    121 06/Feb/2019:05:38
    122 06/Feb/2019:05:39
    125 06/Feb/2019:05:42
    126 06/Feb/2019:05:40
    126 06/Feb/2019:05:41
```

- I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!

```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
  10411 200
      1 301
      7 302
      3 404
     18 499
      2 500
```

- I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    328 220.247.212.35
    372 66.249.66.221
    380 207.46.13.2
    519 2a01:4f8:140:3192::2
    572 5.143.231.8
    689 35.237.175.180
    771 108.212.105.35
   1236 5.9.6.51
   1554 66.249.66.219
   4942 85.25.237.71
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     10 66.249.66.221
     26 66.249.66.219
     69 5.143.231.8
    340 45.5.184.72
   1040 34.218.226.147
   1542 108.212.105.35
   1937 50.116.102.77
   4661 205.186.128.185
   4661 70.32.83.92
   5102 45.5.186.2
```

## 2019-02-07

- Linode sent an alert last night that the load on CGSpace (linode18) was over 300%
- Here are the top IPs in the web server and API logs before, during, and after that time, respectively:

```
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      5 66.249.66.209
      6 2a01:4f8:210:51ef::2
      6 40.77.167.75
      9 104.198.9.108
      9 157.55.39.192
     10 157.55.39.244
     12 66.249.66.221
     20 95.108.181.88
     27 66.249.66.219
   2381 45.5.186.2
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    455 45.5.186.2
    506 40.77.167.75
    559 54.70.40.11
    825 157.55.39.244
    871 2a01:4f8:140:3192::2
    938 157.55.39.192
   1058 85.25.237.71
   1416 5.9.6.51
   1606 66.249.66.219
   1718 35.237.175.180
```

- Then again this morning another alert:

```
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      5 66.249.66.223
      8 104.198.9.108
     13 110.54.160.222
     24 66.249.66.219
     25 175.158.217.98
    214 34.218.226.147
    346 45.5.184.72
   4529 45.5.186.2
   4661 205.186.128.185
   4661 70.32.83.92
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    145 157.55.39.237
    154 66.249.66.221
    214 34.218.226.147
    261 35.237.175.180
    273 2a01:4f8:140:3192::2
    300 169.48.66.92
    487 5.143.231.39
    766 5.9.6.51
    771 85.25.237.71
    848 66.249.66.219
```

- So it seems that the load issue comes from the REST API, not the XMLUI
- I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don't get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)
- Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:

```
Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
```

- Collection 1056 appears to be [IITA Posters and Presentations](https://cgspace.cgiar.org/handle/10568/68741) and I see that its workflow step 1 (Accept/Reject) is empty:

![IITA Posters and Presentations workflow step 1 empty](/cgspace-notes/2019/02/iita-workflow-step1-empty.png)

- IITA editors or approvers should be added to that step (though I'm curious why nobody is in that group currently)
- Abenet says we are not using the "Accept/Reject" step so this group should be deleted
- Bizuwork asked about the "DSpace Submission Approved and Archived" emails that stopped working last month
- I tried the `test-email` command on DSpace and it indeed is not working:

```
$ dspace test-email

About to send test email:
 - To: aorth@mjanja.ch
 - Subject: DSpace test email
 - Server: smtp.serv.cgnet.com

Error sending email:
 - Error: javax.mail.MessagingException: Could not connect to SMTP host: smtp.serv.cgnet.com, port: 25;
  nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)

Please see the DSpace documentation for assistance.
```

- I can't connect to TCP port 25 on that server so I sent a mail to CGNET support to ask what's up
- CGNET said these servers were discontinued in 2018-01 and that I should use [Office 365](https://docs.microsoft.com/en-us/exchange/mail-flow-best-practices/how-to-set-up-a-multifunction-device-or-application-to-send-email-using-office-3)

## 2019-02-08

- I re-configured CGSpace to use the email/password for cgspace-support, but I get this error when I try the `test-email` script:

```
Error sending email:
 - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
```

- I tried to log into Outlook 365 with the credentials but I think the ones I have must be wrong, so I will ask ICT to reset the password

## 2019-02-09

- Linode sent alerts about CPU load yesterday morning, yesterday night, and this morning! All over 300% CPU load!
- This is just for this morning:

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    289 35.237.175.180
    290 66.249.66.221
    296 18.195.78.144
    312 207.46.13.201
    393 207.46.13.64
    526 2a01:4f8:140:3192::2
    580 151.80.203.180
    742 5.143.231.38
   1046 5.9.6.51
   1331 66.249.66.219
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      4 66.249.83.30
      5 49.149.10.16
      8 207.46.13.64
      9 207.46.13.201
     11 105.63.86.154
     11 66.249.66.221
     31 66.249.66.219
    297 2001:41d0:d:1990::
    908 34.218.226.147
   1947 50.116.102.77
```

- I know 66.249.66.219 is Google, 5.9.6.51 is MegaIndex, and 5.143.231.38 is SputnikBot
- Ooh, but 151.80.203.180 is some malicious bot making requests for `/etc/passwd` like this:

```
/bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;isAllowed=../etc/passwd
```

- 151.80.203.180 is on OVH so I sent a message to their abuse email...

## 2019-02-10

- Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    232 18.195.78.144
    238 35.237.175.180
    281 66.249.66.221
    314 151.80.203.180
    319 34.218.226.147
    326 40.77.167.178
    352 157.55.39.149
    444 2a01:4f8:140:3192::2
   1171 5.9.6.51
   1196 66.249.66.219
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      6 112.203.241.69
      7 157.55.39.149
      9 40.77.167.178
     15 66.249.66.219
    368 45.5.184.72
    432 50.116.102.77
    971 34.218.226.147
   4403 45.5.186.2
   4668 205.186.128.185
   4668 70.32.83.92
```

- Another interesting thing might be the total number of requests for web and API services during that time:

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
16333
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
15964
```

- Also, the number of unique IPs served during that time:

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
1622
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
95
```

- It's very clear to me now that the API requests are the heaviest!
- I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it's becoming a bit of *the boy who cried wolf* because it alerts like clockwork twice per day!
- Add my Python- and shell-based metadata workflow helper scripts as well as the environment settings for pipenv to our DSpace repository ([#408](https://github.com/ilri/DSpace/pull/408)) so I can track changes and distribute them more formally instead of just keeping them [collected on the wiki](https://github.com/ilri/DSpace/wiki/Scripts)
- Started adding IITA research theme (`cg.identifier.iitatheme`) to CGSpace
  - I'm still waiting for feedback from IITA whether they actually want to use "SOCIAL SCIENCE & AGRIC BUSINESS" because it is listed as ["Social Science and Agribusiness"](http://www.iita.org/project-discipline/social-science-and-agribusiness/) on their website
  - Also, I think they want to do some mappings of items with existing subjects to these new themes
- Update ILRI author name style in the controlled vocabulary (Domelevo Entfellner, Jean-Baka) ([#409](https://github.com/ilri/DSpace/pull/409))
  - I'm still waiting to hear from Bizuwork whether we'll batch update all existing items with the old name style
  - No, there is only one entry and Bizu already fixed it
- Last week Hector Tobon from CCAFS asked me about the Creative Commons 3.0 Intergovernmental Organizations (IGO) license because it is not in the list of SPDX licenses
  - Today I made [a request](http://13.57.134.254/app/license_requests/15/) to the [SPDX using their web form](https://github.com/spdx/license-list-XML/blob/master/CONTRIBUTING.md) to include this [class of Creative Commons licenses](https://wiki.creativecommons.org/wiki/Intergovernmental_Organizations)
- Testing the `mail.server.disabled` property that I noticed in `dspace.cfg` recently
  - Setting it to true results in the following message when I try the `dspace test-email` helper on DSpace Test:

```
Error sending email:
 - Error: cannot test email because mail.server.disabled is set to true
```

- I'm not sure why I didn't know about this configuration option before, and always maintained multiple configurations for development and production
  - I will modify the [Ansible DSpace role](https://github.com/ilri/rmg-ansible-public) to use this in its `build.properties` template
- I updated my local Sonatype nexus Docker image and had an issue with the volume for some reason so I decided to just start from scratch:

```
# docker rm nexus
# docker pull sonatype/nexus3
# mkdir -p /home/aorth/.local/lib/containers/volumes/nexus_data
# chown 200:200 /home/aorth/.local/lib/containers/volumes/nexus_data
# docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus-data -p 8081:8081 sonatype/nexus3
```

- For some reason my `mvn package` for DSpace is not working now... I might go back to [using Artifactory for caching](https://mjanja.ch/2018/02/cache-maven-artifacts-with-artifactory/) instead:

```
# docker pull docker.bintray.io/jfrog/artifactory-oss:latest
# mkdir -p /home/aorth/.local/lib/containers/volumes/artifactory5_data
# chown 1030 /home/aorth/.local/lib/containers/volumes/artifactory5_data
# docker run --name artifactory --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
```

## 2019-02-11

- Bosede from IITA said we can use "SOCIAL SCIENCE & AGRIBUSINESS" in their new IITA theme field to be consistent with other places they are using it
- Run all system updates on DSpace Test (linode19) and reboot it

## 2019-02-12

- I notice that [DSpace 6 has included a new JAR-based PDF thumbnailer based on PDFBox](https://jira.duraspace.org/browse/DS-3052), I wonder how good its thumbnails are and how it handles CMYK PDFs
- On a similar note, I wonder if we could use the performance-focused [libvps](https://libvips.github.io/libvips/) and the third-party [jlibvips Java library](https://github.com/codecitizen/jlibvips/) in DSpace
- Testing the `vipsthumbnail` command line tool with [this CGSpace item that uses CMYK](https://cgspace.cgiar.org/handle/10568/51999):

```
$ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
```

- (DSpace 5 appears to use JPEG 92 quality so I do the same)
- Thinking about making "top items" endpoints in my [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api)
- I could use the following SQL queries very easily to get the top items by views or downloads:

```
dspacestatistics=# SELECT * FROM items WHERE views > 0 ORDER BY views DESC LIMIT 10;
dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads DESC LIMIT 10;
```

- I'd have to think about what to make the REST API endpoints, perhaps: `/statistics/top/items?limit=10`
- But how do I do top items by views / downloads separately?
- I re-deployed DSpace 6.3 locally to test the PDFBox thumbnails, especially to see if they handle CMYK files properly
  - The quality is JPEG 75 and I don't see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:

```
$ identify -verbose alc_contrastes_desafios.pdf.jpg
...
  Colorspace: sRGB
```

- I will read the PDFBox thumbnailer documentation to see if I can change the size and quality

## 2019-02-13

- ILRI ICT reset the password for the CGSpace mail account, but I still can't get it to send mail from DSpace's `test-email` utility
- I even added extra mail properties to `dspace.cfg` as suggested by someone on the dspace-tech mailing list:

```
mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
```

- But the result is still:

```
Error sending email:
 - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
```

- I tried to log into the Outlook 365 web mail and it doesn't work so I've emailed ILRI ICT again
- After reading the [common mistakes in the JavaMail FAQ](https://javaee.github.io/javamail/FAQ#commonmistakes) I reconfigured the extra properties in DSpace's mail configuration to be simply:

```
mail.extraproperties = mail.smtp.starttls.enable=true
```

- ... and then I was able to send a mail using my personal account where I know the credentials work
- The CGSpace account still gets this error message:

```
Error sending email:
 - Error: javax.mail.AuthenticationFailedException
```

- I updated the [DSpace SMTP settings in `dspace.cfg`](https://github.com/ilri/DSpace/pull/410) as well as the [variables in the DSpace role of the Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public/commit/ab5fe4d10e16413cd04ffb1bc3179dc970d6d47c)
- Thierry from CTA is having issues with his account on DSpace Test, and there is no admin password reset function on DSpace (only via email, which is disabled on DSpace Test), so I have to delete and re-create his account:

```
$ dspace user --delete --email blah@cta.int
$ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
```

- On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable `webui.user.assumelogin = true`
- I will enable this on CGSpace ([#411](https://github.com/ilri/DSpace/pull/411))
- Test re-creating my local PostgreSQL and Artifactory containers with podman instead of Docker (using the volumes from my old Docker containers though):

```
# podman pull postgres:9.6-alpine
# podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
# podman pull docker.bintray.io/jfrog/artifactory-oss
# podman run --name artifactory -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
```

- Totally works... awesome!
- Then I tried with rootless containers by creating the subuid and subgid mappings for aorth:

```
$ sudo touch /etc/subuid /etc/subgid
$ usermod --add-subuids 10000-75535 aorth
$ usermod --add-subgids 10000-75535 aorth
$ sudo sysctl kernel.unprivileged_userns_clone=1
$ podman pull postgres:9.6-alpine
$ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
```

- Which totally works, but Podman's rootless support doesn't work with port mappings yet...
- Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:

```
# systemctl stop tomcat7
# apt remove tomcat7 tomcat7-admin
# useradd -m -r -s /bin/bash dspace
# mv /usr/share/tomcat7/.m2 /home/dspace
# mv /usr/share/tomcat7/src /home/dspace
# chown -R dspace:dspace /home/dspace
# chown -R dspace:dspace /home/cgspace.cgiar.org
# dpkg -P tomcat7-admin tomcat7-common
```

- After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:

```
2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
```

- The issue last month was address space, which is now set as `LimitAS=infinity` in `tomcat7.service`...
- I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server
- Still the error persists after reboot
- I will try to stop Tomcat and then remove the locks manually:

```
# find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
```

- After restarting Tomcat the usage statistics are back
- Interestingly, many of the locks were from last month, last year, and even 2015! I'm pretty sure that's not supposed to be how locks work...
- Help Sarah Kasyoka finish an item submission that she was having issues with due to the file size
- I increased the nginx upload limit, but she said she was having problems and couldn't really tell me why
- I logged in as her and completed the submission with no problems...

## 2019-02-15

- Tomcat was killed around 3AM by the kernel's OOM killer according to `dmesg`:

```
[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
[Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
[Fri Feb 15 03:10:43 2019] oom_reaper: reaped process 12027 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
```

- The `tomcat7` service shows:

```
Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
```

- I suspect it was related to the media-filter cron job that runs at 3AM but I don't see anything particular in the log files
- I want to try to normalize the `text_lang` values to make working with metadata easier
- We currently have a bunch of weird values that DSpace uses like `NULL`, `en_US`, and `en` and others that have been entered manually by editors:

```
dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
 text_lang |  count
-----------+---------
           | 1069539
 en_US     |  577110
           |  334768
 en        |  133501
 es        |      12
 *         |      11
 es_ES     |       2
 fr        |       2
 spa       |       2
 E.        |       1
 ethnob    |       1
```

- The majority are `NULL`, `en_US`, the blank string, and `en`—the rest are not enough to be significant
- Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!
- I'm going to normalized these to `NULL` at least on DSpace Test for now:

```
dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
UPDATE 1045410
```

- I started proofing IITA's 2019-01 records that Sisay uploaded this week
  - There were 259 records in IITA's original spreadsheet, but there are 276 in Sisay's collection
  - Also, I found that there are at least twenty duplicates in these records that we will need to address
- ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works
- Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman's volumes:

```
$ podman pull postgres:9.6-alpine
$ podman volume create dspacedb_data
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost dspace_2019-02-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
```

- And it's all running without root!
- Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:

```
$ podman volume create artifactory_data
artifactory_data
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
$ buildah unshare
$ chown -R 1030:1030 ~/.local/share/containers/storage/volumes/artifactory_data
$ exit
$ podman start artifactory
```

- More on the [subuid permissions issue with rootless containers here](https://podman.io/blogs/2018/10/03/podman-remove-content-homedir.html)

## 2019-02-17

- I ran DSpace's cleanup task on CGSpace (linode18) and there were errors:

```
$ dspace cleanup -v
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(162844) is still referenced from table "bundle".
```

- The solution is, as always:
```
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
UPDATE 1
```

- I merged the Atmire Metadata Quality Module (MQM) changes to the `5_x-prod` branch and deployed it on CGSpace ([#407](https://github.com/ilri/DSpace/pull/407))
- Then I ran all system updates on CGSpace server and rebooted it

## 2019-02-18

- Jesus fucking Christ, Linode sent an alert that CGSpace (linode18) was using 421% CPU for a few hours this afternoon (server time):
- There seems to have been a lot of activity in XMLUI:

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1236 18.212.208.240
   1276 54.164.83.99
   1277 3.83.14.11
   1282 3.80.196.188
   1296 3.84.172.18
   1299 100.24.48.177
   1299 34.230.15.139
   1327 52.54.252.47
   1477 5.9.6.51
   1861 94.71.244.172
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      8 42.112.238.64
      9 121.52.152.3
      9 157.55.39.50
     10 110.54.151.102
     10 194.246.119.6
     10 66.249.66.221
     15 190.56.193.94
     28 66.249.66.219
     43 34.209.213.122
    178 50.116.102.77
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq | wc -l
2727
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq | wc -l
186
```

- 94.71.244.172 is in Greece and uses the user agent "Indy Library"
- At least they are re-using their Tomcat session:

```
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
```

- The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0":
  - 52.54.252.47
  - 34.230.15.139
  - 100.24.48.177
  - 3.84.172.18
  - 3.80.196.188
  - 3.83.14.11
  - 54.164.83.99
  - 18.212.208.240

- Actually, even up to the top 30 IPs are almost all on Amazon and use the same user agent!
- For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
   1173 52.91.249.23
   1176 107.22.118.106
   1178 3.88.173.152
   1179 3.81.136.184
   1183 34.201.220.164
   1183 3.89.134.93
   1184 54.162.66.53
   1187 3.84.62.209
   1188 3.87.4.140
   1189 54.158.27.198
   1190 54.209.39.13
   1192 54.82.238.223
   1208 3.82.232.144
   1209 3.80.128.247
   1214 54.167.64.164
   1219 3.91.17.126
   1220 34.201.108.226
   1221 3.84.223.134
   1222 18.206.155.14
   1231 54.210.125.13
   1236 18.212.208.240
   1276 54.164.83.99
   1277 3.83.14.11
   1282 3.80.196.188
   1296 3.84.172.18
   1299 100.24.48.177
   1299 34.230.15.139
   1327 52.54.252.47
   1477 5.9.6.51
   1861 94.71.244.172
```

- In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):

```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
     10 18/Feb/2019:17:20
     10 18/Feb/2019:17:22
     10 18/Feb/2019:17:31
     11 18/Feb/2019:13:21
     11 18/Feb/2019:15:18
     11 18/Feb/2019:16:43
     11 18/Feb/2019:16:57
     11 18/Feb/2019:16:58
     11 18/Feb/2019:18:34
     12 18/Feb/2019:14:37
```

- As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics
- There were 92,000 requests from these IPs alone today!

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
92756
```

- I will add this user agent to the ["badbots" rate limiting in our nginx configuration](https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/nginx/default.conf.j2)
- I realized that I had effectively only been applying the "badbots" rate limiting to requests at the root, so I added it to the other blocks that match Discovery, Browse, etc as well
- IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
- I will merge them with our existing list and then resolve their names using my `resolve-orcids.py` script:

```
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt  | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-02-18-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
```

- I merged the changes to the `5_x-prod` branch and they will go live the next time we re-deploy CGSpace ([#412](https://github.com/ilri/DSpace/pull/412))

## 2019-02-19

- Linode sent another alert about CPU usage on CGSpace (linode18) averaging 417% this morning
- Unfortunately, I don't see any strange activity in the web server API or XMLUI logs at that time in particular
- So far today the top ten IPs in the XMLUI logs are:

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
  11541 18.212.208.240
  11560 3.81.136.184
  11562 3.88.237.84
  11569 34.230.15.139
  11572 3.80.128.247
  11573 3.91.17.126
  11586 54.82.89.217
  11610 54.209.39.13
  11657 54.175.90.13
  14686 143.233.242.130
```

- 143.233.242.130 is in Greece and using the user agent "Indy Library", like the top IP yesterday (94.71.244.172)
- That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don't know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this
- The user is requesting only things like `/handle/10568/56199?show=full` so it's nothing malicious, only annoying
- Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday's nginx rate limiting updates
  - I should really try to script something around [ipapi.co](https://ipapi.co/api/) to get these quickly and easily
- The top requests in the API logs today are:

```
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     42 66.249.66.221
     44 156.156.81.215
     55 3.85.54.129
     76 66.249.66.219
     87 34.209.213.122
   1550 34.218.226.147
   2127 50.116.102.77
   4684 205.186.128.185
  11429 45.5.186.2
  12360 2a01:7e00::f03c:91ff:fe0a:d645
```

- `2a01:7e00::f03c:91ff:fe0a:d645` is on Linode, and I can see from the XMLUI access logs that it is Drupal, so I assume it is part of the new ILRI website harvester...
- Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I'm so fucking sick of this
- Our usage stats have exploded the last few months:

![Usage stats](/cgspace-notes/2019/02/usage-stats.png)

- I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate
- I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from [10568/96140](https://hdl.handle.net/10568/96140) almost 200 times:

```
# grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
185
```

- Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:

```
# grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
346
```

- In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!

```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
      1 139.162.146.60
      1 157.55.39.159
      1 196.188.127.94
      1 196.190.127.16
      1 197.183.33.222
      1 66.249.66.221
      2 104.237.146.139
      2 175.158.209.61
      2 196.190.63.120
      2 196.191.127.118
      2 213.55.99.121
      2 82.145.223.103
      3 197.250.96.248
      4 196.191.127.125
      4 197.156.77.24
      5 105.112.75.237
    185 41.190.30.105
    346 41.190.3.229
    503 41.190.31.73
```

- That is so weird, they are all using this Android user agent:

```
Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
```

- I wrote a quick and dirty Python script called `resolve-addresses.py` to resolve IP addresses to their owning organization's name, ASN, and country using the [IPAPI.co API](https://ipapi.co)

## 2019-02-20

- Ben Hack was asking about getting authors publications programmatically from CGSpace for the new ILRI website
- I told him that they should probably try to use the REST API's `find-by-metadata-field` endpoint
- The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:

```
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": ""}'
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": null}'
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": "en_US"}'
```

- This returns six items for me, which is the [same I see in a Discovery search](https://cgspace.cgiar.org/discover?filtertype_1=orcid&filter_relational_operator_1=contains&filter_1=Alan+S.+Orth%3A+0000-0002-1735-7458&submit_apply_filter=&query=)
- Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api)
- I was playing with [YasGUI](http://yasgui.org/) to query AGROVOC's SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually
- I think I want to stick to the regular [web services](http://aims.fao.org/agrovoc/webservices) to validate AGROVOC terms

![YasGUI querying AGROVOC](/cgspace-notes/2019/02/yasgui-agrovoc.png)

- There seems to be a REST API for AGROVOC here: http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=FISH&lang=en
- See this [issue on the VIVO tracker](https://jira.duraspace.org/browse/VIVO-1655) for more information about this endpoint
- The old-school AGROVOC SOAP WSDL works with the [Zeep Python library](https://python-zeep.readthedocs.io/en/master/), but in my tests the results are way too broad despite trying to use a "exact match" searching

## 2019-02-21

- I wrote a script [agrovoc-lookup.py](https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py) to resolve subject terms against the public AGROVOC REST API
- It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to
- I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files:

```
$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
```

- Then I generated a list of all the unique matched terms:

```
$ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
```

- And then a list of all the unique *unmatched* terms using some utility I've never heard of before called `comm` or with `diff`:

```
$ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
$ diff --new-line-format="" --unchanged-line-format="" /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt > /tmp/2019-02-21-unmatched-subjects.txt
```

- Generate a list of countries and regions from CGSpace for Sisay to look through:

```
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
COPY 202
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
COPY 33
```

- I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it's almost ready so I created a pull request ([#413](https://github.com/ilri/DSpace/pull/413))
- I still need to test the batch tagging of IITA items with themes based on their IITA subjects:
  - NATURAL RESOURCE MANAGEMENT research theme to items with NATURAL RESOURCE MANAGEMENT subject
  - BIOTECH & PLANT BREEDING research theme to items with PLANT BREEDING subject
  - SOCIAL SCIENCE & AGRIBUSINESS research theme to items with AGRIBUSINESS subject
  - PLANT PRODUCTION & HEALTH research theme to items with PLANT PRODUCTION subject
  - PLANT PRODUCTION & HEALTH research theme to items with PLANT HEALTH subject
  - NUTRITION & HUMAN HEALTH research theme to items with NUTRITION subject

## 2019-02-22

- Help Udana from WLE with some issues related to CGSpace items on their [Publications website](https://www.wle.cgiar.org/publications)
  - He wanted some IWMI items to show up in their publications website
  - The items were mapped into WLE collections, but still weren't showing up on the publications website
  - I told him that he needs to add the `cg.identifier.wletheme` to the items so that the website indexer finds them
  - A few days ago he added the metadata to [10568/93011](https://cgspace.cgiar.org/handle/10568/93011) and now I see that the item is present on the [WLE publications website](https://www.wle.cgiar.org/resource-recovery-waste-business-models-energy-nutrient-and-water-reuse-low-and-middle-income)
- Start looking at IITA's latest round of batch uploads called ["IITA_Feb_14" on DSpace Test](https://dspacetest.cgiar.org/handle/10568/108684)
  - One mispelled authorship type
  - A few dozen incorrect inconsistent affiliations (I dumped a list of the top 1500 affiliations and reconciled against it, but it was still a lot of work)
  - One issue with smart quotes in countries
  - A few IITA subjects with syntax errors
  - Some whitespace and consistency issues in sponsorships
  - Eight items with invalid ISBN: 0-471-98560-3
  - Two incorrectly formatted ISSNs
  - Lots of incorrect values in subjects, but that's a difficult problem to do in an automated way

- I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:

```
import json
import re
import urllib
import urllib2

pattern = re.compile('^S[A-Z ]+$')
if pattern.match(value):
  url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&lang=en'
  get = urllib2.urlopen(url)
  data = json.load(get)
  if len(data['results']) == 1:
    return "matched"

return "unmatched"
```

- You have to make sure to URL encode the value with `quote_plus()` and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable
- There is a [good resource discussing OpenRefine, Jython, and web scraping](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-2-url-queries-and-parsing-json)

## 2019-02-24

- I decided to try to validate the AGROVOC subjects in IITA's recent batch upload by dumping all their terms, checking them in en/es/fr with `agrovoc-lookup.py`, then reconciling against the final list using reconcile-csv with OpenRefine
- I'm not sure how to deal with terms like "CORN" that are alternative labels (`altLabel`) in AGROVOC where the preferred label (`prefLabel`) would be "MAIZE"
- For example, [a query](http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&lang=en) for `CORN*` returns:

```
    "results": [
        {
            "altLabel": "corn (maize)",
            "lang": "en",
            "prefLabel": "maize",
            "type": [
                "skos:Concept"
            ],
            "uri": "http://aims.fao.org/aos/agrovoc/c_12332",
            "vocab": "agrovoc"
        },
```

- There are dozens of other entries like "corn (soft wheat)", "corn (zea)", "corn bran", "Cornales", etc that could potentially match and to determine if they are related programatically is difficult
- Shit, and then there are terms like "GENETIC DIVERSITY" that should [technically be](http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/page/c_33952) "genetic diversity (as resource)"
- I applied all changes to the IITA Feb 14 batch data except the affiliations and sponsorships because I think I made some mistakes with the copying of reconciled values so I will try to look at those again separately
- I went back and re-did the affiliations and sponsorships and then applied them on the IITA Feb 14 collection on DSpace Test
- I did a duplicate check of the IITA Feb 14 records on DSpace Test and there were about fifteen or twenty items reported
  - A few of them are actually in previous IITA batch updates, which means they have been uploaded to CGSpace yet, so I worry that there would be many more
  - I want to re-synchronize CGSpace to DSpace Test to make sure that the duplicate checking is accurate, but I'm not sure I can because the Earlham guys are still testing COPO actively on DSpace Test

## 2019-02-25

- There seems to be something going on with Solr on CGSpace (linode18) because statistics on communities and collections are blank for January and February this year
- I see some errors started recently in Solr (yesterday):

```
$ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
/home/cgspace.cgiar.org/log/solr.log.2019-02-11.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-12.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-13.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-14.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-15.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-16.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-17.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-18.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-19.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-20.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-21.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-22.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-23.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-24:34
```

- But I don't see anything interesting in yesterday's Solr log...
- I see this in the Tomcat 7 logs yesterday:

```
Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
Feb 25 21:09:29 linode18 tomcat7[1015]: java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLogger$9.visit(SourceFile:1241)
Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLogger.visitEachStatisticShard(SourceFile:268)
Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLogger.update(SourceFile:1225)
Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLogger.update(SourceFile:1220)
Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:103)
...
```

- In the Solr admin GUI I see we have the following error: "statistics-2011: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher"
- I restarted Tomcat and upon startup I see lots of errors in the systemd journal, like:

```
Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
Feb 25 21:37:49 linode18 tomcat7[28363]: java.io.StreamCorruptedException: invalid type code: 00
Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1601)
Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:561)
Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.lang.Throwable.readObject(Throwable.java:914)
Feb 25 21:37:49 linode18 tomcat7[28363]:         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Feb 25 21:37:49 linode18 tomcat7[28363]:         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
```
- I don't think that's related...
- Also, now the Solr admin UI says "statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher"
- In the Solr log I see:

```
2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
...
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
        ... 31 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
        at org.apache.lucene.store.Lock.obtain(Lock.java:89)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:753)
        at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
        at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
        at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
        at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
        ... 33 more
2019-02-25 21:38:14,250 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2015': Unable to create core [statistics-2015] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2015/data/index/write.lock
```

- I tried to shutdown Tomcat and remove the locks:

```
# systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr -iname "*.lock" -delete
# systemctl start tomcat7
```

- ... but the problem still occurs
- I can see that there are still hits being recorded for items (in the Solr admin UI as well as my statistics API), so the main stats core is working at least!
- On a hunch I tried adding `ulimit -v unlimited` to the Tomcat `catalina.sh` and now Solr starts up with no core errors and I actually have statistics for January and February on [some communities](https://cgspace.cgiar.org/handle/10568/16814), but not [others](https://cgspace.cgiar.org/handle/10568/1)
- I wonder if the address space limits that I added via `LimitAS=infinity` in the systemd service are somehow not working?
- I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the `LimitAS` setting does work, and the `infinity` setting in systemd does get translated to "unlimited" on the service
- I thought it might be open file limit, but it seems we're nowhere near the current limit of 16384:

```
# lsof -u dspace | wc -l
3016
```

- For what it's worth I see the same errors about `solr_update_time_stamp` on DSpace Test (linode19)
- Update DSpace Test to [Tomcat 7.0.93](https://tomcat.apache.org/tomcat-7.0-doc/changelog.html#Tomcat_7.0.93_(violetagg))
- Something seems to have happened (some Atmire scheduled task, perhaps the CUA one at 7AM?) on CGSpace because I checked a few communities and collections on CGSpace and there are now statistics for January and February

![CGSpace statlets working again](/cgspace-notes/2019/02/statlets-working.png)

- I still have not figured out what the *real* cause for the Solr cores to not load was, though

## 2019-02-26

- I sent a mail to the dspace-tech mailing list about the "solr_update_time_stamp" error
- A CCAFS user sent a message saying they got this error when submitting to CGSpace:

```
Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
```

- According to the [REST API](https://cgspace.cgiar.org/rest/collections/1021) collection 1021 appears to be [CCAFS Tools, Maps, Datasets and Models](https://cgspace.cgiar.org/handle/10568/66581)
- I looked at the `WORKFLOW_STEP_1` (Accept/Reject) and the group is of course empty
- As we've seen several times recently, we are not using this step so it should simply be deleted

## 2019-02-27

- Discuss batch uploads with Sisay
- He's trying to upload some CTA records, but it's not possible to do collection mapping when using the web UI
  - I sent a mail to the dspace-tech mailing list to ask about the inability to perform mappings when uploading via the XMLUI batch upload
- He asked me to upload the files for him via the command line, but the file he referenced (`Thumbnails_feb_2019.zip`) doesn't exist
- I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file's name:

```
$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
```

- Why don't they just derive the directory from the path to the zip file?
- Working on Udana's Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
  - I also added a few regions because they are obvious for the countries
  - Also I added some rights fields that I noticed were easily available from the publications pages
  - I imported the records into my local environment with a fresh snapshot of the CGSpace database and ran the Atmire duplicate checker against them and it didn't find any
  - I uploaded fifty-two records to the [Restoring Degraded Landscapes collection](https://cgspace.cgiar.org/handle/10568/81592) on CGSpace

## 2019-02-28

- I helped Sisay upload the nineteen CTA records from last week via the command line because they required mappings (which is not possible to do via the batch upload web interface)

```
$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
```

- Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out *sigh*
- Now I'm getting this message when trying to use DSpace's `test-email` script:

```
$ dspace test-email

About to send test email:
 - To: stfu@google.com
 - Subject: DSpace test email
 - Server: smtp.office365.com

Error sending email:
 - Error: javax.mail.AuthenticationFailedException

Please see the DSpace documentation for assistance.
```

- I've tried to log in with the last two passwords that ICT reset it to earlier this month, but they are not working
- I sent a mail to ILRI ICT to check if we're locked out or reset the password again

<!-- vim: set sw=2 ts=2: -->