cgspace-notes/content/posts/2020-03.md

---
title: "March, 2020"
date: 2020-03-02T12:31:30+02:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2020-03-02

- Update [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) for DSpace 6+ UUIDs
  - Tag version 1.2.0 on GitHub
- Test migrating legacy Solr statistics to UUIDs with the as-of-yet unreleased [SolrUpgradePre6xStatistics.java](https://github.com/DSpace/DSpace/commit/184f2b2153479045fba6239342c63e7f8564b8b6#diff-0350ce2e13b28d5d61252b7a8f50a059)
  - You need to download this into the DSpace 6.x source and compile it

<!--more-->

```
$ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
```

## 2020-03-03

- Skype with Peter and Abenet to discuss the CG Core survey
  - We also discussed some other CGSpace issues

## 2020-03-04

- Abenet asked me to add some new ILRI subjects to CGSpace
  - I [updated the input-forms.xml](https://github.com/ilri/DSpace/commit/b51a242e773bd8658d3cab4ac883975708b00386) in our `5_x-prod` branch on GitHub
  - Abenet said we are changing `HEALTH` to `HUMAN HEALTH` so I need to fix those using my `fix-metadata-values.py` script:

```
$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
```

- But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it...
- Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs

## 2020-03-05

- I found a very [interesting comment on the Solr 8.1 guide](https://lucene.apache.org/solr/guide/8_1/solr-system-requirements.html#lucene-solr-prior-to-7-0) about Java compatibility:

> Lucene/Solr 7.0 was the first version that successfully passed our tests using Java 9 and higher. You should avoid Java 9 or later for Lucene/Solr 6.x or earlier.

## 2020-03-08

- I want to try to consolidate our yearly Solr statistics cores back into one `statistics` core using the solr-import-export-json tool
- I will try it on DSpace test, doing one year at a time:

```
$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2011.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2011*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2012.json -k uid
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
  "response":{"numFound":3761989,"start":0,"docs":[]
$ curl -s 'http://localhost:8081/solr/statistics-2012/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
  "response":{"numFound":3761989,"start":0,"docs":[]
$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2012*</query></delete>"
```

- I will do this for as many cores as I can (disk space limited) and then monitor the effect on the system and JVM memory usage
  - Exporting half years might work, using a filter query with months as a regular expression:

```
$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
```

- Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
  - I've been running it for one month in my local environment, and others have reported on the dspace-tech mailing list that they are using 10 and 11

```
# apt install postgresql-10 postgresql-contrib-10
# systemctl stop tomcat7
# pg_ctlcluster 9.6 main stop
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
# tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6
# pg_ctlcluster 10 main stop
# pg_dropcluster 10 main
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
```

## 2020-03-09

- Peter noticed that the Solr stats were not showing anything before 2020
  - I had to restart Tomcat three times before all cores loaded properly...

## 2020-03-10

- Fix some logic issues in the nginx config
  - Use generic blocking of `[Bb]ot` and `[Cc]rawl` and `[Ss]pider` in the "badbots" rate limiting logic instead of trying to list them all one by one (bots should not be trying to index dynamic pages *no matter what* so we punish hard here)
  - We were not properly forwarding the remote IP address to Tomcat in all nginx location blocks, which led some locations to log a hit from 127.0.0.1 (because we need to explicitly add the global proxy params when setting other headers in location blocks)
  - Unfortunately this affected the REST API and there are a few hundred thousand requests from this user agent:

```
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
```

- It seems to only be a problem in the last week:

```
# zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
/var/log/nginx/rest.log.1:0
/var/log/nginx/rest.log.2:0
/var/log/nginx/rest.log.3:0
/var/log/nginx/rest.log.4:3625
/var/log/nginx/rest.log.5:27458
/var/log/nginx/rest.log.6:0
/var/log/nginx/rest.log.7:0
/var/log/nginx/rest.log.8:0
/var/log/nginx/rest.log.9:0
```

- In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean
- I will purge them from Solr statistics:

```
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
```

- Another user agent that seems to be a bot is:

```
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
```

- In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx's logs I see it belongs to three IPs on Online.net in France:

```
# zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
  63090 163.172.68.99
 183428 163.172.70.248
 147608 163.172.71.24
```
- It is making 10,000 to 40,000 requests to XMLUI per day...

```
# zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
/var/log/nginx/access.log.30.gz:18687
/var/log/nginx/access.log.31.gz:28936
/var/log/nginx/access.log.32.gz:36402
/var/log/nginx/access.log.33.gz:38886
/var/log/nginx/access.log.34.gz:30607
/var/log/nginx/access.log.35.gz:19040
/var/log/nginx/access.log.36.gz:10780
/var/log/nginx/access.log.37.gz:5808
/var/log/nginx/access.log.38.gz:3100
/var/log/nginx/access.log.39.gz:1485
/var/log/nginx/access.log.3.gz:2898
/var/log/nginx/access.log.40.gz:373
/var/log/nginx/access.log.41.gz:3909
/var/log/nginx/access.log.42.gz:4729
/var/log/nginx/access.log.43.gz:3906
```

- I will purge those hits too!

```
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
```

- Shit, and something happened and a few thousand hits from user agents with "Bot" in their user agent got through
  - I need to re-run the `check-bot-hits.sh` script with the standard COUNTER-Robots list again, but add my own versions of a few because the script/Solr doesn't support case-insensitive regular expressions:

```
$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: Citoid
Purging 11 hits from Citoid in statistics
(DEBUG) Checking for hits from spider: ecointernet
Purging 375 hits from ecointernet in statistics
(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
Purging 1 hits from ^Pattern\/[0-9] in statistics
(DEBUG) Checking for hits from spider: sqlmap
(DEBUG) Checking for hits from spider: Typhoeus
Purging 6 hits from Typhoeus in statistics
(DEBUG) Checking for hits from spider: 7siters
(DEBUG) Checking for hits from spider: Apache-HttpClient
Purging 3178 hits from Apache-HttpClient in statistics

Total number of bot hits purged: 3571

$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: [Bb]ot
Purging 8317 hits from [Bb]ot in statistics
(DEBUG) Checking for hits from spider: [Cc]rawl
Purging 1314 hits from [Cc]rawl in statistics
(DEBUG) Checking for hits from spider: [Ss]pider
Purging 62 hits from [Ss]pider in statistics
(DEBUG) Checking for hits from spider: Citoid
(DEBUG) Checking for hits from spider: ecointernet
(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
(DEBUG) Checking for hits from spider: sqlmap
(DEBUG) Checking for hits from spider: Typhoeus
(DEBUG) Checking for hits from spider: 7siters
(DEBUG) Checking for hits from spider: Apache-HttpClient
```

## 2020-03-11

- Ask Michael Victor for permission to create a new Linode server for DSpace Test

## 2020-3-12

- I'm working on the 170 IITA records on [DSpace Test](https://dspacetest.cgiar.org/handle/10568/106567) from January finally
  - It's been two months since I last looked and I want to do a thorough check to make sure Bosede didn't introduce any new issues, but I want to consolidate all the text languages for these records so it's easier to check them in OpenRefine
  - First I got a list of IDs from `csvcut` and then I updated the text languages for only those records:

```
dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
```

- Then I exported the metadata from DSpace Test and imported it into OpenRefine
  - I corrected one invalid AGROVOC subject using my `csv-metadata-quality` script
- I exported a new list of affiliations from the database, added line numbers with `csvcut`, and then validated them in OpenRefine using `reconcile-csv`:


```
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
dspace=# \q
$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' > /tmp/affiliations.csv
$ lein run /tmp/affiliations.csv name id
```

- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: `if(cell.recon.matched, cell.recon.match.name, value)`
- I mapped all 170 items to their appropriate collections based on type and uploaded them to CGSpace

## 2020-03-16

- I'm looking at the CPU usage of CGSpace (linode18) over the past year and I see we *rarely* even go over two CPUs on average sustained usage:

![linode18 CPU usage year](/cgspace-notes/2020/03/cgspace-cpu-year.png)

- Also clearly visible is the effect of CPU steal in 2019-03

![linode18 RAM usage year](/cgspace-notes/2020/03/cgspace-memory-year.png)

![linode18 JVM heap usage year](/cgspace-notes/2020/03/cgspace-heap-year.png)

- At max we have committed 10GB of RAM, the rest is used opportunistically by the filesystem cache, likely for Solr
  - There was a huge drop in 2019-07 when I changed the JVM settings
  - I think we should re-evaluate our deployment and perhaps target a different instance type and add block storage for assetstore (as we determined Linode's block storage to be too slow for Solr)

## 2020-03-17

- Update the PostgreSQL JDBC driver to version 42.2.11
- Maria from Bioversity asked me to add a new field for the combined subjects of Bioversity and CIAT, since they merged recently
  - We will use `cg.subject.alliancebiovciat`

## 2020-03-18

- Provision new Linode server (linode26) for DSpace Test to replace the current linode19 server
- Improve DSpace role of [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
  - We should install npm packages in the DSpace user's home directory instead of globally as root

## 2020-03-19

- Finalized migration of DSpace Test to linode26 and removed linode19

## 2020-03-22

- Look over the AReS ToRs sent by Enrico and Moayad and add a few notes about missing GitHub issues
  - Hopefully now they can start working on the development!

## 2020-03-24

- Skype meeting about CGSpace with Peter and Abenet

## 2020-03-25

- I sent Atmire a message to ask if they managed to start working on the DSpace 6 port, as the last communication was twenty-six days ago when they said they were going to secure technical resources to do so
- Start adapting the `dspace` role in our [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public) for DSpace 6 support

## 2020-03-26

- More work adapting the `dspace` role in our Ansible infrastructure scripts to DSpace 6
- Update Tomcat to version 7.0.103 in the Ansible infrastrcutrue playbooks and deploy on DSpace Test (linode26)
- Maria sent me a few new ORCID identifiers from Bioversity so I combined them with our existing ones, filtered the unique ones, and then resolved their names using my `resolve-orcids.py` script:

```
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-03-26-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-03-26-combined-orcids.txt -o /tmp/2020-03-26-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
```
- I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:

```
dc.contributor.author,cg.creator.id
"King, Brian","Brian King: 0000-0002-7056-9214"
"Ortiz-Crespo, Berta","Berta Ortiz-Crespo: 0000-0002-6664-0815"
"Ekesa, Beatrice","Beatrice Ekesa: 0000-0002-2630-258X"
"Ekesa, B.","Beatrice Ekesa: 0000-0002-2630-258X"
"Ekesa, B.N.","Beatrice Ekesa: 0000-0002-2630-258X"
"Gullotta, G.","Gaia Gullotta: 0000-0002-2240-3869"
```

- Running the `add-orcid-identifiers-csv.py` script I added 32 ORCID iDs to items on CGSpace!

```
$ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
```

- Udana from IWMI asked about some items that are missing Altmetric donuts on CGSpace
  - [One of them](https://hdl.handle.net/10568/103225) had a link to the paper on Nature, but was missing a DOI
  - [The second item](https://hdl.handle.net/10568/106899) had no donut so I [tweeted its handle](https://twitter.com/mralanorth/status/1243158045540134913)
  - [The third item](https://hdl.handle.net/10568/107258) also had no handle so I [tweeted it](https://twitter.com/mralanorth/status/1243158786392625153) as well
- Abenet pointed out [one item](https://hdl.handle.net/10568/106573) that she had tweeted last week that is missing a donut as well, so I [tweeted it](https://twitter.com/mralanorth/status/1243163710241345536) too

## 2020-03-29

- Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors' existing publications in the database using this CSV with my `add-orcid-identifiers-csv.py` script:

```
dc.contributor.author,cg.creator.id
"Snook, L.K.","Laura Snook: 0000-0002-9168-1301"
"Snook, L.","Laura Snook: 0000-0002-9168-1301"
"Zheng, S.J.","Sijun Zheng: 0000-0003-1550-3738"
"Zheng, S.","Sijun Zheng: 0000-0003-1550-3738"
```

- Deploy latest Bioversity and CIAT updates on CGSpace (linode18) and DSpace Test (linode26)
- Deploy latest Ansible infrastructure playbooks on CGSpace and DSpace Test to get the latest dspace-statistics-api (v1.1.1) and Tomcat (7.0.103) versions
- Run system updates on CGSpace and DSpace Test and reboot them
  - After reboot all the Solr statistics cores came back up on the first time on both servers (yay)

<!-- vim: set sw=2 ts=2: -->
Add notes for 2020-03-02 2020-03-02 11:38:10 +01:00			`---`
			`title: "March, 2020"`
			`date: 2020-03-02T12:31:30+02:00`
			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

			`## 2020-03-02`

			`- Update [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) for DSpace 6+ UUIDs`
			`- Tag version 1.2.0 on GitHub`
			`- Test migrating legacy Solr statistics to UUIDs with the as-of-yet unreleased [SolrUpgradePre6xStatistics.java](https://github.com/DSpace/DSpace/commit/184f2b2153479045fba6239342c63e7f8564b8b6#diff-0350ce2e13b28d5d61252b7a8f50a059)`
			`- You need to download this into the DSpace 6.x source and compile it`

			`<!--more-->`

			```
			`$ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"`
			`$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x`
			```

Add notes for 2020-03-04 2020-03-04 17:02:54 +01:00			`## 2020-03-03`

			`- Skype with Peter and Abenet to discuss the CG Core survey`
			`- We also discussed some other CGSpace issues`

			`## 2020-03-04`

			`- Abenet asked me to add some new ILRI subjects to CGSpace`
			- I [updated the input-forms.xml](https://github.com/ilri/DSpace/commit/b51a242e773bd8658d3cab4ac883975708b00386) in our `5_x-prod` branch on GitHub
			- Abenet said we are changing `HEALTH` to `HUMAN HEALTH` so I need to fix those using my `fix-metadata-values.py` script:

			```
			`$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d`
			```

			`- But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it...`
			`- Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs`

Add notes for 2020-03-08 2020-03-08 12:34:56 +01:00			`## 2020-03-05`

			`- I found a very [interesting comment on the Solr 8.1 guide](https://lucene.apache.org/solr/guide/8_1/solr-system-requirements.html#lucene-solr-prior-to-7-0) about Java compatibility:`

			`> Lucene/Solr 7.0 was the first version that successfully passed our tests using Java 9 and higher. You should avoid Java 9 or later for Lucene/Solr 6.x or earlier.`

			`## 2020-03-08`

			- I want to try to consolidate our yearly Solr statistics cores back into one `statistics` core using the solr-import-export-json tool
			`- I will try it on DSpace test, doing one year at a time:`

			```
			`$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid`
			`$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid`
			`$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"`
			`$ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid`
			`$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2011.json -k uid`
			`$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2011*</query></delete>"`
			`$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2012.json -k uid`
			`$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2012*&rows=0&wt=json&indent=true' \| grep numFound`
			`"response":{"numFound":3761989,"start":0,"docs":[]`
			`$ curl -s 'http://localhost:8081/solr/statistics-2012/select?q=time:2012*&rows=0&wt=json&indent=true' \| grep numFound`
			`"response":{"numFound":3761989,"start":0,"docs":[]`
			`$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2012*</query></delete>"`
			```

			`- I will do this for as many cores as I can (disk space limited) and then monitor the effect on the system and JVM memory usage`
Update notes for 2020-03-08 2020-03-08 13:28:39 +01:00			`- Exporting half years might work, using a filter query with months as a regular expression:`

			```
			`$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'`
			```
Add notes for 2020-03-08 2020-03-08 12:34:56 +01:00
Update notes for 2020-03-08 2020-03-08 14:53:34 +01:00			`- Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)`
			`- I've been running it for one month in my local environment, and others have reported on the dspace-tech mailing list that they are using 10 and 11`

			```
			`# apt install postgresql-10 postgresql-contrib-10`
			`# systemctl stop tomcat7`
			`# pg_ctlcluster 9.6 main stop`
			`# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6`
			`# tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6`
			`# pg_ctlcluster 10 main stop`
			`# pg_dropcluster 10 main`
			`# pg_upgradecluster 9.6 main`
			`# pg_dropcluster 9.6 main`
			`# dpkg -l \| grep postgresql \| grep 9.6 \| awk '{print $2}' \| xargs dpkg -r`
			```

Add notes for 2020-03-10 2020-03-10 15:18:20 +01:00			`## 2020-03-09`

			`- Peter noticed that the Solr stats were not showing anything before 2020`
			`- I had to restart Tomcat three times before all cores loaded properly...`

			`## 2020-03-10`

			`- Fix some logic issues in the nginx config`
			- Use generic blocking of `[Bb]ot` and `[Cc]rawl` and `[Ss]pider` in the "badbots" rate limiting logic instead of trying to list them all one by one (bots should not be trying to index dynamic pages no matter what so we punish hard here)
			`- We were not properly forwarding the remote IP address to Tomcat in all nginx location blocks, which led some locations to log a hit from 127.0.0.1 (because we need to explicitly add the global proxy params when setting other headers in location blocks)`
			`- Unfortunately this affected the REST API and there are a few hundred thousand requests from this user agent:`

			```
			`Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)`
			```

			`- It seems to only be a problem in the last week:`

			```
			`# zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}`
			`/var/log/nginx/rest.log.1:0`
			`/var/log/nginx/rest.log.2:0`
			`/var/log/nginx/rest.log.3:0`
			`/var/log/nginx/rest.log.4:3625`
			`/var/log/nginx/rest.log.5:27458`
			`/var/log/nginx/rest.log.6:0`
			`/var/log/nginx/rest.log.7:0`
			`/var/log/nginx/rest.log.8:0`
			`/var/log/nginx/rest.log.9:0`
			```

			`- In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean`
			`- I will purge them from Solr statistics:`

			```
			`$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'`
			```

			`- Another user agent that seems to be a bot is:`

			```
			`Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)`
			```

			`- In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx's logs I see it belongs to three IPs on Online.net in France:`

			```
			`# zcat /var/log/nginx/access.log..gz /var/log/nginx/rest.log..gz \| grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' \| awk '{print $1}' \| sort \| uniq -c`
			`63090 163.172.68.99`
			`183428 163.172.70.248`
			`147608 163.172.71.24`
			```
			`- It is making 10,000 to 40,000 requests to XMLUI per day...`

			```
			`# zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}`
			`/var/log/nginx/access.log.30.gz:18687`
			`/var/log/nginx/access.log.31.gz:28936`
			`/var/log/nginx/access.log.32.gz:36402`
			`/var/log/nginx/access.log.33.gz:38886`
			`/var/log/nginx/access.log.34.gz:30607`
			`/var/log/nginx/access.log.35.gz:19040`
			`/var/log/nginx/access.log.36.gz:10780`
			`/var/log/nginx/access.log.37.gz:5808`
			`/var/log/nginx/access.log.38.gz:3100`
			`/var/log/nginx/access.log.39.gz:1485`
			`/var/log/nginx/access.log.3.gz:2898`
			`/var/log/nginx/access.log.40.gz:373`
			`/var/log/nginx/access.log.41.gz:3909`
			`/var/log/nginx/access.log.42.gz:4729`
			`/var/log/nginx/access.log.43.gz:3906`
			```

			`- I will purge those hits too!`

			```
			`$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'`
			```

			`- Shit, and something happened and a few thousand hits from user agents with "Bot" in their user agent got through`
			- I need to re-run the `check-bot-hits.sh` script with the standard COUNTER-Robots list again, but add my own versions of a few because the script/Solr doesn't support case-insensitive regular expressions:

			```
			`$ ./check-spider-hits.sh -f /tmp/bots -d -p`
			`(DEBUG) Using spiders pattern file: /tmp/bots`
			`(DEBUG) Checking for hits from spider: Citoid`
			`Purging 11 hits from Citoid in statistics`
			`(DEBUG) Checking for hits from spider: ecointernet`
			`Purging 375 hits from ecointernet in statistics`
			`(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]`
			`Purging 1 hits from ^Pattern\/[0-9] in statistics`
			`(DEBUG) Checking for hits from spider: sqlmap`
			`(DEBUG) Checking for hits from spider: Typhoeus`
			`Purging 6 hits from Typhoeus in statistics`
			`(DEBUG) Checking for hits from spider: 7siters`
			`(DEBUG) Checking for hits from spider: Apache-HttpClient`
			`Purging 3178 hits from Apache-HttpClient in statistics`

			`Total number of bot hits purged: 3571`

			`$ ./check-spider-hits.sh -f /tmp/bots -d -p`
			`(DEBUG) Using spiders pattern file: /tmp/bots`
			`(DEBUG) Checking for hits from spider: [Bb]ot`
			`Purging 8317 hits from [Bb]ot in statistics`
			`(DEBUG) Checking for hits from spider: [Cc]rawl`
			`Purging 1314 hits from [Cc]rawl in statistics`
			`(DEBUG) Checking for hits from spider: [Ss]pider`
			`Purging 62 hits from [Ss]pider in statistics`
			`(DEBUG) Checking for hits from spider: Citoid`
			`(DEBUG) Checking for hits from spider: ecointernet`
			`(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]`
			`(DEBUG) Checking for hits from spider: sqlmap`
			`(DEBUG) Checking for hits from spider: Typhoeus`
			`(DEBUG) Checking for hits from spider: 7siters`
			`(DEBUG) Checking for hits from spider: Apache-HttpClient`
			```

Add notes for 2020-03-12 2020-03-12 11:58:21 +01:00			`## 2020-03-11`

			`- Ask Michael Victor for permission to create a new Linode server for DSpace Test`

			`## 2020-3-12`

			`- I'm working on the 170 IITA records on [DSpace Test](https://dspacetest.cgiar.org/handle/10568/106567) from January finally`
			`- It's been two months since I last looked and I want to do a thorough check to make sure Bosede didn't introduce any new issues, but I want to consolidate all the text languages for these records so it's easier to check them in OpenRefine`
			- First I got a list of IDs from `csvcut` and then I updated the text languages for only those records:

			```
			dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
			```

			`- Then I exported the metadata from DSpace Test and imported it into OpenRefine`
			- I corrected one invalid AGROVOC subject using my `csv-metadata-quality` script
			- I exported a new list of affiliations from the database, added line numbers with `csvcut`, and then validated them in OpenRefine using `reconcile-csv`:


			```
			dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
			`dspace=# \q`
			`$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv \| sed -e 's/^line_number/id/' -e 's/text_value/name/' > /tmp/affiliations.csv`
			`$ lein run /tmp/affiliations.csv name id`
			```

			- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: `if(cell.recon.matched, cell.recon.match.name, value)`
			`- I mapped all 170 items to their appropriate collections based on type and uploaded them to CGSpace`

Update docs 2020-03-16 15:19:44 +01:00			`## 2020-03-16`

			`- I'm looking at the CPU usage of CGSpace (linode18) over the past year and I see we rarely even go over two CPUs on average sustained usage:`

			`![linode18 CPU usage year](/cgspace-notes/2020/03/cgspace-cpu-year.png)`

			`- Also clearly visible is the effect of CPU steal in 2019-03`

			`![linode18 RAM usage year](/cgspace-notes/2020/03/cgspace-memory-year.png)`

			`![linode18 JVM heap usage year](/cgspace-notes/2020/03/cgspace-heap-year.png)`

			`- At max we have committed 10GB of RAM, the rest is used opportunistically by the filesystem cache, likely for Solr`
			`- There was a huge drop in 2019-07 when I changed the JVM settings`
			`- I think we should re-evaluate our deployment and perhaps target a different instance type and add block storage for assetstore (as we determined Linode's block storage to be too slow for Solr)`

Add notes for 2020-03-17 2020-03-17 14:27:20 +01:00			`## 2020-03-17`

			`- Update the PostgreSQL JDBC driver to version 42.2.11`
Add notes for 2020-03-19 2020-03-19 11:19:11 +01:00			`- Maria from Bioversity asked me to add a new field for the combined subjects of Bioversity and CIAT, since they merged recently`
			- We will use `cg.subject.alliancebiovciat`

			`## 2020-03-18`

			`- Provision new Linode server (linode26) for DSpace Test to replace the current linode19 server`
			`- Improve DSpace role of [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)`
			`- We should install npm packages in the DSpace user's home directory instead of globally as root`

			`## 2020-03-19`

			`- Finalized migration of DSpace Test to linode26 and removed linode19`
Add notes for 2020-03-17 2020-03-17 14:27:20 +01:00
Add notes for 2020-03-22 2020-03-22 13:35:20 +01:00			`## 2020-03-22`

			`- Look over the AReS ToRs sent by Enrico and Moayad and add a few notes about missing GitHub issues`
			`- Hopefully now they can start working on the development!`

Add notes for 2020-03-24 2020-03-24 14:25:19 +01:00			`## 2020-03-24`

			`- Skype meeting about CGSpace with Peter and Abenet`

Add notes for 2020-03-25 2020-03-25 14:58:01 +01:00			`## 2020-03-25`

			`- I sent Atmire a message to ask if they managed to start working on the DSpace 6 port, as the last communication was twenty-six days ago when they said they were going to secure technical resources to do so`
			- Start adapting the `dspace` role in our [Ansible infrastructure playbooks](https://github.com/ilri/rmg-ansible-public) for DSpace 6 support

Add notes for 2020-03-26 2020-03-26 14:14:14 +01:00			`## 2020-03-26`

			- More work adapting the `dspace` role in our Ansible infrastructure scripts to DSpace 6
			`- Update Tomcat to version 7.0.103 in the Ansible infrastrcutrue playbooks and deploy on DSpace Test (linode26)`
			- Maria sent me a few new ORCID identifiers from Bioversity so I combined them with our existing ones, filtered the unique ones, and then resolved their names using my `resolve-orcids.py` script:

			```
			`$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids \| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' \| sort \| uniq > /tmp/2020-03-26-combined-orcids.txt`
			`$ ./resolve-orcids.py -i /tmp/2020-03-26-combined-orcids.txt -o /tmp/2020-03-26-combined-names.txt -d`
			`# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)`
			`$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml`
			```
			`- I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:`

			```
			`dc.contributor.author,cg.creator.id`
			`"King, Brian","Brian King: 0000-0002-7056-9214"`
			`"Ortiz-Crespo, Berta","Berta Ortiz-Crespo: 0000-0002-6664-0815"`
			`"Ekesa, Beatrice","Beatrice Ekesa: 0000-0002-2630-258X"`
			`"Ekesa, B.","Beatrice Ekesa: 0000-0002-2630-258X"`
			`"Ekesa, B.N.","Beatrice Ekesa: 0000-0002-2630-258X"`
			`"Gullotta, G.","Gaia Gullotta: 0000-0002-2240-3869"`
			```

			- Running the `add-orcid-identifiers-csv.py` script I added 32 ORCID iDs to items on CGSpace!

			```
			`$ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'`
			```

			`- Udana from IWMI asked about some items that are missing Altmetric donuts on CGSpace`
Update notes for 2020-04-02 2020-04-02 11:33:41 +02:00			`- [One of them](https://hdl.handle.net/10568/103225) had a link to the paper on Nature, but was missing a DOI`
Add notes for 2020-03-26 2020-03-26 14:14:14 +01:00			`- [The second item](https://hdl.handle.net/10568/106899) had no donut so I [tweeted its handle](https://twitter.com/mralanorth/status/1243158045540134913)`
			`- [The third item](https://hdl.handle.net/10568/107258) also had no handle so I [tweeted it](https://twitter.com/mralanorth/status/1243158786392625153) as well`
			`- Abenet pointed out [one item](https://hdl.handle.net/10568/106573) that she had tweeted last week that is missing a donut as well, so I [tweeted it](https://twitter.com/mralanorth/status/1243163710241345536) too`

Add notes for 2020-03-29 2020-03-29 11:13:47 +02:00			`## 2020-03-29`

			- Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors' existing publications in the database using this CSV with my `add-orcid-identifiers-csv.py` script:

			```
			`dc.contributor.author,cg.creator.id`
			`"Snook, L.K.","Laura Snook: 0000-0002-9168-1301"`
			`"Snook, L.","Laura Snook: 0000-0002-9168-1301"`
			`"Zheng, S.J.","Sijun Zheng: 0000-0003-1550-3738"`
			`"Zheng, S.","Sijun Zheng: 0000-0003-1550-3738"`
			```

			`- Deploy latest Bioversity and CIAT updates on CGSpace (linode18) and DSpace Test (linode26)`
			`- Deploy latest Ansible infrastructure playbooks on CGSpace and DSpace Test to get the latest dspace-statistics-api (v1.1.1) and Tomcat (7.0.103) versions`
			`- Run system updates on CGSpace and DSpace Test and reboot them`
			`- After reboot all the Solr statistics cores came back up on the first time on both servers (yay)`

Add notes for 2020-03-02 2020-03-02 11:38:10 +01:00			`<!-- vim: set sw=2 ts=2: -->`