cgspace-notes/content/posts/2020-08.md

---
title: "August, 2020"
date: 2020-08-02T15:35:54+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2020-08-02

- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values
  - It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names)
  - It implements a "force" mode too that will clear existing country codes and re-tag everything
  - It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa...

<!--more-->

- The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks
  - I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config)
- I forked the [dspace-curation-tasks to ILRI's GitHub](https://github.com/ilri/dspace-curation-tasks) and [submitted the project to Maven Central](https://issues.sonatype.org/browse/OSSRH-59650) so I can integrate it more easily with our DSpace build via dependencies

## 2020-08-03

- Atmire responded to the ticket about the ongoing upgrade issues
  - They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!
  - They also said they have never experienced the `type: 5` site statistics issue, so I need to try to purge those and continue with the stats processing
- I purged all unmigrated stats in a few cores and then restarted processing:

```
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
```

- Andrea from Macaroni Bros emailed me a few days ago to say he's having issues with the CGSpace REST API
  - He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: https://www.rtb.cgiar.org/publications/

## 2020-08-04

- Look into the REST API issues that Macaroni Bros raised last week:
  - The first one was about the `collections` endpoint returning empty items:
    - https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=2 (offset=2 is correct)
    - https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=3 (offset=3 is empty)
    - https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=4 (offset=4 is correct again)
  - I confirm that the second link returns zero items on CGSpace...
    - I tested on my local development instance and it returns one item correctly...
    - I tested on DSpace Test (currently DSpace 6 with UUIDs) and it returns one item correctly...
    - Perhaps an indexing issue?
  - The second issue is the `collections` endpoint returning the wrong number of items:
    - https://cgspace.cgiar.org/rest/collections/1445 (numberItems: 63)
    - https://cgspace.cgiar.org/rest/collections/1445/items (real number of items: 61)
  - I confirm that it is indeed happening on CGSpace...
    - And actually I can replicate the same issue on my local CGSpace 5.8 instance:

```
$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
   "numberItems" : 63,
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
61
```

    - Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:

```
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
   "numberItems" : 61,
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
59
```

- Ah! I exported that collection's metadata and checked it in OpenRefine, where I noticed that two items are mapped twice
  - I dealt with this problem in 2017-01 and the solution is to check the `collection2item` table:

```
dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
   id   | collection_id | item_id
--------+---------------+---------
 133698 |           966 |  107687
 134685 |          1445 |  107687
 134686 |          1445 |  107687
(3 rows)
```

- So for each id you can delete one duplicate mapping:

```
dspace=# DELETE FROM collection2item WHERE id='134686';
dspace=# DELETE FROM collection2item WHERE id='128819';
```

- Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter's preferred display names

```
$ cat 2020-08-04-PB-new-countries.csv
cg.coverage.country,correct
CAPE VERDE,CABO VERDE
COCOS ISLANDS,COCOS (KEELING) ISLANDS
"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF"
COTE D'IVOIRE,CÔTE D'IVOIRE
"KOREA, REPUBLIC","KOREA, REPUBLIC OF"
PALESTINE,"PALESTINE, STATE OF"
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
```

- I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
  - I started a full Discovery re-indexing

## 2020-08-05

- Port my [dspace-curation-tasks](https://github.com/ilri/dspace-curation-tasks) to DSpace 6 and tag version `6.0-SNAPSHOT`
- I downloaded the [UN M.49](https://unstats.un.org/unsd/methodology/m49/overview/) CSV file to start working on updating the CGSpace regions
  - First issue is they don't version the file so you have no idea when it was released
  - Second issue is that three rows have errors due to not using quotes around "China, Macao Special Administrative Region"
- Bizu said she was having problems approving tasks on CGSpace
  - I looked at the PostgreSQL locks and they have skyrocketed since yesterday:

![PostgreSQL locks day](/cgspace-notes/2020/08/postgres_locks_ALL-day.png)

![PostgreSQL query length day](/cgspace-notes/2020/08/postgres_querylength_ALL-day.png)

- Seems that something happened yesterday afternoon at around 5PM...
  - For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue
  - I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly
- I checked the nginx logs around 5PM yesterday to see who was accessing the server:

```
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
```

- I see the Macaroni Bros are using their new user agent for harvesting: `RTB website BOT`
  - But that pattern doesn't match in the nginx bot list or Tomcat's crawler session manager valve because we're only checking for `[Bb]ot`!
  - So they have created thousands of Tomcat sessions:

```
$ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
5693
```

- DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don't misuse the resources
  - Perhaps `[Bb][Oo][Tt]`...
- I see another IP 104.198.96.245, which is also using the "RTB website BOT" but there are 70,000 hits in Solr from earlier this year before they started using the user agent
  - I purged all the hits from Solr, including a few thousand from 64.62.202.71
- A few more IPs causing lots of Tomcat sessions yesterday:

```
$ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
1585
$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
5691
```

- 38.128.66.10 isn't creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:

```
Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
```

- 64.62.202.71 is using a user agent I've never seen before:

```
Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
```

- So now our "bot" regex can't even match that...
  - Unless we change it to `[Bb]\.?[Oo]\.?[Tt]\.?`... which seems to match all variations of "bot" I can think of right now, according to [regexr.com](https://regexr.com/59lpt):

```
RTB website BOT
Altmetribot
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
```

- And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):

```
$ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2777
```

- I will add `Turnitin` to the Tomcat Crawler Session Manager Valve regex as well...

<!-- vim: set sw=2 ts=2: -->
Add notes for 2020-08-02 2020-08-02 21:14:16 +02:00			`---`
			`title: "August, 2020"`
content/posts/2020-08.md: Fix date 2020-08-06 09:56:13 +02:00			`date: 2020-08-02T15:35:54+03:00`
Add notes for 2020-08-02 2020-08-02 21:14:16 +02:00			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

			`## 2020-08-02`

			- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values
			`- It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names)`
			`- It implements a "force" mode too that will clear existing country codes and re-tag everything`
			`- It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa...`

			`<!--more-->`

			`- The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks`
			- I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config)
Update notes for 2020-08-02 2020-08-02 22:55:04 +02:00			`- I forked the [dspace-curation-tasks to ILRI's GitHub](https://github.com/ilri/dspace-curation-tasks) and [submitted the project to Maven Central](https://issues.sonatype.org/browse/OSSRH-59650) so I can integrate it more easily with our DSpace build via dependencies`
Add notes for 2020-08-02 2020-08-02 21:14:16 +02:00
Update notes for 2020-08-03 2020-08-03 15:27:51 +02:00			`## 2020-08-03`

			`- Atmire responded to the ticket about the ongoing upgrade issues`
			`- They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!`
			- They also said they have never experienced the `type: 5` site statistics issue, so I need to try to purge those and continue with the stats processing
			`- I purged all unmigrated stats in a few cores and then restarted processing:`

			```
			`$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.unmigrated./</query></delete>'`
			`$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'`
			`$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics`
			```

			`- Andrea from Macaroni Bros emailed me a few days ago to say he's having issues with the CGSpace REST API`
			`- He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: https://www.rtb.cgiar.org/publications/`

Add notes for 2020-08-05 2020-08-05 14:00:06 +02:00			`## 2020-08-04`

			`- Look into the REST API issues that Macaroni Bros raised last week:`
			- The first one was about the `collections` endpoint returning empty items:
			`- https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=2 (offset=2 is correct)`
			`- https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=3 (offset=3 is empty)`
			`- https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=4 (offset=4 is correct again)`
			`- I confirm that the second link returns zero items on CGSpace...`
			`- I tested on my local development instance and it returns one item correctly...`
			`- I tested on DSpace Test (currently DSpace 6 with UUIDs) and it returns one item correctly...`
			`- Perhaps an indexing issue?`
			- The second issue is the `collections` endpoint returning the wrong number of items:
			`- https://cgspace.cgiar.org/rest/collections/1445 (numberItems: 63)`
			`- https://cgspace.cgiar.org/rest/collections/1445/items (real number of items: 61)`
			`- I confirm that it is indeed happening on CGSpace...`
			`- And actually I can replicate the same issue on my local CGSpace 5.8 instance:`

			```
			`$ http 'http://localhost:8080/rest/collections/1445' \| json_pp \| grep numberItems`
			`"numberItems" : 63,`
			`$ http 'http://localhost:8080/rest/collections/1445/items' jq '. \| length'`
			`61`
			```

			`- Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:`

			```
			`$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' \| json_pp \| grep numberItems`
			`"numberItems" : 61,`
			`$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' \| jq '. \| length'`
			`59`
			```

			`- Ah! I exported that collection's metadata and checked it in OpenRefine, where I noticed that two items are mapped twice`
			- I dealt with this problem in 2017-01 and the solution is to check the `collection2item` table:

			```
			`dspace=# SELECT * FROM collection2item WHERE item_id = '107687';`
			`id \| collection_id \| item_id`
			`--------+---------------+---------`
			`133698 \| 966 \| 107687`
			`134685 \| 1445 \| 107687`
			`134686 \| 1445 \| 107687`
			`(3 rows)`
			```

			`- So for each id you can delete one duplicate mapping:`

			```
			`dspace=# DELETE FROM collection2item WHERE id='134686';`
			`dspace=# DELETE FROM collection2item WHERE id='128819';`
			```

			`- Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter's preferred display names`

			```
			`$ cat 2020-08-04-PB-new-countries.csv`
			`cg.coverage.country,correct`
			`CAPE VERDE,CABO VERDE`
			`COCOS ISLANDS,COCOS (KEELING) ISLANDS`
			`"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF"`
			`COTE D'IVOIRE,CÔTE D'IVOIRE`
			`"KOREA, REPUBLIC","KOREA, REPUBLIC OF"`
			`PALESTINE,"PALESTINE, STATE OF"`
			`$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228`
			```

			`- I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly`
			`- I started a full Discovery re-indexing`

			`## 2020-08-05`

			- Port my [dspace-curation-tasks](https://github.com/ilri/dspace-curation-tasks) to DSpace 6 and tag version `6.0-SNAPSHOT`
			`- I downloaded the [UN M.49](https://unstats.un.org/unsd/methodology/m49/overview/) CSV file to start working on updating the CGSpace regions`
			`- First issue is they don't version the file so you have no idea when it was released`
			`- Second issue is that three rows have errors due to not using quotes around "China, Macao Special Administrative Region"`
			`- Bizu said she was having problems approving tasks on CGSpace`
			`- I looked at the PostgreSQL locks and they have skyrocketed since yesterday:`

			`![PostgreSQL locks day](/cgspace-notes/2020/08/postgres_locks_ALL-day.png)`

			`![PostgreSQL query length day](/cgspace-notes/2020/08/postgres_querylength_ALL-day.png)`

			`- Seems that something happened yesterday afternoon at around 5PM...`
			`- For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue`
			`- I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly`
Update notes for 2020-08-05 2020-08-05 15:58:31 +02:00			`- I checked the nginx logs around 5PM yesterday to see who was accessing the server:`

			```
			`# cat /var/log/nginx/.log /var/log/nginx/.log.1 \| grep -E '04/Aug/2020:(17\|18)' \| goaccess --log-format=COMBINED -`
			```

			- I see the Macaroni Bros are using their new user agent for harvesting: `RTB website BOT`
			- But that pattern doesn't match in the nginx bot list or Tomcat's crawler session manager valve because we're only checking for `[Bb]ot`!
			`- So they have created thousands of Tomcat sessions:`

			```
			`$ cat dspace.log.2020-08-04 \| grep -E "(63.32.242.35\|64.62.202.71)" \| grep -E 'session_id=[A-Z0-9]{32}' \| sort \| uniq \| wc -l`
			`5693`
			```

			`- DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don't misuse the resources`
			- Perhaps `[Bb][Oo][Tt]`...
			`- I see another IP 104.198.96.245, which is also using the "RTB website BOT" but there are 70,000 hits in Solr from earlier this year before they started using the user agent`
			`- I purged all the hits from Solr, including a few thousand from 64.62.202.71`
			`- A few more IPs causing lots of Tomcat sessions yesterday:`

			```
			`$ cat dspace.log.2020-08-04 \| grep "38.128.66.10" \| grep -E 'session_id=[A-Z0-9]{32}' \| sort \| uniq \| wc -l`
			`1585`
			`$ cat dspace.log.2020-08-04 \| grep "64.62.202.71" \| grep -E 'session_id=[A-Z0-9]{32}' \| sort \| uniq \| wc -l`
			`5691`
			```

			`- 38.128.66.10 isn't creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:`

			```
			`Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2`
			```

			`- 64.62.202.71 is using a user agent I've never seen before:`

			```
			`Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)`
			```

			`- So now our "bot" regex can't even match that...`
			- Unless we change it to `[Bb]\.?[Oo]\.?[Tt]\.?`... which seems to match all variations of "bot" I can think of right now, according to [regexr.com](https://regexr.com/59lpt):

			```
			`RTB website BOT`
			`Altmetribot`
			`Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)`
			`Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)`
			`Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)`
			```

			`- And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):`

			```
			`$ cat dspace.log.2020-08-04 \| grep "199.47.87.145" \| grep -E 'sessi`
			`on_id=[A-Z0-9]{32}' \| sort \| uniq \| wc -l`
			`2777`
			```

			- I will add `Turnitin` to the Tomcat Crawler Session Manager Valve regex as well...
Add notes for 2020-08-05 2020-08-05 14:00:06 +02:00
Add notes for 2020-08-02 2020-08-02 21:14:16 +02:00			`<!-- vim: set sw=2 ts=2: -->`