cgspace-notes/content/posts/2021-10.md

---
title: "October, 2021"
date: 2021-10-01T11:14:07+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2021-10-01

- Export all affiliations on CGSpace and run them against the latest RoR data dump:

```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
1879
$ wc -l /tmp/2021-10-01-affiliations.txt 
7100 /tmp/2021-10-01-affiliations.txt
```

- So we have 1879/7100 (26.46%) matching already

<!--more-->

## 2021-10-03

- Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites
- Start a fresh indexing on AReS
- Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management
  - He added licenses
  - I want to clean up the `dcterms.extent` field though because it has volume, issue, and pages there
  - I cloned the column several times and extracted values based on their positions, for example:
    - Volume: `value.partition(":")[0]`
    - Issue: `value.partition("(")[2].partition(")")[0]`
    - Page: `"p. " + value.replace(".", "")`

## 2021-10-04

- Start looking at the last month of Solr statistics on CGSpace
  - I see a number of IPs with "normal" user agents who clearly behave like bots
    - 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)
    - 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)
    - 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)
    - 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)
    - 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
    - 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
    - 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
    - 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
  - Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:

```console
# zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
# wc -l /tmp/mozilla-4.0-ips.txt 
543 /tmp/mozilla-4.0-ips.txt
```

- Then I resolved the IPs and extracted the ones belonging to Amazon:

```console
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
```

- I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon
- Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:

```console
   1592 GET /handle/10947/2526
   1592 GET /handle/10947/2527
   1592 GET /handle/10947/34
   1593 GET /handle/10947/6
   1594 GET /handle/10947/1
   1598 GET /handle/10947/2515
   1598 GET /handle/10947/2516
   1599 GET /handle/10568/101335
   1599 GET /handle/10568/91688
   1599 GET /handle/10947/2517
   1599 GET /handle/10947/2518
   1599 GET /handle/10947/2519
   1599 GET /handle/10947/2708
   1599 GET /handle/10947/2871
   1600 GET /handle/10568/89342
   1600 GET /handle/10947/4467
   1607 GET /handle/10568/103816
 290382 GET /handle/10568/83389
```

- Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight...
- Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR
- Meeting with Michelle from Altmetric about their new CSV upload system
  - I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them

```csv
doi,handle
10.1016/j.agsy.2021.103263,10568/115288
10.3389/fgene.2021.723360,10568/115287
10.3389/fpls.2021.720670,10568/115285
```

- Extract the AGROVOC subjects from IWMI's 292 publications to validate them against AGROVOC:

```console
$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
```

## 2021-10-05

- Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
  - I added `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)` to the list of bad bots in nginx
  - I purged all the Amazon IPs using this user agent, as well as the few other IPs I identified yesterday

```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
...

Total number of bot hits purged: 465119
```

## 2021-10-06

- Thinking about how we could check for duplicates before importing
  - I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):

```console
localhost/dspace63= > CREATE EXTENSION pg_trgm;
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
 metadata_value_id │                                         text_value                                         │           dspace_object_id
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
           3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
           3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
(2 rows)
```

- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
  - I think I will check for similar titles, and if I find them I will print out the handles for verification
  - I could also proceed to check other metadata like type because those shouldn't vary too much
- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
  - Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!
  - This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives
  - I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
    - 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs  0.09s user 0.03s system 0% cpu 24:40.42 total
    - 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs  0.12s user 0.03s system 0% cpu 24:29.15 total
    - 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs  0.09s user 0.03s system 0% cpu 25:44.13 total
- Some minor updates to csv-metadata-quality
  - Fix two issues with regular expressions in the duplicate items and experimental language checks
  - Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field
- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check:

```console
$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
```

- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...
  - I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"
  - The duplicates are definitely removed from the CSV, but DSpace doesn't detect them
  - I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
  - I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:

> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.

- Shit, so that's worth looking into...

## 2021-10-07

- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
  - I started by copying just a handful of fields from the iwmi.csv community export:

```console
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
# Copy and blank columns in OpenRefine
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
```
- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...

## 2021-10-08

- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:

```console
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
 text_lang |  count  
-----------+---------
 en_US     | 2603711
 en_Fu     |  115568
 en        |    8818
           |    5286
 fr        |       2
 vn        |       2
           |       0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
```

- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:

```console
$ grep -c 'Removing duplicate value' /tmp/out.log
391
```

- I tried to export ILRI's community, but ran into the export bug (DS-4211)
  - After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):

```console
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l 
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
19315
```

- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:

```console
$ grep -c 'Removing duplicate value' /tmp/out.log
220
```

- I found a cool way to select only the items with corrections
  - First, extract a handful of fields from the CSV with csvcut
  - Second, clean the CSV with csv-metadata-quality
  - Third, rename the columns to something obvious in the cleaned CSV
  - Fourth, use csvjoin to merge the cleaned file with the original

```console
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
```

- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:

```
if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
```

- For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
  - After these are uploaded I will normalize the `text_lang` fields in PostgreSQL again
- I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:

```console
$ grep -c 'Removing duplicate value' /tmp/out.log
7720
```

- I applied these to the CIAT community, so in total that's over 8,000 duplicate metadata values removed in a handful of fields...

## 2021-10-09

- I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there
- Also of note, there are some other fixes too, for example in IITA's community:

```console
$ grep -c -E '(Fixing|Removing) (duplicate|excessive|invalid)' /tmp/out.log
249
```

- I ran a full Discovery re-indexing on CGSpace
- Then I exported all of CGSpace and extracted the ISSNs and ISBNs:

```console
$ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
```

- I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs

## 2021-10-10

- Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on `metadata-export` (DS-4211)
- First create a new PostgreSQL 13 container:

```console
$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5433:5432 -d postgres:13-alpine
$ createuser -h localhost -p 5433 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
$ psql -h localhost -p 5433 -U postgres dspace7 -c 'CREATE EXTENSION pgcrypto;'
```

- Then edit setting in `dspace/config/local.cfg` and build the backend server with Java 11:

```console
$ mvn package
$ cd dspace/target/dspace-installer
$ ant fresh_install
# fix database not being fully ready, causing Tomcat to fail to start the server application
$ ~/dspace7/bin/dspace database migrate
```

- Copy Solr configs and start Solr:

```console
$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
$ ~/src/solr-8.8.2/bin/solr start
```

- Start my local Tomcat 9 instance:

```console
$ systemctl --user start tomcat9@dspace7
```

- This works, so now I will drop the default database and import a dump from CGSpace

```console
$ systemctl --user stop tomcat9@dspace7                                
$ dropdb -h localhost -p 5433 -U postgres dspace7
$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -p 5433 -U postgres -d dspace7 -O --role=dspacetest -h localhost dspace-2021-10-09.backup
$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest nosuperuser;'
```

- Delete Atmire migrations and some others that were "unresolved":

```console
$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"
$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"
```

- Now DSpace 7 starts with my CGSpace data... nice
  - The Discovery indexing still takes seven hours... fuck
- I tested the `metadata-export` on DSpace 7.1-SNAPSHOT and it still has the duplicate items issue introduced by DS-4211
  - I filed a GitHub issue and notified nwoodward: https://github.com/DSpace/DSpace/issues/7988
- Start a full reindex on AReS

## 2021-10-11

- Start a full Discovery reindex on my local DSpace 6.3 instance:

```console
$ /usr/bin/time -f %M:%e chrt -b 0 ~/dspace63/bin/dspace index-discovery -b
Loading @mire database changes for module MQM
Changes have been processed
836140:6543.6
```

- So that's 1.8 hours versus 7 on DSpace 7, with the same database!
- Several users wrote to me that CGSpace was slow recently
  - Looking at the PostgreSQL database I see connections look normal, but locks for `dspaceWeb` are high:

```console
$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
53
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
1697
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'" | wc -l
1681
```

- Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:

![PostgreSQL locks week](/cgspace-notes/2021/10/postgres_locks_ALL-week.png)

- The only thing I did on 2021-10-07 was import a few thousand metadata corrections...
- I restarted PostgreSQL (instead of restarting Tomcat), so let's see if that helps
- I filed [a bug for the DSpace 6/7 duplicate values metadata import issue](https://github.com/DSpace/DSpace/issues/7989)
- I tested the two patches for removing abandoned submissions from the workflow but unfortunately it seems that they are for the configurable aka XML workflow, and we are using the basic workflow

<!-- vim: set sw=2 ts=2: -->
Add notes for 2021-10 2021-10-04 18:40:13 +02:00			`---`
			`title: "October, 2021"`
			`date: 2021-10-01T11:14:07+03:00`
			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

			`## 2021-10-01`

			`- Export all affiliations on CGSpace and run them against the latest RoR data dump:`

			```console
			`localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;`
			`$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv \| sed 1d > /tmp/2021-10-01-affiliations.txt`
			`$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili`
			`ations-matching.csv`
			`$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv \| sed 1d \| wc -l`
			`1879`
			`$ wc -l /tmp/2021-10-01-affiliations.txt`
			`7100 /tmp/2021-10-01-affiliations.txt`
			```

			`- So we have 1879/7100 (26.46%) matching already`

			`<!--more-->`

			`## 2021-10-03`

			`- Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites`
			`- Start a fresh indexing on AReS`
			`- Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management`
			`- He added licenses`
			- I want to clean up the `dcterms.extent` field though because it has volume, issue, and pages there
			`- I cloned the column several times and extracted values based on their positions, for example:`
			- Volume: `value.partition(":")[0]`
			- Issue: `value.partition("(")[2].partition(")")[0]`
			- Page: `"p. " + value.replace(".", "")`

			`## 2021-10-04`

			`- Start looking at the last month of Solr statistics on CGSpace`
			`- I see a number of IPs with "normal" user agents who clearly behave like bots`
			`- 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)`
			`- 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)`
			`- 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)`
			`- 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)`
			- 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
			- 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
			- 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
			- 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
			`- Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:`

			```console
			`# zcat --force /var/log/nginx/.log \| grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' \| awk '{print $1}' \| sort \| uniq > /tmp/mozilla-4.0-ips.txt`
			`# wc -l /tmp/mozilla-4.0-ips.txt`
			`543 /tmp/mozilla-4.0-ips.txt`
			```

			`- Then I resolved the IPs and extracted the ones belonging to Amazon:`

			```console
			`$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv`
			`$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv \| csvcut -c ip \| sed 1d \| tee /tmp/amazon-ips.txt \| wc -l`
			```

			`- I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon`
			`- Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:`

			```console
			`1592 GET /handle/10947/2526`
			`1592 GET /handle/10947/2527`
			`1592 GET /handle/10947/34`
			`1593 GET /handle/10947/6`
			`1594 GET /handle/10947/1`
			`1598 GET /handle/10947/2515`
			`1598 GET /handle/10947/2516`
			`1599 GET /handle/10568/101335`
			`1599 GET /handle/10568/91688`
			`1599 GET /handle/10947/2517`
			`1599 GET /handle/10947/2518`
			`1599 GET /handle/10947/2519`
			`1599 GET /handle/10947/2708`
			`1599 GET /handle/10947/2871`
			`1600 GET /handle/10568/89342`
			`1600 GET /handle/10947/4467`
			`1607 GET /handle/10568/103816`
			`290382 GET /handle/10568/83389`
			```

Add notes for 2021-10-05 2021-10-05 17:54:39 +02:00			`- Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight...`
Add notes for 2021-10 2021-10-04 18:40:13 +02:00			`- Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR`
			`- Meeting with Michelle from Altmetric about their new CSV upload system`
			`- I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them`

			```csv
			`doi,handle`
			`10.1016/j.agsy.2021.103263,10568/115288`
			`10.3389/fgene.2021.723360,10568/115287`
			`10.3389/fpls.2021.720670,10568/115285`
			```

			`- Extract the AGROVOC subjects from IWMI's 292 publications to validate them against AGROVOC:`

			```console
			`$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv \| sed -e 1d -e 's/\|\|/\n/g' -e 's/"//g' \| sort -u > /tmp/agrovoc.txt`
			`$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv`
			`$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv \| csvcut -c 1 > /tmp/invalid-agrovoc.csv`
			```

Add notes for 2021-10-05 2021-10-05 17:54:39 +02:00			`## 2021-10-05`

			`- Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs`
			- I added `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)` to the list of bad bots in nginx
			`- I purged all the Amazon IPs using this user agent, as well as the few other IPs I identified yesterday`

			```console
			`$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p`
			`...`

			`Total number of bot hits purged: 465119`
			```

Add notes for 2021-10-06 2021-10-07 07:27:39 +02:00			`## 2021-10-06`

			`- Thinking about how we could check for duplicates before importing`
			`- I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):`

			```console
			`localhost/dspace63= > CREATE EXTENSION pg_trgm;`
			`localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;`
			`metadata_value_id │ text_value │ dspace_object_id`
			`───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────`
			`3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467`
			`3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e`
			`(2 rows)`
			```

			`- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)`
			`- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items`
			`- I think I will check for similar titles, and if I find them I will print out the handles for verification`
			`- I could also proceed to check other metadata like type because those shouldn't vary too much`
			- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
			`- Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!`
			`- This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives`
			`- I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!`
			`- 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total`
			`- 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total`
			`- 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total`
			`- Some minor updates to csv-metadata-quality`
			`- Fix two issues with regular expressions in the duplicate items and experimental language checks`
			`- Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field`
			`- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check:`

			```console
			`$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv`
			`$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv \| tee /tmp/out.log`
			`$ xsv split -s 2000 /tmp /tmp/iwmi.csv`
			```

			`- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...`
			`- I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"`
			`- The duplicates are definitely removed from the CSV, but DSpace doesn't detect them`
			`- I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!`
			`- I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:`

			`> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.`
			`> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.`

			`- Shit, so that's worth looking into...`

Add notes for 2021-10-08 2021-10-08 16:15:17 +02:00			`## 2021-10-07`

			- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
			`- I started by copying just a handful of fields from the iwmi.csv community export:`

			```console
			`$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv`
			`# Copy and blank columns in OpenRefine`
			`$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv \| tee /tmp/out.log`
			`$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv`
			```
			`- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...`

			`## 2021-10-08`

			`- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:`

			```console
			`cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;`
			`text_lang \| count`
			`-----------+---------`
			`en_US \| 2603711`
			`en_Fu \| 115568`
			`en \| 8818`
			`\| 5286`
			`fr \| 2`
			`vn \| 2`
			`\| 0`
			`(7 rows)`
			`cgspace=# BEGIN;`
			`cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');`
			`UPDATE 129673`
			`cgspace=# COMMIT;`
			```

			`- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:`

			```console
			`$ grep -c 'Removing duplicate value' /tmp/out.log`
			`391`
			```

			`- I tried to export ILRI's community, but ran into the export bug (DS-4211)`
			`- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):`

			```console
			`$ csvcut -c id /tmp/ilri-duplicate-metadata.csv \| sed '1d' \| wc -l`
			`32070`
			`$ csvcut -c id /tmp/ilri-duplicate-metadata.csv \| sort -u \| sed '1d' \| wc -l`
			`19315`
			```

			`- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:`

			```console
			`$ grep -c 'Removing duplicate value' /tmp/out.log`
			`220`
			```

			`- I found a cool way to select only the items with corrections`
			`- First, extract a handful of fields from the CSV with csvcut`
			`- Second, clean the CSV with csv-metadata-quality`
			`- Third, rename the columns to something obvious in the cleaned CSV`
			`- Fourth, use csvjoin to merge the cleaned file with the original`

			```console
			`$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv \| csvsort \| uniq > /tmp/ilri-deduplicated-items.csv`
			`$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv \| tee /tmp/out.log`
			`$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv`
			`$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv`
			```

			`- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:`

			```
			`if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")`
			```

			`- For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column`
			- After these are uploaded I will normalize the `text_lang` fields in PostgreSQL again
Add notes for 2021-10-09 2021-10-09 21:00:59 +02:00			`- I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:`

			```console
			`$ grep -c 'Removing duplicate value' /tmp/out.log`
			`7720`
			```

			`- I applied these to the CIAT community, so in total that's over 8,000 duplicate metadata values removed in a handful of fields...`

			`## 2021-10-09`

			`- I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there`
			`- Also of note, there are some other fixes too, for example in IITA's community:`

			```console
			`$ grep -c -E '(Fixing\|Removing) (duplicate\|excessive\|invalid)' /tmp/out.log`
			`249`
			```

			`- I ran a full Discovery re-indexing on CGSpace`
			`- Then I exported all of CGSpace and extracted the ISSNs and ISBNs:`

			```console
			`$ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv`
			```

			`- I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs`
Add notes for 2021-10-08 2021-10-08 16:15:17 +02:00
Add notes for 2021-10-10 2021-10-10 15:01:27 +02:00			`## 2021-10-10`

			- Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on `metadata-export` (DS-4211)
			`- First create a new PostgreSQL 13 container:`

			```console
			`$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5433:5432 -d postgres:13-alpine`
			`$ createuser -h localhost -p 5433 -U postgres --pwprompt dspacetest`
			`$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7`
			`$ psql -h localhost -p 5433 -U postgres dspace7 -c 'CREATE EXTENSION pgcrypto;'`
			```

			- Then edit setting in `dspace/config/local.cfg` and build the backend server with Java 11:

			```console
			`$ mvn package`
			`$ cd dspace/target/dspace-installer`
			`$ ant fresh_install`
			`# fix database not being fully ready, causing Tomcat to fail to start the server application`
			`$ ~/dspace7/bin/dspace database migrate`
			```

			`- Copy Solr configs and start Solr:`

			```console
			`$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets`
			`$ ~/src/solr-8.8.2/bin/solr start`
			```

			`- Start my local Tomcat 9 instance:`

			```console
			`$ systemctl --user start tomcat9@dspace7`
			```

			`- This works, so now I will drop the default database and import a dump from CGSpace`

			```console
			`$ systemctl --user stop tomcat9@dspace7`
			`$ dropdb -h localhost -p 5433 -U postgres dspace7`
			`$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7`
			`$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest superuser;'`
			`$ pg_restore -h localhost -p 5433 -U postgres -d dspace7 -O --role=dspacetest -h localhost dspace-2021-10-09.backup`
			`$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest nosuperuser;'`
			```

			`- Delete Atmire migrations and some others that were "unresolved":`

			```console
			`$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"`
			`$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"`
			```

			`- Now DSpace 7 starts with my CGSpace data... nice`
Add notes for 2021-10-11 2021-10-11 19:06:42 +02:00			`- The Discovery indexing still takes seven hours... fuck`
Add notes for 2021-10-10 2021-10-10 15:01:27 +02:00			- I tested the `metadata-export` on DSpace 7.1-SNAPSHOT and it still has the duplicate items issue introduced by DS-4211
			`- I filed a GitHub issue and notified nwoodward: https://github.com/DSpace/DSpace/issues/7988`
			`- Start a full reindex on AReS`

Add notes for 2021-10-11 2021-10-11 19:06:42 +02:00			`## 2021-10-11`

			`- Start a full Discovery reindex on my local DSpace 6.3 instance:`

			```console
			`$ /usr/bin/time -f %M:%e chrt -b 0 ~/dspace63/bin/dspace index-discovery -b`
			`Loading @mire database changes for module MQM`
			`Changes have been processed`
			`836140:6543.6`
			```

			`- So that's 1.8 hours versus 7 on DSpace 7, with the same database!`
			`- Several users wrote to me that CGSpace was slow recently`
			- Looking at the PostgreSQL database I see connections look normal, but locks for `dspaceWeb` are high:

			```console
			`$ psql -c 'SELECT * FROM pg_stat_activity' \| wc -l`
			`53`
			`$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" \| wc -l`
			`1697`
			`$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'" \| wc -l`
			`1681`
			```

			`- Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:`

			`![PostgreSQL locks week](/cgspace-notes/2021/10/postgres_locks_ALL-week.png)`

			`- The only thing I did on 2021-10-07 was import a few thousand metadata corrections...`
			`- I restarted PostgreSQL (instead of restarting Tomcat), so let's see if that helps`
			`- I filed [a bug for the DSpace 6/7 duplicate values metadata import issue](https://github.com/DSpace/DSpace/issues/7989)`
			`- I tested the two patches for removing abandoned submissions from the workflow but unfortunately it seems that they are for the configurable aka XML workflow, and we are using the basic workflow`

Add notes for 2021-10 2021-10-04 18:40:13 +02:00			`<!-- vim: set sw=2 ts=2: -->`