cgspace-notes/content/posts/2021-10.md

---
title: "October, 2021"
date: 2021-10-01T11:14:07+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2021-10-01

- Export all affiliations on CGSpace and run them against the latest RoR data dump:

```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
1879
$ wc -l /tmp/2021-10-01-affiliations.txt 
7100 /tmp/2021-10-01-affiliations.txt
```

- So we have 1879/7100 (26.46%) matching already

<!--more-->

## 2021-10-03

- Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites
- Start a fresh indexing on AReS
- Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management
  - He added licenses
  - I want to clean up the `dcterms.extent` field though because it has volume, issue, and pages there
  - I cloned the column several times and extracted values based on their positions, for example:
    - Volume: `value.partition(":")[0]`
    - Issue: `value.partition("(")[2].partition(")")[0]`
    - Page: `"p. " + value.replace(".", "")`

## 2021-10-04

- Start looking at the last month of Solr statistics on CGSpace
  - I see a number of IPs with "normal" user agents who clearly behave like bots
    - 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)
    - 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)
    - 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)
    - 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)
    - 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
    - 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
    - 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
    - 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
  - Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:

```console
# zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
# wc -l /tmp/mozilla-4.0-ips.txt 
543 /tmp/mozilla-4.0-ips.txt
```

- Then I resolved the IPs and extracted the ones belonging to Amazon:

```console
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
```

- I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon
- Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:

```console
   1592 GET /handle/10947/2526
   1592 GET /handle/10947/2527
   1592 GET /handle/10947/34
   1593 GET /handle/10947/6
   1594 GET /handle/10947/1
   1598 GET /handle/10947/2515
   1598 GET /handle/10947/2516
   1599 GET /handle/10568/101335
   1599 GET /handle/10568/91688
   1599 GET /handle/10947/2517
   1599 GET /handle/10947/2518
   1599 GET /handle/10947/2519
   1599 GET /handle/10947/2708
   1599 GET /handle/10947/2871
   1600 GET /handle/10568/89342
   1600 GET /handle/10947/4467
   1607 GET /handle/10568/103816
 290382 GET /handle/10568/83389
```

- Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight...
- Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR
- Meeting with Michelle from Altmetric about their new CSV upload system
  - I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them

```csv
doi,handle
10.1016/j.agsy.2021.103263,10568/115288
10.3389/fgene.2021.723360,10568/115287
10.3389/fpls.2021.720670,10568/115285
```

- Extract the AGROVOC subjects from IWMI's 292 publications to validate them against AGROVOC:

```console
$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
```

## 2021-10-05

- Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
  - I added `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)` to the list of bad bots in nginx
  - I purged all the Amazon IPs using this user agent, as well as the few other IPs I identified yesterday

```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
...

Total number of bot hits purged: 465119
```

## 2021-10-06

- Thinking about how we could check for duplicates before importing
  - I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):

```console
localhost/dspace63= > CREATE EXTENSION pg_trgm;
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
 metadata_value_id │                                         text_value                                         │           dspace_object_id
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
           3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
           3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
(2 rows)
```

- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
  - I think I will check for similar titles, and if I find them I will print out the handles for verification
  - I could also proceed to check other metadata like type because those shouldn't vary too much
- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
  - Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!
  - This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives
  - I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
    - 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs  0.09s user 0.03s system 0% cpu 24:40.42 total
    - 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs  0.12s user 0.03s system 0% cpu 24:29.15 total
    - 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs  0.09s user 0.03s system 0% cpu 25:44.13 total
- Some minor updates to csv-metadata-quality
  - Fix two issues with regular expressions in the duplicate items and experimental language checks
  - Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field
- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check:

```console
$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
```

- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...
  - I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"
  - The duplicates are definitely removed from the CSV, but DSpace doesn't detect them
  - I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
  - I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:

> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.

- Shit, so that's worth looking into...

<!-- vim: set sw=2 ts=2: -->
Add notes for 2021-10 2021-10-04 18:40:13 +02:00			`---`
			`title: "October, 2021"`
			`date: 2021-10-01T11:14:07+03:00`
			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

			`## 2021-10-01`

			`- Export all affiliations on CGSpace and run them against the latest RoR data dump:`

			```console
			`localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;`
			`$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv \| sed 1d > /tmp/2021-10-01-affiliations.txt`
			`$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili`
			`ations-matching.csv`
			`$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv \| sed 1d \| wc -l`
			`1879`
			`$ wc -l /tmp/2021-10-01-affiliations.txt`
			`7100 /tmp/2021-10-01-affiliations.txt`
			```

			`- So we have 1879/7100 (26.46%) matching already`

			`<!--more-->`

			`## 2021-10-03`

			`- Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites`
			`- Start a fresh indexing on AReS`
			`- Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management`
			`- He added licenses`
			- I want to clean up the `dcterms.extent` field though because it has volume, issue, and pages there
			`- I cloned the column several times and extracted values based on their positions, for example:`
			- Volume: `value.partition(":")[0]`
			- Issue: `value.partition("(")[2].partition(")")[0]`
			- Page: `"p. " + value.replace(".", "")`

			`## 2021-10-04`

			`- Start looking at the last month of Solr statistics on CGSpace`
			`- I see a number of IPs with "normal" user agents who clearly behave like bots`
			`- 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)`
			`- 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)`
			`- 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)`
			`- 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)`
			- 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
			- 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
			- 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
			- 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
			`- Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:`

			```console
			`# zcat --force /var/log/nginx/.log \| grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' \| awk '{print $1}' \| sort \| uniq > /tmp/mozilla-4.0-ips.txt`
			`# wc -l /tmp/mozilla-4.0-ips.txt`
			`543 /tmp/mozilla-4.0-ips.txt`
			```

			`- Then I resolved the IPs and extracted the ones belonging to Amazon:`

			```console
			`$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv`
			`$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv \| csvcut -c ip \| sed 1d \| tee /tmp/amazon-ips.txt \| wc -l`
			```

			`- I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon`
			`- Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:`

			```console
			`1592 GET /handle/10947/2526`
			`1592 GET /handle/10947/2527`
			`1592 GET /handle/10947/34`
			`1593 GET /handle/10947/6`
			`1594 GET /handle/10947/1`
			`1598 GET /handle/10947/2515`
			`1598 GET /handle/10947/2516`
			`1599 GET /handle/10568/101335`
			`1599 GET /handle/10568/91688`
			`1599 GET /handle/10947/2517`
			`1599 GET /handle/10947/2518`
			`1599 GET /handle/10947/2519`
			`1599 GET /handle/10947/2708`
			`1599 GET /handle/10947/2871`
			`1600 GET /handle/10568/89342`
			`1600 GET /handle/10947/4467`
			`1607 GET /handle/10568/103816`
			`290382 GET /handle/10568/83389`
			```

Add notes for 2021-10-05 2021-10-05 17:54:39 +02:00			`- Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight...`
Add notes for 2021-10 2021-10-04 18:40:13 +02:00			`- Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR`
			`- Meeting with Michelle from Altmetric about their new CSV upload system`
			`- I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them`

			```csv
			`doi,handle`
			`10.1016/j.agsy.2021.103263,10568/115288`
			`10.3389/fgene.2021.723360,10568/115287`
			`10.3389/fpls.2021.720670,10568/115285`
			```

			`- Extract the AGROVOC subjects from IWMI's 292 publications to validate them against AGROVOC:`

			```console
			`$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv \| sed -e 1d -e 's/\|\|/\n/g' -e 's/"//g' \| sort -u > /tmp/agrovoc.txt`
			`$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv`
			`$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv \| csvcut -c 1 > /tmp/invalid-agrovoc.csv`
			```

Add notes for 2021-10-05 2021-10-05 17:54:39 +02:00			`## 2021-10-05`

			`- Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs`
			- I added `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)` to the list of bad bots in nginx
			`- I purged all the Amazon IPs using this user agent, as well as the few other IPs I identified yesterday`

			```console
			`$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p`
			`...`

			`Total number of bot hits purged: 465119`
			```

Add notes for 2021-10-06 2021-10-07 07:27:39 +02:00			`## 2021-10-06`

			`- Thinking about how we could check for duplicates before importing`
			`- I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):`

			```console
			`localhost/dspace63= > CREATE EXTENSION pg_trgm;`
			`localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;`
			`metadata_value_id │ text_value │ dspace_object_id`
			`───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────`
			`3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467`
			`3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e`
			`(2 rows)`
			```

			`- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)`
			`- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items`
			`- I think I will check for similar titles, and if I find them I will print out the handles for verification`
			`- I could also proceed to check other metadata like type because those shouldn't vary too much`
			- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
			`- Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!`
			`- This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives`
			`- I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!`
			`- 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total`
			`- 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total`
			`- 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total`
			`- Some minor updates to csv-metadata-quality`
			`- Fix two issues with regular expressions in the duplicate items and experimental language checks`
			`- Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field`
			`- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check:`

			```console
			`$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv`
			`$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv \| tee /tmp/out.log`
			`$ xsv split -s 2000 /tmp /tmp/iwmi.csv`
			```

			`- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...`
			`- I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"`
			`- The duplicates are definitely removed from the CSV, but DSpace doesn't detect them`
			`- I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!`
			`- I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:`

			`> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.`
			`> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.`

			`- Shit, so that's worth looking into...`

Add notes for 2021-10 2021-10-04 18:40:13 +02:00			`<!-- vim: set sw=2 ts=2: -->`