mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-09-29 21:54:17 +02:00
280 lines
15 KiB
Markdown
280 lines
15 KiB
Markdown
---
|
|
title: "October, 2021"
|
|
date: 2021-10-01T11:14:07+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2021-10-01
|
|
|
|
- Export all affiliations on CGSpace and run them against the latest RoR data dump:
|
|
|
|
```console
|
|
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
|
|
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
|
|
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
|
|
ations-matching.csv
|
|
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
|
|
1879
|
|
$ wc -l /tmp/2021-10-01-affiliations.txt
|
|
7100 /tmp/2021-10-01-affiliations.txt
|
|
```
|
|
|
|
- So we have 1879/7100 (26.46%) matching already
|
|
|
|
<!--more-->
|
|
|
|
## 2021-10-03
|
|
|
|
- Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites
|
|
- Start a fresh indexing on AReS
|
|
- Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management
|
|
- He added licenses
|
|
- I want to clean up the `dcterms.extent` field though because it has volume, issue, and pages there
|
|
- I cloned the column several times and extracted values based on their positions, for example:
|
|
- Volume: `value.partition(":")[0]`
|
|
- Issue: `value.partition("(")[2].partition(")")[0]`
|
|
- Page: `"p. " + value.replace(".", "")`
|
|
|
|
## 2021-10-04
|
|
|
|
- Start looking at the last month of Solr statistics on CGSpace
|
|
- I see a number of IPs with "normal" user agents who clearly behave like bots
|
|
- 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)
|
|
- 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)
|
|
- 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)
|
|
- 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)
|
|
- 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
|
|
- 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
|
|
- 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
|
|
- 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
|
|
- Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:
|
|
|
|
```console
|
|
# zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
|
|
# wc -l /tmp/mozilla-4.0-ips.txt
|
|
543 /tmp/mozilla-4.0-ips.txt
|
|
```
|
|
|
|
- Then I resolved the IPs and extracted the ones belonging to Amazon:
|
|
|
|
```console
|
|
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
|
|
$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
|
|
```
|
|
|
|
- I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon
|
|
- Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:
|
|
|
|
```console
|
|
1592 GET /handle/10947/2526
|
|
1592 GET /handle/10947/2527
|
|
1592 GET /handle/10947/34
|
|
1593 GET /handle/10947/6
|
|
1594 GET /handle/10947/1
|
|
1598 GET /handle/10947/2515
|
|
1598 GET /handle/10947/2516
|
|
1599 GET /handle/10568/101335
|
|
1599 GET /handle/10568/91688
|
|
1599 GET /handle/10947/2517
|
|
1599 GET /handle/10947/2518
|
|
1599 GET /handle/10947/2519
|
|
1599 GET /handle/10947/2708
|
|
1599 GET /handle/10947/2871
|
|
1600 GET /handle/10568/89342
|
|
1600 GET /handle/10947/4467
|
|
1607 GET /handle/10568/103816
|
|
290382 GET /handle/10568/83389
|
|
```
|
|
|
|
- Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight...
|
|
- Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR
|
|
- Meeting with Michelle from Altmetric about their new CSV upload system
|
|
- I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them
|
|
|
|
```csv
|
|
doi,handle
|
|
10.1016/j.agsy.2021.103263,10568/115288
|
|
10.3389/fgene.2021.723360,10568/115287
|
|
10.3389/fpls.2021.720670,10568/115285
|
|
```
|
|
|
|
- Extract the AGROVOC subjects from IWMI's 292 publications to validate them against AGROVOC:
|
|
|
|
```console
|
|
$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
|
|
$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
|
|
$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
|
|
```
|
|
|
|
## 2021-10-05
|
|
|
|
- Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
|
|
- I added `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)` to the list of bad bots in nginx
|
|
- I purged all the Amazon IPs using this user agent, as well as the few other IPs I identified yesterday
|
|
|
|
```console
|
|
$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
|
|
...
|
|
|
|
Total number of bot hits purged: 465119
|
|
```
|
|
|
|
## 2021-10-06
|
|
|
|
- Thinking about how we could check for duplicates before importing
|
|
- I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):
|
|
|
|
```console
|
|
localhost/dspace63= > CREATE EXTENSION pg_trgm;
|
|
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
|
|
metadata_value_id │ text_value │ dspace_object_id
|
|
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
|
3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
|
|
3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
|
|
(2 rows)
|
|
```
|
|
|
|
- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
|
|
- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
|
|
- I think I will check for similar titles, and if I find them I will print out the handles for verification
|
|
- I could also proceed to check other metadata like type because those shouldn't vary too much
|
|
- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
|
|
- Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!
|
|
- This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives
|
|
- I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
|
|
- 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total
|
|
- 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total
|
|
- 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total
|
|
- Some minor updates to csv-metadata-quality
|
|
- Fix two issues with regular expressions in the duplicate items and experimental language checks
|
|
- Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field
|
|
- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check:
|
|
|
|
```console
|
|
$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
|
|
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
|
|
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
|
|
```
|
|
|
|
- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...
|
|
- I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"
|
|
- The duplicates are definitely removed from the CSV, but DSpace doesn't detect them
|
|
- I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
|
|
- I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:
|
|
|
|
> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
|
|
> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.
|
|
|
|
- Shit, so that's worth looking into...
|
|
|
|
## 2021-10-07
|
|
|
|
- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
|
|
- I started by copying just a handful of fields from the iwmi.csv community export:
|
|
|
|
```console
|
|
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
|
|
# Copy and blank columns in OpenRefine
|
|
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
|
|
$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
|
|
```
|
|
- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...
|
|
|
|
## 2021-10-08
|
|
|
|
- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
|
|
|
|
```console
|
|
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
|
|
text_lang | count
|
|
-----------+---------
|
|
en_US | 2603711
|
|
en_Fu | 115568
|
|
en | 8818
|
|
| 5286
|
|
fr | 2
|
|
vn | 2
|
|
| 0
|
|
(7 rows)
|
|
cgspace=# BEGIN;
|
|
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
|
|
UPDATE 129673
|
|
cgspace=# COMMIT;
|
|
```
|
|
|
|
- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
|
|
|
|
```console
|
|
$ grep -c 'Removing duplicate value' /tmp/out.log
|
|
391
|
|
```
|
|
|
|
- I tried to export ILRI's community, but ran into the export bug (DS-4211)
|
|
- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
|
|
|
|
```console
|
|
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
|
|
32070
|
|
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
|
|
19315
|
|
```
|
|
|
|
- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:
|
|
|
|
```console
|
|
$ grep -c 'Removing duplicate value' /tmp/out.log
|
|
220
|
|
```
|
|
|
|
- I found a cool way to select only the items with corrections
|
|
- First, extract a handful of fields from the CSV with csvcut
|
|
- Second, clean the CSV with csv-metadata-quality
|
|
- Third, rename the columns to something obvious in the cleaned CSV
|
|
- Fourth, use csvjoin to merge the cleaned file with the original
|
|
|
|
```console
|
|
$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
|
|
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
|
|
$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
|
|
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
|
|
```
|
|
|
|
- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:
|
|
|
|
```
|
|
if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
|
|
```
|
|
|
|
- For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
|
|
- After these are uploaded I will normalize the `text_lang` fields in PostgreSQL again
|
|
- I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:
|
|
|
|
```console
|
|
$ grep -c 'Removing duplicate value' /tmp/out.log
|
|
7720
|
|
```
|
|
|
|
- I applied these to the CIAT community, so in total that's over 8,000 duplicate metadata values removed in a handful of fields...
|
|
|
|
## 2021-10-09
|
|
|
|
- I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there
|
|
- Also of note, there are some other fixes too, for example in IITA's community:
|
|
|
|
```console
|
|
$ grep -c -E '(Fixing|Removing) (duplicate|excessive|invalid)' /tmp/out.log
|
|
249
|
|
```
|
|
|
|
- I ran a full Discovery re-indexing on CGSpace
|
|
- Then I exported all of CGSpace and extracted the ISSNs and ISBNs:
|
|
|
|
```console
|
|
$ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
|
|
```
|
|
|
|
- I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|