--- title: "October, 2021" date: 2021-10-01T11:14:07+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2021-10-01 - Export all affiliations on CGSpace and run them against the latest RoR data dump: ```console localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER; $ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt $ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili ations-matching.csv $ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 1879 $ wc -l /tmp/2021-10-01-affiliations.txt 7100 /tmp/2021-10-01-affiliations.txt ``` - So we have 1879/7100 (26.46%) matching already ## 2021-10-03 - Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites - Start a fresh indexing on AReS - Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management - He added licenses - I want to clean up the `dcterms.extent` field though because it has volume, issue, and pages there - I cloned the column several times and extracted values based on their positions, for example: - Volume: `value.partition(":")[0]` - Issue: `value.partition("(")[2].partition(")")[0]` - Page: `"p. " + value.replace(".", "")` ## 2021-10-04 - Start looking at the last month of Solr statistics on CGSpace - I see a number of IPs with "normal" user agents who clearly behave like bots - 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US) - 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE) - 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE) - 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US) - 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US) - 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US) - 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US) - 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US) - Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it: ```console # zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt # wc -l /tmp/mozilla-4.0-ips.txt 543 /tmp/mozilla-4.0-ips.txt ``` - Then I resolved the IPs and extracted the ones belonging to Amazon: ```console $ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv $ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l ``` - I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon - Even more interesting, these requests are weighted VERY heavily on the CGIAR System community: ```console 1592 GET /handle/10947/2526 1592 GET /handle/10947/2527 1592 GET /handle/10947/34 1593 GET /handle/10947/6 1594 GET /handle/10947/1 1598 GET /handle/10947/2515 1598 GET /handle/10947/2516 1599 GET /handle/10568/101335 1599 GET /handle/10568/91688 1599 GET /handle/10947/2517 1599 GET /handle/10947/2518 1599 GET /handle/10947/2519 1599 GET /handle/10947/2708 1599 GET /handle/10947/2871 1600 GET /handle/10568/89342 1600 GET /handle/10947/4467 1607 GET /handle/10568/103816 290382 GET /handle/10568/83389 ``` - Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight... - Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR - Meeting with Michelle from Altmetric about their new CSV upload system - I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them ```csv doi,handle 10.1016/j.agsy.2021.103263,10568/115288 10.3389/fgene.2021.723360,10568/115287 10.3389/fpls.2021.720670,10568/115285 ``` - Extract the AGROVOC subjects from IWMI's 292 publications to validate them against AGROVOC: ```console $ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt $ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv $ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv ``` ## 2021-10-05 - Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs - I added `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)` to the list of bad bots in nginx - I purged all the Amazon IPs using this user agent, as well as the few other IPs I identified yesterday ```console $ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p ... Total number of bot hits purged: 465119 ```