+
+ 2021-10-01
+
+- Export all affiliations on CGSpace and run them against the latest RoR data dump:
+
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
+$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
+ations-matching.csv
+$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
+1879
+$ wc -l /tmp/2021-10-01-affiliations.txt
+7100 /tmp/2021-10-01-affiliations.txt
+
+- So we have 1879/7100 (26.46%) matching already
+
+2021-10-03
+
+- Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites
+- Start a fresh indexing on AReS
+- Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management
+
+- He added licenses
+- I want to clean up the
dcterms.extent
field though because it has volume, issue, and pages there
+- I cloned the column several times and extracted values based on their positions, for example:
+
+- Volume:
value.partition(":")[0]
+- Issue:
value.partition("(")[2].partition(")")[0]
+- Page:
"p. " + value.replace(".", "")
+
+
+
+
+
+2021-10-04
+
+- Start looking at the last month of Solr statistics on CGSpace
+
+- I see a number of IPs with “normal” user agents who clearly behave like bots
+
+- 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)
+- 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)
+- 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)
+- 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)
+- 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
, from ASN 14618 (AMAZON-AES, US)
+- 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
, from ASN 14618 (AMAZON-AES, US)
+- 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
, from ASN 14618 (AMAZON-AES, US)
+- 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
, from ASN 14618 (AMAZON-AES, US)
+
+
+- Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:
+
+
+
+# zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
+# wc -l /tmp/mozilla-4.0-ips.txt
+543 /tmp/mozilla-4.0-ips.txt
+
+- Then I resolved the IPs and extracted the ones belonging to Amazon:
+
+$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
+$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
+
+- I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon
+- Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:
+
+ 1592 GET /handle/10947/2526
+ 1592 GET /handle/10947/2527
+ 1592 GET /handle/10947/34
+ 1593 GET /handle/10947/6
+ 1594 GET /handle/10947/1
+ 1598 GET /handle/10947/2515
+ 1598 GET /handle/10947/2516
+ 1599 GET /handle/10568/101335
+ 1599 GET /handle/10568/91688
+ 1599 GET /handle/10947/2517
+ 1599 GET /handle/10947/2518
+ 1599 GET /handle/10947/2519
+ 1599 GET /handle/10947/2708
+ 1599 GET /handle/10947/2871
+ 1600 GET /handle/10568/89342
+ 1600 GET /handle/10947/4467
+ 1607 GET /handle/10568/103816
+ 290382 GET /handle/10568/83389
+
+- Before I purge all those I will ask someone Samuel Stacey from the System office to hopefully get an insight…
+- Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR
+- Meeting with Michelle from Altmetric about their new CSV upload system
+
+- I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them
+
+
+
+doi,handle
+10.1016/j.agsy.2021.103263,10568/115288
+10.3389/fgene.2021.723360,10568/115287
+10.3389/fpls.2021.720670,10568/115285
+
+- Extract the AGROVOC subjects from IWMI’s 292 publications to validate them against AGROVOC:
+
+$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
+$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
+$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
+
+
+
+
+
+
+