Add notes for 2020-07-20

2025-01-27 05:49:12 +01:00 · 2020-07-20 22:14:45 +03:00
parent 49d08e2db9
commit 501c282ecb
20 changed files with 93 additions and 25 deletions
--- a/content/posts/2020-07.md
+++ b/content/posts/2020-07.md
@ -540,4 +540,34 @@ $ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dsp
  - I said I would try to do a migration on DSpace Test with more of CGSpace's Solr data to try and approximate how much of our data be affected
  - I also asked them about the Tomcat 8.5 issue with CUA as well as the CUA group name issue that I had asked originally in April

+## 2020-07-20
+
+- Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:
+
+```
+217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
+```
+
+- I still see 12,000 records in Solr from this user agent, though.
+  - I wonder why the DSpace bot list didn't get those... because it has "bot" which should cause Solr to not log the hit
+- I purged ~30,000 hits from Solr statistics based on the IPs above, but also for some agents like Drupal (which isn't in the list yet) and OgScrper (which is as of 2020-03)
+- Some of my user agent patterns had been incorporated into COUNTER-Robots in 2020-07, but not all
+  - I closed the [old pull request](https://github.com/atmire/COUNTER-Robots/pull/34) and created a [new one](https://github.com/atmire/COUNTER-Robots/pull/36) 
+  - Then I updated the lists in the `5_x-prod` and 6.x branches
+- I re-ran the `check-spider-hits.sh` script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total
+- I looked at the [CLARISA](https://clarisa.cgiar.org/) institutions list again, since I hadn't looked at it in over six months:
+
+```
+$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
+```
+
+- The API still needs a key unless you query from Swagger web interface
+  - They currently have 3,469 institutions...
+  - Also, they still combine multiple text names into one string along with acronyms and countries:
+    - Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung / Federal Ministry of Economic Cooperation and Development (Germany)
+    - Ministerio del Ambiente / Ministry of Environment (Peru)
+    - Carthage University / Université de Carthage
+    - Sweet Potato Research Institute (SPRI) of Chinese Academy of Agricultural Sciences (CAAS)
+  - I think the ROR is much better in every possible way
+
 <!-- vim: set sw=2 ts=2: -->