diff --git a/content/posts/2020-07.md b/content/posts/2020-07.md
index 04a18c2b2..701933c38 100644
--- a/content/posts/2020-07.md
+++ b/content/posts/2020-07.md
@@ -540,4 +540,34 @@ $ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dsp
- I said I would try to do a migration on DSpace Test with more of CGSpace's Solr data to try and approximate how much of our data be affected
- I also asked them about the Tomcat 8.5 issue with CUA as well as the CUA group name issue that I had asked originally in April
+## 2020-07-20
+
+- Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:
+
+```
+217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
+```
+
+- I still see 12,000 records in Solr from this user agent, though.
+ - I wonder why the DSpace bot list didn't get those... because it has "bot" which should cause Solr to not log the hit
+- I purged ~30,000 hits from Solr statistics based on the IPs above, but also for some agents like Drupal (which isn't in the list yet) and OgScrper (which is as of 2020-03)
+- Some of my user agent patterns had been incorporated into COUNTER-Robots in 2020-07, but not all
+ - I closed the [old pull request](https://github.com/atmire/COUNTER-Robots/pull/34) and created a [new one](https://github.com/atmire/COUNTER-Robots/pull/36)
+ - Then I updated the lists in the `5_x-prod` and 6.x branches
+- I re-ran the `check-spider-hits.sh` script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total
+- I looked at the [CLARISA](https://clarisa.cgiar.org/) institutions list again, since I hadn't looked at it in over six months:
+
+```
+$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
+```
+
+- The API still needs a key unless you query from Swagger web interface
+ - They currently have 3,469 institutions...
+ - Also, they still combine multiple text names into one string along with acronyms and countries:
+ - Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung / Federal Ministry of Economic Cooperation and Development (Germany)
+ - Ministerio del Ambiente / Ministry of Environment (Peru)
+ - Carthage University / Université de Carthage
+ - Sweet Potato Research Institute (SPRI) of Chinese Academy of Agricultural Sciences (CAAS)
+ - I think the ROR is much better in every possible way
+
diff --git a/docs/2020-07/index.html b/docs/2020-07/index.html
index ed84a5696..15554426d 100644
--- a/docs/2020-07/index.html
+++ b/docs/2020-07/index.html
@@ -20,7 +20,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
-
+
@@ -45,9 +45,9 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
"@type": "BlogPosting",
"headline": "July, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-07/",
- "wordCount": "3347",
+ "wordCount": "3664",
"datePublished": "2020-07-01T10:53:54+03:00",
- "dateModified": "2020-07-14T10:57:49+03:00",
+ "dateModified": "2020-07-15T15:42:23+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -659,6 +659,44 @@ COPY 186
+
2020-07-20
+
+- Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:
+
+217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
+
+- I still see 12,000 records in Solr from this user agent, though.
+
+- I wonder why the DSpace bot list didn’t get those… because it has “bot” which should cause Solr to not log the hit
+
+
+- I purged ~30,000 hits from Solr statistics based on the IPs above, but also for some agents like Drupal (which isn’t in the list yet) and OgScrper (which is as of 2020-03)
+- Some of my user agent patterns had been incorporated into COUNTER-Robots in 2020-07, but not all
+
+
+- I re-ran the
check-spider-hits.sh
script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total
+- I looked at the CLARISA institutions list again, since I hadn’t looked at it in over six months:
+
+$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
+
+- The API still needs a key unless you query from Swagger web interface
+
+- They currently have 3,469 institutions…
+- Also, they still combine multiple text names into one string along with acronyms and countries:
+
+- Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung / Federal Ministry of Economic Cooperation and Development (Germany)
+- Ministerio del Ambiente / Ministry of Environment (Peru)
+- Carthage University / Université de Carthage
+- Sweet Potato Research Institute (SPRI) of Chinese Academy of Agricultural Sciences (CAAS)
+
+
+- I think the ROR is much better in every possible way
+
+
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 50a0a6a59..c18ab7b84 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index f3c14617b..93ac6b827 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index ae931dd8e..0d898cfbb 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 8ae151cdc..71f3f2b8a 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index b510b6aff..1b9dc9088 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index f63c5e5db..da15368a0 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index d049dc5c7..919fff200 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 8539c6d5a..ef1c6c249 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index ca9de4b14..db49ec40d 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 988e6be24..400d2f916 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index a012b98ee..f06cb4d5b 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 3bbb9d666..2d1cd53c0 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 97da8ffe3..b61016e1c 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 38eae8e17..8f426925c 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 5bb5106e1..221b3cc02 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index aa3cb73ba..277677a6d 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 53867d52a..9ccc58089 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index d0bf6b929..83291c2b1 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,27 +4,27 @@
https://alanorth.github.io/cgspace-notes/categories/
- 2020-07-14T10:57:49+03:00
+ 2020-07-15T15:42:23+03:00
https://alanorth.github.io/cgspace-notes/
- 2020-07-14T10:57:49+03:00
+ 2020-07-15T15:42:23+03:00
https://alanorth.github.io/cgspace-notes/2020-07/
- 2020-07-14T10:57:49+03:00
+ 2020-07-15T15:42:23+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2020-07-14T10:57:49+03:00
+ 2020-07-15T15:42:23+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2020-07-14T10:57:49+03:00
+ 2020-07-15T15:42:23+03:00