diff --git a/content/posts/2021-06.md b/content/posts/2021-06.md index e58adeaf3..c5394dcee 100644 --- a/content/posts/2021-06.md +++ b/content/posts/2021-06.md @@ -68,4 +68,21 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos - Dump and re-create indexes on AReS (as above) so I can do a harvest +## 2021-06-16 + +- Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot's DNS + - For example, user agent `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)` with DNS `msnbot-131-253-25-91.search.msn.com.` + - I queried Solr for all hits using the MSN bot DNS (`dns:*msnbot* AND dns:*.msn.com.`) and found 457,706 + - I extracted their IPs using Solr's CSV format and ran them through my `resolve-addresses.py` script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075) + - Note that [Microsoft's docs say that reverse lookups on Bingbot IPs will always have "search.msn.com"](https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26) so it is safe to purge these as non-human traffic + - I purged the hits with `ilri/check-spider-ip-hits.sh` (though I had to do it in 3 batches because I forgot to increase the `facet.limit` so I was only getting them 100 at a time) +- Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV + - It will hopefully also fix the duplicate and missing items issues + - I had a Skype with him to discuss + - I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run: + +```console +$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data +``` + diff --git a/docs/2021-06/index.html b/docs/2021-06/index.html index ab249f437..ca6d5e00d 100644 --- a/docs/2021-06/index.html +++ b/docs/2021-06/index.html @@ -20,7 +20,7 @@ I simply started it and AReS was running again: - + @@ -46,9 +46,9 @@ I simply started it and AReS was running again: "@type": "BlogPosting", "headline": "June, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-06/", - "wordCount": "415", + "wordCount": "627", "datePublished": "2021-06-01T10:51:07+03:00", - "dateModified": "2021-06-10T21:41:44+03:00", + "dateModified": "2021-06-14T15:09:07+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -189,7 +189,27 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
with DNS msnbot-131-253-25-91.search.msn.com.
dns:*msnbot* AND dns:*.msn.com.
) and found 457,706resolve-addresses.py
script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)ilri/check-spider-ip-hits.sh
(though I had to do it in 3 batches because I forgot to increase the facet.limit
so I was only getting them 100 at a time)$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 78ebc57d1..d4560cb1b 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index 092f4d8df..ab6f3e8d5 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index 5402488a8..b6788b861 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index bcdb8899a..b52bb83cc 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 188f306d7..ed239c8dd 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index deba2eabb..4bf88c868 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index b899868f9..f5b38e687 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index 97c46af0c..bab75bbba 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index f14f5ca21..4474fa79e 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index 992440147..a6feb840e 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index ed6d468b7..71f167c96 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 412ad3f61..b7df5cd3c 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 5d104035c..52333d1a1 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index b156059db..fc09cb62e 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 962ac5ea5..a7c3a35ea 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 6bc425f54..abc00dc6b 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index ec5ea6555..ab332783d 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 6c0b36a53..de276e530 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 1234ba3a9..0ea551a45 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 45727658b..62093e967 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 34d64ad17..2fe1d161f 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index 48c684d90..7d8824a5c 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index d9f7fcbde..7609d5e8c 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">