diff --git a/content/posts/2022-11.md b/content/posts/2022-11.md
index 90964e961..638180c16 100644
--- a/content/posts/2022-11.md
+++ b/content/posts/2022-11.md
@@ -117,5 +117,83 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagic
```
- But looking at the items it processed, I'm not sure it's working as expected
+ - I looked at a few dozen
+- I found some links to the Bioversity website on CGSpace that are not redirecting properly:
+
+```console
+$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
+GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
+Accept: */*
+Accept-Encoding: gzip, deflate
+Connection: keep-alive
+Host: www.bioversityinternational.org
+User-Agent: HTTPie/3.2.1
+
+HTTP/1.1 302 Found
+Connection: Keep-Alive
+Content-Length: 275
+Content-Type: text/html; charset=iso-8859-1
+Date: Mon, 07 Nov 2022 16:35:21 GMT
+Keep-Alive: timeout=15, max=100
+Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
+Server: Apache
+```
+
+- The `Location` header is clearly wrong, and if I try https directly I get an HTTP 500
+
+## 2022-11-08
+
+- Looking at the Solr statistics hits on CGSpace for 2022-11
+ - I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
+ - I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
+ - I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
+ - I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
+ - I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
+ - I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
+ - I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
+ - I will purge all these hits and proably add China Unicom's subnet mask to my nginx `bot-network.conf` file to tag them as bots since there are SO many bad and malicious requests coming from there
+
+```console
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+Purging 8975 hits from 221.219.100.42 in statistics
+Purging 7577 hits from 122.10.101.60 in statistics
+Purging 6536 hits from 135.125.21.38 in statistics
+Purging 23950 hits from 163.237.216.11 in statistics
+Purging 4093 hits from 51.254.154.148 in statistics
+Purging 2797 hits from 221.219.103.211 in statistics
+Purging 2618 hits from 216.218.223.53 in statistics
+
+Total number of bot hits purged: 56546
+```
+
+- Also interesting to see a few new user agents:
+ - `RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)`
+ - `rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)`
+ - `MEL`
+ - `Gov employment data scraper ([[your email]])`
+ - `RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)`
+- I will purge all these:
+
+```console
+$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
+Purging 6155 hits from RStudio in statistics
+Purging 1929 hits from rstudio in statistics
+Purging 1454 hits from MEL in statistics
+Purging 1094 hits from Gov employment data scraper in statistics
+
+Total number of bot hits purged: 10632
+```
+
+- Work on the CIAT Library items a bit again in OpenRefine
+ - I flagged items with:
+ - URL containing "#page" at the end (these are linking to book chapters, but we don't want to upload the PDF multiple times)
+ - Same URL used by more than one item ("Duplicates" facet in OpenRefine, these are some corner case I don't want to handle right now)
+ - URL containing ":8080" to CIAT's old DSpace (this server is no longer live)
+ - I want to try to handle the simple cases that should cover most of the items first
+
+## 2022-11-09
+
+- Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
+ - I got the basic functionality working
diff --git a/docs/2022-11/index.html b/docs/2022-11/index.html
index f664e17a1..a4e1d650f 100644
--- a/docs/2022-11/index.html
+++ b/docs/2022-11/index.html
@@ -24,7 +24,7 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
-
+
@@ -54,9 +54,9 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
"@type": "BlogPosting",
"headline": "November, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-11/",
- "wordCount": "863",
+ "wordCount": "1392",
"datePublished": "2022-11-01T09:11:36+03:00",
- "dateModified": "2022-11-01T22:12:24+03:00",
+ "dateModified": "2022-11-07T17:18:14+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -263,7 +263,97 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
-- But looking at the items it processed, I’m not sure it’s working as expected
+- But looking at the items it processed, I’m not sure it’s working as expected
+
+- I looked at a few dozen
+
+
+- I found some links to the Bioversity website on CGSpace that are not redirecting properly:
+
+$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
+GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
+Accept: */*
+Accept-Encoding: gzip, deflate
+Connection: keep-alive
+Host: www.bioversityinternational.org
+User-Agent: HTTPie/3.2.1
+
+HTTP/1.1 302 Found
+Connection: Keep-Alive
+Content-Length: 275
+Content-Type: text/html; charset=iso-8859-1
+Date: Mon, 07 Nov 2022 16:35:21 GMT
+Keep-Alive: timeout=15, max=100
+Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
+Server: Apache
+
+- The
Location
header is clearly wrong, and if I try https directly I get an HTTP 500
+
+2022-11-08
+
+- Looking at the Solr statistics hits on CGSpace for 2022-11
+
+- I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
+- I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
+- I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
+- I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
+- I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
+- I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
+- I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
+- I will purge all these hits and proably add China Unicom’s subnet mask to my nginx
bot-network.conf
file to tag them as bots since there are SO many bad and malicious requests coming from there
+
+
+
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+Purging 8975 hits from 221.219.100.42 in statistics
+Purging 7577 hits from 122.10.101.60 in statistics
+Purging 6536 hits from 135.125.21.38 in statistics
+Purging 23950 hits from 163.237.216.11 in statistics
+Purging 4093 hits from 51.254.154.148 in statistics
+Purging 2797 hits from 221.219.103.211 in statistics
+Purging 2618 hits from 216.218.223.53 in statistics
+
+Total number of bot hits purged: 56546
+
+- Also interesting to see a few new user agents:
+
+RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)
+rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)
+MEL
+Gov employment data scraper ([[your email]])
+RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)
+
+
+- I will purge all these:
+
+$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
+Purging 6155 hits from RStudio in statistics
+Purging 1929 hits from rstudio in statistics
+Purging 1454 hits from MEL in statistics
+Purging 1094 hits from Gov employment data scraper in statistics
+
+Total number of bot hits purged: 10632
+
+- Work on the CIAT Library items a bit again in OpenRefine
+
+- I flagged items with:
+
+- URL containing “#page” at the end (these are linking to book chapters, but we don’t want to upload the PDF multiple times)
+- Same URL used by more than one item (“Duplicates” facet in OpenRefine, these are some corner case I don’t want to handle right now)
+- URL containing “:8080” to CIAT’s old DSpace (this server is no longer live)
+
+
+- I want to try to handle the simple cases that should cover most of the items first
+
+
+
+2022-11-09
+
+- Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
+
+- I got the basic functionality working
+
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 5f8fb221d..677e740c0 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index 7c02837a3..f530544e2 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index ba7d503f8..2bb04b631 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index c4a8cd60d..44b2ccdff 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 29b3ce18e..e49a9332d 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index 15274e250..471ab53b3 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html
index cf91f1183..ca9493742 100644
--- a/docs/categories/notes/page/6/index.html
+++ b/docs/categories/notes/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html
index 77bed3fd0..0fccfa3e1 100644
--- a/docs/categories/notes/page/7/index.html
+++ b/docs/categories/notes/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index fa8037be2..1c6ce4b28 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index 4f8f56616..35855081d 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 7354d0c90..f29fee228 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index a76c90bdc..83812dfec 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 41e866f99..2ae8eed83 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 8daa48921..6978e88ab 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 35ebcdb57..bf662829a 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index b96b14d48..e99a3b937 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/9/index.html b/docs/page/9/index.html
index c2028ee32..f6d98a1bc 100644
--- a/docs/page/9/index.html
+++ b/docs/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 3e6815854..8a3765b99 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index fd020fe03..2f09da92d 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 4f19318a6..f06916555 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 9eb9a6641..b0df8d234 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 9b02cb17a..a018f1ff9 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 57df92c29..150decc0c 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 0e3dc3ec5..d881caaae 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index 0991494e5..07d34703e 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html
index 759a02d97..6e61c3c82 100644
--- a/docs/posts/page/9/index.html
+++ b/docs/posts/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 7a6516f7a..2fbd1d159 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/categories/
- 2022-11-01T22:12:24+03:00
+ 2022-11-07T17:18:14+03:00
https://alanorth.github.io/cgspace-notes/
- 2022-11-01T22:12:24+03:00
+ 2022-11-07T17:18:14+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2022-11-01T22:12:24+03:00
+ 2022-11-07T17:18:14+03:00
https://alanorth.github.io/cgspace-notes/2022-11/
- 2022-11-01T22:12:24+03:00
+ 2022-11-07T17:18:14+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2022-11-01T22:12:24+03:00
+ 2022-11-07T17:18:14+03:00
https://alanorth.github.io/cgspace-notes/2022-10/
2022-10-31T16:59:47+03:00