From 4e6a8ec51b1428fbb9670ae531fd3668e0fec348 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Thu, 10 Nov 2022 15:45:04 +0300 Subject: [PATCH] Add notes for 2022-11-09 --- content/posts/2022-11.md | 78 ++++++++++++++++++++ docs/2022-11/index.html | 98 ++++++++++++++++++++++++- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/categories/notes/page/6/index.html | 2 +- docs/categories/notes/page/7/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/page/8/index.html | 2 +- docs/page/9/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/posts/page/8/index.html | 2 +- docs/posts/page/9/index.html | 2 +- docs/sitemap.xml | 10 +-- 29 files changed, 203 insertions(+), 35 deletions(-) diff --git a/content/posts/2022-11.md b/content/posts/2022-11.md index 90964e961..638180c16 100644 --- a/content/posts/2022-11.md +++ b/content/posts/2022-11.md @@ -117,5 +117,83 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagic ``` - But looking at the items it processed, I'm not sure it's working as expected + - I looked at a few dozen +- I found some links to the Bioversity website on CGSpace that are not redirecting properly: + +```console +$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html +GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1 +Accept: */* +Accept-Encoding: gzip, deflate +Connection: keep-alive +Host: www.bioversityinternational.org +User-Agent: HTTPie/3.2.1 + +HTTP/1.1 302 Found +Connection: Keep-Alive +Content-Length: 275 +Content-Type: text/html; charset=iso-8859-1 +Date: Mon, 07 Nov 2022 16:35:21 GMT +Keep-Alive: timeout=15, max=100 +Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html +Server: Apache +``` + +- The `Location` header is clearly wrong, and if I try https directly I get an HTTP 500 + +## 2022-11-08 + +- Looking at the Solr statistics hits on CGSpace for 2022-11 + - I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent + - I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent + - I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection + - I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent + - I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection + - I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent + - I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent + - I will purge all these hits and proably add China Unicom's subnet mask to my nginx `bot-network.conf` file to tag them as bots since there are SO many bad and malicious requests coming from there + +```console +$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p +Purging 8975 hits from 221.219.100.42 in statistics +Purging 7577 hits from 122.10.101.60 in statistics +Purging 6536 hits from 135.125.21.38 in statistics +Purging 23950 hits from 163.237.216.11 in statistics +Purging 4093 hits from 51.254.154.148 in statistics +Purging 2797 hits from 221.219.103.211 in statistics +Purging 2618 hits from 216.218.223.53 in statistics + +Total number of bot hits purged: 56546 +``` + +- Also interesting to see a few new user agents: + - `RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)` + - `rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)` + - `MEL` + - `Gov employment data scraper ([[your email]])` + - `RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)` +- I will purge all these: + +```console +$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p +Purging 6155 hits from RStudio in statistics +Purging 1929 hits from rstudio in statistics +Purging 1454 hits from MEL in statistics +Purging 1094 hits from Gov employment data scraper in statistics + +Total number of bot hits purged: 10632 +``` + +- Work on the CIAT Library items a bit again in OpenRefine + - I flagged items with: + - URL containing "#page" at the end (these are linking to book chapters, but we don't want to upload the PDF multiple times) + - Same URL used by more than one item ("Duplicates" facet in OpenRefine, these are some corner case I don't want to handle right now) + - URL containing ":8080" to CIAT's old DSpace (this server is no longer live) + - I want to try to handle the simple cases that should cover most of the items first + +## 2022-11-09 + +- Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace + - I got the basic functionality working diff --git a/docs/2022-11/index.html b/docs/2022-11/index.html index f664e17a1..a4e1d650f 100644 --- a/docs/2022-11/index.html +++ b/docs/2022-11/index.html @@ -24,7 +24,7 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe - + @@ -54,9 +54,9 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe "@type": "BlogPosting", "headline": "November, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-11/", - "wordCount": "863", + "wordCount": "1392", "datePublished": "2022-11-01T09:11:36+03:00", - "dateModified": "2022-11-01T22:12:24+03:00", + "dateModified": "2022-11-07T17:18:14+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -263,7 +263,97 @@ I reverted the Cocoon autosave change because it was more of a nuissance that Pe
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
 
+
$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html 
+GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
+Accept: */*
+Accept-Encoding: gzip, deflate
+Connection: keep-alive
+Host: www.bioversityinternational.org
+User-Agent: HTTPie/3.2.1
+
+HTTP/1.1 302 Found
+Connection: Keep-Alive
+Content-Length: 275
+Content-Type: text/html; charset=iso-8859-1
+Date: Mon, 07 Nov 2022 16:35:21 GMT
+Keep-Alive: timeout=15, max=100
+Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
+Server: Apache
+
+

2022-11-08

+ +
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+Purging 8975 hits from 221.219.100.42 in statistics
+Purging 7577 hits from 122.10.101.60 in statistics
+Purging 6536 hits from 135.125.21.38 in statistics
+Purging 23950 hits from 163.237.216.11 in statistics
+Purging 4093 hits from 51.254.154.148 in statistics
+Purging 2797 hits from 221.219.103.211 in statistics
+Purging 2618 hits from 216.218.223.53 in statistics
+
+Total number of bot hits purged: 56546
+
+
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
+Purging 6155 hits from RStudio in statistics
+Purging 1929 hits from rstudio in statistics
+Purging 1454 hits from MEL in statistics
+Purging 1094 hits from Gov employment data scraper in statistics
+
+Total number of bot hits purged: 10632
+
+

2022-11-09

+ diff --git a/docs/categories/index.html b/docs/categories/index.html index 5f8fb221d..677e740c0 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 7c02837a3..f530544e2 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index ba7d503f8..2bb04b631 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index c4a8cd60d..44b2ccdff 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 29b3ce18e..e49a9332d 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 15274e250..471ab53b3 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index cf91f1183..ca9493742 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index 77bed3fd0..0fccfa3e1 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index fa8037be2..1c6ce4b28 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 4f8f56616..35855081d 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 7354d0c90..f29fee228 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index a76c90bdc..83812dfec 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 41e866f99..2ae8eed83 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 8daa48921..6978e88ab 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 35ebcdb57..bf662829a 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index b96b14d48..e99a3b937 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index c2028ee32..f6d98a1bc 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 3e6815854..8a3765b99 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index fd020fe03..2f09da92d 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 4f19318a6..f06916555 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 9eb9a6641..b0df8d234 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 9b02cb17a..a018f1ff9 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 57df92c29..150decc0c 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 0e3dc3ec5..d881caaae 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 0991494e5..07d34703e 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 759a02d97..6e61c3c82 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 7a6516f7a..2fbd1d159 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2022-11-01T22:12:24+03:00 + 2022-11-07T17:18:14+03:00 https://alanorth.github.io/cgspace-notes/ - 2022-11-01T22:12:24+03:00 + 2022-11-07T17:18:14+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2022-11-01T22:12:24+03:00 + 2022-11-07T17:18:14+03:00 https://alanorth.github.io/cgspace-notes/2022-11/ - 2022-11-01T22:12:24+03:00 + 2022-11-07T17:18:14+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2022-11-01T22:12:24+03:00 + 2022-11-07T17:18:14+03:00 https://alanorth.github.io/cgspace-notes/2022-10/ 2022-10-31T16:59:47+03:00