diff --git a/content/posts/2022-07.md b/content/posts/2022-07.md index 8ab9be97e..de38c26d0 100644 --- a/content/posts/2022-07.md +++ b/content/posts/2022-07.md @@ -354,4 +354,53 @@ $ wc -l /tmp/bot-ips.txt 1946968 /tmp/bot-ips.txt ``` +- I started running `check-spider-ip-hits.sh` with the 1946968 IPs and left it running in dry run mode + +## 2022-07-19 + +- Patrizio and Fabio emailed me to ask if their IP was banned from CGSpace + - It's one of the Hetzner ones so I said yes definitely, and asked more about how they are using the API +- Add ORCID identifer for Ram Dhulipala, Lilian Wambua, and Dan Masiga to CGSpace and tag them and some other existing items: + +```console +dc.contributor.author,cg.creator.identifier +"Dhulipala, Ram K","Ram Dhulipala: 0000-0002-9720-3247" +"Dhulipala, Ram","Ram Dhulipala: 0000-0002-9720-3247" +"Dhulipala, R.","Ram Dhulipala: 0000-0002-9720-3247" +"Wambua, Lillian","Lillian Wambua: 0000-0003-3632-7411" +"Wambua, Lilian","Lillian Wambua: 0000-0003-3632-7411" +"Masiga, D.K.","Daniel Masiga: 0000-0001-7513-0887" +"Masiga, Daniel K.","Daniel Masiga: 0000-0001-7513-0887" +"Jores, Joerg","Joerg Jores: 0000-0003-3790-5746" +"Schieck, Elise","Elise Schieck: 0000-0003-1756-6337" +"Schieck, Elise G.","Elise Schieck: 0000-0003-1756-6337" +$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-07-19-add-orcids.csv -db dspace -u dspace -p 'fuuu' +``` + +- Review the AfricaRice records from earlier this month again + - I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two +- I took all the ~560 IPs that had hits so far in `check-spider-ip-hits.sh` above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace + - This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from + - Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script + +## 2022-07-20 + +- Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance + - Once it worked well I did an import to CGSpace: + +```console +$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat +``` + +- Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up +- Extract another ~1,600 IPs that had hits since I started the second round of `check-spider-ip-hits.sh` yesterday and purge another 303,594 hits + - This is about 999846 into the original list of 1946968 from yesterday + - A metric fuck ton of the IPs in this batch were from Hetzner + +## 2022-07-21 + +- Extract another ~2,100 IPs that had hits since I started the third round of `check-spider-ip-hits.sh` last night and purge another 763,843 hits + - This is about 1441221 into the original list of 1946968 from two days ago + - Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner) + diff --git a/docs/2022-07/index.html b/docs/2022-07/index.html index 848c43d96..67250e7cb 100644 --- a/docs/2022-07/index.html +++ b/docs/2022-07/index.html @@ -19,7 +19,7 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens - + @@ -44,9 +44,9 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens "@type": "BlogPosting", "headline": "July, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-07/", - "wordCount": "2266", + "wordCount": "2679", "datePublished": "2022-07-02T14:07:36+03:00", - "dateModified": "2022-07-18T12:32:23+03:00", + "dateModified": "2022-07-18T16:45:55+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -521,7 +521,71 @@ Also, the trgm functions I’ve used before are case insensitive, but Levens
$ while read -r line; do prips "$line" | sed -e '1d; $d'; done < /tmp/bot-networks.conf > /tmp/bot-ips.txt
 $ wc -l /tmp/bot-ips.txt                                                                                        
 1946968 /tmp/bot-ips.txt
-
+ +

2022-07-19

+ +
dc.contributor.author,cg.creator.identifier
+"Dhulipala, Ram K","Ram Dhulipala: 0000-0002-9720-3247"
+"Dhulipala, Ram","Ram Dhulipala: 0000-0002-9720-3247"
+"Dhulipala, R.","Ram Dhulipala: 0000-0002-9720-3247"
+"Wambua, Lillian","Lillian Wambua: 0000-0003-3632-7411"
+"Wambua, Lilian","Lillian Wambua: 0000-0003-3632-7411"
+"Masiga, D.K.","Daniel Masiga: 0000-0001-7513-0887"
+"Masiga, Daniel K.","Daniel Masiga: 0000-0001-7513-0887"
+"Jores, Joerg","Joerg Jores: 0000-0003-3790-5746"
+"Schieck, Elise","Elise Schieck: 0000-0003-1756-6337"
+"Schieck, Elise G.","Elise Schieck: 0000-0003-1756-6337"
+$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-07-19-add-orcids.csv -db dspace -u dspace -p 'fuuu'
+
+

2022-07-20

+ +
$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat
+
+

2022-07-21

+ + diff --git a/docs/categories/index.html b/docs/categories/index.html index ae4e4602a..f89bdbe08 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index b1a7e0175..52de8ac06 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 0c6845648..a33ff9b7c 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 0f3351aad..8cb8ff946 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 1634ece89..d3ceb4031 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 144711a18..2bb084d11 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index 84ae6d816..e675f414b 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index 796797dfd..04bd704a9 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index e2429bec3..acb0ce20d 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index dd3ba8e09..54c4a1160 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 1d2510117..4cb3657e7 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 426b28c44..a63285042 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 4d459354c..91077562e 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 133cd5689..eb8a0df31 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 9a843246e..3f52c682b 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 631aa884a..9270b4168 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index 2cccf680d..ce9b4d9c9 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 75887b5cb..ee61406fa 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 66b20da96..aaf3381f1 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 385d2ff11..aa255bbf5 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 46f827716..7457b95f0 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 4f114ddff..9c2fcbb02 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 555e110f4..3bb0fed10 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 9dd0bed47..687ac7da1 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index a3ddbd76d..6ee8432ae 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 9c79413b9..d5e68b588 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 32e40f5ce..28e0ceeb3 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2022-07-18T12:32:23+03:00 + 2022-07-18T16:45:55+03:00 https://alanorth.github.io/cgspace-notes/ - 2022-07-18T12:32:23+03:00 + 2022-07-18T16:45:55+03:00 https://alanorth.github.io/cgspace-notes/2022-07/ - 2022-07-18T12:32:23+03:00 + 2022-07-18T16:45:55+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2022-07-18T12:32:23+03:00 + 2022-07-18T16:45:55+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2022-07-18T12:32:23+03:00 + 2022-07-18T16:45:55+03:00 https://alanorth.github.io/cgspace-notes/2022-06/ 2022-07-04T09:25:14+03:00