From 3a1264b8582a44312c64f97899f4ec6f55058120 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Mon, 9 Aug 2021 08:38:44 +0300 Subject: [PATCH] Add notes for 2021-08-08 --- content/posts/2021-08.md | 90 +++++++++++++++++++++ docs/2021-02/index.html | 8 +- docs/2021-08/index.html | 102 +++++++++++++++++++++++- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/page/8/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/posts/page/8/index.html | 2 +- docs/sitemap.xml | 12 +-- 26 files changed, 220 insertions(+), 36 deletions(-) diff --git a/content/posts/2021-08.md b/content/posts/2021-08.md index d62632a3c..71d7cd00d 100644 --- a/content/posts/2021-08.md +++ b/content/posts/2021-08.md @@ -89,4 +89,94 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv 0 /tmp/2021-08-05-all-ips-to-purge.csv ``` +## 2021-08-08 + +- Advise IWMI colleagues on best practices for thumbnails +- Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest + - I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example: + - WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific + - WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea +- Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange + - 35.174.144.154 made 11,000 requests last month with the following user agent: + +```console +Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36 +``` + +- That IP is on Amazon, and from looking at the DSpace logs I don't see them logging in at all, only scraping... so I will purge hits from that IP +- I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too + - Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so + - Same deal with 93.158.90.91, also in Sweden +- 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart +- 61.143.40.50 is in China and uses this hilarious user agent: + +```console +Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}" +``` + +- 47.252.80.214 is owned by Alibaba in the US and has the same user agent +- 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours +- 95.87.154.12 seems to be a new bot with the following user agent: + +```console +Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/ +``` + +- They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU + - I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots +- I see a new bot using this user agent: + +```console +nettle (+https://www.nettle.sk) +``` + +- 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period. +- 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day +- 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human +- There are probably more but that's most of them over 1,000 hits last month, so I will purge them: + +```console +$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p +Purging 10796 hits from 35.174.144.154 in statistics +Purging 9993 hits from 93.158.90.30 in statistics +Purging 6092 hits from 130.255.162.173 in statistics +Purging 24863 hits from 3.225.28.105 in statistics +Purging 2988 hits from 93.158.90.91 in statistics +Purging 2497 hits from 61.143.40.50 in statistics +Purging 13866 hits from 159.138.131.15 in statistics +Purging 2721 hits from 95.87.154.12 in statistics +Purging 2786 hits from 47.252.80.214 in statistics +Purging 1485 hits from 129.0.211.251 in statistics +Purging 8952 hits from 217.182.21.193 in statistics +Purging 3446 hits from 103.135.104.139 in statistics + +Total number of bot hits purged: 90485 +``` + +- Then I purged a few thousand more by user agent: + +```console +$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri +Found 2707 hits from MaCoCu in statistics +Found 1785 hits from nettle in statistics + +Total number of hits from bots: 4492 +``` + +- I found some CGSpace metadata in the wrong fields + - Seven metadata in dc.subject (57) should be in dcterms.subject (187) + - Twelve metadata in cg.title.journal (202) should be in cg.journal (251) + - Three dc.identifier.isbn (20) should be in cg.isbn (252) + - Three dc.identifier.issn (21) should be in cg.issn (253) + - I re-ran the `migrate-fields.sh` script on CGSpace +- I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs: + +```console +$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv +``` + +- Then in OpenRefine I merged all null, blank, and en fields into the `en_US` one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves + - In total it was a few thousand metadata entries or so so I had to split the CSV with `xsv split` in order to process it + - I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes... + diff --git a/docs/2021-02/index.html b/docs/2021-02/index.html index baf7ecfce..de6dbad1e 100644 --- a/docs/2021-02/index.html +++ b/docs/2021-02/index.html @@ -32,7 +32,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty - + @@ -72,7 +72,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty "url": "https://alanorth.github.io/cgspace-notes/2021-02/", "wordCount": "4143", "datePublished": "2021-02-01T10:13:54+02:00", - "dateModified": "2021-05-18T23:21:39+03:00", + "dateModified": "2021-08-08T17:07:54+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -431,9 +431,9 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
 30354
-$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l         
+$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l
 18555
-$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail     
+$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail
       5 c21a79e5-e24e-4861-aa07-e06703d1deb7
       5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38
       5 d73fb3ae-9fac-4f7e-990f-e394f344246c
diff --git a/docs/2021-08/index.html b/docs/2021-08/index.html
index fb7cbbd81..ebc91e4f8 100644
--- a/docs/2021-08/index.html
+++ b/docs/2021-08/index.html
@@ -18,7 +18,7 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
 
 
 
-
+
 
 
 
@@ -42,9 +42,9 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
   "@type": "BlogPosting",
   "headline": "August, 2021",
   "url": "https://alanorth.github.io/cgspace-notes/2021-08/",
-  "wordCount": "535",
+  "wordCount": "1288",
   "datePublished": "2021-08-01T09:01:07+03:00",
-  "dateModified": "2021-08-02T16:00:42+03:00",
+  "dateModified": "2021-08-06T09:08:15+03:00",
   "author": {
     "@type": "Person",
     "name": "Alan Orth"
@@ -204,7 +204,101 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
 $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
 $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
 0 /tmp/2021-08-05-all-ips-to-purge.csv
-
+

2021-08-08

+ +
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
+
+
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
+
+
Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
+
+
nettle (+https://www.nettle.sk)
+
+
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+Purging 10796 hits from 35.174.144.154 in statistics
+Purging 9993 hits from 93.158.90.30 in statistics
+Purging 6092 hits from 130.255.162.173 in statistics
+Purging 24863 hits from 3.225.28.105 in statistics
+Purging 2988 hits from 93.158.90.91 in statistics
+Purging 2497 hits from 61.143.40.50 in statistics
+Purging 13866 hits from 159.138.131.15 in statistics
+Purging 2721 hits from 95.87.154.12 in statistics
+Purging 2786 hits from 47.252.80.214 in statistics
+Purging 1485 hits from 129.0.211.251 in statistics
+Purging 8952 hits from 217.182.21.193 in statistics
+Purging 3446 hits from 103.135.104.139 in statistics
+
+Total number of bot hits purged: 90485
+
+
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri 
+Found 2707 hits from MaCoCu in statistics
+Found 1785 hits from nettle in statistics
+
+Total number of hits from bots: 4492
+
+
$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
+
+ diff --git a/docs/categories/index.html b/docs/categories/index.html index 0121dc8ac..4fa09f197 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index b0f303a4f..556b420fd 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index c9cfeb5d2..5559c9939 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 6e330ed44..0243175df 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 1d3a9ac0f..11aec734d 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index bd40304e9..8f4aecc4f 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 7aa301eca..7ded0ed70 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 8cd47dd7d..97e792e4e 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 6aa07fc31..4eadf6e62 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 45267cca7..142ffe24c 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index fb9ae67cf..792ed18ee 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 7e988e1db..b491cc9a6 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 048e78693..43c98a304 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index ca8d3e08f..02baea481 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 6d8871042..86a073fd7 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index e1fadad73..f7d7a6563 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 2c6fc3840..9fcc59d92 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index ae0eeba17..6ab2d655b 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index c21d13e66..2b4bb1302 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index d8ac57878..80b87818e 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 95f7c2335..44a9af68f 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index f54e4da58..0492ff413 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 17c883da2..78e923056 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/2021-08/ - 2021-08-02T16:00:42+03:00 + 2021-08-06T09:08:15+03:00 https://alanorth.github.io/cgspace-notes/categories/ - 2021-08-02T16:00:42+03:00 + 2021-08-08T17:07:54+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-08-02T16:00:42+03:00 + 2021-08-08T17:07:54+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-08-02T16:00:42+03:00 + 2021-08-08T17:07:54+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-08-02T16:00:42+03:00 + 2021-08-08T17:07:54+03:00 https://alanorth.github.io/cgspace-notes/2021-07/ 2021-08-01T16:19:05+03:00 @@ -42,7 +42,7 @@ 2021-03-30T09:56:38+03:00 https://alanorth.github.io/cgspace-notes/2021-02/ - 2021-05-18T23:21:39+03:00 + 2021-08-08T17:07:54+03:00 https://alanorth.github.io/cgspace-notes/2021-01/ 2021-01-31T16:32:16+02:00