diff --git a/content/posts/2021-12.md b/content/posts/2021-12.md index a2fd898f3..23ce6ac2b 100644 --- a/content/posts/2021-12.md +++ b/content/posts/2021-12.md @@ -96,4 +96,32 @@ $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity p - I did a bit more work to add missing AGROVOC subjects, countries, regions, extents, etc and then uploaded the forty-six items to CGSpace - I started looking at the seventy CAS records that Abenet has been working on for the past few months +## 2021-12-07 + +- I sent Vini from CGIAR CAS some questions about the seventy records I was working on yesterday + - Also, I ran the `check-duplicates.py` script on them and found that they might ALL be duplicates!!! + - I tweaked the script a bit more to use the issue dates as a third criteria and now there are less duplicates, but it's still at least twenty or so... + - The script now checks if the issue date of the item in the CSV and the issue date of the item in the database are less than 365 days apart (by default) + - For example, many items like "Annual Report 2020" can have similar title and type to previous annual reports, but are not duplicates +- I noticed a strange user agent in the XMLUI logs on CGSpace: + +```console +20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] "GET /handle/10568/33203 HTTP/1.1" 200 6328 "-" "python-requests/2.25.1" +20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] "GET /handle/10568/33203 HTTP/2.0" 200 6315 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36" +``` + +- I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft + - It could be someone on Azure? + - I opened [a pull request to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/49) and I'll add this user agent to our local override until they decide to include it or not +- I purged 34,000 hits from this user agent in our Solr statistics: + +```console +$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p +Purging 34458 hits from HeadlessChrome in statistics + +Total number of bot hits purged: 34458 +``` + +- Meeting with partners about repositories in the One CGIAR + diff --git a/docs/2021-12/index.html b/docs/2021-12/index.html index 09fba7a45..4da569fad 100644 --- a/docs/2021-12/index.html +++ b/docs/2021-12/index.html @@ -22,7 +22,7 @@ Total number of bot hits purged: 3679 - + @@ -50,9 +50,9 @@ Total number of bot hits purged: 3679 "@type": "BlogPosting", "headline": "December, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-12/", - "wordCount": "711", + "wordCount": "968", "datePublished": "2021-12-01T16:07:07+02:00", - "dateModified": "2021-12-05T17:55:47+02:00", + "dateModified": "2021-12-06T16:40:50+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -238,6 +238,36 @@ Purging 455 hits from WhatsApp in statistics
  • I started looking at the seventy CAS records that Abenet has been working on for the past few months
  • +

    2021-12-07

    + +
    20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] "GET /handle/10568/33203 HTTP/1.1" 200 6328 "-" "python-requests/2.25.1"
    +20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] "GET /handle/10568/33203 HTTP/2.0" 200 6315 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36"
    +
    +
    $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
    +Purging 34458 hits from HeadlessChrome in statistics
    +
    +Total number of bot hits purged: 34458
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index e358367ec..fc5769c4b 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 3a77c22b6..b8c04e317 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 2717de7bd..fdf01eab3 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 23c8d2a06..c94ed67e7 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 7f23db97a..d5cd23e4e 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 1ca977894..d5b59d2bc 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index b8e0daf83..7277626c9 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index a74ca3fc8..d051bf8ab 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 2c68940e5..0cf04fd8d 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 8d0a14f63..8eeb5f9e4 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 12dfeea2c..cb6cd09e3 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 484a25b93..8d699639e 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 37aa52d7b..207c7f537 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index c31f1628d..e512f585a 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 85923eea6..93b5ed0fb 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 7d159b7be..975b3bd08 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index ce0f7ad63..1a2c635cf 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 83029e2fa..ad0e3fd69 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 57369d65e..d9d05bcac 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 5442f4f9c..cefe0beac 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 702d91087..e3ba723f5 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index c29849d85..934e00c23 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 6be274690..9f5fbf01b 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 8e3f9133f..b73f90467 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-12-05T17:55:47+02:00 + 2021-12-06T16:40:50+02:00 https://alanorth.github.io/cgspace-notes/ - 2021-12-05T17:55:47+02:00 + 2021-12-06T16:40:50+02:00 https://alanorth.github.io/cgspace-notes/2021-12/ - 2021-12-05T17:55:47+02:00 + 2021-12-06T16:40:50+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-12-05T17:55:47+02:00 + 2021-12-06T16:40:50+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-12-05T17:55:47+02:00 + 2021-12-06T16:40:50+02:00 https://alanorth.github.io/cgspace-notes/2021-11/ 2021-11-30T16:44:30+02:00