From 27287aec4fe428cd7b5975fd7b3567ef09695854 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 6 Nov 2019 09:35:51 +0200 Subject: [PATCH] Update notes for 2019-11-05 --- content/posts/2019-11.md | 18 ++++++++++++++++++ docs/2019-11/index.html | 33 +++++++++++++++++++++++++-------- docs/sitemap.xml | 10 +++++----- 3 files changed, 48 insertions(+), 13 deletions(-) diff --git a/content/posts/2019-11.md b/content/posts/2019-11.md index 5aa26a41e..48e6036bc 100644 --- a/content/posts/2019-11.md +++ b/content/posts/2019-11.md @@ -153,5 +153,23 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf - So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list - Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents +- I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't: + +``` +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com" +$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com" +``` + +- Then commit changes to Solr so we don't have to wait: + +``` +$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true' +$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound + +$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound + +``` + +- So the blocking seems to be working because "www\.gnip\.com" is one of the new patterns added to the spiders file... diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html index abb83d001..9eb7a690d 100644 --- a/docs/2019-11/index.html +++ b/docs/2019-11/index.html @@ -34,7 +34,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t - + @@ -73,9 +73,9 @@ Let’s see how many of the REST API requests were for bitstreams (because t "@type": "BlogPosting", "headline": "November, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-11\/", - "wordCount": "931", + "wordCount": "1038", "datePublished": "2019-11-04T12:20:30+02:00", - "dateModified": "2019-11-04T16:41:19+02:00", + "dateModified": "2019-11-05T10:37:16+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -302,15 +302,32 @@ $ http –print Hh ‘http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound $ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound +

+ +

+- So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
+  - Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
+- I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't:
+
+
+ +

$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“www.gnip.com” +$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:“www.gnyp.com”

+ +

+- Then commit changes to Solr so we don't have to wait:
+
+
+ +

$ http –print b ‘http://localhost:8081/solr/statistics/update?commit=true' +$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound +$ http –print b ‘http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint –format - | grep numFound + ```

    -
  • So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list - -
      -
    • Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
    • -
  • +
  • So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…
diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 7a4348348..d85157082 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2019-11-04T16:41:19+02:00 + 2019-11-05T10:37:16+02:00 https://alanorth.github.io/cgspace-notes/ - 2019-11-04T16:41:19+02:00 + 2019-11-05T10:37:16+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2019-11-04T16:41:19+02:00 + 2019-11-05T10:37:16+02:00 https://alanorth.github.io/cgspace-notes/2019-11/ - 2019-11-04T16:41:19+02:00 + 2019-11-05T10:37:16+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2019-11-04T16:41:19+02:00 + 2019-11-05T10:37:16+02:00