From 89a4212e2bc8d6b508e2866702d50f64a414dc26 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Sun, 7 Apr 2019 11:45:34 +0300 Subject: [PATCH] Add notes for 2019-04-07 --- content/posts/2019-04.md | 124 ++++++++++++++++++++++++++++++++++ docs/2019-04/index.html | 139 ++++++++++++++++++++++++++++++++++++++- docs/sitemap.xml | 10 +-- 3 files changed, 265 insertions(+), 8 deletions(-) diff --git a/content/posts/2019-04.md b/content/posts/2019-04.md index 46b955bee..0bdca3b3f 100644 --- a/content/posts/2019-04.md +++ b/content/posts/2019-04.md @@ -170,4 +170,128 @@ GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_re - Maria from Bioversity recommended that we use the phrase "AGROVOC subject" instead of "Subject" in Listings and Reports - I made a pull request to update this and merged it to the `5_x-prod` branch ([#418](https://github.com/ilri/DSpace/pull/418)) +## 2019-04-07 + +- Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day* +- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams + - I am trying to see if these are registered as downloads in Solr or not + - I see 96,925 downloads from their AWS gateway IPs in 2019-03: + +``` +$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true' +{ + "response": { + "docs": [], + "numFound": 96925, + "start": 0 + }, + "responseHeader": { + "QTime": 1, + "params": { + "fq": [ + "statistics_type:view", + "bundleName:ORIGINAL", + "dateYearMonth:2019-03" + ], + "indent": "true", + "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)", + "rows": "0", + "wt": "json" + }, + "status": 0 + } +} +``` + +- Strangely I don't see many hits in 2019-04: + +``` +$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true' +{ + "response": { + "docs": [], + "numFound": 38, + "start": 0 + }, + "responseHeader": { + "QTime": 1, + "params": { + "fq": [ + "statistics_type:view", + "bundleName:ORIGINAL", + "dateYearMonth:2019-04" + ], + "indent": "true", + "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)", + "rows": "0", + "wt": "json" + }, + "status": 0 + } +} +``` + +- Making some tests on GET vs HEAD requests on the [CTA Spore 192 item](https://dspacetest.cgiar.org/handle/10568/100289) on DSpace Test: + +``` +$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf +GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1 +Accept: */* +Accept-Encoding: gzip, deflate +Connection: keep-alive +Host: dspacetest.cgiar.org +User-Agent: HTTPie/1.0.2 + +HTTP/1.1 200 OK +Connection: keep-alive +Content-Language: en-US +Content-Length: 2069158 +Content-Type: application/pdf;charset=ISO-8859-1 +Date: Sun, 07 Apr 2019 08:38:34 GMT +Expires: Sun, 07 Apr 2019 09:38:34 GMT +Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT +Server: nginx +Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly +Vary: User-Agent +X-Cocoon-Version: 2.2.0 +X-Content-Type-Options: nosniff +X-Frame-Options: SAMEORIGIN +X-Robots-Tag: none +X-XSS-Protection: 1; mode=block + +$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf +HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1 +Accept: */* +Accept-Encoding: gzip, deflate +Connection: keep-alive +Host: dspacetest.cgiar.org +User-Agent: HTTPie/1.0.2 + +HTTP/1.1 200 OK +Connection: keep-alive +Content-Language: en-US +Content-Length: 2069158 +Content-Type: application/pdf;charset=ISO-8859-1 +Date: Sun, 07 Apr 2019 08:39:01 GMT +Expires: Sun, 07 Apr 2019 09:39:01 GMT +Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT +Server: nginx +Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly +Vary: User-Agent +X-Cocoon-Version: 2.2.0 +X-Content-Type-Options: nosniff +X-Frame-Options: SAMEORIGIN +X-Robots-Tag: none +X-XSS-Protection: 1; mode=block +``` + +- And from the server side, the nginx logs show: + +``` +78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2" +78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2" +``` + +- So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr + diff --git a/docs/2019-04/index.html b/docs/2019-04/index.html index 55c9621b6..9ec25d89a 100644 --- a/docs/2019-04/index.html +++ b/docs/2019-04/index.html @@ -38,7 +38,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace - + @@ -81,9 +81,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace "@type": "BlogPosting", "headline": "April, 2019", "url": "https://alanorth.github.io/cgspace-notes/2019-04/", - "wordCount": "1056", + "wordCount": "1457", "datePublished": "2019-04-01T09:00:43+03:00", - "dateModified": "2019-04-06T12:01:09+03:00", + "dateModified": "2019-04-06T12:06:14+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -359,6 +359,139 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace +

2019-04-07

+ + + +
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
+{
+    "response": {
+        "docs": [],
+        "numFound": 96925,
+        "start": 0
+    },
+    "responseHeader": {
+        "QTime": 1,
+        "params": {
+            "fq": [
+                "statistics_type:view",
+                "bundleName:ORIGINAL",
+                "dateYearMonth:2019-03"
+            ],
+            "indent": "true",
+            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
+            "rows": "0",
+            "wt": "json"
+        },
+        "status": 0
+    }
+}
+
+ + + +
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
+{
+    "response": {
+        "docs": [],
+        "numFound": 38,
+        "start": 0
+    },
+    "responseHeader": {
+        "QTime": 1,
+        "params": {
+            "fq": [
+                "statistics_type:view",
+                "bundleName:ORIGINAL",
+                "dateYearMonth:2019-04"
+            ],
+            "indent": "true",
+            "q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
+            "rows": "0",
+            "wt": "json"
+        },
+        "status": 0
+    }
+}
+
+ + + +
$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
+GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
+Accept: */*
+Accept-Encoding: gzip, deflate
+Connection: keep-alive
+Host: dspacetest.cgiar.org
+User-Agent: HTTPie/1.0.2
+
+HTTP/1.1 200 OK
+Connection: keep-alive
+Content-Language: en-US
+Content-Length: 2069158
+Content-Type: application/pdf;charset=ISO-8859-1
+Date: Sun, 07 Apr 2019 08:38:34 GMT
+Expires: Sun, 07 Apr 2019 09:38:34 GMT
+Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
+Server: nginx
+Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
+Vary: User-Agent
+X-Cocoon-Version: 2.2.0
+X-Content-Type-Options: nosniff
+X-Frame-Options: SAMEORIGIN
+X-Robots-Tag: none
+X-XSS-Protection: 1; mode=block
+
+$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf      
+HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1                                                            
+Accept: */*
+Accept-Encoding: gzip, deflate
+Connection: keep-alive
+Host: dspacetest.cgiar.org
+User-Agent: HTTPie/1.0.2
+
+HTTP/1.1 200 OK
+Connection: keep-alive
+Content-Language: en-US
+Content-Length: 2069158
+Content-Type: application/pdf;charset=ISO-8859-1
+Date: Sun, 07 Apr 2019 08:39:01 GMT
+Expires: Sun, 07 Apr 2019 09:39:01 GMT
+Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
+Server: nginx
+Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly                                            
+Vary: User-Agent
+X-Cocoon-Version: 2.2.0
+X-Content-Type-Options: nosniff
+X-Frame-Options: SAMEORIGIN
+X-Robots-Tag: none
+X-XSS-Protection: 1; mode=block
+
+ + + +
78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
+78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
+
+ + + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 2d5476818..a7f3f668a 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2019-04/ - 2019-04-06T12:01:09+03:00 + 2019-04-06T12:06:14+03:00 @@ -219,7 +219,7 @@ https://alanorth.github.io/cgspace-notes/ - 2019-04-06T12:01:09+03:00 + 2019-04-06T12:06:14+03:00 0 @@ -230,7 +230,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2019-04-06T12:01:09+03:00 + 2019-04-06T12:06:14+03:00 0 @@ -242,13 +242,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2019-04-06T12:01:09+03:00 + 2019-04-06T12:06:14+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2019-04-06T12:01:09+03:00 + 2019-04-06T12:06:14+03:00 0