From c9685770ab2e3b7c285e58fab83db53ca346cee9 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Sun, 7 Apr 2019 18:08:38 +0300 Subject: [PATCH] Update notes for 2019-04-07 --- content/posts/2019-04.md | 60 ++++++++++++++++++++++++++- docs/2019-04/index.html | 89 +++++++++++++++++++++++++++++++++++++--- docs/robots.txt | 2 +- docs/sitemap.xml | 20 ++++----- 4 files changed, 154 insertions(+), 17 deletions(-) diff --git a/content/posts/2019-04.md b/content/posts/2019-04.md index 0bdca3b3f..1e509fee2 100644 --- a/content/posts/2019-04.md +++ b/content/posts/2019-04.md @@ -173,7 +173,7 @@ GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_re ## 2019-04-07 - Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day* -- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams +- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams - I am trying to see if these are registered as downloads in Solr or not - I see 96,925 downloads from their AWS gateway IPs in 2019-03: @@ -293,5 +293,63 @@ X-XSS-Protection: 1; mode=block ``` - So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr + - After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log: + +``` +2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818 +2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818 +``` + +- So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned + - Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the "optimize" process from the Admin UI and I see this in the Solr log: + +``` +2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} +``` + +- Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are `statistics_type:view`... very weird + - I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18) + - I will try to re-deploy the `5_x-dev` branch and test again +- According to the [DSpace 5.x Solr documentation](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics) the default commit time is after 15 minutes or 10,000 documents (see `solrconfig.xml`) +- I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they *do* register as downloads (even though they are internal): + +``` +$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true' +{ + "response": { + "docs": [], + "numFound": 909, + "start": 0 + }, + "responseHeader": { + "QTime": 0, + "params": { + "fq": [ + "statistics_type:view", + "isInternal:true" + ], + "indent": "true", + "q": "type:0 AND time:2019-04-07*", + "rows": "0", + "wt": "json" + }, + "status": 0 + } +} +``` + +- I confirmed the same on CGSpace itself after making one HEAD request +- So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week + - I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace +- Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded + - This leads me to believe there is something specifically wrong with DSpace Test (linode19) + - The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC +- Holy shit, all this is actually because of the GeoIP1 deprecation and a missing `GeoLiteCity.dat` + - For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error! + - See: [DS-3986](https://jira.duraspace.org/browse/DS-3986) + - See: [DS-4020](https://jira.duraspace.org/browse/DS-4020) + - See: [DS-3832](https://jira.duraspace.org/browse/DS-3832) + - DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been *removed* from MaxMind's server as of 2018-04-01 + - Now I made 100 requests and I see them in the Solr statistics... fuck my life for wasting five hours debugging this diff --git a/docs/2019-04/index.html b/docs/2019-04/index.html index 9ec25d89a..0fc45940e 100644 --- a/docs/2019-04/index.html +++ b/docs/2019-04/index.html @@ -38,7 +38,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace - + @@ -81,9 +81,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace "@type": "BlogPosting", "headline": "April, 2019", "url": "https://alanorth.github.io/cgspace-notes/2019-04/", - "wordCount": "1457", + "wordCount": "1954", "datePublished": "2019-04-01T09:00:43+03:00", - "dateModified": "2019-04-06T12:06:14+03:00", + "dateModified": "2019-04-07T11:45:34+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -363,7 +363,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace