Update notes for 2019-04-07

This commit is contained in:
Alan Orth 2019-04-07 18:08:38 +03:00
parent 89a4212e2b
commit c9685770ab
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
4 changed files with 154 additions and 17 deletions

View File

@ -173,7 +173,7 @@ GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_re
## 2019-04-07
- Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day*
- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams
- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
- I am trying to see if these are registered as downloads in Solr or not
- I see 96,925 downloads from their AWS gateway IPs in 2019-03:
@ -293,5 +293,63 @@ X-XSS-Protection: 1; mode=block
```
- So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
- After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
```
2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
```
- So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
- Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the "optimize" process from the Admin UI and I see this in the Solr log:
```
2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
```
- Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are `statistics_type:view`... very weird
- I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
- I will try to re-deploy the `5_x-dev` branch and test again
- According to the [DSpace 5.x Solr documentation](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics) the default commit time is after 15 minutes or 10,000 documents (see `solrconfig.xml`)
- I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they *do* register as downloads (even though they are internal):
```
$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
{
"response": {
"docs": [],
"numFound": 909,
"start": 0
},
"responseHeader": {
"QTime": 0,
"params": {
"fq": [
"statistics_type:view",
"isInternal:true"
],
"indent": "true",
"q": "type:0 AND time:2019-04-07*",
"rows": "0",
"wt": "json"
},
"status": 0
}
}
```
- I confirmed the same on CGSpace itself after making one HEAD request
- So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
- I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
- Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
- This leads me to believe there is something specifically wrong with DSpace Test (linode19)
- The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC
- Holy shit, all this is actually because of the GeoIP1 deprecation and a missing `GeoLiteCity.dat`
- For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!
- See: [DS-3986](https://jira.duraspace.org/browse/DS-3986)
- See: [DS-4020](https://jira.duraspace.org/browse/DS-4020)
- See: [DS-3832](https://jira.duraspace.org/browse/DS-3832)
- DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been *removed* from MaxMind's server as of 2018-04-01
- Now I made 100 requests and I see them in the Solr statistics... fuck my life for wasting five hours debugging this
<!-- vim: set sw=2 ts=2: -->

View File

@ -38,7 +38,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
<meta property="article:published_time" content="2019-04-01T09:00:43&#43;03:00"/>
<meta property="article:modified_time" content="2019-04-06T12:06:14&#43;03:00"/>
<meta property="article:modified_time" content="2019-04-07T11:45:34&#43;03:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2019"/>
@ -81,9 +81,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
"@type": "BlogPosting",
"headline": "April, 2019",
"url": "https://alanorth.github.io/cgspace-notes/2019-04/",
"wordCount": "1457",
"wordCount": "1954",
"datePublished": "2019-04-01T09:00:43&#43;03:00",
"dateModified": "2019-04-06T12:06:14&#43;03:00",
"dateModified": "2019-04-07T11:45:34&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -363,7 +363,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<ul>
<li>Looking into the impact of harvesters like <code>45.5.184.72</code>, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands <em>per day</em></li>
<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams
<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
<ul>
<li>I am trying to see if these are registered as downloads in Solr or not</li>
@ -489,7 +489,86 @@ X-XSS-Protection: 1; mode=block
</code></pre>
<ul>
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr</li>
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
<ul>
<li>After twenty minutes of waiting I still don&rsquo;t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:</li>
</ul></li>
</ul>
<pre><code>2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
</code></pre>
<ul>
<li>So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
<ul>
<li>Strangely, the statistics Solr core says it hasn&rsquo;t been modified in 24 hours, so I tried to start the &ldquo;optimize&rdquo; process from the Admin UI and I see this in the Solr log:</li>
</ul></li>
</ul>
<pre><code>2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
</code></pre>
<ul>
<li>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>&hellip; very weird
<ul>
<li>I don&rsquo;t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)</li>
<li>I will try to re-deploy the <code>5_x-dev</code> branch and test again</li>
</ul></li>
<li>According to the <a href="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true'
{
&quot;response&quot;: {
&quot;docs&quot;: [],
&quot;numFound&quot;: 909,
&quot;start&quot;: 0
},
&quot;responseHeader&quot;: {
&quot;QTime&quot;: 0,
&quot;params&quot;: {
&quot;fq&quot;: [
&quot;statistics_type:view&quot;,
&quot;isInternal:true&quot;
],
&quot;indent&quot;: &quot;true&quot;,
&quot;q&quot;: &quot;type:0 AND time:2019-04-07*&quot;,
&quot;rows&quot;: &quot;0&quot;,
&quot;wt&quot;: &quot;json&quot;
},
&quot;status&quot;: 0
}
}
</code></pre>
<ul>
<li>I confirmed the same on CGSpace itself after making one HEAD request</li>
<li>So I&rsquo;m pretty sure it&rsquo;s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
<ul>
<li>I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace</li>
</ul></li>
<li>Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
<ul>
<li>This leads me to believe there is something specifically wrong with DSpace Test (linode19)</li>
<li>The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC</li>
</ul></li>
<li>Holy shit, all this is actually because of the GeoIP1 deprecation and a missing <code>GeoLiteCity.dat</code>
<ul>
<li>For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!</li>
<li>See: <a href="https://jira.duraspace.org/browse/DS-3986">DS-3986</a></li>
<li>See: <a href="https://jira.duraspace.org/browse/DS-4020">DS-4020</a></li>
<li>See: <a href="https://jira.duraspace.org/browse/DS-3832">DS-3832</a></li>
<li>DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been <em>removed</em> from MaxMind&rsquo;s server as of 2018-04-01</li>
<li>Now I made 100 requests and I see them in the Solr statistics&hellip; fuck my life for wasting five hours debugging this</li>
</ul></li>
</ul>
<!-- vim: set sw=2 ts=2: -->

View File

@ -46,7 +46,7 @@ Disallow: /cgspace-notes/2015-12/
Disallow: /cgspace-notes/2015-11/
Disallow: /cgspace-notes/
Disallow: /cgspace-notes/categories/
Disallow: /cgspace-notes/tags/notes/
Disallow: /cgspace-notes/categories/notes/
Disallow: /cgspace-notes/tags/notes/
Disallow: /cgspace-notes/posts/
Disallow: /cgspace-notes/tags/

View File

@ -4,7 +4,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2019-04/</loc>
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
</url>
<url>
@ -219,7 +219,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
<priority>0</priority>
</url>
@ -228,27 +228,27 @@
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2018-03-09T22:10:33+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
<priority>0</priority>
</url>