mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
Update notes for 2019-04-07
This commit is contained in:
parent
89a4212e2b
commit
c9685770ab
@ -173,7 +173,7 @@ GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_re
|
||||
## 2019-04-07
|
||||
|
||||
- Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day*
|
||||
- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams
|
||||
- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
|
||||
- I am trying to see if these are registered as downloads in Solr or not
|
||||
- I see 96,925 downloads from their AWS gateway IPs in 2019-03:
|
||||
|
||||
@ -293,5 +293,63 @@ X-XSS-Protection: 1; mode=block
|
||||
```
|
||||
|
||||
- So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
|
||||
- After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
|
||||
|
||||
```
|
||||
2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||||
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||||
```
|
||||
|
||||
- So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
|
||||
- Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the "optimize" process from the Admin UI and I see this in the Solr log:
|
||||
|
||||
```
|
||||
2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
|
||||
```
|
||||
|
||||
- Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are `statistics_type:view`... very weird
|
||||
- I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
|
||||
- I will try to re-deploy the `5_x-dev` branch and test again
|
||||
- According to the [DSpace 5.x Solr documentation](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics) the default commit time is after 15 minutes or 10,000 documents (see `solrconfig.xml`)
|
||||
- I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they *do* register as downloads (even though they are internal):
|
||||
|
||||
```
|
||||
$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
|
||||
{
|
||||
"response": {
|
||||
"docs": [],
|
||||
"numFound": 909,
|
||||
"start": 0
|
||||
},
|
||||
"responseHeader": {
|
||||
"QTime": 0,
|
||||
"params": {
|
||||
"fq": [
|
||||
"statistics_type:view",
|
||||
"isInternal:true"
|
||||
],
|
||||
"indent": "true",
|
||||
"q": "type:0 AND time:2019-04-07*",
|
||||
"rows": "0",
|
||||
"wt": "json"
|
||||
},
|
||||
"status": 0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- I confirmed the same on CGSpace itself after making one HEAD request
|
||||
- So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
|
||||
- I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
|
||||
- Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
|
||||
- This leads me to believe there is something specifically wrong with DSpace Test (linode19)
|
||||
- The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC
|
||||
- Holy shit, all this is actually because of the GeoIP1 deprecation and a missing `GeoLiteCity.dat`
|
||||
- For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!
|
||||
- See: [DS-3986](https://jira.duraspace.org/browse/DS-3986)
|
||||
- See: [DS-4020](https://jira.duraspace.org/browse/DS-4020)
|
||||
- See: [DS-3832](https://jira.duraspace.org/browse/DS-3832)
|
||||
- DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been *removed* from MaxMind's server as of 2018-04-01
|
||||
- Now I made 100 requests and I see them in the Solr statistics... fuck my life for wasting five hours debugging this
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -38,7 +38,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
|
||||
<meta property="article:published_time" content="2019-04-01T09:00:43+03:00"/>
|
||||
<meta property="article:modified_time" content="2019-04-06T12:06:14+03:00"/>
|
||||
<meta property="article:modified_time" content="2019-04-07T11:45:34+03:00"/>
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="April, 2019"/>
|
||||
@ -81,9 +81,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
|
||||
"@type": "BlogPosting",
|
||||
"headline": "April, 2019",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2019-04/",
|
||||
"wordCount": "1457",
|
||||
"wordCount": "1954",
|
||||
"datePublished": "2019-04-01T09:00:43+03:00",
|
||||
"dateModified": "2019-04-06T12:06:14+03:00",
|
||||
"dateModified": "2019-04-07T11:45:34+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -363,7 +363,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
|
||||
|
||||
<ul>
|
||||
<li>Looking into the impact of harvesters like <code>45.5.184.72</code>, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands <em>per day</em></li>
|
||||
<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams
|
||||
<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
|
||||
|
||||
<ul>
|
||||
<li>I am trying to see if these are registered as downloads in Solr or not</li>
|
||||
@ -489,7 +489,86 @@ X-XSS-Protection: 1; mode=block
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr</li>
|
||||
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
|
||||
|
||||
<ul>
|
||||
<li>After twenty minutes of waiting I still don’t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
|
||||
<pre><code>2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||||
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
|
||||
|
||||
<ul>
|
||||
<li>Strangely, the statistics Solr core says it hasn’t been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
|
||||
<pre><code>2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>… very weird
|
||||
|
||||
<ul>
|
||||
<li>I don’t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)</li>
|
||||
<li>I will try to re-deploy the <code>5_x-dev</code> branch and test again</li>
|
||||
</ul></li>
|
||||
<li>According to the <a href="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
|
||||
<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
|
||||
{
|
||||
"response": {
|
||||
"docs": [],
|
||||
"numFound": 909,
|
||||
"start": 0
|
||||
},
|
||||
"responseHeader": {
|
||||
"QTime": 0,
|
||||
"params": {
|
||||
"fq": [
|
||||
"statistics_type:view",
|
||||
"isInternal:true"
|
||||
],
|
||||
"indent": "true",
|
||||
"q": "type:0 AND time:2019-04-07*",
|
||||
"rows": "0",
|
||||
"wt": "json"
|
||||
},
|
||||
"status": 0
|
||||
}
|
||||
}
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>I confirmed the same on CGSpace itself after making one HEAD request</li>
|
||||
<li>So I’m pretty sure it’s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
|
||||
|
||||
<ul>
|
||||
<li>I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace</li>
|
||||
</ul></li>
|
||||
<li>Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
|
||||
|
||||
<ul>
|
||||
<li>This leads me to believe there is something specifically wrong with DSpace Test (linode19)</li>
|
||||
<li>The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC</li>
|
||||
</ul></li>
|
||||
<li>Holy shit, all this is actually because of the GeoIP1 deprecation and a missing <code>GeoLiteCity.dat</code>
|
||||
|
||||
<ul>
|
||||
<li>For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!</li>
|
||||
<li>See: <a href="https://jira.duraspace.org/browse/DS-3986">DS-3986</a></li>
|
||||
<li>See: <a href="https://jira.duraspace.org/browse/DS-4020">DS-4020</a></li>
|
||||
<li>See: <a href="https://jira.duraspace.org/browse/DS-3832">DS-3832</a></li>
|
||||
<li>DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been <em>removed</em> from MaxMind’s server as of 2018-04-01</li>
|
||||
<li>Now I made 100 requests and I see them in the Solr statistics… fuck my life for wasting five hours debugging this</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -46,7 +46,7 @@ Disallow: /cgspace-notes/2015-12/
|
||||
Disallow: /cgspace-notes/2015-11/
|
||||
Disallow: /cgspace-notes/
|
||||
Disallow: /cgspace-notes/categories/
|
||||
Disallow: /cgspace-notes/tags/notes/
|
||||
Disallow: /cgspace-notes/categories/notes/
|
||||
Disallow: /cgspace-notes/tags/notes/
|
||||
Disallow: /cgspace-notes/posts/
|
||||
Disallow: /cgspace-notes/tags/
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2019-04/</loc>
|
||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
||||
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
@ -219,7 +219,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
||||
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -228,27 +228,27 @@
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2018-03-09T22:10:33+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
||||
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
||||
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user