mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
Update notes for 2019-04-07
This commit is contained in:
parent
89a4212e2b
commit
c9685770ab
@ -173,7 +173,7 @@ GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_re
|
|||||||
## 2019-04-07
|
## 2019-04-07
|
||||||
|
|
||||||
- Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day*
|
- Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day*
|
||||||
- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams
|
- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
|
||||||
- I am trying to see if these are registered as downloads in Solr or not
|
- I am trying to see if these are registered as downloads in Solr or not
|
||||||
- I see 96,925 downloads from their AWS gateway IPs in 2019-03:
|
- I see 96,925 downloads from their AWS gateway IPs in 2019-03:
|
||||||
|
|
||||||
@ -293,5 +293,63 @@ X-XSS-Protection: 1; mode=block
|
|||||||
```
|
```
|
||||||
|
|
||||||
- So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
|
- So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
|
||||||
|
- After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
|
||||||
|
|
||||||
|
```
|
||||||
|
2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||||||
|
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||||||
|
```
|
||||||
|
|
||||||
|
- So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
|
||||||
|
- Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the "optimize" process from the Admin UI and I see this in the Solr log:
|
||||||
|
|
||||||
|
```
|
||||||
|
2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
|
||||||
|
```
|
||||||
|
|
||||||
|
- Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are `statistics_type:view`... very weird
|
||||||
|
- I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
|
||||||
|
- I will try to re-deploy the `5_x-dev` branch and test again
|
||||||
|
- According to the [DSpace 5.x Solr documentation](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics) the default commit time is after 15 minutes or 10,000 documents (see `solrconfig.xml`)
|
||||||
|
- I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they *do* register as downloads (even though they are internal):
|
||||||
|
|
||||||
|
```
|
||||||
|
$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
|
||||||
|
{
|
||||||
|
"response": {
|
||||||
|
"docs": [],
|
||||||
|
"numFound": 909,
|
||||||
|
"start": 0
|
||||||
|
},
|
||||||
|
"responseHeader": {
|
||||||
|
"QTime": 0,
|
||||||
|
"params": {
|
||||||
|
"fq": [
|
||||||
|
"statistics_type:view",
|
||||||
|
"isInternal:true"
|
||||||
|
],
|
||||||
|
"indent": "true",
|
||||||
|
"q": "type:0 AND time:2019-04-07*",
|
||||||
|
"rows": "0",
|
||||||
|
"wt": "json"
|
||||||
|
},
|
||||||
|
"status": 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- I confirmed the same on CGSpace itself after making one HEAD request
|
||||||
|
- So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
|
||||||
|
- I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
|
||||||
|
- Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
|
||||||
|
- This leads me to believe there is something specifically wrong with DSpace Test (linode19)
|
||||||
|
- The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC
|
||||||
|
- Holy shit, all this is actually because of the GeoIP1 deprecation and a missing `GeoLiteCity.dat`
|
||||||
|
- For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!
|
||||||
|
- See: [DS-3986](https://jira.duraspace.org/browse/DS-3986)
|
||||||
|
- See: [DS-4020](https://jira.duraspace.org/browse/DS-4020)
|
||||||
|
- See: [DS-3832](https://jira.duraspace.org/browse/DS-3832)
|
||||||
|
- DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been *removed* from MaxMind's server as of 2018-04-01
|
||||||
|
- Now I made 100 requests and I see them in the Solr statistics... fuck my life for wasting five hours debugging this
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
@ -38,7 +38,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
|
|||||||
<meta property="og:type" content="article" />
|
<meta property="og:type" content="article" />
|
||||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
|
||||||
<meta property="article:published_time" content="2019-04-01T09:00:43+03:00"/>
|
<meta property="article:published_time" content="2019-04-01T09:00:43+03:00"/>
|
||||||
<meta property="article:modified_time" content="2019-04-06T12:06:14+03:00"/>
|
<meta property="article:modified_time" content="2019-04-07T11:45:34+03:00"/>
|
||||||
|
|
||||||
<meta name="twitter:card" content="summary"/>
|
<meta name="twitter:card" content="summary"/>
|
||||||
<meta name="twitter:title" content="April, 2019"/>
|
<meta name="twitter:title" content="April, 2019"/>
|
||||||
@ -81,9 +81,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "April, 2019",
|
"headline": "April, 2019",
|
||||||
"url": "https://alanorth.github.io/cgspace-notes/2019-04/",
|
"url": "https://alanorth.github.io/cgspace-notes/2019-04/",
|
||||||
"wordCount": "1457",
|
"wordCount": "1954",
|
||||||
"datePublished": "2019-04-01T09:00:43+03:00",
|
"datePublished": "2019-04-01T09:00:43+03:00",
|
||||||
"dateModified": "2019-04-06T12:06:14+03:00",
|
"dateModified": "2019-04-07T11:45:34+03:00",
|
||||||
"author": {
|
"author": {
|
||||||
"@type": "Person",
|
"@type": "Person",
|
||||||
"name": "Alan Orth"
|
"name": "Alan Orth"
|
||||||
@ -363,7 +363,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
|
|||||||
|
|
||||||
<ul>
|
<ul>
|
||||||
<li>Looking into the impact of harvesters like <code>45.5.184.72</code>, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands <em>per day</em></li>
|
<li>Looking into the impact of harvesters like <code>45.5.184.72</code>, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands <em>per day</em></li>
|
||||||
<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams
|
<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
|
||||||
|
|
||||||
<ul>
|
<ul>
|
||||||
<li>I am trying to see if these are registered as downloads in Solr or not</li>
|
<li>I am trying to see if these are registered as downloads in Solr or not</li>
|
||||||
@ -489,7 +489,86 @@ X-XSS-Protection: 1; mode=block
|
|||||||
</code></pre>
|
</code></pre>
|
||||||
|
|
||||||
<ul>
|
<ul>
|
||||||
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr</li>
|
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>After twenty minutes of waiting I still don’t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:</li>
|
||||||
|
</ul></li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||||||
|
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Strangely, the statistics Solr core says it hasn’t been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:</li>
|
||||||
|
</ul></li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>… very weird
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>I don’t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)</li>
|
||||||
|
<li>I will try to re-deploy the <code>5_x-dev</code> branch and test again</li>
|
||||||
|
</ul></li>
|
||||||
|
<li>According to the <a href="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
|
||||||
|
<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
|
||||||
|
{
|
||||||
|
"response": {
|
||||||
|
"docs": [],
|
||||||
|
"numFound": 909,
|
||||||
|
"start": 0
|
||||||
|
},
|
||||||
|
"responseHeader": {
|
||||||
|
"QTime": 0,
|
||||||
|
"params": {
|
||||||
|
"fq": [
|
||||||
|
"statistics_type:view",
|
||||||
|
"isInternal:true"
|
||||||
|
],
|
||||||
|
"indent": "true",
|
||||||
|
"q": "type:0 AND time:2019-04-07*",
|
||||||
|
"rows": "0",
|
||||||
|
"wt": "json"
|
||||||
|
},
|
||||||
|
"status": 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>I confirmed the same on CGSpace itself after making one HEAD request</li>
|
||||||
|
<li>So I’m pretty sure it’s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace</li>
|
||||||
|
</ul></li>
|
||||||
|
<li>Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>This leads me to believe there is something specifically wrong with DSpace Test (linode19)</li>
|
||||||
|
<li>The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC</li>
|
||||||
|
</ul></li>
|
||||||
|
<li>Holy shit, all this is actually because of the GeoIP1 deprecation and a missing <code>GeoLiteCity.dat</code>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!</li>
|
||||||
|
<li>See: <a href="https://jira.duraspace.org/browse/DS-3986">DS-3986</a></li>
|
||||||
|
<li>See: <a href="https://jira.duraspace.org/browse/DS-4020">DS-4020</a></li>
|
||||||
|
<li>See: <a href="https://jira.duraspace.org/browse/DS-3832">DS-3832</a></li>
|
||||||
|
<li>DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been <em>removed</em> from MaxMind’s server as of 2018-04-01</li>
|
||||||
|
<li>Now I made 100 requests and I see them in the Solr statistics… fuck my life for wasting five hours debugging this</li>
|
||||||
|
</ul></li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
@ -46,7 +46,7 @@ Disallow: /cgspace-notes/2015-12/
|
|||||||
Disallow: /cgspace-notes/2015-11/
|
Disallow: /cgspace-notes/2015-11/
|
||||||
Disallow: /cgspace-notes/
|
Disallow: /cgspace-notes/
|
||||||
Disallow: /cgspace-notes/categories/
|
Disallow: /cgspace-notes/categories/
|
||||||
Disallow: /cgspace-notes/tags/notes/
|
|
||||||
Disallow: /cgspace-notes/categories/notes/
|
Disallow: /cgspace-notes/categories/notes/
|
||||||
|
Disallow: /cgspace-notes/tags/notes/
|
||||||
Disallow: /cgspace-notes/posts/
|
Disallow: /cgspace-notes/posts/
|
||||||
Disallow: /cgspace-notes/tags/
|
Disallow: /cgspace-notes/tags/
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2019-04/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2019-04/</loc>
|
||||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
@ -219,7 +219,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -228,27 +228,27 @@
|
|||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
|
||||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
|
||||||
<priority>0</priority>
|
|
||||||
</url>
|
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||||
<lastmod>2018-03-09T22:10:33+02:00</lastmod>
|
<lastmod>2018-03-09T22:10:33+02:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
|
<url>
|
||||||
|
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||||
|
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||||
|
<priority>0</priority>
|
||||||
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||||
<lastmod>2019-04-06T12:06:14+03:00</lastmod>
|
<lastmod>2019-04-07T11:45:34+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user