mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-01-27
This commit is contained in:
@ -20,7 +20,7 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
|
||||
1277694
|
||||
|
||||
So 4.6 million from XMLUI and another 1.2 million from API requests
|
||||
Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
||||
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
||||
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
1183456
|
||||
@ -48,14 +48,14 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
|
||||
1277694
|
||||
|
||||
So 4.6 million from XMLUI and another 1.2 million from API requests
|
||||
Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
||||
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
||||
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
1183456
|
||||
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
||||
106781
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.62.2" />
|
||||
<meta name="generator" content="Hugo 0.63.1" />
|
||||
|
||||
|
||||
|
||||
@ -85,7 +85,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
|
||||
|
||||
<!-- combined, minified CSS -->
|
||||
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css" rel="stylesheet" integrity="sha256-ogwaQ2djljLNs0HSPCfKRP7cx1sPizy+piAwENoVPTw=" crossorigin="anonymous">
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css" rel="stylesheet" integrity="sha256-I+LDKYvMjBE2wZq6MwwhHslMNvfERU6hXPTTVINwBCo=" crossorigin="anonymous">
|
||||
|
||||
|
||||
<!-- RSS 2.0 feed -->
|
||||
@ -132,7 +132,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
|
||||
<header>
|
||||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-11/">November, 2019</a></h2>
|
||||
<p class="blog-post-meta"><time datetime="2019-11-04T12:20:30+02:00">Mon Nov 04, 2019</time> by Alan Orth in
|
||||
<i class="fa fa-folder" aria-hidden="true"></i> <a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
|
||||
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
|
||||
|
||||
|
||||
</p>
|
||||
@ -151,7 +151,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
|
||||
1277694
|
||||
</code></pre><ul>
|
||||
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
|
||||
<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
|
||||
<li>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
|
||||
</ul>
|
||||
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
||||
1183456
|
||||
@ -173,7 +173,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
|
||||
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
|
||||
365288
|
||||
</code></pre><ul>
|
||||
<li>Their user agent is one I've never seen before:</li>
|
||||
<li>Their user agent is one I’ve never seen before:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
|
||||
</code></pre><ul>
|
||||
@ -196,7 +196,7 @@ Let's see how many of the REST API requests were for bitstreams (because the
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
|
||||
</code></pre><ul>
|
||||
<li>On the topic of spiders, I have been wanting to update DSpace's default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire's COUNTER-Robots</a> project
|
||||
<li>On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire’s COUNTER-Robots</a> project
|
||||
<ul>
|
||||
<li>First I checked for a user agent that is in COUNTER-Robots, but NOT in the current <code>dspace/config/spiders/example</code> list</li>
|
||||
<li>Then I made some item and bitstream requests on DSpace Test using that user agent:</li>
|
||||
@ -215,25 +215,25 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
|
||||
</response>
|
||||
</code></pre><ul>
|
||||
<li>Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:</li>
|
||||
<li>Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
|
||||
</code></pre><ul>
|
||||
<li>After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list…
|
||||
<li>After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…
|
||||
<ul>
|
||||
<li>What's strange is that the Solr spider agent configuration in <code>dspace/config/modules/solr-statistics.cfg</code> points to a file that doesn't exist…</li>
|
||||
<li>What’s strange is that the Solr spider agent configuration in <code>dspace/config/modules/solr-statistics.cfg</code> points to a file that doesn’t exist…</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
|
||||
</code></pre><ul>
|
||||
<li>Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file…</li>
|
||||
<li>Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…</li>
|
||||
<li>I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr
|
||||
<ul>
|
||||
<li>Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in <code>dspace/config/spiders/example</code> and then try to use DSpace's “mark spiders” feature to change them to “isBot:true” in Solr</li>
|
||||
<li>I restarted Tomcat and ran <code>dspace stats-util -m</code> and it did some stuff for awhile, but I still don't see any items in Solr with <code>isBot:true</code></li>
|
||||
<li>Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in <code>dspace/config/spiders/example</code> and then try to use DSpace’s “mark spiders” feature to change them to “isBot:true” in Solr</li>
|
||||
<li>I restarted Tomcat and ran <code>dspace stats-util -m</code> and it did some stuff for awhile, but I still don’t see any items in Solr with <code>isBot:true</code></li>
|
||||
<li>According to <code>dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java</code> the patterns for user agents are loaded from any file in the <code>config/spiders/agents</code> directory</li>
|
||||
<li>I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran <code>dspace stats-util -m</code> and still there were no new items marked as being bots in Solr, so I think there is still something wrong</li>
|
||||
<li>Jesus, the code in <code>./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java</code> says that <code>stats-util -m</code> marks spider requests by their IPs, not by their user agents… WTF:</li>
|
||||
@ -267,17 +267,17 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
<result name="response" numFound="0" start="0"/>
|
||||
</code></pre><ul>
|
||||
<li>So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
|
||||
<li>So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list
|
||||
<ul>
|
||||
<li>Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I'm curious how the special character matching is in Solr, so I will test two requests: one with “<a href="http://www.gnip.com%22">www.gnip.com"</a> which is in the spider list, and one with “<a href="http://www.gnyp.com%22">www.gnyp.com"</a> which isn't:</li>
|
||||
<li>I’m curious how the special character matching is in Solr, so I will test two requests: one with “<a href="http://www.gnip.com">www.gnip.com</a>” which is in the spider list, and one with “<a href="http://www.gnyp.com">www.gnyp.com</a>” which isn’t:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
|
||||
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
|
||||
</code></pre><ul>
|
||||
<li>Then commit changes to Solr so we don't have to wait:</li>
|
||||
<li>Then commit changes to Solr so we don’t have to wait:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
||||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
||||
@ -352,7 +352,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
|
||||
</lst>
|
||||
</lst>
|
||||
</code></pre><ul>
|
||||
<li>That answers Peter's question about why the stats jumped in October…</li>
|
||||
<li>That answers Peter’s question about why the stats jumped in October…</li>
|
||||
</ul>
|
||||
<h2 id="2019-11-08">2019-11-08</h2>
|
||||
<ul>
|
||||
@ -409,12 +409,12 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
|
||||
</ul>
|
||||
<h2 id="2019-11-13">2019-11-13</h2>
|
||||
<ul>
|
||||
<li>The <a href="https://hdl.handle.net/10568/97087">item with a low Altmetric score for its Handle</a> that I tweeted yesterday still hasn't linked with the DOI's score
|
||||
<li>The <a href="https://hdl.handle.net/10568/97087">item with a low Altmetric score for its Handle</a> that I tweeted yesterday still hasn’t linked with the DOI’s score
|
||||
<ul>
|
||||
<li>I tweeted it again with the Handle and the DOI</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr's regex search can't use those</li>
|
||||
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr’s regex search can’t use those</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
|
||||
$ http "http://localhost:8081/solr/statistics/update?commit=true"
|
||||
@ -424,19 +424,19 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
|
||||
<result name="response" numFound="1" start="0">
|
||||
</code></pre><ul>
|
||||
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
|
||||
<li>I realized that it's easier to search Solr from curl via POST using this syntax:</li>
|
||||
<li>I realized that it’s easier to search Solr from curl via POST using this syntax:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
|
||||
</code></pre><ul>
|
||||
<li>If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
|
||||
<ul>
|
||||
<li>You can disable this using the <code>-g</code> option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search:</li>
|
||||
<li>You can disable this using the <code>-g</code> option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr’s regex search:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
|
||||
</code></pre><ul>
|
||||
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
|
||||
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
|
||||
</ul>
|
||||
<h2 id="2019-11-14">2019-11-14</h2>
|
||||
<ul>
|
||||
@ -456,14 +456,14 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
<li>Greatly improve my <code>check-spider-hits.sh</code> script to handle regular expressions in the spider agents patterns file
|
||||
<ul>
|
||||
<li>This allows me to detect and purge many more hits from the Solr statistics core</li>
|
||||
<li>I've tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace's Solr cores</li>
|
||||
<li>I’ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace’s Solr cores</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2019-11-15">2019-11-15</h2>
|
||||
<ul>
|
||||
<li>Run the new version of <code>check-spider-hits.sh</code> on CGSpace's Solr statistics cores one by one, starting from the oldest just in case something goes wrong</li>
|
||||
<li>But then I noticed that some (all?) of the hits weren't actually getting purged, all of which were using regular expressions like:
|
||||
<li>Run the new version of <code>check-spider-hits.sh</code> on CGSpace’s Solr statistics cores one by one, starting from the oldest just in case something goes wrong</li>
|
||||
<li>But then I noticed that some (all?) of the hits weren’t actually getting purged, all of which were using regular expressions like:
|
||||
<ul>
|
||||
<li><code>MetaURI[\+\s]API\/[0-9]\.[0-9]</code></li>
|
||||
<li><code>FDM(\s|\+)[0-9]</code></li>
|
||||
@ -474,10 +474,10 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</ul>
|
||||
</li>
|
||||
<li>Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!</li>
|
||||
<li>Plus signs are special in regular expressions, URLs, and Solr's Lucene query parser, so I'm actually not sure where the issue is
|
||||
<li>Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is
|
||||
<ul>
|
||||
<li>I tried to do URL encoding of the +, double escaping, etc… but nothing worked</li>
|
||||
<li>I'm going to ignore regular expressions that have pluses for now</li>
|
||||
<li>I’m going to ignore regular expressions that have pluses for now</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I think I might also have to ignore patterns that have percent signs, like <code>^\%?default\%?$</code></li>
|
||||
@ -495,7 +495,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
<li>statistics: 1043373</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>That's 1.4 million hits in addition to the 2 million I purged earlier this week…</li>
|
||||
<li>That’s 1.4 million hits in addition to the 2 million I purged earlier this week…</li>
|
||||
<li>For posterity, the major contributors to the hits on the statistics core were:
|
||||
<ul>
|
||||
<li>Purging 812429 hits from curl/ in statistics</li>
|
||||
@ -512,7 +512,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</ul>
|
||||
<h2 id="2019-11-17">2019-11-17</h2>
|
||||
<ul>
|
||||
<li>Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE's collection) was added recently and might have not been in the last harvesting yet
|
||||
<li>Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE’s collection) was added recently and might have not been in the last harvesting yet
|
||||
<ul>
|
||||
<li>I told her no, that the department is several years old, and the item was added in 2017</li>
|
||||
<li>Then I looked again at the dashboard for each department and I see the item in both departments now… shit.</li>
|
||||
@ -538,7 +538,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</ul>
|
||||
<h2 id="2019-11-19">2019-11-19</h2>
|
||||
<ul>
|
||||
<li>Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
|
||||
<li>Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
|
||||
<ul>
|
||||
<li>I had previously sent them an export in 2019-04</li>
|
||||
</ul>
|
||||
@ -555,15 +555,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
<li>Found 4429 hits from ^User-Agent in statistics-2016</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Buck is one I've never heard of before, its user agent is:</li>
|
||||
<li>Buck is one I’ve never heard of before, its user agent is:</li>
|
||||
</ul>
|
||||
<pre><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
|
||||
</code></pre><ul>
|
||||
<li>All in all that's about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li>
|
||||
<li>All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li>
|
||||
</ul>
|
||||
<h2 id="2019-11-20">2019-11-20</h2>
|
||||
<ul>
|
||||
<li>Email Usman Muchlish from CIFOR to see what he's doing with their DSpace lately</li>
|
||||
<li>Email Usman Muchlish from CIFOR to see what he’s doing with their DSpace lately</li>
|
||||
</ul>
|
||||
<h2 id="2019-11-21">2019-11-21</h2>
|
||||
<ul>
|
||||
@ -599,8 +599,8 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
<ul>
|
||||
<li>I rebooted DSpace Test (linode19) and it kernel panicked at boot
|
||||
<ul>
|
||||
<li>I looked on the console and saw that it can't mount the root filesystem</li>
|
||||
<li>I switched the boot configuration to use the OS's kernel via GRUB2 instead of Linode's kernel and then it came up after reboot…</li>
|
||||
<li>I looked on the console and saw that it can’t mount the root filesystem</li>
|
||||
<li>I switched the boot configuration to use the OS’s kernel via GRUB2 instead of Linode’s kernel and then it came up after reboot…</li>
|
||||
<li>I initiated a migration of the server from the Fremont, CA region to Frankfurt, DE
|
||||
<ul>
|
||||
<li>The migration is going very slowly, so I assume the network issues from earlier this year are still not fixed</li>
|
||||
|
Reference in New Issue
Block a user