Add notes for 2020-01-27

This commit is contained in:
2020-01-27 16:20:44 +02:00
parent 207ace0883
commit 8feb93be39
112 changed files with 11466 additions and 5158 deletions

View File

@ -20,7 +20,7 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
@ -48,14 +48,14 @@ I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 milli
1277694
So 4.6 million from XMLUI and another 1.2 million from API requests
Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
106781
"/>
<meta name="generator" content="Hugo 0.62.2" />
<meta name="generator" content="Hugo 0.63.1" />
@ -85,7 +85,7 @@ Let&#39;s see how many of the REST API requests were for bitstreams (because the
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css" rel="stylesheet" integrity="sha256-ogwaQ2djljLNs0HSPCfKRP7cx1sPizy&#43;piAwENoVPTw=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css" rel="stylesheet" integrity="sha256-I&#43;LDKYvMjBE2wZq6MwwhHslMNvfERU6hXPTTVINwBCo=" crossorigin="anonymous">
<!-- RSS 2.0 feed -->
@ -132,7 +132,7 @@ Let&#39;s see how many of the REST API requests were for bitstreams (because the
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-11/">November, 2019</a></h2>
<p class="blog-post-meta"><time datetime="2019-11-04T12:20:30&#43;02:00">Mon Nov 04, 2019</time> by Alan Orth in
<i class="fa fa-folder" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
</p>
@ -151,7 +151,7 @@ Let&#39;s see how many of the REST API requests were for bitstreams (because the
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let's see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
1183456
@ -173,7 +173,7 @@ Let&#39;s see how many of the REST API requests were for bitstreams (because the
<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
365288
</code></pre><ul>
<li>Their user agent is one I've never seen before:</li>
<li>Their user agent is one I&rsquo;ve never seen before:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
</code></pre><ul>
@ -196,7 +196,7 @@ Let&#39;s see how many of the REST API requests were for bitstreams (because the
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:&quot;Amazonbot/0.1&quot;
</code></pre><ul>
<li>On the topic of spiders, I have been wanting to update DSpace's default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire's COUNTER-Robots</a> project
<li>On the topic of spiders, I have been wanting to update DSpace&rsquo;s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire&rsquo;s COUNTER-Robots</a> project
<ul>
<li>First I checked for a user agent that is in COUNTER-Robots, but NOT in the current <code>dspace/config/spiders/example</code> list</li>
<li>Then I made some item and bitstream requests on DSpace Test using that user agent:</li>
@ -215,25 +215,25 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
&lt;lst name=&quot;responseHeader&quot;&gt;&lt;int name=&quot;status&quot;&gt;0&lt;/int&gt;&lt;int name=&quot;QTime&quot;&gt;1&lt;/int&gt;&lt;lst name=&quot;params&quot;&gt;&lt;str name=&quot;q&quot;&gt;ip:73.178.9.24 AND userAgent:iskanie&lt;/str&gt;&lt;str name=&quot;fq&quot;&gt;dateYearMonth:2019-11&lt;/str&gt;&lt;str name=&quot;rows&quot;&gt;0&lt;/str&gt;&lt;/lst&gt;&lt;/lst&gt;&lt;result name=&quot;response&quot; numFound=&quot;3&quot; start=&quot;0&quot;&gt;&lt;/result&gt;
&lt;/response&gt;
</code></pre><ul>
<li>Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:</li>
<li>Now I want to make similar requests with a user agent that is included in DSpace&rsquo;s current user agent list:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;celestial&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y' User-Agent:&quot;celestial&quot;
</code></pre><ul>
<li>After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list&hellip;
<li>After twenty minutes I didn&rsquo;t see any requests in Solr, so I assume they did not get logged because they matched a bot list&hellip;
<ul>
<li>What's strange is that the Solr spider agent configuration in <code>dspace/config/modules/solr-statistics.cfg</code> points to a file that doesn't exist&hellip;</li>
<li>What&rsquo;s strange is that the Solr spider agent configuration in <code>dspace/config/modules/solr-statistics.cfg</code> points to a file that doesn&rsquo;t exist&hellip;</li>
</ul>
</li>
</ul>
<pre><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
</code></pre><ul>
<li>Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file&hellip;</li>
<li>Apparently that is part of Atmire&rsquo;s CUA, despite being in a standard DSpace configuration file&hellip;</li>
<li>I tried with some other garbage user agents like &ldquo;fuuuualan&rdquo; and they were visible in Solr
<ul>
<li>Now I want to try adding &ldquo;iskanie&rdquo; and &ldquo;fuuuualan&rdquo; to the list of spider regexes in <code>dspace/config/spiders/example</code> and then try to use DSpace's &ldquo;mark spiders&rdquo; feature to change them to &ldquo;isBot:true&rdquo; in Solr</li>
<li>I restarted Tomcat and ran <code>dspace stats-util -m</code> and it did some stuff for awhile, but I still don't see any items in Solr with <code>isBot:true</code></li>
<li>Now I want to try adding &ldquo;iskanie&rdquo; and &ldquo;fuuuualan&rdquo; to the list of spider regexes in <code>dspace/config/spiders/example</code> and then try to use DSpace&rsquo;s &ldquo;mark spiders&rdquo; feature to change them to &ldquo;isBot:true&rdquo; in Solr</li>
<li>I restarted Tomcat and ran <code>dspace stats-util -m</code> and it did some stuff for awhile, but I still don&rsquo;t see any items in Solr with <code>isBot:true</code></li>
<li>According to <code>dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java</code> the patterns for user agents are loaded from any file in the <code>config/spiders/agents</code> directory</li>
<li>I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran <code>dspace stats-util -m</code> and still there were no new items marked as being bots in Solr, so I think there is still something wrong</li>
<li>Jesus, the code in <code>./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java</code> says that <code>stats-util -m</code> marks spider requests by their IPs, not by their user agents&hellip; WTF:</li>
@ -267,17 +267,17 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
&lt;result name=&quot;response&quot; numFound=&quot;0&quot; start=&quot;0&quot;/&gt;
</code></pre><ul>
<li>So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
<li>So basically it seems like a win to update the example file with the latest one from Atmire&rsquo;s COUNTER-Robots list
<ul>
<li>Even though the &ldquo;mark by user agent&rdquo; function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents</li>
</ul>
</li>
<li>I'm curious how the special character matching is in Solr, so I will test two requests: one with &ldquo;<a href="http://www.gnip.com%22">www.gnip.com&quot;</a> which is in the spider list, and one with &ldquo;<a href="http://www.gnyp.com%22">www.gnyp.com&quot;</a> which isn't:</li>
<li>I&rsquo;m curious how the special character matching is in Solr, so I will test two requests: one with &ldquo;<a href="http://www.gnip.com">www.gnip.com</a>&rdquo; which is in the spider list, and one with &ldquo;<a href="http://www.gnyp.com">www.gnyp.com</a>&rdquo; which isn&rsquo;t:</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnip.com&quot;
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;www.gnyp.com&quot;
</code></pre><ul>
<li>Then commit changes to Solr so we don't have to wait:</li>
<li>Then commit changes to Solr so we don&rsquo;t have to wait:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
@ -352,7 +352,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
&lt;/lst&gt;
&lt;/lst&gt;
</code></pre><ul>
<li>That answers Peter's question about why the stats jumped in October&hellip;</li>
<li>That answers Peter&rsquo;s question about why the stats jumped in October&hellip;</li>
</ul>
<h2 id="2019-11-08">2019-11-08</h2>
<ul>
@ -409,12 +409,12 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
</ul>
<h2 id="2019-11-13">2019-11-13</h2>
<ul>
<li>The <a href="https://hdl.handle.net/10568/97087">item with a low Altmetric score for its Handle</a> that I tweeted yesterday still hasn't linked with the DOI's score
<li>The <a href="https://hdl.handle.net/10568/97087">item with a low Altmetric score for its Handle</a> that I tweeted yesterday still hasn&rsquo;t linked with the DOI&rsquo;s score
<ul>
<li>I tweeted it again with the Handle and the DOI</li>
</ul>
</li>
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr's regex search can't use those</li>
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr&rsquo;s regex search can&rsquo;t use those</li>
</ul>
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:&quot;Scrapoo/1&quot;
$ http &quot;http://localhost:8081/solr/statistics/update?commit=true&quot;
@ -424,19 +424,19 @@ $ http &quot;http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
&lt;result name=&quot;response&quot; numFound=&quot;1&quot; start=&quot;0&quot;&gt;
</code></pre><ul>
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
<li>I realized that it's easier to search Solr from curl via POST using this syntax:</li>
<li>I realized that it&rsquo;s easier to search Solr from curl via POST using this syntax:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:*Scrapoo*&amp;rows=0&quot;)
</code></pre><ul>
<li>If the parameters include something like &ldquo;[0-9]&rdquo; then curl interprets it as a range and will make ten requests
<ul>
<li>You can disable this using the <code>-g</code> option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr's regex search:</li>
<li>You can disable this using the <code>-g</code> option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr&rsquo;s regex search:</li>
</ul>
</li>
</ul>
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&amp;rows=2'
</code></pre><ul>
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I'm evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I&rsquo;m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
</ul>
<h2 id="2019-11-14">2019-11-14</h2>
<ul>
@ -456,14 +456,14 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>Greatly improve my <code>check-spider-hits.sh</code> script to handle regular expressions in the spider agents patterns file
<ul>
<li>This allows me to detect and purge many more hits from the Solr statistics core</li>
<li>I've tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace's Solr cores</li>
<li>I&rsquo;ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace&rsquo;s Solr cores</li>
</ul>
</li>
</ul>
<h2 id="2019-11-15">2019-11-15</h2>
<ul>
<li>Run the new version of <code>check-spider-hits.sh</code> on CGSpace's Solr statistics cores one by one, starting from the oldest just in case something goes wrong</li>
<li>But then I noticed that some (all?) of the hits weren't actually getting purged, all of which were using regular expressions like:
<li>Run the new version of <code>check-spider-hits.sh</code> on CGSpace&rsquo;s Solr statistics cores one by one, starting from the oldest just in case something goes wrong</li>
<li>But then I noticed that some (all?) of the hits weren&rsquo;t actually getting purged, all of which were using regular expressions like:
<ul>
<li><code>MetaURI[\+\s]API\/[0-9]\.[0-9]</code></li>
<li><code>FDM(\s|\+)[0-9]</code></li>
@ -474,10 +474,10 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</ul>
</li>
<li>Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!</li>
<li>Plus signs are special in regular expressions, URLs, and Solr's Lucene query parser, so I'm actually not sure where the issue is
<li>Plus signs are special in regular expressions, URLs, and Solr&rsquo;s Lucene query parser, so I&rsquo;m actually not sure where the issue is
<ul>
<li>I tried to do URL encoding of the +, double escaping, etc&hellip; but nothing worked</li>
<li>I'm going to ignore regular expressions that have pluses for now</li>
<li>I&rsquo;m going to ignore regular expressions that have pluses for now</li>
</ul>
</li>
<li>I think I might also have to ignore patterns that have percent signs, like <code>^\%?default\%?$</code></li>
@ -495,7 +495,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>statistics: 1043373</li>
</ul>
</li>
<li>That's 1.4 million hits in addition to the 2 million I purged earlier this week&hellip;</li>
<li>That&rsquo;s 1.4 million hits in addition to the 2 million I purged earlier this week&hellip;</li>
<li>For posterity, the major contributors to the hits on the statistics core were:
<ul>
<li>Purging 812429 hits from curl/ in statistics</li>
@ -512,7 +512,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</ul>
<h2 id="2019-11-17">2019-11-17</h2>
<ul>
<li>Altmetric support responded about our dashboard question, asking if the second &ldquo;department&rdquo; (aka WLE's collection) was added recently and might have not been in the last harvesting yet
<li>Altmetric support responded about our dashboard question, asking if the second &ldquo;department&rdquo; (aka WLE&rsquo;s collection) was added recently and might have not been in the last harvesting yet
<ul>
<li>I told her no, that the department is several years old, and the item was added in 2017</li>
<li>Then I looked again at the dashboard for each department and I see the item in both departments now&hellip; shit.</li>
@ -538,7 +538,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</ul>
<h2 id="2019-11-19">2019-11-19</h2>
<ul>
<li>Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
<li>Export IITA&rsquo;s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
<ul>
<li>I had previously sent them an export in 2019-04</li>
</ul>
@ -555,15 +555,15 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<li>Found 4429 hits from ^User-Agent in statistics-2016</li>
</ul>
</li>
<li>Buck is one I've never heard of before, its user agent is:</li>
<li>Buck is one I&rsquo;ve never heard of before, its user agent is:</li>
</ul>
<pre><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
</code></pre><ul>
<li>All in all that's about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li>
<li>All in all that&rsquo;s about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li>
</ul>
<h2 id="2019-11-20">2019-11-20</h2>
<ul>
<li>Email Usman Muchlish from CIFOR to see what he's doing with their DSpace lately</li>
<li>Email Usman Muchlish from CIFOR to see what he&rsquo;s doing with their DSpace lately</li>
</ul>
<h2 id="2019-11-21">2019-11-21</h2>
<ul>
@ -599,8 +599,8 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<ul>
<li>I rebooted DSpace Test (linode19) and it kernel panicked at boot
<ul>
<li>I looked on the console and saw that it can't mount the root filesystem</li>
<li>I switched the boot configuration to use the OS's kernel via GRUB2 instead of Linode's kernel and then it came up after reboot&hellip;</li>
<li>I looked on the console and saw that it can&rsquo;t mount the root filesystem</li>
<li>I switched the boot configuration to use the OS&rsquo;s kernel via GRUB2 instead of Linode&rsquo;s kernel and then it came up after reboot&hellip;</li>
<li>I initiated a migration of the server from the Fremont, CA region to Frankfurt, DE
<ul>
<li>The migration is going very slowly, so I assume the network issues from earlier this year are still not fixed</li>