<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="November, 2019" /> <meta property="og:description" content="2019-11-04 Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs: # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 So 4.6 million from XMLUI and another 1.2 million from API requests Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-11/" /> <meta property="article:published_time" content="2019-11-04T12:20:30+02:00" /> <meta property="article:modified_time" content="2019-11-06T09:35:51+02:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="November, 2019"/> <meta name="twitter:description" content="2019-11-04 Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs: # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 So 4.6 million from XMLUI and another 1.2 million from API requests Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 "/> <meta name="generator" content="Hugo 0.59.1" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "November, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-11\/", "wordCount": "1088", "datePublished": "2019-11-04T12:20:30+02:00", "dateModified": "2019-11-06T09:35:51+02:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-11/"> <title>November, 2019 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous"> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-11/">November, 2019</a></h2> <p class="blog-post-meta"><time datetime="2019-11-04T12:20:30+02:00">Mon Nov 04, 2019</time> by Alan Orth in <i class="fa fa-folder" aria-hidden="true"></i> <a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a> </p> </header> <h2 id="2019-11-04">2019-11-04</h2> <ul> <li><p>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics</p> <ul> <li><p>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</p> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 </code></pre></li> </ul></li> <li><p>So 4.6 million from XMLUI and another 1.2 million from API requests</p></li> <li><p>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</p> <pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 </code></pre></li> </ul> <ul> <li><p>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</p> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n 1 PUT 8 PROPFIND 283 OPTIONS 30102 POST 46581 HEAD 4594967 GET </code></pre></li> <li><p>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</p> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)' 365288 </code></pre></li> <li><p>Their user agent is one I’ve never seen before:</p> <pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) </code></pre></li> <li><p>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</p></li> </ul> <h1 id="zcat-force-var-log-nginx-access-log-gz-grep-e-0-9-1-2-oct-2019-grep-amazonbot-grep-o-e-get-bitstream-discover-handle-sort-uniq-c">zcat –force /var/log/nginx/<em>access.log.</em>.gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -o -E “GET /(bitstream|discover|handle)” | sort | uniq -c</h1> <p>6566 GET /bitstream 351928 GET /handle</p> <h1 id="zcat-force-var-log-nginx-access-log-gz-grep-e-0-9-1-2-oct-2019-grep-amazonbot-grep-e-get-bitstream-discover-handle-grep-c-discover">zcat –force /var/log/nginx/<em>access.log.</em>.gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -E “GET /(bitstream|discover|handle)” | grep -c discover</h1> <p>214209</p> <h1 id="zcat-force-var-log-nginx-access-log-gz-grep-e-0-9-1-2-oct-2019-grep-amazonbot-grep-e-get-bitstream-discover-handle-grep-c-browse">zcat –force /var/log/nginx/<em>access.log.</em>.gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -E “GET /(bitstream|discover|handle)” | grep -c browse</h1> <p>86874</p> <pre><code> - As far as I can tell, none of their requests are counted in the Solr statistics: </code></pre> <p>$ http –print b ‘<a href="http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'">http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'</a></p> <pre><code> - Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load - After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503): </code></pre> <p>$ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/1/discover'">https://dspacetest.cgiar.org/handle/10568/1/discover'</a> User-Agent:“Amazonbot/0.1”</p> <pre><code> - On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list - Then I made some item and bitstream requests on DSpace Test using that user agent: </code></pre> <p>$ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:“iskanie” $ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:“iskanie” $ http –print Hh ‘<a href="https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y'">https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y'</a> User-Agent:“iskanie”</p> <pre><code> - A bit later I checked Solr and found three requests from my IP with that user agent this month: </code></pre> <p>$ http –print b ‘<a href="http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'">http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'</a> <?xml version=“1.0” encoding=“UTF-8”?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result> </response></p> <pre><code> - Now I want to make similar requests with a user agent that is included in DSpace's current user agent list: </code></pre> <p>$ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:“celestial” $ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:“celestial” $ http –print Hh ‘<a href="https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y'">https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y'</a> User-Agent:“celestial”</p> <pre><code> - After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list... - What's strange is that the Solr spider agent configuration in `dspace/config/modules/solr-statistics.cfg` points to a file that doesn't exist... </code></pre> <p>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt</p> <pre><code> - Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file... - I tried with some other garbage user agents like "fuuuualan" and they were visible in Solr - Now I want to try adding "iskanie" and "fuuuualan" to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's "mark spiders" feature to change them to "isBot:true" in Solr - I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true` - According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory - I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong - Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF: </code></pre> <p>else if (line.hasOption(’m’)) { SolrLogger.markRobotsByIP(); }</p> <pre><code> - WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere! - It appears to be unimplemented... - I sent a message to the dspace-tech mailing list to ask if I should file an issue ## 2019-11-05 - I added "alanfuu2" to the example spiders file, restarted Tomcat, then made two requests to DSpace Test: </code></pre> <p>$ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:“alanfuuu1” $ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:“alanfuuu2”</p> <pre><code> - After committing the changes in Solr I saw one request for "alanfuu1" and no requests for "alanfuu2": </code></pre> <p>$ http –print b ‘<a href="http://localhost:8081/solr/statistics/update?commit=true'">http://localhost:8081/solr/statistics/update?commit=true'</a> $ http –print b ‘<a href="http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11'">http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11'</a> | xmllint –format - | grep numFound <result name="response" numFound="1" start="0"> $ http –print b ‘<a href="http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11'">http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11'</a> | xmllint –format - | grep numFound <result name="response" numFound="0" start="0"/></p> <pre><code> - So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list - Even though the "mark by user agent" function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents - I'm curious how the special character matching is in Solr, so I will test two requests: one with "www.gnip.com" which is in the spider list, and one with "www.gnyp.com" which isn't: </code></pre> <p>$ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:“www.gnip.com” $ http –print Hh ‘<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:“www.gnyp.com”</p> <pre><code> - Then commit changes to Solr so we don't have to wait: </code></pre> <p>$ http –print b ‘<a href="http://localhost:8081/solr/statistics/update?commit=true'">http://localhost:8081/solr/statistics/update?commit=true'</a> $ http –print b ‘<a href="http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11'">http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11'</a> | xmllint –format - | grep numFound <result name="response" numFound="0" start="0"/> $ http –print b ‘<a href="http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11'">http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11'</a> | xmllint –format - | grep numFound <result name="response" numFound="1" start="0"> ```</p> <ul> <li>So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…</li> </ul> <h2 id="2019-11-07">2019-11-07</h2> <ul> <li>CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate <ul> <li>They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty</li> <li>I have now merged the changes in to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/432">#432</a>)</li> </ul></li> </ul> <!-- vim: set sw=2 ts=2: --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2019-11/">November, 2019</a></li> <li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li> <li><a href="/cgspace-notes/2019-10/">October, 2019</a></li> <li><a href="/cgspace-notes/2019-09/">September, 2019</a></li> <li><a href="/cgspace-notes/2019-08/">August, 2019</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>