<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="November, 2019" /> <meta property="og:description" content="2019-11-04 Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs: # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 So 4.6 million from XMLUI and another 1.2 million from API requests Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-11/" /> <meta property="article:published_time" content="2019-11-04T12:20:30+02:00" /> <meta property="article:modified_time" content="2019-11-28T17:30:45+02:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="November, 2019"/> <meta name="twitter:description" content="2019-11-04 Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs: # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 So 4.6 million from XMLUI and another 1.2 million from API requests Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats): # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 "/> <meta name="generator" content="Hugo 0.78.2" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "November, 2019", "url": "https://alanorth.github.io/cgspace-notes/2019-11/", "wordCount": "3457", "datePublished": "2019-11-04T12:20:30+02:00", "dateModified": "2019-11-28T17:30:45+02:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-11/"> <title>November, 2019 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.d20c61b183eb27beb5b2c48f70a38b91c8bb5fb929e77b447d5f77c7285221ad.css" rel="stylesheet" integrity="sha256-0gxhsYPrJ761ssSPcKOLkci7X7kp53tEfV93xyhSIa0=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.4ed405d7c7002b970d34cbe6026ff44a556b0808cb98a9db4008752110ed964b.js" integrity="sha256-TtQF18cAK5cNNMvmAm/0SlVrCAjLmKnbQAh1IRDtlks=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-11/">November, 2019</a></h2> <p class="blog-post-meta"> <time datetime="2019-11-04T12:20:30+02:00">Mon Nov 04, 2019</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2019-11-04">2019-11-04</h2> <ul> <li>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics <ul> <li>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</li> </ul> </li> </ul> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 4671942 # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019" 1277694 </code></pre><ul> <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li> <li>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li> </ul> <pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019" 1183456 # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 </code></pre><ul> <li>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</li> </ul> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n 1 PUT 8 PROPFIND 283 OPTIONS 30102 POST 46581 HEAD 4594967 GET </code></pre><ul> <li>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</li> </ul> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)' 365288 </code></pre><ul> <li>Their user agent is one I’ve never seen before:</li> </ul> <pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) </code></pre><ul> <li>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</li> </ul> <pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c 6566 GET /bitstream 351928 GET /handle # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover 214209 # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse 86874 </code></pre><ul> <li>As far as I can tell, none of their requests are counted in the Solr statistics:</li> </ul> <pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true' </code></pre><ul> <li>Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load</li> <li>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</li> </ul> <pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1" </code></pre><ul> <li>On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire’s COUNTER-Robots</a> project <ul> <li>First I checked for a user agent that is in COUNTER-Robots, but NOT in the current <code>dspace/config/spiders/example</code> list</li> <li>Then I made some item and bitstream requests on DSpace Test using that user agent:</li> </ul> </li> </ul> <pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie" $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie" </code></pre><ul> <li>A bit later I checked Solr and found three requests from my IP with that user agent this month:</li> </ul> <pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0' <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result> </response> </code></pre><ul> <li>Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:</li> </ul> <pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial" $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial" </code></pre><ul> <li>After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list… <ul> <li>What’s strange is that the Solr spider agent configuration in <code>dspace/config/modules/solr-statistics.cfg</code> points to a file that doesn’t exist…</li> </ul> </li> </ul> <pre><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt </code></pre><ul> <li>Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…</li> <li>I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr <ul> <li>Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in <code>dspace/config/spiders/example</code> and then try to use DSpace’s “mark spiders” feature to change them to “isBot:true” in Solr</li> <li>I restarted Tomcat and ran <code>dspace stats-util -m</code> and it did some stuff for awhile, but I still don’t see any items in Solr with <code>isBot:true</code></li> <li>According to <code>dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java</code> the patterns for user agents are loaded from any file in the <code>config/spiders/agents</code> directory</li> <li>I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran <code>dspace stats-util -m</code> and still there were no new items marked as being bots in Solr, so I think there is still something wrong</li> <li>Jesus, the code in <code>./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java</code> says that <code>stats-util -m</code> marks spider requests by their IPs, not by their user agents… WTF:</li> </ul> </li> </ul> <pre><code>else if (line.hasOption('m')) { SolrLogger.markRobotsByIP(); } </code></pre><ul> <li>WTF again, there is actually a function called <code>markRobotByUserAgent()</code> that is never called anywhere! <ul> <li>It appears to be unimplemented…</li> <li>I sent a message to the dspace-tech mailing list to ask if I should file an issue</li> </ul> </li> </ul> <h2 id="2019-11-05">2019-11-05</h2> <ul> <li>I added “alanfuu2” to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:</li> </ul> <pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2" </code></pre><ul> <li>After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:</li> </ul> <pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true' $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="1" start="0"> $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="0" start="0"/> </code></pre><ul> <li>So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list <ul> <li>Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents</li> </ul> </li> <li>I’m curious how the special character matching is in Solr, so I will test two requests: one with “<a href="http://www.gnip.com">www.gnip.com</a>” which is in the spider list, and one with “<a href="http://www.gnyp.com">www.gnyp.com</a>” which isn’t:</li> </ul> <pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com" $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com" </code></pre><ul> <li>Then commit changes to Solr so we don’t have to wait:</li> </ul> <pre><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true' $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="0" start="0"/> $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound <result name="response" numFound="1" start="0"> </code></pre><ul> <li>So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…</li> </ul> <h2 id="2019-11-07">2019-11-07</h2> <ul> <li>CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate <ul> <li>They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty</li> <li>I have now merged the changes in to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/432">#432</a>)</li> </ul> </li> <li>I am reconsidering the move of <code>cg.identifier.dataurl</code> to <code>cg.hasMetadata</code> in CG Core v2 <ul> <li>The values of this field are mostly links to data sets on Dataverse and partner sites</li> <li>I opened an <a href="https://github.com/AgriculturalSemantics/cg-core/issues/10">issue on GitHub</a> to ask Marie-Angelique for clarification</li> </ul> </li> <li>Looking into CGSpace statistics again <ul> <li>I searched for hits in Solr from the BUbiNG bot and found 63,000 in the <code>statistics-2018</code> core:</li> </ul> </li> </ul> <pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound <result name="response" numFound="62944" start="0"> </code></pre><ul> <li>Similar for com.plumanalytics, Grammarly, and ltx71!</li> </ul> <pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent: *com.plumanalytics*' | xmllint --format - | grep numFound <result name="response" numFound="28256" start="0"> $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound <result name="response" numFound="6288" start="0"> $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound <result name="response" numFound="105663" start="0"> </code></pre><ul> <li>Deleting these seems to work, for example the 105,000 ltx71 records from 2018:</li> </ul> <pre><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true' $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound <result name="response" numFound="0" start="0"/> </code></pre><ul> <li>I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores <ul> <li>For years 2010 until 2019 there are 1.6 million hits from these spider user agents</li> <li>For 2019 alone there are 740,000, over half of which come from Unpaywall!</li> <li>Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10:</li> </ul> </li> </ul> <pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit= 12&q=userAgent:*Unpaywall*' | xmllint --format - | less ... <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2019-10">198624</int> <int name="2019-05">88422</int> <int name="2019-06">79911</int> <int name="2019-09">67065</int> <int name="2019-07">39026</int> <int name="2019-08">36889</int> <int name="2019-04">36512</int> <int name="2019-11">760</int> </lst> </lst> </code></pre><ul> <li>That answers Peter’s question about why the stats jumped in October…</li> </ul> <h2 id="2019-11-08">2019-11-08</h2> <ul> <li>I saw a bunch of user agents that have the literal string <code>User-Agent</code> in their user agent HTTP header, for example: <ul> <li><code>User-Agent: Drupal (+http://drupal.org/)</code></li> <li><code>User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31</code></li> <li><code>User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;</code></li> <li><code>User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)</code></li> <li><code>User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;</code></li> <li><code>User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;</code></li> </ul> </li> <li>I filed <a href="https://github.com/atmire/COUNTER-Robots/issues/27">an issue</a> on the COUNTER-Robots project to see if they agree to add <code>User-Agent:</code> to the list of robot user agents</li> </ul> <h2 id="2019-11-09">2019-11-09</h2> <ul> <li>Deploy the latest <code>5_x-prod</code> branch on CGSpace (linode19) <ul> <li>This includes the updated CCAFS phase II project tags and the updated spider user agents</li> </ul> </li> <li>Run all system updates on CGSpace and reboot the server <ul> <li>After rebooting it seems that all Solr statistics cores came back up fine…</li> </ul> </li> <li>I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace <ul> <li>The script is called <code>check-spider-hits.sh</code></li> <li>After a bunch of tests and checks I ran it for each statistics shard like so:</li> </ul> </li> </ul> <pre><code>$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done </code></pre><ul> <li>Open a <a href="https://github.com/atmire/COUNTER-Robots/pull/28">pull request</a> against COUNTER-Robots to remove unnecessary escaping of dashes</li> </ul> <h2 id="2019-11-12">2019-11-12</h2> <ul> <li>Udana and Chandima emailed me to ask why <a href="https://hdl.handle.net/10568/81236">one of their WLE items</a> that is mapped from IWMI only shows up in the IWMI “department” on the Altmetric dashboard <ul> <li>A <a href="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management">search in the IWMI department shows the item</a></li> <li>A <a href="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management">search in the WLE department shows no results</a></li> <li>I emailed Altmetric support to ask for help</li> </ul> </li> <li>Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding <code>dc.rights</code>, fixing DOIs, adding ISSNs, etc) and noticed one issue with <a href="https://hdl.handle.net/10568/97087">an item</a> that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score) <ul> <li>I tweeted the Handle to see if the score would get linked once Altmetric noticed it</li> </ul> </li> </ul> <h2 id="2019-11-13">2019-11-13</h2> <ul> <li>The <a href="https://hdl.handle.net/10568/97087">item with a low Altmetric score for its Handle</a> that I tweeted yesterday still hasn’t linked with the DOI’s score <ul> <li>I tweeted it again with the Handle and the DOI</li> </ul> </li> <li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr’s regex search can’t use those</li> </ul> <pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1" $ http "http://localhost:8081/solr/statistics/update?commit=true" $ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound <result name="response" numFound="1" start="0"> $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound <result name="response" numFound="1" start="0"> </code></pre><ul> <li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li> <li>I realized that it’s easier to search Solr from curl via POST using this syntax:</li> </ul> <pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0") </code></pre><ul> <li>If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests <ul> <li>You can disable this using the <code>-g</code> option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr’s regex search:</li> </ul> </li> </ul> <pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2' </code></pre><ul> <li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li> </ul> <h2 id="2019-11-14">2019-11-14</h2> <ul> <li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li> <li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li> </ul> <pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt $ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents) $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml </code></pre><ul> <li>I created a <a href="https://github.com/ilri/DSpace/pull/437">pull request</a> and merged them into the <code>5_x-prod</code> branch <ul> <li>I will deploy them to CGSpace in the next few days</li> </ul> </li> <li>Greatly improve my <code>check-spider-hits.sh</code> script to handle regular expressions in the spider agents patterns file <ul> <li>This allows me to detect and purge many more hits from the Solr statistics core</li> <li>I’ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace’s Solr cores</li> </ul> </li> </ul> <h2 id="2019-11-15">2019-11-15</h2> <ul> <li>Run the new version of <code>check-spider-hits.sh</code> on CGSpace’s Solr statistics cores one by one, starting from the oldest just in case something goes wrong</li> <li>But then I noticed that some (all?) of the hits weren’t actually getting purged, all of which were using regular expressions like: <ul> <li><code>MetaURI[\+\s]API\/[0-9]\.[0-9]</code></li> <li><code>FDM(\s|\+)[0-9]</code></li> <li><code>Goldfire(\s|\+)Server</code></li> <li><code>^Mozilla\/4\.0\+\(compatible;\)$</code></li> <li><code>^Mozilla\/4\.0\+\(compatible;\+ICS\)$</code></li> <li><code>^Mozilla\/4\.5\+\[en]\+\(Win98;\+I\)$</code></li> </ul> </li> <li>Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!</li> <li>Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is <ul> <li>I tried to do URL encoding of the +, double escaping, etc… but nothing worked</li> <li>I’m going to ignore regular expressions that have pluses for now</li> </ul> </li> <li>I think I might also have to ignore patterns that have percent signs, like <code>^\%?default\%?$</code></li> <li>After I added the ignores and did some more testing I finally ran the <code>check-spider-hits.sh</code> on all CGSpace Solr statistics cores and these are the number of hits purged from each core: <ul> <li>statistics-2010: 113</li> <li>statistics-2011: 7235</li> <li>statistics-2012: 0</li> <li>statistics-2013: 0</li> <li>statistics-2014: 316</li> <li>statistics-2015: 16809</li> <li>statistics-2016: 41732</li> <li>statistics-2017: 39207</li> <li>statistics-2018: 295546</li> <li>statistics: 1043373</li> </ul> </li> <li>That’s 1.4 million hits in addition to the 2 million I purged earlier this week…</li> <li>For posterity, the major contributors to the hits on the statistics core were: <ul> <li>Purging 812429 hits from curl/ in statistics</li> <li>Purging 48206 hits from facebookexternalhit/ in statistics</li> <li>Purging 72004 hits from PHP/ in statistics</li> <li>Purging 76072 hits from Yeti/[0-9] in statistics</li> </ul> </li> <li>Most of the curl hits were from CIAT in mid-2019, where they were using <a href="https://guzzle3.readthedocs.io/http-client/client.html">GuzzleHttp</a> from PHP, which uses something like this for its user agent:</li> </ul> <pre><code>Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION> </code></pre><ul> <li>Run system updates on DSpace Test and reboot the server</li> </ul> <h2 id="2019-11-17">2019-11-17</h2> <ul> <li>Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE’s collection) was added recently and might have not been in the last harvesting yet <ul> <li>I told her no, that the department is several years old, and the item was added in 2017</li> <li>Then I looked again at the dashboard for each department and I see the item in both departments now… shit.</li> <li>A <a href="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management">search in the IWMI department shows the item</a></li> <li>A <a href="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management">search in the WLE department shows the item</a></li> </ul> </li> <li>I finally decided to revert <code>cg.hasMetadata</code> back to <code>cg.identifier.dataurl</code> in my CG Core v2 branch (see <a href="https://github.com/AgriculturalSemantics/cg-core/issues/10">#10</a>)</li> <li>Regarding the <a href="https://hdl.handle.net/10568/97087">WLE item</a> that has a much lower score than its DOI… <ul> <li>I tweeted the item twice last week and the score never got linked</li> <li>Then I noticed that I had already made a note about the same issue in 2019-04, when I also tweeted it several times…</li> <li>I will ask Altmetric support for help with that</li> </ul> </li> <li>Finally deploy <code>5_x-cgcorev2</code> branch on DSpace Test</li> </ul> <h2 id="2019-11-18">2019-11-18</h2> <ul> <li>I sent a mail to the CGSpace partners in Addis about the CG Core v2 changes on DSpace Test</li> <li>Then I filed an <a href="https://github.com/AgriculturalSemantics/cg-core/issues/11">issue on the CG Core GitHub</a> to let the metadata people know about our progress</li> <li>It seems like I will do a session about CG Core v2 implementation and limitations in DSpace for the data workshop in December in Nairobi (?)</li> </ul> <h2 id="2019-11-19">2019-11-19</h2> <ul> <li>Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something <ul> <li>I had previously sent them an export in 2019-04</li> </ul> </li> <li>Atmire merged my <a href="https://github.com/atmire/COUNTER-Robots/pull/28">pull request regarding unnecessary escaping of dashes</a> in regular expressions, as well as <a href="https://github.com/atmire/COUNTER-Robots/issues/27">my suggestion of adding “User-Agent” to the list of patterns</a></li> <li>I made another <a href="https://github.com/atmire/COUNTER-Robots/pull/29">pull request to fix invalid escaping of one of their new patterns</a></li> <li>I ran my <code>check-spider-hits.sh</code> script again with these new patterns and found a bunch more statistics requests that match, for example: <ul> <li>Found 39560 hits from ^Buck/[0-9] in statistics</li> <li>Found 5471 hits from ^User-Agent in statistics</li> <li>Found 2994 hits from ^Buck/[0-9] in statistics-2018</li> <li>Found 14076 hits from ^User-Agent in statistics-2018</li> <li>Found 16310 hits from ^User-Agent in statistics-2017</li> <li>Found 4429 hits from ^User-Agent in statistics-2016</li> </ul> </li> <li>Buck is one I’ve never heard of before, its user agent is:</li> </ul> <pre><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html) </code></pre><ul> <li>All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li> </ul> <h2 id="2019-11-20">2019-11-20</h2> <ul> <li>Email Usman Muchlish from CIFOR to see what he’s doing with their DSpace lately</li> </ul> <h2 id="2019-11-21">2019-11-21</h2> <ul> <li>Discuss bugs and issues with AReS v2 that are limiting its adoption <ul> <li>BUG: If you search for items between year 2012 and 2019, then remove some years from the “info product analysis”, they are still present in the search results and export</li> <li>FEATURE: Ability to add month to date filter?</li> <li>FEATURE: Add “review status”, “series”, and “usage rights” to search filters</li> <li>FEATURE: Downloads and views are not included in exports</li> <li>FEATURE: Add more fields to exports (Abenet will clarify)</li> </ul> </li> <li>As for the larger features to focus on in the future ToRs: <ul> <li>FEATURE: Unique, linkable URL for a set of search results (discussed with Moayad, he has a plan for this)</li> <li>FEATURE: Reporting that we talked about in Amman in January, 2019.</li> </ul> </li> <li>We have a meeting about AReS future developments with Jane, Abenet, Peter, and Enrico tomorrow</li> </ul> <h2 id="2019-11-22">2019-11-22</h2> <ul> <li>Skype with Jane, Abenet, Peter, and Enrico about AReS v2 future development <ul> <li>We want to move AReS v2 from dspacetest.cgiar.org/explorer to cgspace.cgiar.org/explorer</li> <li>We want to maintain a public demo of the vanilla OpenRXV with a subset of data, for example a non-CG community</li> <li>We want to try to move all issues and milestones to GitHub</li> <li>I need to try to work with ILRI Finance to pre-pay the AReS Linode server (linode11779072) for 2020</li> </ul> </li> </ul> <h2 id="2019-11-24">2019-11-24</h2> <ul> <li>I rebooted DSpace Test (linode19) and it kernel panicked at boot <ul> <li>I looked on the console and saw that it can’t mount the root filesystem</li> <li>I switched the boot configuration to use the OS’s kernel via GRUB2 instead of Linode’s kernel and then it came up after reboot…</li> <li>I initiated a migration of the server from the Fremont, CA region to Frankfurt, DE <ul> <li>The migration is going very slowly, so I assume the network issues from earlier this year are still not fixed</li> <li>I opened a new ticket (13056701) with Linode support, with reference to my previous ticket (11804943)</li> </ul> </li> </ul> </li> </ul> <h2 id="2019-11-25">2019-11-25</h2> <ul> <li>The migration of DSpace Test from Fremont, CA (USA) to Frankfurt (DE) region completed <ul> <li>The IP address of the server changed so I need to email CGNET to ask them to update the DNS</li> </ul> </li> </ul> <h2 id="2019-11-26">2019-11-26</h2> <ul> <li>Visit CodeObia to discuss future of OpenRXV and AReS <ul> <li>I started working on categorizing and validating the feedback that Jane collated into a spreadsheet last week</li> <li>I added GitHub issues for eight of the items so far, tagging them by “bug”, “search”, “feature”, “graphics”, “low-priority”, etc</li> <li>I moved AReS v2 to be available on CGSpace</li> </ul> </li> </ul> <h2 id="2019-11-27">2019-11-27</h2> <ul> <li>Minor updates on the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> <ul> <li>Introduce isort for import sorting</li> <li>Introduce black for code formatting according to PEP8</li> <li>Fix some minor issues raised by flake8</li> <li>Release <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.1.1">version 1.1.1</a> and deploy to DSpace Test (linode19)</li> <li>I realize that I never deployed version 1.1.0 (with falcon 2.0.0) on CGSpace (linode18) so I did that as well</li> </ul> </li> <li>File a ticket (242418) with Altmetric about DCTERMS migration to see if there is anything we need to be careful about</li> <li>Make a pull request against cg-core schema to fix inconsistent references to <code>cg.embargoDate</code> (<a href="https://github.com/AgriculturalSemantics/cg-core/pull/13">#13</a>)</li> <li>Review the AReS feedback again after Peter made some comments <ul> <li>I standardized the GitHub issue labels in both OpenRXV and AReS issue trackers, using labels like “P-low” for priority</li> <li>I filed another handful of issues in both trackers and added them to the spreadsheet</li> </ul> </li> <li>I need to ask Marie-Angelique about the <code>cg.peer-reviewed</code> field <ul> <li>We currently use <code>dc.description.version</code> with values like “Internal Review” and “Peer Review”, and CG Core v2 currently recommends using “True” if the field is peer reviewed</li> </ul> </li> </ul> <h2 id="2019-11-28">2019-11-28</h2> <ul> <li>File an issue with CG Core v2 project to ask Marie-Angelique about expanding the scope of <code>cg.peer-reviewed</code> to include other types of review, and possibly to change the field name to something more generic like <code>cg.review-status</code> (<a href="https://github.com/AgriculturalSemantics/cg-core/issues/14">#14</a>)</li> <li>More review of AReS feedback <ul> <li>I clarified some of the feedback</li> <li>I added status of “Issue Filed”, “Duplicate” and “No Action Required” to several items</li> <li>I filed a handful more GitHub issues in AReS and OpenRXV GitHub trackers</li> </ul> </li> </ul> <!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2020-12/">December, 2020</a></li> <li><a href="/cgspace-notes/cgspace-dspace6-upgrade/">CGSpace DSpace 6 Upgrade</a></li> <li><a href="/cgspace-notes/2020-11/">November, 2020</a></li> <li><a href="/cgspace-notes/2020-10/">October, 2020</a></li> <li><a href="/cgspace-notes/2020-09/">September, 2020</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>