mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-27 00:48:19 +01:00
747 lines
40 KiB
HTML
747 lines
40 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
|
|
<meta property="og:title" content="November, 2019" />
|
|
<meta property="og:description" content="2019-11-04
|
|
|
|
Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
|
|
|
|
I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
|
|
|
|
|
|
|
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
|
4671942
|
|
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
|
1277694
|
|
|
|
So 4.6 million from XMLUI and another 1.2 million from API requests
|
|
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
|
|
|
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
|
1183456
|
|
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
|
106781
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-11/" />
|
|
<meta property="article:published_time" content="2019-11-04T12:20:30+02:00" />
|
|
<meta property="article:modified_time" content="2019-11-28T17:30:45+02:00" />
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="November, 2019"/>
|
|
<meta name="twitter:description" content="2019-11-04
|
|
|
|
Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
|
|
|
|
I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:
|
|
|
|
|
|
|
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
|
4671942
|
|
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
|
1277694
|
|
|
|
So 4.6 million from XMLUI and another 1.2 million from API requests
|
|
Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
|
|
|
|
# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
|
1183456
|
|
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
|
106781
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.92.2" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "November, 2019",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2019-11/",
|
|
"wordCount": "3457",
|
|
"datePublished": "2019-11-04T12:20:30+02:00",
|
|
"dateModified": "2019-11-28T17:30:45+02:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-11/">
|
|
|
|
<title>November, 2019 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-11/">November, 2019</a></h2>
|
|
<p class="blog-post-meta">
|
|
<time datetime="2019-11-04T12:20:30+02:00">Mon Nov 04, 2019</time>
|
|
in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2019-11-04">2019-11-04</h2>
|
|
<ul>
|
|
<li>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics
|
|
<ul>
|
|
<li>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
|
4671942
|
|
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
|
|
1277694
|
|
</code></pre><ul>
|
|
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
|
|
<li>Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
|
|
</ul>
|
|
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
|
|
1183456
|
|
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
|
|
106781
|
|
</code></pre><ul>
|
|
<li>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</li>
|
|
</ul>
|
|
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
|
|
1 PUT
|
|
8 PROPFIND
|
|
283 OPTIONS
|
|
30102 POST
|
|
46581 HEAD
|
|
4594967 GET
|
|
</code></pre><ul>
|
|
<li>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
|
|
365288
|
|
</code></pre><ul>
|
|
<li>Their user agent is one I’ve never seen before:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
|
|
</code></pre><ul>
|
|
<li>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
|
|
6566 GET /bitstream
|
|
351928 GET /handle
|
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
|
|
214209
|
|
# zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c browse
|
|
86874
|
|
</code></pre><ul>
|
|
<li>As far as I can tell, none of their requests are counted in the Solr statistics:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
|
|
</code></pre><ul>
|
|
<li>Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load</li>
|
|
<li>After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
|
|
</code></pre><ul>
|
|
<li>On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in <code>config/spiders/agents</code>, perhaps by dropping a new list in from <a href="https://github.com/atmire/COUNTER-Robots">Atmire’s COUNTER-Robots</a> project
|
|
<ul>
|
|
<li>First I checked for a user agent that is in COUNTER-Robots, but NOT in the current <code>dspace/config/spiders/example</code> list</li>
|
|
<li>Then I made some item and bitstream requests on DSpace Test using that user agent:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
|
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
|
|
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
|
|
</code></pre><ul>
|
|
<li>A bit later I checked Solr and found three requests from my IP with that user agent this month:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<response>
|
|
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
|
|
</response>
|
|
</code></pre><ul>
|
|
<li>Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
|
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
|
|
$ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
|
|
</code></pre><ul>
|
|
<li>After twenty minutes I didn’t see any requests in Solr, so I assume they did not get logged because they matched a bot list…
|
|
<ul>
|
|
<li>What’s strange is that the Solr spider agent configuration in <code>dspace/config/modules/solr-statistics.cfg</code> points to a file that doesn’t exist…</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
|
|
</code></pre><ul>
|
|
<li>Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…</li>
|
|
<li>I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr
|
|
<ul>
|
|
<li>Now I want to try adding “iskanie” and “fuuuualan” to the list of spider regexes in <code>dspace/config/spiders/example</code> and then try to use DSpace’s “mark spiders” feature to change them to “isBot:true” in Solr</li>
|
|
<li>I restarted Tomcat and ran <code>dspace stats-util -m</code> and it did some stuff for awhile, but I still don’t see any items in Solr with <code>isBot:true</code></li>
|
|
<li>According to <code>dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java</code> the patterns for user agents are loaded from any file in the <code>config/spiders/agents</code> directory</li>
|
|
<li>I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran <code>dspace stats-util -m</code> and still there were no new items marked as being bots in Solr, so I think there is still something wrong</li>
|
|
<li>Jesus, the code in <code>./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java</code> says that <code>stats-util -m</code> marks spider requests by their IPs, not by their user agents… WTF:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>else if (line.hasOption('m'))
|
|
{
|
|
SolrLogger.markRobotsByIP();
|
|
}
|
|
</code></pre><ul>
|
|
<li>WTF again, there is actually a function called <code>markRobotByUserAgent()</code> that is never called anywhere!
|
|
<ul>
|
|
<li>It appears to be unimplemented…</li>
|
|
<li>I sent a message to the dspace-tech mailing list to ask if I should file an issue</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2019-11-05">2019-11-05</h2>
|
|
<ul>
|
|
<li>I added “alanfuu2” to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
|
|
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
|
|
</code></pre><ul>
|
|
<li>After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
|
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="1" start="0">
|
|
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="0" start="0"/>
|
|
</code></pre><ul>
|
|
<li>So basically it seems like a win to update the example file with the latest one from Atmire’s COUNTER-Robots list
|
|
<ul>
|
|
<li>Even though the “mark by user agent” function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents</li>
|
|
</ul>
|
|
</li>
|
|
<li>I’m curious how the special character matching is in Solr, so I will test two requests: one with “<a href="http://www.gnip.com">www.gnip.com</a>” which is in the spider list, and one with “<a href="http://www.gnyp.com">www.gnyp.com</a>” which isn’t:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
|
|
$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
|
|
</code></pre><ul>
|
|
<li>Then commit changes to Solr so we don’t have to wait:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
|
|
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="0" start="0"/>
|
|
$ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="1" start="0">
|
|
</code></pre><ul>
|
|
<li>So the blocking seems to be working because “www.gnip.com” is one of the new patterns added to the spiders file…</li>
|
|
</ul>
|
|
<h2 id="2019-11-07">2019-11-07</h2>
|
|
<ul>
|
|
<li>CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate
|
|
<ul>
|
|
<li>They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty</li>
|
|
<li>I have now merged the changes in to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/432">#432</a>)</li>
|
|
</ul>
|
|
</li>
|
|
<li>I am reconsidering the move of <code>cg.identifier.dataurl</code> to <code>cg.hasMetadata</code> in CG Core v2
|
|
<ul>
|
|
<li>The values of this field are mostly links to data sets on Dataverse and partner sites</li>
|
|
<li>I opened an <a href="https://github.com/AgriculturalSemantics/cg-core/issues/10">issue on GitHub</a> to ask Marie-Angelique for clarification</li>
|
|
</ul>
|
|
</li>
|
|
<li>Looking into CGSpace statistics again
|
|
<ul>
|
|
<li>I searched for hits in Solr from the BUbiNG bot and found 63,000 in the <code>statistics-2018</code> core:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="62944" start="0">
|
|
</code></pre><ul>
|
|
<li>Similar for com.plumanalytics, Grammarly, and ltx71!</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
|
|
*com.plumanalytics*' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="28256" start="0">
|
|
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="6288" start="0">
|
|
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="105663" start="0">
|
|
</code></pre><ul>
|
|
<li>Deleting these seems to work, for example the 105,000 ltx71 records from 2018:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
|
|
$ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
|
|
<result name="response" numFound="0" start="0"/>
|
|
</code></pre><ul>
|
|
<li>I wrote a quick bash script to check all these user agents against the CGSpace Solr statistics cores
|
|
<ul>
|
|
<li>For years 2010 until 2019 there are 1.6 million hits from these spider user agents</li>
|
|
<li>For 2019 alone there are 740,000, over half of which come from Unpaywall!</li>
|
|
<li>Looking at the facets I see there were about 200,000 hits from Unpaywall in 2019-10:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
|
|
12&q=userAgent:*Unpaywall*' | xmllint --format - | less
|
|
...
|
|
<lst name="facet_counts">
|
|
<lst name="facet_queries"/>
|
|
<lst name="facet_fields">
|
|
<lst name="dateYearMonth">
|
|
<int name="2019-10">198624</int>
|
|
<int name="2019-05">88422</int>
|
|
<int name="2019-06">79911</int>
|
|
<int name="2019-09">67065</int>
|
|
<int name="2019-07">39026</int>
|
|
<int name="2019-08">36889</int>
|
|
<int name="2019-04">36512</int>
|
|
<int name="2019-11">760</int>
|
|
</lst>
|
|
</lst>
|
|
</code></pre><ul>
|
|
<li>That answers Peter’s question about why the stats jumped in October…</li>
|
|
</ul>
|
|
<h2 id="2019-11-08">2019-11-08</h2>
|
|
<ul>
|
|
<li>I saw a bunch of user agents that have the literal string <code>User-Agent</code> in their user agent HTTP header, for example:
|
|
<ul>
|
|
<li><code>User-Agent: Drupal (+http://drupal.org/)</code></li>
|
|
<li><code>User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31</code></li>
|
|
<li><code>User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;</code></li>
|
|
<li><code>User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)</code></li>
|
|
<li><code>User-Agent:User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.5; .NET4.0C)IKU/6.7.6.12189;IKUCID/IKU;IKU/6.7.6.12189;IKUCID/IKU;</code></li>
|
|
<li><code>User-Agent:Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) IKU/7.0.5.9226;IKUCID/IKU;</code></li>
|
|
</ul>
|
|
</li>
|
|
<li>I filed <a href="https://github.com/atmire/COUNTER-Robots/issues/27">an issue</a> on the COUNTER-Robots project to see if they agree to add <code>User-Agent:</code> to the list of robot user agents</li>
|
|
</ul>
|
|
<h2 id="2019-11-09">2019-11-09</h2>
|
|
<ul>
|
|
<li>Deploy the latest <code>5_x-prod</code> branch on CGSpace (linode19)
|
|
<ul>
|
|
<li>This includes the updated CCAFS phase II project tags and the updated spider user agents</li>
|
|
</ul>
|
|
</li>
|
|
<li>Run all system updates on CGSpace and reboot the server
|
|
<ul>
|
|
<li>After rebooting it seems that all Solr statistics cores came back up fine…</li>
|
|
</ul>
|
|
</li>
|
|
<li>I did some work to clean up my bot processing script and removed about 2 million hits from the statistics cores on CGSpace
|
|
<ul>
|
|
<li>The script is called <code>check-spider-hits.sh</code></li>
|
|
<li>After a bunch of tests and checks I ran it for each statistics shard like so:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
|
|
istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
|
|
</code></pre><ul>
|
|
<li>Open a <a href="https://github.com/atmire/COUNTER-Robots/pull/28">pull request</a> against COUNTER-Robots to remove unnecessary escaping of dashes</li>
|
|
</ul>
|
|
<h2 id="2019-11-12">2019-11-12</h2>
|
|
<ul>
|
|
<li>Udana and Chandima emailed me to ask why <a href="https://hdl.handle.net/10568/81236">one of their WLE items</a> that is mapped from IWMI only shows up in the IWMI “department” on the Altmetric dashboard
|
|
<ul>
|
|
<li>A <a href="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management">search in the IWMI department shows the item</a></li>
|
|
<li>A <a href="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management">search in the WLE department shows no results</a></li>
|
|
<li>I emailed Altmetric support to ask for help</li>
|
|
</ul>
|
|
</li>
|
|
<li>Also, while analysing this, I looked through some of the other top WLE items and fixed some metadata issues (adding <code>dc.rights</code>, fixing DOIs, adding ISSNs, etc) and noticed one issue with <a href="https://hdl.handle.net/10568/97087">an item</a> that has an Altmetric score for its Handle (lower) despite it having a correct DOI (with a higher score)
|
|
<ul>
|
|
<li>I tweeted the Handle to see if the score would get linked once Altmetric noticed it</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2019-11-13">2019-11-13</h2>
|
|
<ul>
|
|
<li>The <a href="https://hdl.handle.net/10568/97087">item with a low Altmetric score for its Handle</a> that I tweeted yesterday still hasn’t linked with the DOI’s score
|
|
<ul>
|
|
<li>I tweeted it again with the Handle and the DOI</li>
|
|
</ul>
|
|
</li>
|
|
<li>Testing modifying some of the COUNTER-Robots patterns to use <code>[0-9]</code> instead of <code>\d</code> digit character type, as Solr’s regex search can’t use those</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
|
|
$ http "http://localhost:8081/solr/statistics/update?commit=true"
|
|
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
|
|
<result name="response" numFound="1" start="0">
|
|
$ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/[0-9]/" | xmllint --format - | grep numFound
|
|
<result name="response" numFound="1" start="0">
|
|
</code></pre><ul>
|
|
<li>Nice, so searching with regex in Solr with <code>//</code> syntax works for those digits!</li>
|
|
<li>I realized that it’s easier to search Solr from curl via POST using this syntax:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
|
|
</code></pre><ul>
|
|
<li>If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
|
|
<ul>
|
|
<li>You can disable this using the <code>-g</code> option, but there are other benefits to searching with POST, for example it seems that I have less issues with escaping special parameters when using Solr’s regex search:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
|
|
</code></pre><ul>
|
|
<li>I updated the <code>check-spider-hits.sh</code> script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling</li>
|
|
</ul>
|
|
<h2 id="2019-11-14">2019-11-14</h2>
|
|
<ul>
|
|
<li>IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary</li>
|
|
<li>I will merge them with our existing list and then resolve their names using my <code>resolve-orcids.py</code> script:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
|
|
$ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
|
|
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
|
|
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
|
|
</code></pre><ul>
|
|
<li>I created a <a href="https://github.com/ilri/DSpace/pull/437">pull request</a> and merged them into the <code>5_x-prod</code> branch
|
|
<ul>
|
|
<li>I will deploy them to CGSpace in the next few days</li>
|
|
</ul>
|
|
</li>
|
|
<li>Greatly improve my <code>check-spider-hits.sh</code> script to handle regular expressions in the spider agents patterns file
|
|
<ul>
|
|
<li>This allows me to detect and purge many more hits from the Solr statistics core</li>
|
|
<li>I’ve tested it quite a bit on DSpace Test, but I need to do a little more before I feel comfortable running the new code on CGSpace’s Solr cores</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2019-11-15">2019-11-15</h2>
|
|
<ul>
|
|
<li>Run the new version of <code>check-spider-hits.sh</code> on CGSpace’s Solr statistics cores one by one, starting from the oldest just in case something goes wrong</li>
|
|
<li>But then I noticed that some (all?) of the hits weren’t actually getting purged, all of which were using regular expressions like:
|
|
<ul>
|
|
<li><code>MetaURI[\+\s]API\/[0-9]\.[0-9]</code></li>
|
|
<li><code>FDM(\s|\+)[0-9]</code></li>
|
|
<li><code>Goldfire(\s|\+)Server</code></li>
|
|
<li><code>^Mozilla\/4\.0\+\(compatible;\)$</code></li>
|
|
<li><code>^Mozilla\/4\.0\+\(compatible;\+ICS\)$</code></li>
|
|
<li><code>^Mozilla\/4\.5\+\[en]\+\(Win98;\+I\)$</code></li>
|
|
</ul>
|
|
</li>
|
|
<li>Upon closer inspection, the plus signs seem to be getting misinterpreted somehow in the delete, but not in the select!</li>
|
|
<li>Plus signs are special in regular expressions, URLs, and Solr’s Lucene query parser, so I’m actually not sure where the issue is
|
|
<ul>
|
|
<li>I tried to do URL encoding of the +, double escaping, etc… but nothing worked</li>
|
|
<li>I’m going to ignore regular expressions that have pluses for now</li>
|
|
</ul>
|
|
</li>
|
|
<li>I think I might also have to ignore patterns that have percent signs, like <code>^\%?default\%?$</code></li>
|
|
<li>After I added the ignores and did some more testing I finally ran the <code>check-spider-hits.sh</code> on all CGSpace Solr statistics cores and these are the number of hits purged from each core:
|
|
<ul>
|
|
<li>statistics-2010: 113</li>
|
|
<li>statistics-2011: 7235</li>
|
|
<li>statistics-2012: 0</li>
|
|
<li>statistics-2013: 0</li>
|
|
<li>statistics-2014: 316</li>
|
|
<li>statistics-2015: 16809</li>
|
|
<li>statistics-2016: 41732</li>
|
|
<li>statistics-2017: 39207</li>
|
|
<li>statistics-2018: 295546</li>
|
|
<li>statistics: 1043373</li>
|
|
</ul>
|
|
</li>
|
|
<li>That’s 1.4 million hits in addition to the 2 million I purged earlier this week…</li>
|
|
<li>For posterity, the major contributors to the hits on the statistics core were:
|
|
<ul>
|
|
<li>Purging 812429 hits from curl/ in statistics</li>
|
|
<li>Purging 48206 hits from facebookexternalhit/ in statistics</li>
|
|
<li>Purging 72004 hits from PHP/ in statistics</li>
|
|
<li>Purging 76072 hits from Yeti/[0-9] in statistics</li>
|
|
</ul>
|
|
</li>
|
|
<li>Most of the curl hits were from CIAT in mid-2019, where they were using <a href="https://guzzle3.readthedocs.io/http-client/client.html">GuzzleHttp</a> from PHP, which uses something like this for its user agent:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
|
|
</code></pre><ul>
|
|
<li>Run system updates on DSpace Test and reboot the server</li>
|
|
</ul>
|
|
<h2 id="2019-11-17">2019-11-17</h2>
|
|
<ul>
|
|
<li>Altmetric support responded about our dashboard question, asking if the second “department” (aka WLE’s collection) was added recently and might have not been in the last harvesting yet
|
|
<ul>
|
|
<li>I told her no, that the department is several years old, and the item was added in 2017</li>
|
|
<li>Then I looked again at the dashboard for each department and I see the item in both departments now… shit.</li>
|
|
<li>A <a href="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_16814&q=Towards%20sustainable%20sanitation%20management">search in the IWMI department shows the item</a></li>
|
|
<li>A <a href="https://www.altmetric.com/explorer/outputs?department_id%5B%5D=CGSpace%3Agroup%3Acom_10568_34494&q=Towards%20sustainable%20sanitation%20management">search in the WLE department shows the item</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>I finally decided to revert <code>cg.hasMetadata</code> back to <code>cg.identifier.dataurl</code> in my CG Core v2 branch (see <a href="https://github.com/AgriculturalSemantics/cg-core/issues/10">#10</a>)</li>
|
|
<li>Regarding the <a href="https://hdl.handle.net/10568/97087">WLE item</a> that has a much lower score than its DOI…
|
|
<ul>
|
|
<li>I tweeted the item twice last week and the score never got linked</li>
|
|
<li>Then I noticed that I had already made a note about the same issue in 2019-04, when I also tweeted it several times…</li>
|
|
<li>I will ask Altmetric support for help with that</li>
|
|
</ul>
|
|
</li>
|
|
<li>Finally deploy <code>5_x-cgcorev2</code> branch on DSpace Test</li>
|
|
</ul>
|
|
<h2 id="2019-11-18">2019-11-18</h2>
|
|
<ul>
|
|
<li>I sent a mail to the CGSpace partners in Addis about the CG Core v2 changes on DSpace Test</li>
|
|
<li>Then I filed an <a href="https://github.com/AgriculturalSemantics/cg-core/issues/11">issue on the CG Core GitHub</a> to let the metadata people know about our progress</li>
|
|
<li>It seems like I will do a session about CG Core v2 implementation and limitations in DSpace for the data workshop in December in Nairobi (?)</li>
|
|
</ul>
|
|
<h2 id="2019-11-19">2019-11-19</h2>
|
|
<ul>
|
|
<li>Export IITA’s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
|
|
<ul>
|
|
<li>I had previously sent them an export in 2019-04</li>
|
|
</ul>
|
|
</li>
|
|
<li>Atmire merged my <a href="https://github.com/atmire/COUNTER-Robots/pull/28">pull request regarding unnecessary escaping of dashes</a> in regular expressions, as well as <a href="https://github.com/atmire/COUNTER-Robots/issues/27">my suggestion of adding “User-Agent” to the list of patterns</a></li>
|
|
<li>I made another <a href="https://github.com/atmire/COUNTER-Robots/pull/29">pull request to fix invalid escaping of one of their new patterns</a></li>
|
|
<li>I ran my <code>check-spider-hits.sh</code> script again with these new patterns and found a bunch more statistics requests that match, for example:
|
|
<ul>
|
|
<li>Found 39560 hits from ^Buck/[0-9] in statistics</li>
|
|
<li>Found 5471 hits from ^User-Agent in statistics</li>
|
|
<li>Found 2994 hits from ^Buck/[0-9] in statistics-2018</li>
|
|
<li>Found 14076 hits from ^User-Agent in statistics-2018</li>
|
|
<li>Found 16310 hits from ^User-Agent in statistics-2017</li>
|
|
<li>Found 4429 hits from ^User-Agent in statistics-2016</li>
|
|
</ul>
|
|
</li>
|
|
<li>Buck is one I’ve never heard of before, its user agent is:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
|
|
</code></pre><ul>
|
|
<li>All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week</li>
|
|
</ul>
|
|
<h2 id="2019-11-20">2019-11-20</h2>
|
|
<ul>
|
|
<li>Email Usman Muchlish from CIFOR to see what he’s doing with their DSpace lately</li>
|
|
</ul>
|
|
<h2 id="2019-11-21">2019-11-21</h2>
|
|
<ul>
|
|
<li>Discuss bugs and issues with AReS v2 that are limiting its adoption
|
|
<ul>
|
|
<li>BUG: If you search for items between year 2012 and 2019, then remove some years from the “info product analysis”, they are still present in the search results and export</li>
|
|
<li>FEATURE: Ability to add month to date filter?</li>
|
|
<li>FEATURE: Add “review status”, “series”, and “usage rights” to search filters</li>
|
|
<li>FEATURE: Downloads and views are not included in exports</li>
|
|
<li>FEATURE: Add more fields to exports (Abenet will clarify)</li>
|
|
</ul>
|
|
</li>
|
|
<li>As for the larger features to focus on in the future ToRs:
|
|
<ul>
|
|
<li>FEATURE: Unique, linkable URL for a set of search results (discussed with Moayad, he has a plan for this)</li>
|
|
<li>FEATURE: Reporting that we talked about in Amman in January, 2019.</li>
|
|
</ul>
|
|
</li>
|
|
<li>We have a meeting about AReS future developments with Jane, Abenet, Peter, and Enrico tomorrow</li>
|
|
</ul>
|
|
<h2 id="2019-11-22">2019-11-22</h2>
|
|
<ul>
|
|
<li>Skype with Jane, Abenet, Peter, and Enrico about AReS v2 future development
|
|
<ul>
|
|
<li>We want to move AReS v2 from dspacetest.cgiar.org/explorer to cgspace.cgiar.org/explorer</li>
|
|
<li>We want to maintain a public demo of the vanilla OpenRXV with a subset of data, for example a non-CG community</li>
|
|
<li>We want to try to move all issues and milestones to GitHub</li>
|
|
<li>I need to try to work with ILRI Finance to pre-pay the AReS Linode server (linode11779072) for 2020</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2019-11-24">2019-11-24</h2>
|
|
<ul>
|
|
<li>I rebooted DSpace Test (linode19) and it kernel panicked at boot
|
|
<ul>
|
|
<li>I looked on the console and saw that it can’t mount the root filesystem</li>
|
|
<li>I switched the boot configuration to use the OS’s kernel via GRUB2 instead of Linode’s kernel and then it came up after reboot…</li>
|
|
<li>I initiated a migration of the server from the Fremont, CA region to Frankfurt, DE
|
|
<ul>
|
|
<li>The migration is going very slowly, so I assume the network issues from earlier this year are still not fixed</li>
|
|
<li>I opened a new ticket (13056701) with Linode support, with reference to my previous ticket (11804943)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2019-11-25">2019-11-25</h2>
|
|
<ul>
|
|
<li>The migration of DSpace Test from Fremont, CA (USA) to Frankfurt (DE) region completed
|
|
<ul>
|
|
<li>The IP address of the server changed so I need to email CGNET to ask them to update the DNS</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2019-11-26">2019-11-26</h2>
|
|
<ul>
|
|
<li>Visit CodeObia to discuss future of OpenRXV and AReS
|
|
<ul>
|
|
<li>I started working on categorizing and validating the feedback that Jane collated into a spreadsheet last week</li>
|
|
<li>I added GitHub issues for eight of the items so far, tagging them by “bug”, “search”, “feature”, “graphics”, “low-priority”, etc</li>
|
|
<li>I moved AReS v2 to be available on CGSpace</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2019-11-27">2019-11-27</h2>
|
|
<ul>
|
|
<li>Minor updates on the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>
|
|
<ul>
|
|
<li>Introduce isort for import sorting</li>
|
|
<li>Introduce black for code formatting according to PEP8</li>
|
|
<li>Fix some minor issues raised by flake8</li>
|
|
<li>Release <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.1.1">version 1.1.1</a> and deploy to DSpace Test (linode19)</li>
|
|
<li>I realize that I never deployed version 1.1.0 (with falcon 2.0.0) on CGSpace (linode18) so I did that as well</li>
|
|
</ul>
|
|
</li>
|
|
<li>File a ticket (242418) with Altmetric about DCTERMS migration to see if there is anything we need to be careful about</li>
|
|
<li>Make a pull request against cg-core schema to fix inconsistent references to <code>cg.embargoDate</code> (<a href="https://github.com/AgriculturalSemantics/cg-core/pull/13">#13</a>)</li>
|
|
<li>Review the AReS feedback again after Peter made some comments
|
|
<ul>
|
|
<li>I standardized the GitHub issue labels in both OpenRXV and AReS issue trackers, using labels like “P-low” for priority</li>
|
|
<li>I filed another handful of issues in both trackers and added them to the spreadsheet</li>
|
|
</ul>
|
|
</li>
|
|
<li>I need to ask Marie-Angelique about the <code>cg.peer-reviewed</code> field
|
|
<ul>
|
|
<li>We currently use <code>dc.description.version</code> with values like “Internal Review” and “Peer Review”, and CG Core v2 currently recommends using “True” if the field is peer reviewed</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2019-11-28">2019-11-28</h2>
|
|
<ul>
|
|
<li>File an issue with CG Core v2 project to ask Marie-Angelique about expanding the scope of <code>cg.peer-reviewed</code> to include other types of review, and possibly to change the field name to something more generic like <code>cg.review-status</code> (<a href="https://github.com/AgriculturalSemantics/cg-core/issues/14">#14</a>)</li>
|
|
<li>More review of AReS feedback
|
|
<ul>
|
|
<li>I clarified some of the feedback</li>
|
|
<li>I added status of “Issue Filed”, “Duplicate” and “No Action Required” to several items</li>
|
|
<li>I filed a handful more GitHub issues in AReS and OpenRXV GitHub trackers</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2022-02/">February, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-01/">January, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-12/">December, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-11/">November, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-10/">October, 2021</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|