<!DOCTYPE html>
<html lang="en" >

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<meta property="og:title" content="November, 2019" />
<meta property="og:description" content="2019-11-04


Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics


I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:

# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
1277694



So 4.6 million from XMLUI and another 1.2 million from API requests

Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
1183456 
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781

" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-11/" />
<meta property="article:published_time" content="2019-11-04T12:20:30+02:00" />
<meta property="article:modified_time" content="2019-11-06T09:35:51+02:00" />

<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2019"/>
<meta name="twitter:description" content="2019-11-04


Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics


I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:

# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
1277694



So 4.6 million from XMLUI and another 1.2 million from API requests

Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
1183456 
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781

"/>
<meta name="generator" content="Hugo 0.59.1" />


    
<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "November, 2019",
  "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-11\/",
  "wordCount": "1088",
  "datePublished": "2019-11-04T12:20:30+02:00",
  "dateModified": "2019-11-06T09:35:51+02:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>



    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-11/">

    <title>November, 2019 | CGSpace Notes</title>

    
    <!-- combined, minified CSS -->
    <link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
    

    <!-- RSS 2.0 feed -->
    

    

    

  </head>

  <body>

    
    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>
    

    
    
    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>
    
    

    
    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">

          


<article class="blog-post">
  <header>
    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-11/">November, 2019</a></h2>
    <p class="blog-post-meta"><time datetime="2019-11-04T12:20:30&#43;02:00">Mon Nov 04, 2019</time> by Alan Orth in 
<i class="fa fa-folder" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>


</p>
  </header>
  <h2 id="2019-11-04">2019-11-04</h2>

<ul>
<li><p>Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics</p>

<ul>
<li><p>I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:</p>

<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
1277694
</code></pre></li>
</ul></li>

<li><p>So 4.6 million from XMLUI and another 1.2 million from API requests</p></li>

<li><p>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</p>

<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
1183456 
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781
</code></pre></li>
</ul>

<ul>
<li><p>The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)</p>

<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | awk '{print $6}' | sed 's/&quot;//' | sort | uniq -c | sort -n
  1 PUT
  8 PROPFIND
283 OPTIONS
30102 POST
46581 HEAD
4594967 GET
</code></pre></li>

<li><p>Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:</p>

<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
365288
</code></pre></li>

<li><p>Their user agent is one I&rsquo;ve never seen before:</p>

<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
</code></pre></li>

<li><p>Most of them seem to be to community or collection discover and browse results pages like <code>/handle/10568/103/discover</code>:</p></li>
</ul>

<h1 id="zcat-force-var-log-nginx-access-log-gz-grep-e-0-9-1-2-oct-2019-grep-amazonbot-grep-o-e-get-bitstream-discover-handle-sort-uniq-c">zcat &ndash;force /var/log/nginx/<em>access.log.</em>.gz | grep -E &ldquo;[0-9]{1,2}/Oct/2019&rdquo; | grep Amazonbot | grep -o -E &ldquo;GET /(bitstream|discover|handle)&rdquo; | sort | uniq -c</h1>

<p>6566 GET /bitstream
 351928 GET /handle</p>

<h1 id="zcat-force-var-log-nginx-access-log-gz-grep-e-0-9-1-2-oct-2019-grep-amazonbot-grep-e-get-bitstream-discover-handle-grep-c-discover">zcat &ndash;force /var/log/nginx/<em>access.log.</em>.gz | grep -E &ldquo;[0-9]{1,2}/Oct/2019&rdquo; | grep Amazonbot | grep -E &ldquo;GET /(bitstream|discover|handle)&rdquo; | grep -c discover</h1>

<p>214209</p>

<h1 id="zcat-force-var-log-nginx-access-log-gz-grep-e-0-9-1-2-oct-2019-grep-amazonbot-grep-e-get-bitstream-discover-handle-grep-c-browse">zcat &ndash;force /var/log/nginx/<em>access.log.</em>.gz | grep -E &ldquo;[0-9]{1,2}/Oct/2019&rdquo; | grep Amazonbot | grep -E &ldquo;GET /(bitstream|discover|handle)&rdquo; | grep -c browse</h1>

<p>86874</p>

<pre><code>
- As far as I can tell, none of their requests are counted in the Solr statistics:

</code></pre>

<p>$ http &ndash;print b &lsquo;<a href="http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true'">http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&amp;rows=0&amp;wt=json&amp;indent=true'</a></p>

<pre><code>
- Still, those requests are CPU intensive so I will add their user agent to the &quot;badbots&quot; rate limiting in nginx to reduce the impact on server load
- After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):

</code></pre>

<p>$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/1/discover'">https://dspacetest.cgiar.org/handle/10568/1/discover'</a> User-Agent:&ldquo;Amazonbot/0.1&rdquo;</p>

<pre><code>
- On the topic of spiders, I have been wanting to update DSpace's default list of spiders in `config/spiders/agents`, perhaps by dropping a new list in from [Atmire's COUNTER-Robots](https://github.com/atmire/COUNTER-Robots) project
  - First I checked for a user agent that is in COUNTER-Robots, but NOT in the current `dspace/config/spiders/example` list
  - Then I made some item and bitstream requests on DSpace Test using that user agent:

</code></pre>

<p>$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:&ldquo;iskanie&rdquo;
$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:&ldquo;iskanie&rdquo;
$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y'">https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y'</a> User-Agent:&ldquo;iskanie&rdquo;</p>

<pre><code>
- A bit later I checked Solr and found three requests from my IP with that user agent this month:

</code></pre>

<p>$ http &ndash;print b &lsquo;<a href="http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0'">http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&amp;fq=dateYearMonth%3A2019-11&amp;rows=0'</a>
&lt;?xml version=&ldquo;1.0&rdquo; encoding=&ldquo;UTF-8&rdquo;?&gt;
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
</response></p>

<pre><code>
- Now I want to make similar requests with a user agent that is included in DSpace's current user agent list:

</code></pre>

<p>$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:&ldquo;celestial&rdquo;
$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:&ldquo;celestial&rdquo;
$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y'">https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&amp;isAllowed=y'</a> User-Agent:&ldquo;celestial&rdquo;</p>

<pre><code>
- After twenty minutes I didn't see any requests in Solr, so I assume they did not get logged because they matched a bot list...
  - What's strange is that the Solr spider agent configuration in `dspace/config/modules/solr-statistics.cfg` points to a file that doesn't exist...

</code></pre>

<p>spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt</p>

<pre><code>
- Apparently that is part of Atmire's CUA, despite being in a standard DSpace configuration file...
- I tried with some other garbage user agents like &quot;fuuuualan&quot; and they were visible in Solr
  - Now I want to try adding &quot;iskanie&quot; and &quot;fuuuualan&quot; to the list of spider regexes in `dspace/config/spiders/example` and then try to use DSpace's &quot;mark spiders&quot; feature to change them to &quot;isBot:true&quot; in Solr
  - I restarted Tomcat and ran `dspace stats-util -m` and it did some stuff for awhile, but I still don't see any items in Solr with `isBot:true`
  - According to `dspace-api/src/main/java/org/dspace/statistics/util/SpiderDetector.java` the patterns for user agents are loaded from any file in the `config/spiders/agents` directory
  - I downloaded the COUNTER-Robots list to DSpace Test and overwrote the example file, then ran `dspace stats-util -m` and still there were no new items marked as being bots in Solr, so I think there is still something wrong
  - Jesus, the code in `./dspace-api/src/main/java/org/dspace/statistics/util/StatisticsClient.java` says that `stats-util -m` marks spider requests by their IPs, not by their user agents... WTF:

</code></pre>

<p>else if (line.hasOption(&rsquo;m&rsquo;))
{
    SolrLogger.markRobotsByIP();
}</p>

<pre><code>
- WTF again, there is actually a function called `markRobotByUserAgent()` that is never called anywhere!
  - It appears to be unimplemented...
  - I sent a message to the dspace-tech mailing list to ask if I should file an issue

## 2019-11-05

- I added &quot;alanfuu2&quot; to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:

</code></pre>

<p>$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:&ldquo;alanfuuu1&rdquo;
$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:&ldquo;alanfuuu2&rdquo;</p>

<pre><code>
- After committing the changes in Solr I saw one request for &quot;alanfuu1&quot; and no requests for &quot;alanfuu2&quot;:

</code></pre>

<p>$ http &ndash;print b &lsquo;<a href="http://localhost:8081/solr/statistics/update?commit=true'">http://localhost:8081/solr/statistics/update?commit=true'</a>
$ http &ndash;print b &lsquo;<a href="http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11'">http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&amp;fq=dateYearMonth%3A2019-11'</a> | xmllint &ndash;format - | grep numFound
  <result name="response" numFound="1" start="0">
$ http &ndash;print b &lsquo;<a href="http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11'">http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&amp;fq=dateYearMonth%3A2019-11'</a> | xmllint &ndash;format - | grep numFound
  <result name="response" numFound="0" start="0"/></p>

<pre><code>
- So basically it seems like a win to update the example file with the latest one from Atmire's COUNTER-Robots list
  - Even though the &quot;mark by user agent&quot; function is not working (see email to dspace-tech mailing list) DSpace will still not log Solr events from these user agents
- I'm curious how the special character matching is in Solr, so I will test two requests: one with &quot;www.gnip.com&quot; which is in the spider list, and one with &quot;www.gnyp.com&quot; which isn't:

</code></pre>

<p>$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:&ldquo;www.gnip.com&rdquo;
$ http &ndash;print Hh &lsquo;<a href="https://dspacetest.cgiar.org/handle/10568/105487'">https://dspacetest.cgiar.org/handle/10568/105487'</a> User-Agent:&ldquo;www.gnyp.com&rdquo;</p>

<pre><code>
- Then commit changes to Solr so we don't have to wait:

</code></pre>

<p>$ http &ndash;print b &lsquo;<a href="http://localhost:8081/solr/statistics/update?commit=true'">http://localhost:8081/solr/statistics/update?commit=true'</a>
$ http &ndash;print b &lsquo;<a href="http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11'">http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&amp;fq=dateYearMonth%3A2019-11'</a> | xmllint &ndash;format - | grep numFound
  <result name="response" numFound="0" start="0"/>
$ http &ndash;print b &lsquo;<a href="http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11'">http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&amp;fq=dateYearMonth%3A2019-11'</a> | xmllint &ndash;format - | grep numFound
  <result name="response" numFound="1" start="0">
```</p>

<ul>
<li>So the blocking seems to be working because &ldquo;www.gnip.com&rdquo; is one of the new patterns added to the spiders file&hellip;</li>
</ul>

<h2 id="2019-11-07">2019-11-07</h2>

<ul>
<li>CCAFS finally confirmed that they do indeed need the confusing new project tag that looks like a duplicate

<ul>
<li>They had proposed a batch of new tags in 2019-09 and we never merged them due to this uncertainty</li>
<li>I have now merged the changes in to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/432">#432</a>)</li>
</ul></li>
</ul>

<!-- vim: set sw=2 ts=2: -->

  

  

</article> 



        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">
  

  
        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2019-11/">November, 2019</a></li>

<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>

<li><a href="/cgspace-notes/2019-10/">October, 2019</a></li>

<li><a href="/cgspace-notes/2019-09/">September, 2019</a></li>

<li><a href="/cgspace-notes/2019-08/">August, 2019</a></li>

    </ol>
  </section>

  

  
  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">
      
      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
      
      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
      
      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
      
    </ol>
  </section>
  
</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->
    

    
    <footer class="blog-footer">
      <p dir="auto">
      
      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
      
      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>
    

  </body>

</html>