cgspace-notes/docs/2019-04/index.html

<!DOCTYPE html>
<html lang="en" >

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">


<meta property="og:title" content="April, 2019" />
<meta property="og:description" content="2019-04-01

Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc

They asked if we had plans to enable RDF support in CGSpace


There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!


# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
   4432 200

In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
Apply country and region corrections and deletions on DSpace Test and CGSpace:

$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
<meta property="article:published_time" content="2019-04-01T09:00:43+03:00" />
<meta property="article:modified_time" content="2021-08-18T15:29:31+03:00" />


<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2019"/>
<meta name="twitter:description" content="2019-04-01

Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc

They asked if we had plans to enable RDF support in CGSpace


There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today

I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!


# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
   4432 200

In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
Apply country and region corrections and deletions on DSpace Test and CGSpace:

$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
"/>
<meta name="generator" content="Hugo 0.92.2" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "April, 2019",
  "url": "https://alanorth.github.io/cgspace-notes/2019-04/",
  "wordCount": "6778",
  "datePublished": "2019-04-01T09:00:43+03:00",
  "dateModified": "2021-08-18T15:29:31+03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-04/">

    <title>April, 2019 | CGSpace Notes</title>

    
    <!-- combined, minified CSS -->
    
    <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
    

    <!-- minified Font Awesome for SVG icons -->
    
    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>

    <!-- RSS 2.0 feed -->
    

  </head>

  <body>

    
    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>
    

    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>
    
    
    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">

          
<article class="blog-post">
  <header>
    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-04/">April, 2019</a></h2>
    <p class="blog-post-meta">
<time datetime="2019-04-01T09:00:43+03:00">Mon Apr 01, 2019</time>
 in 
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>


</p>
  </header>
  <h2 id="2019-04-01">2019-04-01</h2>
<ul>
<li>Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
<ul>
<li>They asked if we had plans to enable RDF support in CGSpace</li>
</ul>
</li>
<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
<ul>
<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
   4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
</code></pre><h2 id="2019-04-02">2019-04-02</h2>
<ul>
<li>CTA says the Amazon IPs are AWS gateways for real user traffic</li>
<li>I was trying to add Felix Shaw&rsquo;s account back to the Administrators group on DSpace Test, but I couldn&rsquo;t find his name in the user search of the groups page
<ul>
<li>If I searched for &ldquo;Felix&rdquo; or &ldquo;Shaw&rdquo; I saw other matches, included one for his personal email address!</li>
<li>I ended up finding him via searching for his email address</li>
</ul>
</li>
</ul>
<h2 id="2019-04-03">2019-04-03</h2>
<ul>
<li>Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
<ul>
<li>First I need to extract the ones that are unique from their list compared to our existing one:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-04-03-orcid-ids.txt
</code></pre><ul>
<li>We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!</li>
<li>Next I will resolve all their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
</code></pre><ul>
<li>After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim</li>
<li>One user&rsquo;s name has changed so I will update those using my <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
</code></pre><ul>
<li>I created a pull request and merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/417">#417</a>)</li>
<li>A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it&rsquo;s still going:</li>
</ul>
<pre tabindex="0"><code>2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
</code></pre><ul>
<li>Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:</li>
</ul>
<pre tabindex="0"><code>$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
      1 
      3 http://localhost:8081/solr//statistics-2017
   5662 http://localhost:8081/solr//statistics-2018
</code></pre><ul>
<li>I will have to keep an eye on it because nothing should be updating 2018 stats in 2019&hellip;</li>
</ul>
<h2 id="2019-04-05">2019-04-05</h2>
<ul>
<li>Uptime Robot reported that CGSpace (linode18) went down tonight</li>
<li>I see there are lots of PostgreSQL connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
     10 dspaceCli
    250 dspaceWeb
</code></pre><ul>
<li>I still see those weird messages about updating the statistics-2018 Solr core:</li>
</ul>
<pre tabindex="0"><code>2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
</code></pre><ul>
<li>Looking at <code>iostat 1 10</code> I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:</li>
</ul>
<p><img src="/cgspace-notes/2019/04/cpu-week.png" alt="CPU usage week"></p>
<ul>
<li>The other thing visible there is that the past few days the load has spiked to 500% and I don&rsquo;t think it&rsquo;s a coincidence that the Solr updating thing is happening&hellip;</li>
<li>I ran all system updates and rebooted the server
<ul>
<li>The load was lower on the server after reboot, but Solr didn&rsquo;t come back up properly according to the Solr Admin UI:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
</code></pre><ul>
<li>I restarted it again and all the Solr cores came up properly&hellip;</li>
</ul>
<h2 id="2019-04-06">2019-04-06</h2>
<ul>
<li>Udana asked why item <a href="https://cgspace.cgiar.org/handle/10568/91278">10568/91278</a> didn&rsquo;t have an Altmetric badge on CGSpace, but on the <a href="https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity">WLE website</a> it does
<ul>
<li>I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all</li>
<li>I tweeted the item and I assume this will link the Handle with the DOI in the system</li>
<li>Twenty minutes later I see the same Altmetric score (9) on CGSpace</li>
</ul>
</li>
<li>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    222 18.195.78.144
    245 207.46.13.58
    303 207.46.13.194
    328 66.249.79.33
    564 207.46.13.210
    566 66.249.79.62
    575 40.77.167.66
   1803 66.249.79.59
   2834 2a01:4f8:140:3192::2
   9623 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     31 66.249.79.62
     41 207.46.13.210
     42 40.77.167.66
     54 42.113.50.219
    132 66.249.79.59
    785 2001:41d0:d:1990::
   1164 45.5.184.72
   2014 50.116.102.77
   4267 45.5.186.2
   4893 205.186.128.185
</code></pre><ul>
<li><code>45.5.184.72</code> is in Colombia so it&rsquo;s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT&rsquo;s datasets collection:</li>
</ul>
<pre tabindex="0"><code>GET /handle/10568/72970/discover?filtertype_0=type&amp;filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_relational_operator_0=equals&amp;filter_1=&amp;filter_0=Dataset&amp;filtertype=dateIssued&amp;filter_relational_operator=equals&amp;filter=2014
</code></pre><ul>
<li>Their user agent is the one I added to the badbots list in nginx last week: &ldquo;GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1&rdquo;</li>
<li>They made 22,000 requests to Discover on this collection today alone (and it&rsquo;s only 11AM):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;06/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
  22077 /handle/10568/72970/discover
</code></pre><ul>
<li>Yesterday they made 43,000 requests and we actually blocked most of them:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
  43631 /handle/10568/72970/discover
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
    142 200
  43489 503
</code></pre><ul>
<li>I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover</li>
<li>Maria from Bioversity recommended that we use the phrase &ldquo;AGROVOC subject&rdquo; instead of &ldquo;Subject&rdquo; in Listings and Reports
<ul>
<li>I made a pull request to update this and merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/418">#418</a>)</li>
</ul>
</li>
</ul>
<h2 id="2019-04-07">2019-04-07</h2>
<ul>
<li>Looking into the impact of harvesters like <code>45.5.184.72</code>, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands <em>per day</em></li>
<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
<ul>
<li>I am trying to see if these are registered as downloads in Solr or not</li>
<li>I see 96,925 downloads from their AWS gateway IPs in 2019-03:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-03&amp;rows=0&amp;wt=json&amp;indent=true'
{
    &quot;response&quot;: {
        &quot;docs&quot;: [],
        &quot;numFound&quot;: 96925,
        &quot;start&quot;: 0
    },
    &quot;responseHeader&quot;: {
        &quot;QTime&quot;: 1,
        &quot;params&quot;: {
            &quot;fq&quot;: [
                &quot;statistics_type:view&quot;,
                &quot;bundleName:ORIGINAL&quot;,
                &quot;dateYearMonth:2019-03&quot;
            ],
            &quot;indent&quot;: &quot;true&quot;,
            &quot;q&quot;: &quot;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&quot;,
            &quot;rows&quot;: &quot;0&quot;,
            &quot;wt&quot;: &quot;json&quot;
        },
        &quot;status&quot;: 0
    }
}
</code></pre><ul>
<li>Strangely I don&rsquo;t see many hits in 2019-04:</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-04&amp;rows=0&amp;wt=json&amp;indent=true'
{
    &quot;response&quot;: {
        &quot;docs&quot;: [],
        &quot;numFound&quot;: 38,
        &quot;start&quot;: 0
    },
    &quot;responseHeader&quot;: {
        &quot;QTime&quot;: 1,
        &quot;params&quot;: {
            &quot;fq&quot;: [
                &quot;statistics_type:view&quot;,
                &quot;bundleName:ORIGINAL&quot;,
                &quot;dateYearMonth:2019-04&quot;
            ],
            &quot;indent&quot;: &quot;true&quot;,
            &quot;q&quot;: &quot;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&quot;,
            &quot;rows&quot;: &quot;0&quot;,
            &quot;wt&quot;: &quot;json&quot;
        },
        &quot;status&quot;: 0
    }
}
</code></pre><ul>
<li>Making some tests on GET vs HEAD requests on the <a href="https://dspacetest.cgiar.org/handle/10568/100289">CTA Spore 192 item</a> on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: dspacetest.cgiar.org
User-Agent: HTTPie/1.0.2

HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2069158
Content-Type: application/pdf;charset=ISO-8859-1
Date: Sun, 07 Apr 2019 08:38:34 GMT
Expires: Sun, 07 Apr 2019 09:38:34 GMT
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
Server: nginx
Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Robots-Tag: none
X-XSS-Protection: 1; mode=block

$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf      
HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1                                                            
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: dspacetest.cgiar.org
User-Agent: HTTPie/1.0.2

HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2069158
Content-Type: application/pdf;charset=ISO-8859-1
Date: Sun, 07 Apr 2019 08:39:01 GMT
Expires: Sun, 07 Apr 2019 09:39:01 GMT
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
Server: nginx
Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly                                            
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Robots-Tag: none
X-XSS-Protection: 1; mode=block
</code></pre><ul>
<li>And from the server side, the nginx logs show:</li>
</ul>
<pre tabindex="0"><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] &quot;GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 68078 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] &quot;HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 0 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
</code></pre><ul>
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
<ul>
<li>After twenty minutes of waiting I still don&rsquo;t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>2019-04-07 02:05:30,966 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 02:05:39,265 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
</code></pre><ul>
<li>So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
<ul>
<li>Strangely, the statistics Solr core says it hasn&rsquo;t been modified in 24 hours, so I tried to start the &ldquo;optimize&rdquo; process from the Admin UI and I see this in the Solr log:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>2019-04-07 02:08:44,186 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
</code></pre><ul>
<li>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>&hellip; very weird
<ul>
<li>I don&rsquo;t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)</li>
<li>I will try to re-deploy the <code>5_x-dev</code> branch and test again</li>
</ul>
</li>
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
</ul>
<pre tabindex="0"><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true'
{
    &quot;response&quot;: {
        &quot;docs&quot;: [],
        &quot;numFound&quot;: 909,
        &quot;start&quot;: 0
    },
    &quot;responseHeader&quot;: {
        &quot;QTime&quot;: 0,
        &quot;params&quot;: {
            &quot;fq&quot;: [
                &quot;statistics_type:view&quot;,
                &quot;isInternal:true&quot;
            ],
            &quot;indent&quot;: &quot;true&quot;,
            &quot;q&quot;: &quot;type:0 AND time:2019-04-07*&quot;,
            &quot;rows&quot;: &quot;0&quot;,
            &quot;wt&quot;: &quot;json&quot;
        },
        &quot;status&quot;: 0
    }
}
</code></pre><ul>
<li>I confirmed the same on CGSpace itself after making one HEAD request</li>
<li>So I&rsquo;m pretty sure it&rsquo;s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
<ul>
<li>I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace</li>
</ul>
</li>
<li>Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
<ul>
<li>This leads me to believe there is something specifically wrong with DSpace Test (linode19)</li>
<li>The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC</li>
</ul>
</li>
<li>Holy shit, all this is actually because of the GeoIP1 deprecation and a missing <code>GeoLiteCity.dat</code>
<ul>
<li>For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!</li>
<li>See: <a href="https://jira.duraspace.org/browse/DS-3986">DS-3986</a></li>
<li>See: <a href="https://jira.duraspace.org/browse/DS-4020">DS-4020</a></li>
<li>See: <a href="https://jira.duraspace.org/browse/DS-3832">DS-3832</a></li>
<li>DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been <em>removed</em> from MaxMind&rsquo;s server as of 2018-04-01</li>
<li>Now I made 100 requests and I see them in the Solr statistics&hellip; fuck my life for wasting five hours debugging this</li>
</ul>
</li>
<li>UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check <code>iostat 1 10</code> and I saw that CPU steal is around 10–30 percent right now&hellip;</li>
<li>The load average is super high right now, as I&rsquo;ve noticed the last few times UptimeRobot said that CGSpace went down:</li>
</ul>
<pre tabindex="0"><code>$ cat /proc/loadavg 
10.70 9.17 8.85 18/633 4198
</code></pre><ul>
<li>According to the server logs there is actually not much going on right now:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    118 18.195.78.144
    128 207.46.13.219
    129 167.114.64.100
    159 207.46.13.129
    179 207.46.13.33
    188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
    195 66.249.79.59
    363 40.77.167.21
    740 2a01:4f8:140:3192::2
   4823 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      3 66.249.79.62
      3 66.249.83.196
      4 207.46.13.86
      5 82.145.222.150
      6 2a01:4f9:2b:1263::2
      6 41.204.190.40
      7 35.174.176.49
     10 40.77.167.21
     11 194.246.119.6
     11 66.249.79.59
</code></pre><ul>
<li><code>45.5.184.72</code> is CIAT, who I already blocked and am waiting to hear from</li>
<li><code>2a01:4f8:140:3192::2</code> is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve</li>
<li><code>2408:8214:7a00:868f:7c1e:e0f3:20c6:c142</code> is some stupid Chinese bot making malicious POST requests</li>
<li>There are free database connections in the pool:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
     23 dspaceWeb
</code></pre><ul>
<li>It seems that the issue with CGSpace being &ldquo;down&rdquo; is actually because of CPU steal again!!!</li>
<li>I opened a ticket with support and asked them to migrate the VM to a less busy host</li>
</ul>
<h2 id="2019-04-08">2019-04-08</h2>
<ul>
<li>Start checking IITA&rsquo;s last round of batch uploads from <a href="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)
<ul>
<li>Lots of problems with affiliations, I had to correct about sixty of them</li>
<li>I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
</code></pre><ul>
<li>After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
<ul>
<li>The matched values can be accessed with <code>cell.recon.match.name</code>, but some of the new values don&rsquo;t appear, perhaps because I edited the original cell values?</li>
<li>I ended up using this GREL expression to copy all values to a new column:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>if(cell.recon.matched, cell.recon.match.name, value)
</code></pre><ul>
<li>See the <a href="https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon">OpenRefine variables documentation</a> for more notes about the <code>recon</code> object</li>
<li>I also noticed a handful of errors in our current list of affiliations so I corrected them:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
</code></pre><ul>
<li>We should create a new list of affiliations to update our controlled vocabulary again</li>
<li>I dumped a list of the top 1500 affiliations:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
COPY 1500
</code></pre><ul>
<li>Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):</li>
</ul>
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';
</code></pre><ul>
<li>I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
COPY 20
</code></pre><ul>
<li>I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
</code></pre><ul>
<li>UptimeRobot said that CGSpace (linode18) went down tonight
<ul>
<li>The load average is at <code>9.42, 8.87, 7.87</code></li>
<li>I looked at PostgreSQL and see shitloads of connections there:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
    250 dspaceWeb
</code></pre><ul>
<li>On a related note I see connection pool errors in the DSpace log:</li>
</ul>
<pre tabindex="0"><code>2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<li>But still I see 10 to 30% CPU steal in <code>iostat</code> that is also reflected in the Munin graphs:</li>
</ul>
<p><img src="/cgspace-notes/2019/04/cpu-week2.png" alt="CPU usage week"></p>
<ul>
<li>Linode Support still didn&rsquo;t respond to my ticket from yesterday, so I attached a new output of <code>iostat 1 10</code> and asked them to move the VM to a less busy host</li>
<li>The web server logs are not very busy:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    124 40.77.167.135
    135 95.108.181.88
    139 157.55.39.206
    190 66.249.79.133
    202 45.5.186.2
    284 207.46.13.95
    359 18.196.196.108
    457 157.55.39.164
    457 40.77.167.132
   3822 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      5 129.0.79.206
      5 41.205.240.21
      7 207.46.13.95
      7 66.249.79.133
      7 66.249.79.135
      7 95.108.181.88
      8 40.77.167.111
     19 157.55.39.164
     20 40.77.167.132
    370 51.254.16.223
</code></pre><h2 id="2019-04-09">2019-04-09</h2>
<ul>
<li>Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning</li>
<li>Here are the top IPs in the web server logs around that time:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     18 66.249.79.139
     21 157.55.39.160
     29 66.249.79.137
     38 66.249.79.135
     50 34.200.212.137
     54 66.249.79.133
    100 102.128.190.18
   1166 45.5.184.72
   4251 45.5.186.2
   4895 205.186.128.185
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    200 144.48.242.108
    202 207.46.13.185
    206 18.194.46.84
    239 66.249.79.139
    246 18.196.196.108
    274 31.6.77.23
    289 66.249.79.137
    312 157.55.39.160
    441 66.249.79.135
    856 66.249.79.133
</code></pre><ul>
<li><code>45.5.186.2</code> is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:</li>
</ul>
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
</code></pre><ul>
<li>Database connection usage looks fine:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      7 dspaceCli
     11 dspaceWeb
</code></pre><ul>
<li>Ironically I do still see some 2 to 10% of CPU steal in <code>iostat 1 10</code></li>
<li>Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace
<ul>
<li>I told them how to use the REST API to get the CIAT Datasets collection and enumerate its items</li>
</ul>
</li>
<li>In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night</li>
</ul>
<h2 id="2019-04-10">2019-04-10</h2>
<ul>
<li>Abenet pointed out a possibility of validating funders against the <a href="https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API">CrossRef API</a></li>
<li>Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked</li>
</ul>
<pre tabindex="0"><code>$ http 'https://api.crossref.org/funders?query=mercator&amp;mailto=me@cgiar.org'
</code></pre><ul>
<li>Otherwise, they provide the funder data in <a href="https://www.crossref.org/services/funder-registry/">CSV and RDF format</a></li>
<li>I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn&rsquo;t match will need a human to go and do some manual checking and informed decision making&hellip;</li>
<li>If I want to write a script for this I could use the Python <a href="https://habanero.readthedocs.io/en/latest/modules/crossref.html">habanero library</a>:</li>
</ul>
<pre tabindex="0"><code>from habanero import Crossref
cr = Crossref(mailto=&quot;me@cgiar.org&quot;)
x = cr.funders(query = &quot;mercator&quot;)
</code></pre><h2 id="2019-04-11">2019-04-11</h2>
<ul>
<li>Continue proofing IITA&rsquo;s last round of batch uploads from <a href="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)
<ul>
<li>One misspelled country</li>
<li>Three incorrect regions</li>
<li>Potential duplicates (same DOI, similar title, same authors):
<ul>
<li><a href="https://dspacetest.cgiar.org/handle/10568/100580">10568/100580</a> and <a href="https://dspacetest.cgiar.org/handle/10568/100579">10568/100579</a></li>
<li><a href="https://dspacetest.cgiar.org/handle/10568/100444">10568/100444</a> and <a href="https://dspacetest.cgiar.org/handle/10568/100423">10568/100423</a></li>
</ul>
</li>
<li>Two DOIs with incorrect URL formatting</li>
<li>Two misspelled IITA subjects</li>
<li>Two authors with smart quotes</li>
<li>Lots of issues with sponsors</li>
<li>One misspelled &ldquo;Peer review&rdquo;</li>
<li>One invalid ISBN that I fixed by Googling the title</li>
<li>Lots of issues with sponsors (German Aid Agency, Swiss Aid Agency, Italian Aid Agency, Dutch Aid Agency, etc)</li>
<li>I validated all the AGROVOC subjects against our latest list with reconcile-csv
<ul>
<li>About 720 of the 900 terms were matched, then I checked and fixed or deleted the rest manually</li>
</ul>
</li>
</ul>
</li>
<li>I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA&rsquo;s records, so I applied them to DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
</code></pre><ul>
<li>Answer more questions about DOIs and Altmetric scores from WLE</li>
<li>Answer more questions about DOIs and Altmetric scores from IWMI
<ul>
<li>They can&rsquo;t seem to understand the Altmetric + Twitter flow for associating Handles and DOIs</li>
<li>To make things worse, many of their items DON&rsquo;T have DOIs, so when Altmetric harvests them of course there is no link!  - Then, a bunch of their items don&rsquo;t have scores because they never tweeted them!</li>
<li>They added a DOI to this old item <a href="https://cgspace.cgiar.org/handle/10568/97087">10567/97087</a> this morning and wonder why Altmetric&rsquo;s score hasn&rsquo;t linked with the DOI magically</li>
<li>We should check in a week to see if Altmetric will make the association after one week when they harvest again</li>
</ul>
</li>
</ul>
<h2 id="2019-04-13">2019-04-13</h2>
<ul>
<li>I copied the <code>statistics</code> and <code>statistics-2018</code> Solr cores from CGSpace to my local machine and watched the Java process in VisualVM while indexing item views and downloads with my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>:</li>
</ul>
<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing.png" alt="Java GC during Solr indexing with CMS"></p>
<ul>
<li>It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear &ldquo;sawtooth&rdquo; pattern in the garbage collection</li>
<li>I am curious if the GC pattern would be different if I switched from the <code>-XX:+UseConcMarkSweepGC</code> to G1GC</li>
<li>I switched to G1GC and restarted Tomcat but for some reason I couldn&rsquo;t see the Tomcat PID in VisualVM&hellip;
<ul>
<li>Anyways, the indexing process took much longer, perhaps twice as long!</li>
</ul>
</li>
<li>I tried again with the GC tuning settings from the Solr 4.10.4 release:</li>
</ul>
<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing-solr-settings.png" alt="Java GC during Solr indexing Solr 4.10.4 settings"></p>
<h2 id="2019-04-14">2019-04-14</h2>
<ul>
<li>Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:</li>
</ul>
<pre tabindex="0"><code>GC_TUNE=&quot;-XX:NewRatio=3 \
    -XX:SurvivorRatio=4 \
    -XX:TargetSurvivorRatio=90 \
    -XX:MaxTenuringThreshold=8 \
    -XX:+UseConcMarkSweepGC \
    -XX:+UseParNewGC \
    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
    -XX:+CMSScavengeBeforeRemark \
    -XX:PretenureSizeThreshold=64m \
    -XX:+UseCMSInitiatingOccupancyOnly \
    -XX:CMSInitiatingOccupancyFraction=50 \
    -XX:CMSMaxAbortablePrecleanTime=6000 \
    -XX:+CMSParallelRemarkEnabled \
    -XX:+ParallelRefProcEnabled&quot;
</code></pre><ul>
<li>I need to remember to check the Munin JVM graphs in a few days</li>
<li>It might be placebo, but the site <em>does</em> feel snappier&hellip;</li>
</ul>
<h2 id="2019-04-15">2019-04-15</h2>
<ul>
<li>Rework the dspace-statistics-api to use the vanilla Python requests library instead of Solr client
<ul>
<li><a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.0.0">Tag version 1.0.0</a> and deploy it on DSpace Test</li>
</ul>
</li>
<li>Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to <code>iostat 1 10</code>, though I haven&rsquo;t had any Linode alerts in a few days</li>
<li>Abenet sent me a list of ILRI items that don&rsquo;t have CRPs added to them
<ul>
<li>The spreadsheet only had Handles (no IDs), so I&rsquo;m experimenting with using Python in OpenRefine to get the IDs</li>
<li>I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>import json
import re
import urllib
import urllib2

handle = re.findall('[0-9]+/[0-9]+', value)

url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0]
req = urllib2.Request(url)
req.add_header('User-agent', 'Alan Python bot')
res = urllib2.urlopen(req)
data = json.load(res)
item_id = data['id']

return item_id
</code></pre><ul>
<li>Luckily none of the items already had CRPs, so I didn&rsquo;t have to worry about them getting removed
<ul>
<li>It would have been much trickier if I had to get the CRPs for the items first, then add the CRPs&hellip;</li>
</ul>
</li>
<li>I ran a full Discovery indexing on CGSpace because I didn&rsquo;t do it after all the metadata updates last week:</li>
</ul>
<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    82m45.324s
user    7m33.446s
sys     2m13.463s
</code></pre><h2 id="2019-04-16">2019-04-16</h2>
<ul>
<li>Export IITA&rsquo;s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something</li>
</ul>
<h2 id="2019-04-17">2019-04-17</h2>
<ul>
<li>Reading an interesting <a href="https://teaspoon-consulting.com/articles/solr-cache-tuning.html">blog post about Solr caching</a></li>
<li>Did some tests of the dspace-statistics-api on my local DSpace instance with 28 million documents in a sharded statistics core (<code>statistics</code> and <code>statistics-2018</code>) and monitored the memory usage of Tomcat in VisualVM</li>
<li>4GB heap, CMS GC, 512 filter cache, 512 query cache, with 28 million documents in two shards
<ul>
<li>Run 1:
<ul>
<li>Time: 3.11s user 0.44s system 0% cpu 13:45.07 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.04 GiB</li>
</ul>
</li>
<li>Run 2:
<ul>
<li>Time: 3.23s user 0.43s system 0% cpu 13:46.10 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.06 GiB</li>
</ul>
</li>
<li>Run 3:
<ul>
<li>Time: 3.23s user 0.42s system 0% cpu 13:14.70 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.13 GiB</li>
<li><code>filterCache</code> size: 482, <code>cumulative_lookups</code>: 7062712, <code>cumulative_hits</code>: 167903, <code>cumulative_hitratio</code>: 0.02</li>
<li>queryResultCache size: 2</li>
</ul>
</li>
</ul>
</li>
<li>4GB heap, CMS GC, 1024 filter cache, 512 query cache, with 28 million documents in two shards
<ul>
<li>Run 1:
<ul>
<li>Time: 2.92s user 0.39s system 0% cpu 12:33.08 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.16 GiB</li>
</ul>
</li>
<li>Run 2:
<ul>
<li>Time: 3.10s user 0.39s system 0% cpu 12:25.32 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.07 GiB</li>
</ul>
</li>
<li>Run 3:
<ul>
<li>Time: 3.29s user 0.36s system 0% cpu 11:53.47 total</li>
<li>Tomcat (not Solr) max JVM heap usage: 2.08 GiB</li>
<li><code>filterCache</code> size: 951, <code>cumulative_lookups</code>: 7062712, <code>cumulative_hits</code>: 254379, <code>cumulative_hitratio</code>: 0.04</li>
</ul>
</li>
</ul>
</li>
<li>4GB heap, CMS GC, 2048 filter cache, 512 query cache, with 28 million documents in two shards
<ul>
<li>Run 1:
<ul>
<li>Time: 2.90s user 0.48s system 0% cpu 10:37.31 total</li>
<li>Tomcat max JVM heap usage: 1.96 GiB</li>
<li><code>filterCache</code> size: 1901, <code>cumulative_lookups</code>: 2354237, <code>cumulative_hits</code>: 180111, <code>cumulative_hitratio</code>: 0.08</li>
</ul>
</li>
<li>Run 2:
<ul>
<li>Time: 2.97s user 0.39s system 0% cpu 10:40.06 total</li>
<li>Tomcat max JVM heap usage: 2.09 GiB</li>
<li><code>filterCache</code> size: 1901, <code>cumulative_lookups</code>: 4708473, <code>cumulative_hits</code>: 360068, <code>cumulative_hitratio</code>: 0.08</li>
</ul>
</li>
<li>Run 3:
<ul>
<li>Time: 3.28s user 0.37s system 0% cpu 10:49.56 total</li>
<li>Tomcat max JVM heap usage: 2.05 GiB</li>
<li><code>filterCache</code> size: 1901, <code>cumulative_lookups</code>: 7062712, <code>cumulative_hits</code>: 540020, <code>cumulative_hitratio</code>: 0.08</li>
</ul>
</li>
</ul>
</li>
<li>4GB heap, CMS GC, 4096 filter cache, 512 query cache, with 28 million documents in two shards
<ul>
<li>Run 1:
<ul>
<li>Time: 2.88s user 0.35s system 0% cpu 8:29.55 total</li>
<li>Tomcat max JVM heap usage: 2.15 GiB</li>
<li><code>filterCache</code> size: 3770, <code>cumulative_lookups</code>: 2354237, <code>cumulative_hits</code>: 414512, <code>cumulative_hitratio</code>: 0.18</li>
</ul>
</li>
<li>Run 2:
<ul>
<li>Time: 3.01s user 0.38s system 0% cpu 9:15.65 total</li>
<li>Tomcat max JVM heap usage: 2.17 GiB</li>
<li><code>filterCache</code> size: 3945, <code>cumulative_lookups</code>: 4708473, <code>cumulative_hits</code>: 829093, <code>cumulative_hitratio</code>: 0.18</li>
</ul>
</li>
<li>Run 3:
<ul>
<li>Time: 3.01s user 0.40s system 0% cpu 9:01.31 total</li>
<li>Tomcat max JVM heap usage: 2.07 GiB</li>
<li><code>filterCache</code> size: 3770, <code>cumulative_lookups</code>: 7062712, <code>cumulative_hits</code>: 1243632, <code>cumulative_hitratio</code>: 0.18</li>
</ul>
</li>
</ul>
</li>
<li>The biggest takeaway I have is that this workload benefits from a larger <code>filterCache</code> (for Solr fq parameter), but barely uses the <code>queryResultCache</code> (for Solr q parameter) at all
<ul>
<li>The number of hits goes up and the time taken decreases when we increase the <code>filterCache</code>, and total JVM heap memory doesn&rsquo;t seem to increase much at all</li>
<li>I guess the <code>queryResultCache</code> size is always 2 because I&rsquo;m only doing two queries: <code>type:0</code> and <code>type:2</code> (downloads and views, respectively)</li>
</ul>
</li>
<li>Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:</li>
</ul>
<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing-4096-filterCache.png" alt="VisualVM Tomcat 4096 filterCache"></p>
<ul>
<li>I ran one test with a <code>filterCache</code> of 16384 to try to see if I could make the Tomcat JVM memory balloon, but actually it <em>drastically</em> increased the performance and memory usage of the dspace-statistics-api indexer</li>
<li>4GB heap, CMS GC, 16384 filter cache, 512 query cache, with 28 million documents in two shards
<ul>
<li>Run 1:
<ul>
<li>Time: 2.85s user 0.42s system 2% cpu 2:28.92 total</li>
<li>Tomcat max JVM heap usage: 1.90 GiB</li>
<li><code>filterCache</code> size: 14851, <code>cumulative_lookups</code>: 2354237, <code>cumulative_hits</code>: 2331186, <code>cumulative_hitratio</code>: 0.99</li>
</ul>
</li>
<li>Run 2:
<ul>
<li>Time: 2.90s user 0.37s system 2% cpu 2:23.50 total</li>
<li>Tomcat max JVM heap usage: 1.27 GiB</li>
<li><code>filterCache</code> size: 15834, <code>cumulative_lookups</code>: 4708476, <code>cumulative_hits</code>: 4664762, <code>cumulative_hitratio</code>: 0.99</li>
</ul>
</li>
<li>Run 3:
<ul>
<li>Time: 2.93s user 0.39s system 2% cpu 2:26.17 total</li>
<li>Tomcat max JVM heap usage: 1.05 GiB</li>
<li><code>filterCache</code> size: 15248, <code>cumulative_lookups</code>: 7062715, <code>cumulative_hits</code>: 6998267, <code>cumulative_hitratio</code>: 0.99</li>
</ul>
</li>
</ul>
</li>
<li>The JVM garbage collection graph is MUCH flatter, and memory usage is much lower (not to mention a drop in GC-related CPU usage)!</li>
</ul>
<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing-16384-filterCache.png" alt="VisualVM Tomcat 16384 filterCache"></p>
<ul>
<li>I will deploy this <code>filterCache</code> setting on DSpace Test (linode19)</li>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
<li>Lots of CPU steal going on still on CGSpace (linode18):</li>
</ul>
<p><img src="/cgspace-notes/2019/04/cpu-week3.png" alt="CPU usage week"></p>
<h2 id="2019-04-18">2019-04-18</h2>
<ul>
<li>I&rsquo;ve been trying to copy the <code>statistics-2018</code> Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec
<ul>
<li>I opened a support ticket to ask Linode to investigate</li>
<li>They asked me to send an <code>mtr</code> report from Fremont to Frankfurt and vice versa</li>
</ul>
</li>
<li>Deploy Tomcat 7.0.94 on DSpace Test (linode19)
<ul>
<li>Also, I realized that the CMS GC changes I deployed a few days ago were ignored by Tomcat because of something with how Ansible formatted the options string</li>
<li>I needed to use the &ldquo;folded&rdquo; YAML variable format <code>&gt;-</code> (with the dash so it doesn&rsquo;t add a return at the end)</li>
</ul>
</li>
<li>UptimeRobot says that CGSpace went &ldquo;down&rdquo; this afternoon, but I looked at the CPU steal with <code>iostat 1 10</code> and it&rsquo;s in the 50s and 60s
<ul>
<li>The munin graph shows a lot of CPU steal (red) currently (and over all during the week):</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2019/04/cpu-week4.png" alt="CPU usage week"></p>
<ul>
<li>I opened a ticket with Linode to migrate us somewhere
<ul>
<li>They agreed to migrate us to a quieter host</li>
</ul>
</li>
</ul>
<h2 id="2019-04-20">2019-04-20</h2>
<ul>
<li>Linode agreed to move CGSpace (linode18) to a new machine shortly after I filed my ticket about CPU steal two days ago and now the load is much more sane:</li>
</ul>
<p><img src="/cgspace-notes/2019/04/cpu-week5.png" alt="CPU usage week"></p>
<ul>
<li>For future reference, Linode mentioned that they consider CPU steal above 8% to be significant</li>
<li>Regarding the other Linode issue about speed, I did a test with <code>iperf</code> between linode18 and linode19:</li>
</ul>
<pre tabindex="0"><code># iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 45.79.x.x port 5001 connected with 139.162.x.x port 51378
------------------------------------------------------------
Client connecting to 139.162.x.x, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  5] local 45.79.x.x port 36440 connected with 139.162.x.x port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.2 sec   172 MBytes   142 Mbits/sec
[  4]  0.0-10.5 sec   202 MBytes   162 Mbits/sec
</code></pre><ul>
<li>Even with the software firewalls disabled the rsync speed was low, so it&rsquo;s not a rate limiting issue</li>
<li>I also tried to download a file over HTTPS from CGSpace to DSpace Test, but it was capped at 20KiB/sec
<ul>
<li>I updated the Linode issue with this information</li>
</ul>
</li>
<li>I&rsquo;m going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode&rsquo;s latest x86_64
<ul>
<li>Nope, still 20KiB/sec</li>
</ul>
</li>
</ul>
<h2 id="2019-04-21">2019-04-21</h2>
<ul>
<li>Deploy Solr 4.10.4 on CGSpace (linode18)</li>
<li>Deploy Tomcat 7.0.94 on CGSpace</li>
<li>Deploy dspace-statistics-api v1.0.0 on CGSpace</li>
<li>Linode support replicated the results I had from the network speed testing and said they don&rsquo;t know why it&rsquo;s so slow
<ul>
<li>They offered to live migrate the instance to another host to see if that helps</li>
</ul>
</li>
</ul>
<h2 id="2019-04-22">2019-04-22</h2>
<ul>
<li>Abenet pointed out <a href="https://hdl.handle.net/10568/97912">an item</a> that doesn&rsquo;t have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
<ul>
<li>I tweeted the Handle to see if it will pick it up&hellip;</li>
<li>Like clockwork, after fifteen minutes there was a donut showing on CGSpace</li>
</ul>
</li>
<li>I want to get rid of this annoying warning that is constantly in our DSpace logs:</li>
</ul>
<pre tabindex="0"><code>2019-04-08 19:02:31,770 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
</code></pre><ul>
<li>Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):</li>
</ul>
<pre tabindex="0"><code>$ grep -c 'Falling back to request address' dspace.log.2019-04-20
dspace.log.2019-04-20:1515
</code></pre><ul>
<li>I will fix it in <code>dspace/config/modules/oai.cfg</code></li>
<li>Linode says that it is likely that the host CGSpace (linode18) is on is showing signs of hardware failure and they recommended that I migrate the VM to a new host
<ul>
<li>I told them to migrate it at 04:00:00AM Frankfurt time, when nobody in East Africa, Europe, or South America should be using the server</li>
</ul>
</li>
</ul>
<h2 id="2019-04-23">2019-04-23</h2>
<ul>
<li>One blog post says that there is <a href="https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/">no overprovisioning in Azure</a>:</li>
</ul>
<!-- raw HTML omitted -->
<ul>
<li>Perhaps that&rsquo;s why the <a href="https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/">Azure pricing</a> is so expensive!</li>
<li>Add a privacy page to CGSpace
<ul>
<li>The work was mostly similar to the About page at <code>/page/about</code>, but in addition to adding i18n strings etc, I had to add the logic for the trail to <code>dspace-xmlui-mirage2/src/main/webapp/xsl/preprocess/general.xsl</code></li>
</ul>
</li>
</ul>
<h2 id="2019-04-24">2019-04-24</h2>
<ul>
<li>Linode migrated CGSpace (linode18) to a new host, but I am still getting poor performance when copying data to DSpace Test (linode19)
<ul>
<li>I asked them if we can migrate DSpace Test to a new host</li>
<li>They migrated DSpace Test to a new host and the rsync speed from Frankfurt was still capped at 20KiB/sec&hellip;</li>
<li>I booted DSpace Test to a rescue CD and tried the rsync from CGSpace there too, but it was still capped at 20KiB/sec&hellip;</li>
<li>I copied the 18GB <code>statistics-2018</code> Solr core from Frankfurt to a Linode in London at 15MiB/sec, then from the London one to DSpace Test in Fremont at 15MiB/sec&hellip; so WTF us up with Frankfurt→Fremont?!</li>
</ul>
</li>
<li>Finally upload the 218 IITA items from March to CGSpace
<ul>
<li>Abenet and I had to do a little bit more work to correct the metadata of one item that appeared to be a duplicate, but really just had the wrong DOI</li>
</ul>
</li>
<li>While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (<code>dc.identifier.uri</code>)
<ul>
<li>According to my notes in 2018-09 I had noticed this when he uploaded the records and told him to remove them, but he didn&rsquo;t&hellip;</li>
<li>I exported the IITA community as a CSV then used <code>csvcut</code> to extract the two URI columns and identify and fix the records:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv &gt; /tmp/iita.csv
</code></pre><ul>
<li>Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
<ul>
<li>I told him we never finished it, and that he should try to use the <code>/items/find-by-metadata-field</code> endpoint, with the caveat that you need to match the language attribute exactly (ie &ldquo;en&rdquo;, &ldquo;en_US&rdquo;, null, etc)</li>
<li>I asked him how many terms they are interested in, as we could probably make it easier by normalizing the language attributes of these fields (it would help us anyways)</li>
<li>He says he&rsquo;s getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ curl -f -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
curl: (22) The requested URL returned error: 401
</code></pre><ul>
<li>Note that curl only shows the HTTP 401 error if you use <code>-f</code> (fail), and only then if you <em>don&rsquo;t</em> include <code>-s</code>
<ul>
<li>I see there are about 1,000 items using CPWF subject &ldquo;WATER MANAGEMENT&rdquo; in the database, so there should definitely be results</li>
<li>The breakdown of <code>text_lang</code> fields used in those items is 942:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
 count 
-------
   376
(1 row)

dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
 count 
-------
   149
(1 row)

dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
 count 
-------
   417
(1 row)
</code></pre><ul>
<li>I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn&rsquo;t have permission to access&hellip; from the DSpace log:</li>
</ul>
<pre tabindex="0"><code>2019-04-24 08:11:51,129 INFO  org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
2019-04-24 08:11:51,231 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
2019-04-24 08:11:51,238 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
2019-04-24 08:11:51,243 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
2019-04-24 08:11:51,252 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item!
</code></pre><ul>
<li>Nevertheless, if I request using the <code>null</code> language I get 1020 results, plus 179 for a blank language attribute:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: null}' | jq length
1020
$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;&quot;}' | jq length
179
</code></pre><ul>
<li>This is weird because I see 942–1156 items with &ldquo;WATER MANAGEMENT&rdquo; (depending on wildcard matching for errors in subject spelling):</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
 count 
-------
   942
(1 row)

dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
 count 
-------
  1156
(1 row)
</code></pre><ul>
<li>I sent a message to the dspace-tech mailing list to ask for help</li>
</ul>
<h2 id="2019-04-25">2019-04-25</h2>
<ul>
<li>Peter pointed out that we need to remove Delicious and Google+ from our social sharing links
<ul>
<li>Also, it would be nice if we could include the item title in the shared link</li>
<li>I created an issue on GitHub to track this (<a href="https://github.com/ilri/DSpace/issues/419">#419</a>)</li>
</ul>
</li>
<li>I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:</li>
</ul>
<pre tabindex="0"><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/login&quot; -d '{&quot;email&quot;:&quot;example@me.com&quot;,&quot;password&quot;:&quot;fuuuuu&quot;}'
$ curl -f -H &quot;Content-Type: application/json&quot; -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -X GET &quot;https://dspacetest.cgiar.org/rest/status&quot;
$ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
</code></pre><ul>
<li>I created a normal user for Carlos to try as an unprivileged user:</li>
</ul>
<pre tabindex="0"><code>$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
</code></pre><ul>
<li>But still I get the HTTP 401 and I have no idea which item is causing it</li>
<li>I enabled more verbose logging in <code>ItemsResource.java</code> and now I can at least see the item ID that causes the failure&hellip;
<ul>
<li>The item is not even in the archive, but somehow it is discoverable</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT * FROM item WHERE item_id=74648;
 item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
   74648 |          113 | f          | f         | 2016-03-30 09:00:52.131+00 |                   | t
(1 row)
</code></pre><ul>
<li>I tried to delete the item in the web interface, and it seems successful, but I can still access the item in the admin interface, and nothing changes in PostgreSQL</li>
<li>Meet with CodeObia to see progress on AReS version 2</li>
<li>Marissa Van Epp asked me to add a few new metadata values to their Phase II Project Tags field (cg.identifier.ccafsprojectpii)
<ul>
<li>I created a <a href="https://github.com/ilri/DSpace/pull/420">pull request</a> for it and will do it the next time I run updates on CGSpace</li>
</ul>
</li>
<li>Communicate with Carlos Tejo from the Land Portal about the <code>/items/find-by-metadata-value</code> endpoint</li>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
</ul>
<h2 id="2019-04-26">2019-04-26</h2>
<ul>
<li>Export a list of authors for Peter to look through:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
COPY 65752
</code></pre><h2 id="2019-04-28">2019-04-28</h2>
<ul>
<li>Still trying to figure out the issue with the items that cause the REST API&rsquo;s <code>/items/find-by-metadata-value</code> endpoint to throw an exception
<ul>
<li>I made the item private in the UI and then I see in the UI and PostgreSQL that it is no longer discoverable:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT * FROM item WHERE item_id=74648;
 item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable 
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
   74648 |          113 | f          | f         | 2019-04-28 08:48:52.114-07 |                   | f
(1 row)
</code></pre><ul>
<li>And I tried the <code>curl</code> command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:</li>
</ul>
<pre tabindex="0"><code>2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
</code></pre><ul>
<li>I even tried to &ldquo;expunge&rdquo; the item using an <a href="https://wiki.lyrasis.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-Performing'actions'onitems">action in CSV</a>, and it said &ldquo;EXPUNGED!&rdquo; but the item is still there&hellip;</li>
</ul>
<h2 id="2019-04-30">2019-04-30</h2>
<ul>
<li>Send mail to the dspace-tech mailing list to ask about the item expunge issue</li>
<li>Delete and re-create Podman container for dspacedb after pulling a new PostgreSQL container:</li>
</ul>
<pre tabindex="0"><code>$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><ul>
<li>Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I&rsquo;ll try to do a CSV
<ul>
<li>In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
 text_lang |  count
-----------+---------
           |  358647
 *         |      11
 E.        |       1
 en        |    1635
 en_US     |  602312
 es        |      12
 es_ES     |       2
 ethnob    |       1
 fr        |       2
 spa       |       2
           | 1074345
(11 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE 360295
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 1074345
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
UPDATE 14
</code></pre><ul>
<li>Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos</li>
<li>In other news, while I was looking through the CSV in OpenRefine I saw lots of weird values in some fields&hellip; we should check, for example:
<ul>
<li>issue dates</li>
<li>items missing handles</li>
<li>authorship types</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

  
</article> 


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">
  

        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2022-03/">March, 2022</a></li>

<li><a href="/cgspace-notes/2022-02/">February, 2022</a></li>

<li><a href="/cgspace-notes/2022-01/">January, 2022</a></li>

<li><a href="/cgspace-notes/2021-12/">December, 2021</a></li>

<li><a href="/cgspace-notes/2021-11/">November, 2021</a></li>

    </ol>
  </section>

  
  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">
      
      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
      
      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
      
      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
      
    </ol>
  </section>
  
</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->
    

    <footer class="blog-footer">
      <p dir="auto">
      
      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
      
      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>
    

  </body>

</html>
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
+								<!DOCTYPE html>
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								<html lang="en" >
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
 								  <head>
 								    <meta charset="utf-8">
 								<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
-												Add notes for 2020-12

											
										
										
											2020-12-06 16:53:29 +02:00
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
+								<meta property="og:title" content="April, 2019" />
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
+								<meta property="og:description" content="2019-04-01
 								Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
 								They asked if we had plans to enable RDF support in CGSpace
-												Add notes for 2019-05-05

											
										
										
											2019-05-05 16:45:12 +03:00
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
+								There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
 								I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
 200
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
 								In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
-												Add notes for 2019-05-05

											
										
										
											2019-05-05 16:45:12 +03:00
+								Apply country and region corrections and deletions on DSpace Test and CGSpace:
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
 								$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
 								$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
 								$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
 								$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
 								" />
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
+								<meta property="og:type" content="article" />
 								<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
-												Add notes for 2019-08-08

											
										
										
											2019-08-08 18:10:44 +03:00
+								<meta property="article:published_time" content="2019-04-01T09:00:43+03:00" />
-												Add notes for 2021-08-18

											
										
										
											2021-08-18 16:24:40 +03:00
+								<meta property="article:modified_time" content="2021-08-18T15:29:31+03:00" />
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
-												Add notes for 2020-12

											
										
										
											2020-12-06 16:53:29 +02:00
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
+								<meta name="twitter:card" content="summary"/>
 								<meta name="twitter:title" content="April, 2019"/>
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
+								<meta name="twitter:description" content="2019-04-01
 								Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
 								They asked if we had plans to enable RDF support in CGSpace
-												Add notes for 2019-05-05

											
										
										
											2019-05-05 16:45:12 +03:00
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
+								There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
 								I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
 200
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
 								In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
-												Add notes for 2019-05-05

											
										
										
											2019-05-05 16:45:12 +03:00
+								Apply country and region corrections and deletions on DSpace Test and CGSpace:
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
 								$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
 								$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
 								$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
 								$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
 								"/>
-												Add notes

											
										
										
											2022-02-23 14:46:23 +03:00
+								<meta name="generator" content="Hugo 0.92.2" />
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
 								<script type="application/ld+json">
 								{
 								  "@context": "http://schema.org",
 								  "@type": "BlogPosting",
 								  "headline": "April, 2019",
-												Regenerate public

											
										
										
											2020-04-02 10:55:42 +03:00
+								  "url": "https://alanorth.github.io/cgspace-notes/2019-04/",
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								  "wordCount": "6778",
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								  "datePublished": "2019-04-01T09:00:43+03:00",
-												Add notes for 2021-08-18

											
										
										
											2021-08-18 16:24:40 +03:00
+								  "dateModified": "2021-08-18T15:29:31+03:00",
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
+								  "author": {
 								    "@type": "Person",
 								    "name": "Alan Orth"
 								  },
 								  "keywords": "Notes"
 								}
 								</script>
 								    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-04/">
 								    <title>April, 2019 | CGSpace Notes</title>
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
+								    <!-- combined, minified CSS -->
-												Regenerate docs

											
										
										
											2020-01-23 20:19:38 +02:00
-												Update docs

											
										
										
											2021-01-24 09:46:27 +02:00
+								    <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
-												Regenerate docs

											
										
										
											2020-01-28 12:01:42 +02:00
+								    <!-- minified Font Awesome for SVG icons -->
-												Regenerate docs

											
										
										
											2021-09-28 10:32:32 +03:00
+								    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
-												Regenerate docs

											
										
										
											2020-01-28 12:01:42 +02:00
-												Add notes for 2019-04-14

											
										
										
											2019-04-14 16:59:47 +03:00
+								    <!-- RSS 2.0 feed -->
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
 								  </head>
 								  <body>
 								    <div class="blog-masthead">
 								      <div class="container">
 								        <nav class="nav blog-nav">
 								          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
 								        </nav>
 								      </div>
 								    </div>
 								    <header class="blog-header">
 								      <div class="container">
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
 								        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
+								      </div>
 								    </header>
 								    <div class="container">
 								      <div class="row">
 								        <div class="col-sm-8 blog-main">
 								<article class="blog-post">
 								  <header>
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-04/">April, 2019</a></h2>
-												Regenerate docs

											
										
										
											2020-11-16 10:54:00 +02:00
+								    <p class="blog-post-meta">
 								<time datetime="2019-04-01T09:00:43+03:00">Mon Apr 01, 2019</time>
 								 in
-												Regenerate docs

											
										
										
											2020-01-28 12:01:42 +02:00
+								<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
 								</p>
 								  </header>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								  <h2 id="2019-04-01">2019-04-01</h2>
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
+								<ul>
 								<li>Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
 								<ul>
 								<li>They asked if we had plans to enable RDF support in CGSpace</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+200
 								</code></pre><ul>
 								<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
 								<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
-												Update notes for 2019-04-01

											
										
										
											2019-04-01 17:02:54 +03:00
+								$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
 								$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
 								$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								</code></pre><h2 id="2019-04-02">2019-04-02</h2>
-												Add notes for 2019-04-02

											
										
										
											2019-04-02 12:44:18 +03:00
+								<ul>
 								<li>CTA says the Amazon IPs are AWS gateways for real user traffic</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I was trying to add Felix Shaw&rsquo;s account back to the Administrators group on DSpace Test, but I couldn&rsquo;t find his name in the user search of the groups page
-												Update notes for 2019-04-02

											
										
										
											2019-04-02 20:32:18 +03:00
+								<ul>
 								<li>If I searched for &ldquo;Felix&rdquo; or &ldquo;Shaw&rdquo; I saw other matches, included one for his personal email address!</li>
 								<li>I ended up finding him via searching for his email address</li>
-												Add notes for 2019-04-02

											
										
										
											2019-04-02 12:44:18 +03:00
+								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-03">2019-04-03</h2>
-												Update notes for 2019-04-03

											
										
										
											2019-04-03 17:01:31 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
-												Update notes for 2019-04-03

											
										
										
											2019-04-03 17:01:31 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>First I need to extract the ones that are unique from their list compared to our existing one:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u &gt; /tmp/2019-04-03-orcid-ids.txt
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!</li>
 								<li>Next I will resolve all their names using my <code>resolve-orcids.py</code> script:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>One user&rsquo;s name has changed so I will update those using my <code>fix-metadata-values.py</code> script:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I created a pull request and merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/417">#417</a>)</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it&rsquo;s still going:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
 http://localhost:8081/solr//statistics-2017
 http://localhost:8081/solr//statistics-2018
 								</code></pre><ul>
 								<li>I will have to keep an eye on it because nothing should be updating 2018 stats in 2019&hellip;</li>
-												Update notes for 2019-04-03

											
										
										
											2019-04-03 17:40:05 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-05">2019-04-05</h2>
-												Add notes for 2019-04-05

											
										
										
											2019-04-05 22:22:41 +03:00
+								<ul>
 								<li>Uptime Robot reported that CGSpace (linode18) went down tonight</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I see there are lots of PostgreSQL connections:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+dspaceApi
 dspaceCli
 dspaceWeb
 								</code></pre><ul>
 								<li>I still see those weird messages about updating the statistics-2018 Solr core:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Looking at <code>iostat 1 10</code> I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:</li>
-												Add notes for 2019-04-05

											
										
										
											2019-04-05 22:22:41 +03:00
+								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<p><img src="/cgspace-notes/2019/04/cpu-week.png" alt="CPU usage week"></p>
-												Add notes for 2019-04-05

											
										
										
											2019-04-05 22:22:41 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>The other thing visible there is that the past few days the load has spiked to 500% and I don&rsquo;t think it&rsquo;s a coincidence that the Solr updating thing is happening&hellip;</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I ran all system updates and rebooted the server
-												Update notes for 2019-04-05

											
										
										
											2019-04-05 23:07:30 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>The load was lower on the server after reboot, but Solr didn&rsquo;t come back up properly according to the Solr Admin UI:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I restarted it again and all the Solr cores came up properly&hellip;</li>
-												Add notes for 2019-04-05

											
										
										
											2019-04-05 22:22:41 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-06">2019-04-06</h2>
-												Update notes for 2019-04-06

											
										
										
											2019-04-06 11:47:45 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Udana asked why item <a href="https://cgspace.cgiar.org/handle/10568/91278">10568/91278</a> didn&rsquo;t have an Altmetric badge on CGSpace, but on the <a href="https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity">WLE website</a> it does
-												Update notes for 2019-04-06

											
										
										
											2019-04-06 11:47:45 +03:00
+								<ul>
 								<li>I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all</li>
 								<li>I tweeted the item and I assume this will link the Handle with the DOI in the system</li>
-												Update notes for 2019-04-06

											
										
										
											2019-04-06 12:06:14 +03:00
+								<li>Twenty minutes later I see the same Altmetric score (9) on CGSpace</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+18.195.78.144
 207.46.13.58
 207.46.13.194
 66.249.79.33
 207.46.13.210
 66.249.79.62
 40.77.167.66
 66.249.79.59
 2a01:4f8:140:3192::2
 45.5.184.72
-												Update notes for 2019-04-06

											
										
										
											2019-04-06 11:47:45 +03:00
+								# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+66.249.79.62
 207.46.13.210
 40.77.167.66
 42.113.50.219
 66.249.79.59
 2001:41d0:d:1990::
 45.5.184.72
 50.116.102.77
 45.5.186.2
 205.186.128.185
 								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li><code>45.5.184.72</code> is in Colombia so it&rsquo;s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT&rsquo;s datasets collection:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>GET /handle/10568/72970/discover?filtertype_0=type&amp;filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_relational_operator_0=equals&amp;filter_1=&amp;filter_0=Dataset&amp;filtertype=dateIssued&amp;filter_relational_operator=equals&amp;filter=2014
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Their user agent is the one I added to the badbots list in nginx last week: &ldquo;GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1&rdquo;</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>They made 22,000 requests to Discover on this collection today alone (and it&rsquo;s only 11AM):</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;06/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+/handle/10568/72970/discover
 								</code></pre><ul>
 								<li>Yesterday they made 43,000 requests and we actually blocked most of them:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+/handle/10568/72970/discover
-												Update notes for 2019-04-06

											
										
										
											2019-04-06 12:01:09 +03:00
+								# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &quot;05/Apr/2019&quot; | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+200
 503
 								</code></pre><ul>
 								<li>I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover</li>
 								<li>Maria from Bioversity recommended that we use the phrase &ldquo;AGROVOC subject&rdquo; instead of &ldquo;Subject&rdquo; in Listings and Reports
-												Update notes for 2019-04-06

											
										
										
											2019-04-06 11:47:45 +03:00
+								<ul>
 								<li>I made a pull request to update this and merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/418">#418</a>)</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-07">2019-04-07</h2>
-												Add notes for 2019-04-07

											
										
										
											2019-04-07 11:45:34 +03:00
+								<ul>
 								<li>Looking into the impact of harvesters like <code>45.5.184.72</code>, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands <em>per day</em></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
-												Add notes for 2019-04-07

											
										
										
											2019-04-07 11:45:34 +03:00
+								<ul>
 								<li>I am trying to see if these are registered as downloads in Solr or not</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I see 96,925 downloads from their AWS gateway IPs in 2019-03:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-03&amp;rows=0&amp;wt=json&amp;indent=true'
-												Add notes for 2019-04-07

											
										
										
											2019-04-07 11:45:34 +03:00
+								{
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								    &quot;response&quot;: {
 								        &quot;docs&quot;: [],
 								        &quot;numFound&quot;: 96925,
 								        &quot;start&quot;: 0
 								    },
 								    &quot;responseHeader&quot;: {
 								        &quot;QTime&quot;: 1,
 								        &quot;params&quot;: {
 								            &quot;fq&quot;: [
 								                &quot;statistics_type:view&quot;,
 								                &quot;bundleName:ORIGINAL&quot;,
 								                &quot;dateYearMonth:2019-03&quot;
 								            ],
 								            &quot;indent&quot;: &quot;true&quot;,
 								            &quot;q&quot;: &quot;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&quot;,
 								            &quot;rows&quot;: &quot;0&quot;,
 								            &quot;wt&quot;: &quot;json&quot;
 								        },
 								        &quot;status&quot;: 0
 								    }
-												Add notes for 2019-05-05

											
										
										
											2019-05-05 16:45:12 +03:00
+								}
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Strangely I don&rsquo;t see many hits in 2019-04:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&amp;fq=statistics_type%3Aview&amp;fq=bundleName%3AORIGINAL&amp;fq=dateYearMonth%3A2019-04&amp;rows=0&amp;wt=json&amp;indent=true'
-												Add notes for 2019-04-07

											
										
										
											2019-04-07 11:45:34 +03:00
+								{
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								    &quot;response&quot;: {
 								        &quot;docs&quot;: [],
 								        &quot;numFound&quot;: 38,
 								        &quot;start&quot;: 0
-												Add notes for 2019-04-07

											
										
										
											2019-04-07 11:45:34 +03:00
+								    },
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								    &quot;responseHeader&quot;: {
 								        &quot;QTime&quot;: 1,
 								        &quot;params&quot;: {
 								            &quot;fq&quot;: [
 								                &quot;statistics_type:view&quot;,
 								                &quot;bundleName:ORIGINAL&quot;,
 								                &quot;dateYearMonth:2019-04&quot;
 								            ],
 								            &quot;indent&quot;: &quot;true&quot;,
 								            &quot;q&quot;: &quot;type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)&quot;,
 								            &quot;rows&quot;: &quot;0&quot;,
 								            &quot;wt&quot;: &quot;json&quot;
 								        },
 								        &quot;status&quot;: 0
 								    }
-												Add notes for 2019-05-05

											
										
										
											2019-05-05 16:45:12 +03:00
+								}
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Making some tests on GET vs HEAD requests on the <a href="https://dspacetest.cgiar.org/handle/10568/100289">CTA Spore 192 item</a> on DSpace Test:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
-												Add notes for 2019-04-07

											
										
										
											2019-04-07 11:45:34 +03:00
+								GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
 								Accept: */*
 								Accept-Encoding: gzip, deflate
 								Connection: keep-alive
 								Host: dspacetest.cgiar.org
 								User-Agent: HTTPie/1.0.2
 								HTTP/1.1 200 OK
 								Connection: keep-alive
 								Content-Language: en-US
 								Content-Length: 2069158
 								Content-Type: application/pdf;charset=ISO-8859-1
 								Date: Sun, 07 Apr 2019 08:38:34 GMT
 								Expires: Sun, 07 Apr 2019 09:38:34 GMT
 								Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
 								Server: nginx
 								Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
 								Vary: User-Agent
 								X-Cocoon-Version: 2.2.0
 								X-Content-Type-Options: nosniff
 								X-Frame-Options: SAMEORIGIN
 								X-Robots-Tag: none
 								X-XSS-Protection: 1; mode=block
 								$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
 								HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
 								Accept: */*
 								Accept-Encoding: gzip, deflate
 								Connection: keep-alive
 								Host: dspacetest.cgiar.org
 								User-Agent: HTTPie/1.0.2
 								HTTP/1.1 200 OK
 								Connection: keep-alive
 								Content-Language: en-US
 								Content-Length: 2069158
 								Content-Type: application/pdf;charset=ISO-8859-1
 								Date: Sun, 07 Apr 2019 08:39:01 GMT
 								Expires: Sun, 07 Apr 2019 09:39:01 GMT
 								Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
 								Server: nginx
 								Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly
 								Vary: User-Agent
 								X-Cocoon-Version: 2.2.0
 								X-Content-Type-Options: nosniff
 								X-Frame-Options: SAMEORIGIN
 								X-Robots-Tag: none
 								X-XSS-Protection: 1; mode=block
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>And from the server side, the nginx logs show:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] &quot;GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 68078 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
-												Add notes for 2019-04-07

											
										
										
											2019-04-07 11:45:34 +03:00
+.x.x.x - - [07/Apr/2019:01:39:01 -0700] &quot;HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1&quot; 200 0 &quot;-&quot; &quot;HTTPie/1.0.2&quot;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>After twenty minutes of waiting I still don&rsquo;t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>2019-04-07 02:05:30,966 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+-04-07 02:05:39,265 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Strangely, the statistics Solr core says it hasn&rsquo;t been modified in 24 hours, so I tried to start the &ldquo;optimize&rdquo; process from the Admin UI and I see this in the Solr log:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>2019-04-07 02:08:44,186 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>&hellip; very weird
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I don&rsquo;t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)</li>
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								<li>I will try to re-deploy the <code>5_x-dev</code> branch and test again</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Update notes for 2020-04-13

											
										
										
											2020-04-13 17:24:05 +03:00
+								<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&amp;fq=statistics_type%3Aview&amp;fq=isInternal%3Atrue&amp;rows=0&amp;wt=json&amp;indent=true'
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								{
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								    &quot;response&quot;: {
 								        &quot;docs&quot;: [],
 								        &quot;numFound&quot;: 909,
 								        &quot;start&quot;: 0
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								    },
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								    &quot;responseHeader&quot;: {
 								        &quot;QTime&quot;: 0,
 								        &quot;params&quot;: {
 								            &quot;fq&quot;: [
 								                &quot;statistics_type:view&quot;,
 								                &quot;isInternal:true&quot;
 								            ],
 								            &quot;indent&quot;: &quot;true&quot;,
 								            &quot;q&quot;: &quot;type:0 AND time:2019-04-07*&quot;,
 								            &quot;rows&quot;: &quot;0&quot;,
 								            &quot;wt&quot;: &quot;json&quot;
 								        },
 								        &quot;status&quot;: 0
 								    }
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								}
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I confirmed the same on CGSpace itself after making one HEAD request</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>So I&rsquo;m pretty sure it&rsquo;s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								<ul>
 								<li>I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								<ul>
 								<li>This leads me to believe there is something specifically wrong with DSpace Test (linode19)</li>
 								<li>The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Holy shit, all this is actually because of the GeoIP1 deprecation and a missing <code>GeoLiteCity.dat</code>
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								<ul>
 								<li>For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!</li>
 								<li>See: <a href="https://jira.duraspace.org/browse/DS-3986">DS-3986</a></li>
 								<li>See: <a href="https://jira.duraspace.org/browse/DS-4020">DS-4020</a></li>
 								<li>See: <a href="https://jira.duraspace.org/browse/DS-3832">DS-3832</a></li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been <em>removed</em> from MaxMind&rsquo;s server as of 2018-04-01</li>
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 18:08:38 +03:00
+								<li>Now I made 100 requests and I see them in the Solr statistics&hellip; fuck my life for wasting five hours debugging this</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check <code>iostat 1 10</code> and I saw that CPU steal is around 10–30 percent right now&hellip;</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>The load average is super high right now, as I&rsquo;ve noticed the last few times UptimeRobot said that CGSpace went down:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ cat /proc/loadavg
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 21:05:52 +03:00
+.70 9.17 8.85 18/633 4198
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>According to the server logs there is actually not much going on right now:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+18.195.78.144
 207.46.13.219
 167.114.64.100
 207.46.13.129
 207.46.13.33
 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
 66.249.79.59
 40.77.167.21
 2a01:4f8:140:3192::2
 45.5.184.72
-												Update notes for 2019-04-07

											
										
										
											2019-04-07 21:17:16 +03:00
+								# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;07/Apr/2019:(18|19|20)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+66.249.79.62
 66.249.83.196
 207.46.13.86
 82.145.222.150
 2a01:4f9:2b:1263::2
 41.204.190.40
 35.174.176.49
 40.77.167.21
 194.246.119.6
 66.249.79.59
 								</code></pre><ul>
 								<li><code>45.5.184.72</code> is CIAT, who I already blocked and am waiting to hear from</li>
 								<li><code>2a01:4f8:140:3192::2</code> is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve</li>
 								<li><code>2408:8214:7a00:868f:7c1e:e0f3:20c6:c142</code> is some stupid Chinese bot making malicious POST requests</li>
 								<li>There are free database connections in the pool:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+dspaceApi
 dspaceCli
 dspaceWeb
 								</code></pre><ul>
 								<li>It seems that the issue with CGSpace being &ldquo;down&rdquo; is actually because of CPU steal again!!!</li>
 								<li>I opened a ticket with support and asked them to migrate the VM to a less busy host</li>
-												Add notes for 2019-04-07

											
										
										
											2019-04-07 11:45:34 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-08">2019-04-08</h2>
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 11:26:20 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Start checking IITA&rsquo;s last round of batch uploads from <a href="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 11:26:20 +03:00
+								<ul>
 								<li>Lots of problems with affiliations, I had to correct about sixty of them</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 11:26:20 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>The matched values can be accessed with <code>cell.recon.match.name</code>, but some of the new values don&rsquo;t appear, perhaps because I edited the original cell values?</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I ended up using this GREL expression to copy all values to a new column:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>if(cell.recon.matched, cell.recon.match.name, value)
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>See the <a href="https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon">OpenRefine variables documentation</a> for more notes about the <code>recon</code> object</li>
 								<li>I also noticed a handful of errors in our current list of affiliations so I corrected them:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>We should create a new list of affiliations to update our controlled vocabulary again</li>
 								<li>I dumped a list of the top 1500 affiliations:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 16:29:20 +03:00
+								COPY 1500
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 16:29:20 +03:00
+								dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 16:29:20 +03:00
+								COPY 60
 								dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
 								COPY 20
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 16:29:20 +03:00
+								$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>UptimeRobot said that CGSpace (linode18) went down tonight
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 19:55:20 +03:00
+								<ul>
 								<li>The load average is at <code>9.42, 8.87, 7.87</code></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I looked at PostgreSQL and see shitloads of connections there:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+dspaceApi
 dspaceCli
 dspaceWeb
 								</code></pre><ul>
 								<li>On a related note I see connection pool errors in the DSpace log:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 20:03:58 +03:00
+								org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>But still I see 10 to 30% CPU steal in <code>iostat</code> that is also reflected in the Munin graphs:</li>
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 20:22:40 +03:00
+								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<p><img src="/cgspace-notes/2019/04/cpu-week2.png" alt="CPU usage week"></p>
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 20:22:40 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Linode Support still didn&rsquo;t respond to my ticket from yesterday, so I attached a new output of <code>iostat 1 10</code> and asked them to move the VM to a less busy host</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>The web server logs are not very busy:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+40.77.167.135
 95.108.181.88
 157.55.39.206
 66.249.79.133
 45.5.186.2
 207.46.13.95
 18.196.196.108
 157.55.39.164
 40.77.167.132
 45.5.184.72
-												Update notes for 2019-04-08

											
										
										
											2019-04-08 20:00:45 +03:00
+								# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;08/Apr/2019:(17|18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+129.0.79.206
 41.205.240.21
 207.46.13.95
 66.249.79.133
 66.249.79.135
 95.108.181.88
 40.77.167.111
 157.55.39.164
 40.77.167.132
 51.254.16.223
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								</code></pre><h2 id="2019-04-09">2019-04-09</h2>
-												Add notes for 2019-04-09

											
										
										
											2019-04-09 09:33:52 +03:00
+								<ul>
 								<li>Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Here are the top IPs in the web server logs around that time:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+66.249.79.139
 157.55.39.160
 66.249.79.137
 66.249.79.135
 34.200.212.137
 66.249.79.133
 102.128.190.18
 45.5.184.72
 45.5.186.2
 205.186.128.185
-												Add notes for 2019-04-09

											
										
										
											2019-04-09 09:33:52 +03:00
+								# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E &quot;09/Apr/2019:(06|07|08)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+144.48.242.108
 207.46.13.185
 18.194.46.84
 66.249.79.139
 18.196.196.108
 31.6.77.23
 66.249.79.137
 157.55.39.160
 66.249.79.135
 66.249.79.133
 								</code></pre><ul>
 								<li><code>45.5.186.2</code> is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Database connection usage looks fine:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+dspaceApi
 dspaceCli
 dspaceWeb
 								</code></pre><ul>
 								<li>Ironically I do still see some 2 to 10% of CPU steal in <code>iostat 1 10</code></li>
 								<li>Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace
-												Update notes for 2019-04-10

											
										
										
											2019-04-10 09:48:40 +03:00
+								<ul>
 								<li>I told them how to use the REST API to get the CIAT Datasets collection and enumerate its items</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								<li>In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-10">2019-04-10</h2>
-												Update notes for 2019-04-10

											
										
										
											2019-04-10 09:48:40 +03:00
+								<ul>
 								<li>Abenet pointed out a possibility of validating funders against the <a href="https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API">CrossRef API</a></li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ http 'https://api.crossref.org/funders?query=mercator&amp;mailto=me@cgiar.org'
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Otherwise, they provide the funder data in <a href="https://www.crossref.org/services/funder-registry/">CSV and RDF format</a></li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn&rsquo;t match will need a human to go and do some manual checking and informed decision making&hellip;</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>If I want to write a script for this I could use the Python <a href="https://habanero.readthedocs.io/en/latest/modules/crossref.html">habanero library</a>:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>from habanero import Crossref
-												Update notes for 2019-04-10

											
										
										
											2019-04-10 13:44:05 +03:00
+								cr = Crossref(mailto=&quot;me@cgiar.org&quot;)
 								x = cr.funders(query = &quot;mercator&quot;)
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								</code></pre><h2 id="2019-04-11">2019-04-11</h2>
-												Add notes for 2019-04-11

											
										
										
											2019-04-11 13:29:52 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Continue proofing IITA&rsquo;s last round of batch uploads from <a href="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)
-												Add notes for 2019-04-11

											
										
										
											2019-04-11 13:29:52 +03:00
+								<ul>
 								<li>One misspelled country</li>
 								<li>Three incorrect regions</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Potential duplicates (same DOI, similar title, same authors):
 								<ul>
 								<li><a href="https://dspacetest.cgiar.org/handle/10568/100580">10568/100580</a> and <a href="https://dspacetest.cgiar.org/handle/10568/100579">10568/100579</a></li>
 								<li><a href="https://dspacetest.cgiar.org/handle/10568/100444">10568/100444</a> and <a href="https://dspacetest.cgiar.org/handle/10568/100423">10568/100423</a></li>
 								</ul>
 								</li>
-												Add notes for 2019-04-11

											
										
										
											2019-04-11 13:29:52 +03:00
+								<li>Two DOIs with incorrect URL formatting</li>
 								<li>Two misspelled IITA subjects</li>
 								<li>Two authors with smart quotes</li>
 								<li>Lots of issues with sponsors</li>
 								<li>One misspelled &ldquo;Peer review&rdquo;</li>
 								<li>One invalid ISBN that I fixed by Googling the title</li>
 								<li>Lots of issues with sponsors (German Aid Agency, Swiss Aid Agency, Italian Aid Agency, Dutch Aid Agency, etc)</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I validated all the AGROVOC subjects against our latest list with reconcile-csv
 								<ul>
-												Add notes for 2019-04-11

											
										
										
											2019-04-11 13:29:52 +03:00
+								<li>About 720 of the 900 terms were matched, then I checked and fixed or deleted the rest manually</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
 								</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA&rsquo;s records, so I applied them to DSpace Test and CGSpace:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
-												Update notes for 2019-04-11

											
										
										
											2019-04-11 14:23:30 +03:00
+								$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Answer more questions about DOIs and Altmetric scores from WLE</li>
 								<li>Answer more questions about DOIs and Altmetric scores from IWMI
-												Add notes for 2019-04-11

											
										
										
											2019-04-11 13:29:52 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>They can&rsquo;t seem to understand the Altmetric + Twitter flow for associating Handles and DOIs</li>
 								<li>To make things worse, many of their items DON&rsquo;T have DOIs, so when Altmetric harvests them of course there is no link!  - Then, a bunch of their items don&rsquo;t have scores because they never tweeted them!</li>
 								<li>They added a DOI to this old item <a href="https://cgspace.cgiar.org/handle/10568/97087">10567/97087</a> this morning and wonder why Altmetric&rsquo;s score hasn&rsquo;t linked with the DOI magically</li>
-												Add notes for 2019-04-11

											
										
										
											2019-04-11 13:29:52 +03:00
+								<li>We should check in a week to see if Altmetric will make the association after one week when they harvest again</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-13">2019-04-13</h2>
-												Add notes for 2019-04-13

											
										
										
											2019-04-13 12:15:55 +03:00
+								<ul>
 								<li>I copied the <code>statistics</code> and <code>statistics-2018</code> Solr cores from CGSpace to my local machine and watched the Java process in VisualVM while indexing item views and downloads with my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>:</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing.png" alt="Java GC during Solr indexing with CMS"></p>
-												Add notes for 2019-04-13

											
										
										
											2019-04-13 12:15:55 +03:00
+								<ul>
 								<li>It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear &ldquo;sawtooth&rdquo; pattern in the garbage collection</li>
 								<li>I am curious if the GC pattern would be different if I switched from the <code>-XX:+UseConcMarkSweepGC</code> to G1GC</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I switched to G1GC and restarted Tomcat but for some reason I couldn&rsquo;t see the Tomcat PID in VisualVM&hellip;
-												Add notes for 2019-04-13

											
										
										
											2019-04-13 12:15:55 +03:00
+								<ul>
 								<li>Anyways, the indexing process took much longer, perhaps twice as long!</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2019-04-13

											
										
										
											2019-04-13 12:15:55 +03:00
+								<li>I tried again with the GC tuning settings from the Solr 4.10.4 release:</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing-solr-settings.png" alt="Java GC during Solr indexing Solr 4.10.4 settings"></p>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-14">2019-04-14</h2>
-												Add notes for 2019-04-14

											
										
										
											2019-04-14 16:59:47 +03:00
+								<ul>
-												Add notes for 2021-08-18

											
										
										
											2021-08-18 16:24:40 +03:00
+								<li>Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>GC_TUNE=&quot;-XX:NewRatio=3 \
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								    -XX:SurvivorRatio=4 \
 								    -XX:TargetSurvivorRatio=90 \
 								    -XX:MaxTenuringThreshold=8 \
 								    -XX:+UseConcMarkSweepGC \
 								    -XX:+UseParNewGC \
 								    -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
 								    -XX:+CMSScavengeBeforeRemark \
 								    -XX:PretenureSizeThreshold=64m \
 								    -XX:+UseCMSInitiatingOccupancyOnly \
 								    -XX:CMSInitiatingOccupancyFraction=50 \
 								    -XX:CMSMaxAbortablePrecleanTime=6000 \
 								    -XX:+CMSParallelRemarkEnabled \
 								    -XX:+ParallelRefProcEnabled&quot;
 								</code></pre><ul>
 								<li>I need to remember to check the Munin JVM graphs in a few days</li>
 								<li>It might be placebo, but the site <em>does</em> feel snappier&hellip;</li>
-												Add notes for 2019-04-14

											
										
										
											2019-04-14 16:59:47 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-15">2019-04-15</h2>
-												Add notes for 2019-04-15

											
										
										
											2019-04-15 12:58:07 +03:00
+								<ul>
 								<li>Rework the dspace-statistics-api to use the vanilla Python requests library instead of Solr client
 								<ul>
 								<li><a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.0.0">Tag version 1.0.0</a> and deploy it on DSpace Test</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to <code>iostat 1 10</code>, though I haven&rsquo;t had any Linode alerts in a few days</li>
 								<li>Abenet sent me a list of ILRI items that don&rsquo;t have CRPs added to them
-												Update notes for 2019-04-15

											
										
										
											2019-04-15 17:42:53 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>The spreadsheet only had Handles (no IDs), so I&rsquo;m experimenting with using Python in OpenRefine to get the IDs</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>import json
-												Update notes for 2019-04-15

											
										
										
											2019-04-15 17:42:53 +03:00
+								import re
 								import urllib
 								import urllib2
 								handle = re.findall('[0-9]+/[0-9]+', value)
 								url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0]
 								req = urllib2.Request(url)
 								req.add_header('User-agent', 'Alan Python bot')
 								res = urllib2.urlopen(req)
 								data = json.load(res)
 								item_id = data['id']
 								return item_id
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Luckily none of the items already had CRPs, so I didn&rsquo;t have to worry about them getting removed
-												Update notes for 2019-04-15

											
										
										
											2019-04-15 17:42:53 +03:00
+								<ul>
 								<li>It would have been much trickier if I had to get the CRPs for the items first, then add the CRPs&hellip;</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I ran a full Discovery indexing on CGSpace because I didn&rsquo;t do it after all the metadata updates last week:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
-												Update notes for 2019-04-15

											
										
										
											2019-04-15 23:01:19 +03:00
 								real    82m45.324s
 								user    7m33.446s
 								sys     2m13.463s
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								</code></pre><h2 id="2019-04-16">2019-04-16</h2>
-												Add notes for 2019-04-16

											
										
										
											2019-04-16 13:07:33 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Export IITA&rsquo;s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something</li>
-												Add notes for 2019-04-16

											
										
										
											2019-04-16 13:07:33 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-17">2019-04-17</h2>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<ul>
 								<li>Reading an interesting <a href="https://teaspoon-consulting.com/articles/solr-cache-tuning.html">blog post about Solr caching</a></li>
 								<li>Did some tests of the dspace-statistics-api on my local DSpace instance with 28 million documents in a sharded statistics core (<code>statistics</code> and <code>statistics-2018</code>) and monitored the memory usage of Tomcat in VisualVM</li>
 								<li>4GB heap, CMS GC, 512 filter cache, 512 query cache, with 28 million documents in two shards
 								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Run 1:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 3.11s user 0.44s system 0% cpu 13:45.07 total</li>
 								<li>Tomcat (not Solr) max JVM heap usage: 2.04 GiB</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 2:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 3.23s user 0.43s system 0% cpu 13:46.10 total</li>
 								<li>Tomcat (not Solr) max JVM heap usage: 2.06 GiB</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 3:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 3.23s user 0.42s system 0% cpu 13:14.70 total</li>
 								<li>Tomcat (not Solr) max JVM heap usage: 2.13 GiB</li>
 								<li><code>filterCache</code> size: 482, <code>cumulative_lookups</code>: 7062712, <code>cumulative_hits</code>: 167903, <code>cumulative_hitratio</code>: 0.02</li>
 								<li>queryResultCache size: 2</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
 								</li>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>4GB heap, CMS GC, 1024 filter cache, 512 query cache, with 28 million documents in two shards
 								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Run 1:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 2.92s user 0.39s system 0% cpu 12:33.08 total</li>
 								<li>Tomcat (not Solr) max JVM heap usage: 2.16 GiB</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 2:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 3.10s user 0.39s system 0% cpu 12:25.32 total</li>
 								<li>Tomcat (not Solr) max JVM heap usage: 2.07 GiB</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 3:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 3.29s user 0.36s system 0% cpu 11:53.47 total</li>
 								<li>Tomcat (not Solr) max JVM heap usage: 2.08 GiB</li>
 								<li><code>filterCache</code> size: 951, <code>cumulative_lookups</code>: 7062712, <code>cumulative_hits</code>: 254379, <code>cumulative_hitratio</code>: 0.04</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
 								</li>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>4GB heap, CMS GC, 2048 filter cache, 512 query cache, with 28 million documents in two shards
 								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Run 1:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 2.90s user 0.48s system 0% cpu 10:37.31 total</li>
 								<li>Tomcat max JVM heap usage: 1.96 GiB</li>
 								<li><code>filterCache</code> size: 1901, <code>cumulative_lookups</code>: 2354237, <code>cumulative_hits</code>: 180111, <code>cumulative_hitratio</code>: 0.08</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 2:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 2.97s user 0.39s system 0% cpu 10:40.06 total</li>
 								<li>Tomcat max JVM heap usage: 2.09 GiB</li>
 								<li><code>filterCache</code> size: 1901, <code>cumulative_lookups</code>: 4708473, <code>cumulative_hits</code>: 360068, <code>cumulative_hitratio</code>: 0.08</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 3:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 3.28s user 0.37s system 0% cpu 10:49.56 total</li>
 								<li>Tomcat max JVM heap usage: 2.05 GiB</li>
 								<li><code>filterCache</code> size: 1901, <code>cumulative_lookups</code>: 7062712, <code>cumulative_hits</code>: 540020, <code>cumulative_hitratio</code>: 0.08</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
 								</li>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>4GB heap, CMS GC, 4096 filter cache, 512 query cache, with 28 million documents in two shards
 								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Run 1:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 2.88s user 0.35s system 0% cpu 8:29.55 total</li>
 								<li>Tomcat max JVM heap usage: 2.15 GiB</li>
 								<li><code>filterCache</code> size: 3770, <code>cumulative_lookups</code>: 2354237, <code>cumulative_hits</code>: 414512, <code>cumulative_hitratio</code>: 0.18</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 2:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 3.01s user 0.38s system 0% cpu 9:15.65 total</li>
 								<li>Tomcat max JVM heap usage: 2.17 GiB</li>
 								<li><code>filterCache</code> size: 3945, <code>cumulative_lookups</code>: 4708473, <code>cumulative_hits</code>: 829093, <code>cumulative_hitratio</code>: 0.18</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 3:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 3.01s user 0.40s system 0% cpu 9:01.31 total</li>
 								<li>Tomcat max JVM heap usage: 2.07 GiB</li>
 								<li><code>filterCache</code> size: 3770, <code>cumulative_lookups</code>: 7062712, <code>cumulative_hits</code>: 1243632, <code>cumulative_hitratio</code>: 0.18</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
 								</li>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>The biggest takeaway I have is that this workload benefits from a larger <code>filterCache</code> (for Solr fq parameter), but barely uses the <code>queryResultCache</code> (for Solr q parameter) at all
 								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>The number of hits goes up and the time taken decreases when we increase the <code>filterCache</code>, and total JVM heap memory doesn&rsquo;t seem to increase much at all</li>
 								<li>I guess the <code>queryResultCache</code> size is always 2 because I&rsquo;m only doing two queries: <code>type:0</code> and <code>type:2</code> (downloads and views, respectively)</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing-4096-filterCache.png" alt="VisualVM Tomcat 4096 filterCache"></p>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<ul>
 								<li>I ran one test with a <code>filterCache</code> of 16384 to try to see if I could make the Tomcat JVM memory balloon, but actually it <em>drastically</em> increased the performance and memory usage of the dspace-statistics-api indexer</li>
 								<li>4GB heap, CMS GC, 16384 filter cache, 512 query cache, with 28 million documents in two shards
 								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Run 1:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 2.85s user 0.42s system 2% cpu 2:28.92 total</li>
 								<li>Tomcat max JVM heap usage: 1.90 GiB</li>
 								<li><code>filterCache</code> size: 14851, <code>cumulative_lookups</code>: 2354237, <code>cumulative_hits</code>: 2331186, <code>cumulative_hitratio</code>: 0.99</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 2:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 2.90s user 0.37s system 2% cpu 2:23.50 total</li>
 								<li>Tomcat max JVM heap usage: 1.27 GiB</li>
 								<li><code>filterCache</code> size: 15834, <code>cumulative_lookups</code>: 4708476, <code>cumulative_hits</code>: 4664762, <code>cumulative_hitratio</code>: 0.99</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>Run 3:
 								<ul>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>Time: 2.93s user 0.39s system 2% cpu 2:26.17 total</li>
 								<li>Tomcat max JVM heap usage: 1.05 GiB</li>
 								<li><code>filterCache</code> size: 15248, <code>cumulative_lookups</code>: 7062715, <code>cumulative_hits</code>: 6998267, <code>cumulative_hitratio</code>: 0.99</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
 								</li>
-												Add notes for 2019-04-17

											
										
										
											2019-04-17 13:38:50 +03:00
+								<li>The JVM garbage collection graph is MUCH flatter, and memory usage is much lower (not to mention a drop in GC-related CPU usage)!</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<p><img src="/cgspace-notes/2019/04/visualvm-solr-indexing-16384-filterCache.png" alt="VisualVM Tomcat 16384 filterCache"></p>
-												Update notes for 2019-04-17

											
										
										
											2019-04-17 15:16:27 +03:00
+								<ul>
 								<li>I will deploy this <code>filterCache</code> setting on DSpace Test (linode19)</li>
 								<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
-												Add notes for 2019-04-18

											
										
										
											2019-04-18 12:31:49 +03:00
+								<li>Lots of CPU steal going on still on CGSpace (linode18):</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<p><img src="/cgspace-notes/2019/04/cpu-week3.png" alt="CPU usage week"></p>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-18">2019-04-18</h2>
-												Add notes for 2019-04-18

											
										
										
											2019-04-18 12:31:49 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I&rsquo;ve been trying to copy the <code>statistics-2018</code> Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec
-												Update notes for 2019-04-18

											
										
										
											2019-04-18 19:00:15 +03:00
+								<ul>
 								<li>I opened a support ticket to ask Linode to investigate</li>
 								<li>They asked me to send an <code>mtr</code> report from Fremont to Frankfurt and vice versa</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Update notes for 2019-04-18

											
										
										
											2019-04-18 12:40:54 +03:00
+								<li>Deploy Tomcat 7.0.94 on DSpace Test (linode19)
 								<ul>
 								<li>Also, I realized that the CMS GC changes I deployed a few days ago were ignored by Tomcat because of something with how Ansible formatted the options string</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I needed to use the &ldquo;folded&rdquo; YAML variable format <code>&gt;-</code> (with the dash so it doesn&rsquo;t add a return at the end)</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>UptimeRobot says that CGSpace went &ldquo;down&rdquo; this afternoon, but I looked at the CPU steal with <code>iostat 1 10</code> and it&rsquo;s in the 50s and 60s
-												Update notes for 2019-04-18

											
										
										
											2019-04-18 19:00:15 +03:00
+								<ul>
 								<li>The munin graph shows a lot of CPU steal (red) currently (and over all during the week):</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
 								<p><img src="/cgspace-notes/2019/04/cpu-week4.png" alt="CPU usage week"></p>
-												Update notes for 2019-04-18

											
										
										
											2019-04-18 19:00:15 +03:00
+								<ul>
 								<li>I opened a ticket with Linode to migrate us somewhere
 								<ul>
 								<li>They agreed to migrate us to a quieter host</li>
-												Update notes for 2019-04-17

											
										
										
											2019-04-17 15:16:27 +03:00
+								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-20">2019-04-20</h2>
-												Add notes for 2019-04-20

											
										
										
											2019-04-20 11:48:26 +03:00
+								<ul>
 								<li>Linode agreed to move CGSpace (linode18) to a new machine shortly after I filed my ticket about CPU steal two days ago and now the load is much more sane:</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<p><img src="/cgspace-notes/2019/04/cpu-week5.png" alt="CPU usage week"></p>
-												Add notes for 2019-04-20

											
										
										
											2019-04-20 11:48:26 +03:00
+								<ul>
 								<li>For future reference, Linode mentioned that they consider CPU steal above 8% to be significant</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Regarding the other Linode issue about speed, I did a test with <code>iperf</code> between linode18 and linode19:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code># iperf -s
-												Update notes for 2019-04-20

											
										
										
											2019-04-20 12:20:12 +03:00
+								------------------------------------------------------------
 								Server listening on TCP port 5001
 								TCP window size: 85.3 KByte (default)
 								------------------------------------------------------------
 								[  4] local 45.79.x.x port 5001 connected with 139.162.x.x port 51378
 								------------------------------------------------------------
 								Client connecting to 139.162.x.x, TCP port 5001
 								TCP window size: 85.0 KByte (default)
 								------------------------------------------------------------
 								[  5] local 45.79.x.x port 36440 connected with 139.162.x.x port 5001
 								[ ID] Interval       Transfer     Bandwidth
 								[  5]  0.0-10.2 sec   172 MBytes   142 Mbits/sec
 								[  4]  0.0-10.5 sec   202 MBytes   162 Mbits/sec
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Even with the software firewalls disabled the rsync speed was low, so it&rsquo;s not a rate limiting issue</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I also tried to download a file over HTTPS from CGSpace to DSpace Test, but it was capped at 20KiB/sec
-												Update notes for 2019-04-20

											
										
										
											2019-04-20 12:20:12 +03:00
+								<ul>
 								<li>I updated the Linode issue with this information</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I&rsquo;m going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode&rsquo;s latest x86_64
-												Update notes for 2019-04-20

											
										
										
											2019-04-20 12:24:02 +03:00
+								<ul>
 								<li>Nope, still 20KiB/sec</li>
-												Add notes for 2019-04-20

											
										
										
											2019-04-20 11:48:26 +03:00
+								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-21">2019-04-21</h2>
-												Add notes for 2019-04-20

											
										
										
											2019-04-21 15:31:15 +03:00
+								<ul>
 								<li>Deploy Solr 4.10.4 on CGSpace (linode18)</li>
 								<li>Deploy Tomcat 7.0.94 on CGSpace</li>
 								<li>Deploy dspace-statistics-api v1.0.0 on CGSpace</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Linode support replicated the results I had from the network speed testing and said they don&rsquo;t know why it&rsquo;s so slow
-												Update notes for 2019-04-21

											
										
										
											2019-04-21 18:52:00 +03:00
+								<ul>
 								<li>They offered to live migrate the instance to another host to see if that helps</li>
-												Add notes for 2019-04-20

											
										
										
											2019-04-21 15:31:15 +03:00
+								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-22">2019-04-22</h2>
-												Update notes for 2019-04-22

											
										
										
											2019-04-22 16:09:58 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Abenet pointed out <a href="https://hdl.handle.net/10568/97912">an item</a> that doesn&rsquo;t have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
-												Update notes for 2019-04-22

											
										
										
											2019-04-22 16:09:58 +03:00
+								<ul>
 								<li>I tweeted the Handle to see if it will pick it up&hellip;</li>
 								<li>Like clockwork, after fifteen minutes there was a donut showing on CGSpace</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>I want to get rid of this annoying warning that is constantly in our DSpace logs:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>2019-04-08 19:02:31,770 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ grep -c 'Falling back to request address' dspace.log.2019-04-20
-												Update notes for 2019-04-22

											
										
										
											2019-04-22 16:09:58 +03:00
+								dspace.log.2019-04-20:1515
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I will fix it in <code>dspace/config/modules/oai.cfg</code></li>
 								<li>Linode says that it is likely that the host CGSpace (linode18) is on is showing signs of hardware failure and they recommended that I migrate the VM to a new host
-												Update notes for 2019-04-22

											
										
										
											2019-04-22 16:09:58 +03:00
+								<ul>
 								<li>I told them to migrate it at 04:00:00AM Frankfurt time, when nobody in East Africa, Europe, or South America should be using the server</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-23">2019-04-23</h2>
-												Update notes for 2019-04-23

											
										
										
											2019-04-23 13:04:37 +03:00
+								<ul>
 								<li>One blog post says that there is <a href="https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/">no overprovisioning in Azure</a>:</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<!-- raw HTML omitted -->
-												Update notes for 2019-04-23

											
										
										
											2019-04-23 13:04:37 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Perhaps that&rsquo;s why the <a href="https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/">Azure pricing</a> is so expensive!</li>
-												Update notes for 2019-04-23

											
										
										
											2019-04-23 13:04:37 +03:00
+								<li>Add a privacy page to CGSpace
 								<ul>
 								<li>The work was mostly similar to the About page at <code>/page/about</code>, but in addition to adding i18n strings etc, I had to add the logic for the trail to <code>dspace-xmlui-mirage2/src/main/webapp/xsl/preprocess/general.xsl</code></li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-24">2019-04-24</h2>
-												Add notes for 2019-04-24

											
										
										
											2019-04-24 16:50:24 +03:00
+								<ul>
 								<li>Linode migrated CGSpace (linode18) to a new host, but I am still getting poor performance when copying data to DSpace Test (linode19)
 								<ul>
 								<li>I asked them if we can migrate DSpace Test to a new host</li>
 								<li>They migrated DSpace Test to a new host and the rsync speed from Frankfurt was still capped at 20KiB/sec&hellip;</li>
 								<li>I booted DSpace Test to a rescue CD and tried the rsync from CGSpace there too, but it was still capped at 20KiB/sec&hellip;</li>
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 17:15:13 +03:00
+								<li>I copied the 18GB <code>statistics-2018</code> Solr core from Frankfurt to a Linode in London at 15MiB/sec, then from the London one to DSpace Test in Fremont at 15MiB/sec&hellip; so WTF us up with Frankfurt→Fremont?!</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
-												Add notes for 2019-04-24

											
										
										
											2019-04-24 16:50:24 +03:00
+								<li>Finally upload the 218 IITA items from March to CGSpace
 								<ul>
 								<li>Abenet and I had to do a little bit more work to correct the metadata of one item that appeared to be a duplicate, but really just had the wrong DOI</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (<code>dc.identifier.uri</code>)
-												Add notes for 2019-04-24

											
										
										
											2019-04-24 16:50:24 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>According to my notes in 2018-09 I had noticed this when he uploaded the records and told him to remove them, but he didn&rsquo;t&hellip;</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I exported the IITA community as a CSV then used <code>csvcut</code> to extract the two URI columns and identify and fix the records:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv &gt; /tmp/iita.csv
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
-												Add notes for 2019-04-24

											
										
										
											2019-04-24 16:50:24 +03:00
+								<ul>
 								<li>I told him we never finished it, and that he should try to use the <code>/items/find-by-metadata-field</code> endpoint, with the caveat that you need to match the language attribute exactly (ie &ldquo;en&rdquo;, &ldquo;en_US&rdquo;, null, etc)</li>
 								<li>I asked him how many terms they are interested in, as we could probably make it easier by normalizing the language attributes of these fields (it would help us anyways)</li>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>He says he&rsquo;s getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ curl -f -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								curl: (22) The requested URL returned error: 401
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Note that curl only shows the HTTP 401 error if you use <code>-f</code> (fail), and only then if you <em>don&rsquo;t</em> include <code>-s</code>
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								<ul>
 								<li>I see there are about 1,000 items using CPWF subject &ldquo;WATER MANAGEMENT&rdquo; in the database, so there should definitely be results</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>The breakdown of <code>text_lang</code> fields used in those items is 942:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								 count
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								-------
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								(1 row)
 								dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								 count
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								-------
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								(1 row)
 								dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								 count
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								-------
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								(1 row)
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn&rsquo;t have permission to access&hellip; from the DSpace log:</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>2019-04-24 08:11:51,129 INFO  org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+-04-24 08:11:51,231 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
 -04-24 08:11:51,238 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
 -04-24 08:11:51,243 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
 -04-24 08:11:51,252 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item!
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Nevertheless, if I request using the <code>null</code> language I get 1020 results, plus 179 for a blank language attribute:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: null}' | jq length
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
 								$ curl -s -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;&quot;}' | jq length
 
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>This is weird because I see 942–1156 items with &ldquo;WATER MANAGEMENT&rdquo; (depending on wildcard matching for errors in subject spelling):</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								 count
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								-------
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								(1 row)
 								dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								 count
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								-------
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								(1 row)
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I sent a message to the dspace-tech mailing list to ask for help</li>
-												Update notes for 2019-04-24

											
										
										
											2019-04-24 18:49:55 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-25">2019-04-25</h2>
-												Update notes for 2019-04-25

											
										
										
											2019-04-25 23:06:39 +03:00
+								<ul>
 								<li>Peter pointed out that we need to remove Delicious and Google+ from our social sharing links
 								<ul>
 								<li>Also, it would be nice if we could include the item title in the shared link</li>
 								<li>I created an issue on GitHub to track this (<a href="https://github.com/ilri/DSpace/issues/419">#419</a>)</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</ul>
 								</li>
 								<li>I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/login&quot; -d '{&quot;email&quot;:&quot;example@me.com&quot;,&quot;password&quot;:&quot;fuuuuu&quot;}'
-												Update notes for 2019-04-25

											
										
										
											2019-04-25 23:06:39 +03:00
+								$ curl -f -H &quot;Content-Type: application/json&quot; -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -X GET &quot;https://dspacetest.cgiar.org/rest/status&quot;
 								$ curl -f -H &quot;rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I created a normal user for Carlos to try as an unprivileged user:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>But still I get the HTTP 401 and I have no idea which item is causing it</li>
 								<li>I enabled more verbose logging in <code>ItemsResource.java</code> and now I can at least see the item ID that causes the failure&hellip;
-												Update notes for 2019-04-25

											
										
										
											2019-04-25 23:06:39 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>The item is not even in the archive, but somehow it is discoverable</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# SELECT * FROM item WHERE item_id=74648;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								 item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable
-												Update notes for 2019-04-25

											
										
										
											2019-04-25 23:06:39 +03:00
+								---------+--------------+------------+-----------+----------------------------+-------------------+--------------
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+|          113 | f          | f         | 2016-03-30 09:00:52.131+00 |                   | t
-												Update notes for 2019-04-25

											
										
										
											2019-04-25 23:06:39 +03:00
+								(1 row)
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>I tried to delete the item in the web interface, and it seems successful, but I can still access the item in the admin interface, and nothing changes in PostgreSQL</li>
 								<li>Meet with CodeObia to see progress on AReS version 2</li>
 								<li>Marissa Van Epp asked me to add a few new metadata values to their Phase II Project Tags field (cg.identifier.ccafsprojectpii)
-												Update notes for 2019-04-25

											
										
										
											2019-04-25 23:06:39 +03:00
+								<ul>
 								<li>I created a <a href="https://github.com/ilri/DSpace/pull/420">pull request</a> for it and will do it the next time I run updates on CGSpace</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								<li>Communicate with Carlos Tejo from the Land Portal about the <code>/items/find-by-metadata-value</code> endpoint</li>
 								<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
 								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-26">2019-04-26</h2>
-												Update notes for 2019-04-26

											
										
										
											2019-04-26 12:16:02 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Export a list of authors for Peter to look through:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
-												Update notes for 2019-04-26

											
										
										
											2019-04-26 12:16:02 +03:00
+								COPY 65752
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								</code></pre><h2 id="2019-04-28">2019-04-28</h2>
-												Update notes for 2019-04-28

											
										
										
											2019-04-28 19:07:51 +03:00
+								<ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Still trying to figure out the issue with the items that cause the REST API&rsquo;s <code>/items/find-by-metadata-value</code> endpoint to throw an exception
-												Update notes for 2019-04-28

											
										
										
											2019-04-28 19:07:51 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>I made the item private in the UI and then I see in the UI and PostgreSQL that it is no longer discoverable:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# SELECT * FROM item WHERE item_id=74648;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								 item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable
-												Update notes for 2019-04-28

											
										
										
											2019-04-28 19:07:51 +03:00
+								---------+--------------+------------+-----------+----------------------------+-------------------+--------------
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+|          113 | f          | f         | 2019-04-28 08:48:52.114-07 |                   | f
-												Update notes for 2019-04-28

											
										
										
											2019-04-28 19:07:51 +03:00
+								(1 row)
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>And I tried the <code>curl</code> command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Update notes for 2020-04-13

											
										
										
											2020-04-13 17:24:05 +03:00
+								<li>I even tried to &ldquo;expunge&rdquo; the item using an <a href="https://wiki.lyrasis.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-Performing'actions'onitems">action in CSV</a>, and it said &ldquo;EXPUNGED!&rdquo; but the item is still there&hellip;</li>
-												Update notes for 2019-04-28

											
										
										
											2019-04-28 19:07:51 +03:00
+								</ul>
-												Add notes for 2019-12-17

											
										
										
											2019-12-17 14:49:24 +02:00
+								<h2 id="2019-04-30">2019-04-30</h2>
-												Add notes for 2019-04-30

											
										
										
											2019-04-30 11:39:09 +03:00
+								<ul>
 								<li>Send mail to the dspace-tech mailing list to ask about the item expunge issue</li>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>Delete and re-create Podman container for dspacedb after pulling a new PostgreSQL container:</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
-												Add notes for 2020-01-27

											
										
										
											2020-01-27 16:20:44 +02:00
+								<li>Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I&rsquo;ll try to do a CSV
-												Add notes for 2019-04-30

											
										
										
											2019-04-30 11:39:09 +03:00
+								<ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								<li>In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:</li>
 								</ul>
 								</li>
 								</ul>
-												Add notes for 2021-09-13

											
										
										
											2021-09-13 16:21:16 +03:00
+								<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								 text_lang |  count
-												Add notes for 2019-04-30

											
										
										
											2019-04-30 11:39:09 +03:00
+								-----------+---------
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								           |  358647
 								 *         |      11
 								 E.        |       1
 								 en        |    1635
 								 en_US     |  602312
 								 es        |      12
 								 es_ES     |       2
 								 ethnob    |       1
 								 fr        |       2
 								 spa       |       2
 								           | 1074345
-												Add notes for 2019-04-30

											
										
										
											2019-04-30 11:39:09 +03:00
+								(11 rows)
 								dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
 								UPDATE 360295
 								dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
 								UPDATE 1074345
 								dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
 								UPDATE 14
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</code></pre><ul>
 								<li>Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos</li>
 								<li>In other news, while I was looking through the CSV in OpenRefine I saw lots of weird values in some fields&hellip; we should check, for example:
-												Add notes for 2019-04-30

											
										
										
											2019-04-30 11:39:09 +03:00
+								<ul>
 								<li>issue dates</li>
 								<li>items missing handles</li>
 								<li>authorship types</li>
 								</ul>
-												Add notes for 2019-11-28

											
										
										
											2019-11-28 17:30:45 +02:00
+								</li>
 								</ul>
 								<!-- raw HTML omitted -->
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
 								</article>
 								        </div> <!-- /.blog-main -->
 								        <aside class="col-sm-3 ml-auto blog-sidebar">
 								        <section class="sidebar-module">
 								    <h4>Recent Posts</h4>
 								    <ol class="list-unstyled">
-												Add notes for 2022-03-01

											
										
										
											2022-03-01 17:48:40 +03:00
+								<li><a href="/cgspace-notes/2022-03/">March, 2022</a></li>
-												Add notes for 2022-02-10

											
										
										
											2022-02-10 20:35:40 +03:00
+								<li><a href="/cgspace-notes/2022-02/">February, 2022</a></li>
-												Regenerate docs

											
										
										
											2022-01-01 15:21:47 +02:00
+								<li><a href="/cgspace-notes/2022-01/">January, 2022</a></li>
-												Add notes for 2021-12-02

											
										
										
											2021-12-03 12:58:43 +02:00
+								<li><a href="/cgspace-notes/2021-12/">December, 2021</a></li>
-												Add 2021-11 and regenerate docs

											
										
										
											2021-11-01 10:49:21 +02:00
+								<li><a href="/cgspace-notes/2021-11/">November, 2021</a></li>
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
+								    </ol>
 								  </section>
 								  <section class="sidebar-module">
 								    <h4>Links</h4>
 								    <ol class="list-unstyled">
 								      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
 								      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
 								      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
 								    </ol>
 								  </section>
 								</aside>
 								      </div> <!-- /.row -->
 								    </div> <!-- /.container -->
 								    <footer class="blog-footer">
-												Update theme submodule and regenerate public

											
										
										
											2019-10-11 11:19:42 +03:00
+								      <p dir="auto">
-												Update notes

											
										
										
											2019-04-01 09:02:18 +03:00
 								      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
 								      </p>
 								      <p>
 								      <a href="#">Back to top</a>
 								      </p>
 								    </footer>
 								  </body>
 								</html>