mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-22 11:43:00 +01:00
905 lines
36 KiB
HTML
905 lines
36 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
|
||
<meta property="og:title" content="April, 2019" />
|
||
<meta property="og:description" content="2019-04-01
|
||
|
||
|
||
Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
|
||
|
||
|
||
They asked if we had plans to enable RDF support in CGSpace
|
||
|
||
There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
|
||
|
||
|
||
I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
|
||
|
||
|
||
|
||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
|
||
4432 200
|
||
|
||
|
||
|
||
In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
|
||
Apply country and region corrections and deletions on DSpace Test and CGSpace:
|
||
|
||
|
||
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
|
||
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
||
" />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
|
||
<meta property="article:published_time" content="2019-04-01T09:00:43+03:00"/>
|
||
<meta property="article:modified_time" content="2019-04-10T09:48:40+03:00"/>
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
<meta name="twitter:title" content="April, 2019"/>
|
||
<meta name="twitter:description" content="2019-04-01
|
||
|
||
|
||
Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
|
||
|
||
|
||
They asked if we had plans to enable RDF support in CGSpace
|
||
|
||
There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
|
||
|
||
|
||
I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
|
||
|
||
|
||
|
||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
|
||
4432 200
|
||
|
||
|
||
|
||
In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
|
||
Apply country and region corrections and deletions on DSpace Test and CGSpace:
|
||
|
||
|
||
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
|
||
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
||
"/>
|
||
<meta name="generator" content="Hugo 0.54.0" />
|
||
|
||
|
||
|
||
<script type="application/ld+json">
|
||
{
|
||
"@context": "http://schema.org",
|
||
"@type": "BlogPosting",
|
||
"headline": "April, 2019",
|
||
"url": "https://alanorth.github.io/cgspace-notes/2019-04/",
|
||
"wordCount": "3246",
|
||
"datePublished": "2019-04-01T09:00:43+03:00",
|
||
"dateModified": "2019-04-10T09:48:40+03:00",
|
||
"author": {
|
||
"@type": "Person",
|
||
"name": "Alan Orth"
|
||
},
|
||
"keywords": "Notes"
|
||
}
|
||
</script>
|
||
|
||
|
||
|
||
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-04/">
|
||
|
||
<title>April, 2019 | CGSpace Notes</title>
|
||
|
||
<!-- combined, minified CSS -->
|
||
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
</head>
|
||
|
||
<body>
|
||
|
||
|
||
<div class="blog-masthead">
|
||
<div class="container">
|
||
<nav class="nav blog-nav">
|
||
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
|
||
|
||
<header class="blog-header">
|
||
<div class="container">
|
||
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
||
</div>
|
||
</header>
|
||
|
||
|
||
|
||
|
||
<div class="container">
|
||
<div class="row">
|
||
<div class="col-sm-8 blog-main">
|
||
|
||
|
||
|
||
|
||
<article class="blog-post">
|
||
<header>
|
||
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2019-04/">April, 2019</a></h2>
|
||
<p class="blog-post-meta"><time datetime="2019-04-01T09:00:43+03:00">Mon Apr 01, 2019</time> by Alan Orth in
|
||
|
||
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
||
|
||
</p>
|
||
</header>
|
||
<h2 id="2019-04-01">2019-04-01</h2>
|
||
|
||
<ul>
|
||
<li>Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
|
||
|
||
<ul>
|
||
<li>They asked if we had plans to enable RDF support in CGSpace</li>
|
||
</ul></li>
|
||
<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
|
||
|
||
<ul>
|
||
<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
|
||
4432 200
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
|
||
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
|
||
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
||
</code></pre>
|
||
|
||
<h2 id="2019-04-02">2019-04-02</h2>
|
||
|
||
<ul>
|
||
<li>CTA says the Amazon IPs are AWS gateways for real user traffic</li>
|
||
<li>I was trying to add Felix Shaw’s account back to the Administrators group on DSpace Test, but I couldn’t find his name in the user search of the groups page
|
||
|
||
<ul>
|
||
<li>If I searched for “Felix” or “Shaw” I saw other matches, included one for his personal email address!</li>
|
||
<li>I ended up finding him via searching for his email address</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<h2 id="2019-04-03">2019-04-03</h2>
|
||
|
||
<ul>
|
||
<li>Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
|
||
|
||
<ul>
|
||
<li>First I need to extract the ones that are unique from their list compared to our existing one:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!</li>
|
||
<li>Next I will resolve all their names using my <code>resolve-orcids.py</code> script:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim</li>
|
||
<li>One user’s name has changed so I will update those using my <code>fix-metadata-values.py</code> script:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I created a pull request and merged the changes to the 5_x-prod branch (<a href="https://github.com/ilri/DSpace/pull/417">#417</a>)</li>
|
||
<li>A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:</li>
|
||
</ul>
|
||
|
||
<pre><code>2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
|
||
1
|
||
3 http://localhost:8081/solr//statistics-2017
|
||
5662 http://localhost:8081/solr//statistics-2018
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I will have to keep an eye on it because nothing should be updating 2018 stats in 2019…</li>
|
||
</ul>
|
||
|
||
<h2 id="2019-04-05">2019-04-05</h2>
|
||
|
||
<ul>
|
||
<li>Uptime Robot reported that CGSpace (linode18) went down tonight</li>
|
||
<li>I see there are lots of PostgreSQL connections:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
5 dspaceApi
|
||
10 dspaceCli
|
||
250 dspaceWeb
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I still see those weird messages about updating the statistics-2018 Solr core:</li>
|
||
</ul>
|
||
|
||
<pre><code>2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Looking at <code>iostat 1 10</code> I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:</li>
|
||
</ul>
|
||
|
||
<p><img src="/cgspace-notes/2019/04/cpu-week.png" alt="CPU usage week" /></p>
|
||
|
||
<ul>
|
||
<li>The other thing visible there is that the past few days the load has spiked to 500% and I don’t think it’s a coincidence that the Solr updating thing is happening…</li>
|
||
<li>I ran all system updates and rebooted the server
|
||
|
||
<ul>
|
||
<li>The load was lower on the server after reboot, but Solr didn’t come back up properly according to the Solr Admin UI:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code>statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I restarted it again and all the Solr cores came up properly…</li>
|
||
</ul>
|
||
|
||
<h2 id="2019-04-06">2019-04-06</h2>
|
||
|
||
<ul>
|
||
<li>Udana asked why item <a href="https://cgspace.cgiar.org/handle/10568/91278"><sup>10568</sup>⁄<sub>91278</sub></a> didn’t have an Altmetric badge on CGSpace, but on the <a href="https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity">WLE website</a> it does
|
||
|
||
<ul>
|
||
<li>I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all</li>
|
||
<li>I tweeted the item and I assume this will link the Handle with the DOI in the system</li>
|
||
<li>Twenty minutes later I see the same Altmetric score (9) on CGSpace</li>
|
||
</ul></li>
|
||
<li>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</li>
|
||
</ul>
|
||
|
||
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
222 18.195.78.144
|
||
245 207.46.13.58
|
||
303 207.46.13.194
|
||
328 66.249.79.33
|
||
564 207.46.13.210
|
||
566 66.249.79.62
|
||
575 40.77.167.66
|
||
1803 66.249.79.59
|
||
2834 2a01:4f8:140:3192::2
|
||
9623 45.5.184.72
|
||
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
31 66.249.79.62
|
||
41 207.46.13.210
|
||
42 40.77.167.66
|
||
54 42.113.50.219
|
||
132 66.249.79.59
|
||
785 2001:41d0:d:1990::
|
||
1164 45.5.184.72
|
||
2014 50.116.102.77
|
||
4267 45.5.186.2
|
||
4893 205.186.128.185
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li><code>45.5.184.72</code> is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:</li>
|
||
</ul>
|
||
|
||
<pre><code>GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”</li>
|
||
<li>They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):</li>
|
||
</ul>
|
||
|
||
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
|
||
22077 /handle/10568/72970/discover
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Yesterday they made 43,000 requests and we actually blocked most of them:</li>
|
||
</ul>
|
||
|
||
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
|
||
43631 /handle/10568/72970/discover
|
||
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
|
||
142 200
|
||
43489 503
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover</li>
|
||
<li>Maria from Bioversity recommended that we use the phrase “AGROVOC subject” instead of “Subject” in Listings and Reports
|
||
|
||
<ul>
|
||
<li>I made a pull request to update this and merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/418">#418</a>)</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<h2 id="2019-04-07">2019-04-07</h2>
|
||
|
||
<ul>
|
||
<li>Looking into the impact of harvesters like <code>45.5.184.72</code>, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands <em>per day</em></li>
|
||
<li>Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
|
||
|
||
<ul>
|
||
<li>I am trying to see if these are registered as downloads in Solr or not</li>
|
||
<li>I see 96,925 downloads from their AWS gateway IPs in 2019-03:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
|
||
{
|
||
"response": {
|
||
"docs": [],
|
||
"numFound": 96925,
|
||
"start": 0
|
||
},
|
||
"responseHeader": {
|
||
"QTime": 1,
|
||
"params": {
|
||
"fq": [
|
||
"statistics_type:view",
|
||
"bundleName:ORIGINAL",
|
||
"dateYearMonth:2019-03"
|
||
],
|
||
"indent": "true",
|
||
"q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
|
||
"rows": "0",
|
||
"wt": "json"
|
||
},
|
||
"status": 0
|
||
}
|
||
}
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Strangely I don’t see many hits in 2019-04:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
|
||
{
|
||
"response": {
|
||
"docs": [],
|
||
"numFound": 38,
|
||
"start": 0
|
||
},
|
||
"responseHeader": {
|
||
"QTime": 1,
|
||
"params": {
|
||
"fq": [
|
||
"statistics_type:view",
|
||
"bundleName:ORIGINAL",
|
||
"dateYearMonth:2019-04"
|
||
],
|
||
"indent": "true",
|
||
"q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
|
||
"rows": "0",
|
||
"wt": "json"
|
||
},
|
||
"status": 0
|
||
}
|
||
}
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Making some tests on GET vs HEAD requests on the <a href="https://dspacetest.cgiar.org/handle/10568/100289">CTA Spore 192 item</a> on DSpace Test:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
|
||
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
|
||
Accept: */*
|
||
Accept-Encoding: gzip, deflate
|
||
Connection: keep-alive
|
||
Host: dspacetest.cgiar.org
|
||
User-Agent: HTTPie/1.0.2
|
||
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Language: en-US
|
||
Content-Length: 2069158
|
||
Content-Type: application/pdf;charset=ISO-8859-1
|
||
Date: Sun, 07 Apr 2019 08:38:34 GMT
|
||
Expires: Sun, 07 Apr 2019 09:38:34 GMT
|
||
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
|
||
Vary: User-Agent
|
||
X-Cocoon-Version: 2.2.0
|
||
X-Content-Type-Options: nosniff
|
||
X-Frame-Options: SAMEORIGIN
|
||
X-Robots-Tag: none
|
||
X-XSS-Protection: 1; mode=block
|
||
|
||
$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
|
||
HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
|
||
Accept: */*
|
||
Accept-Encoding: gzip, deflate
|
||
Connection: keep-alive
|
||
Host: dspacetest.cgiar.org
|
||
User-Agent: HTTPie/1.0.2
|
||
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Language: en-US
|
||
Content-Length: 2069158
|
||
Content-Type: application/pdf;charset=ISO-8859-1
|
||
Date: Sun, 07 Apr 2019 08:39:01 GMT
|
||
Expires: Sun, 07 Apr 2019 09:39:01 GMT
|
||
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly
|
||
Vary: User-Agent
|
||
X-Cocoon-Version: 2.2.0
|
||
X-Content-Type-Options: nosniff
|
||
X-Frame-Options: SAMEORIGIN
|
||
X-Robots-Tag: none
|
||
X-XSS-Protection: 1; mode=block
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>And from the server side, the nginx logs show:</li>
|
||
</ul>
|
||
|
||
<pre><code>78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
|
||
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>So definitely the <em>size</em> of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
|
||
|
||
<ul>
|
||
<li>After twenty minutes of waiting I still don’t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code>2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
|
||
|
||
<ul>
|
||
<li>Strangely, the statistics Solr core says it hasn’t been modified in 24 hours, so I tried to start the “optimize” process from the Admin UI and I see this in the Solr log:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code>2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are <code>statistics_type:view</code>… very weird
|
||
|
||
<ul>
|
||
<li>I don’t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)</li>
|
||
<li>I will try to re-deploy the <code>5_x-dev</code> branch and test again</li>
|
||
</ul></li>
|
||
<li>According to the <a href="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics">DSpace 5.x Solr documentation</a> the default commit time is after 15 minutes or 10,000 documents (see <code>solrconfig.xml</code>)</li>
|
||
<li>I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they <em>do</em> register as downloads (even though they are internal):</li>
|
||
</ul>
|
||
|
||
<pre><code>$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
|
||
{
|
||
"response": {
|
||
"docs": [],
|
||
"numFound": 909,
|
||
"start": 0
|
||
},
|
||
"responseHeader": {
|
||
"QTime": 0,
|
||
"params": {
|
||
"fq": [
|
||
"statistics_type:view",
|
||
"isInternal:true"
|
||
],
|
||
"indent": "true",
|
||
"q": "type:0 AND time:2019-04-07*",
|
||
"rows": "0",
|
||
"wt": "json"
|
||
},
|
||
"status": 0
|
||
}
|
||
}
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I confirmed the same on CGSpace itself after making one HEAD request</li>
|
||
<li>So I’m pretty sure it’s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
|
||
|
||
<ul>
|
||
<li>I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace</li>
|
||
</ul></li>
|
||
<li>Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
|
||
|
||
<ul>
|
||
<li>This leads me to believe there is something specifically wrong with DSpace Test (linode19)</li>
|
||
<li>The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC</li>
|
||
</ul></li>
|
||
<li>Holy shit, all this is actually because of the GeoIP1 deprecation and a missing <code>GeoLiteCity.dat</code>
|
||
|
||
<ul>
|
||
<li>For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!</li>
|
||
<li>See: <a href="https://jira.duraspace.org/browse/DS-3986">DS-3986</a></li>
|
||
<li>See: <a href="https://jira.duraspace.org/browse/DS-4020">DS-4020</a></li>
|
||
<li>See: <a href="https://jira.duraspace.org/browse/DS-3832">DS-3832</a></li>
|
||
<li>DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been <em>removed</em> from MaxMind’s server as of 2018-04-01</li>
|
||
<li>Now I made 100 requests and I see them in the Solr statistics… fuck my life for wasting five hours debugging this</li>
|
||
</ul></li>
|
||
<li>UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check <code>iostat 1 10</code> and I saw that CPU steal is around 10–30 percent right now…</li>
|
||
<li>The load average is super high right now, as I’ve noticed the last few times UptimeRobot said that CGSpace went down:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ cat /proc/loadavg
|
||
10.70 9.17 8.85 18/633 4198
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>According to the server logs there is actually not much going on right now:</li>
|
||
</ul>
|
||
|
||
<pre><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
118 18.195.78.144
|
||
128 207.46.13.219
|
||
129 167.114.64.100
|
||
159 207.46.13.129
|
||
179 207.46.13.33
|
||
188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
|
||
195 66.249.79.59
|
||
363 40.77.167.21
|
||
740 2a01:4f8:140:3192::2
|
||
4823 45.5.184.72
|
||
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
3 66.249.79.62
|
||
3 66.249.83.196
|
||
4 207.46.13.86
|
||
5 82.145.222.150
|
||
6 2a01:4f9:2b:1263::2
|
||
6 41.204.190.40
|
||
7 35.174.176.49
|
||
10 40.77.167.21
|
||
11 194.246.119.6
|
||
11 66.249.79.59
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li><code>45.5.184.72</code> is CIAT, who I already blocked and am waiting to hear from</li>
|
||
<li><code>2a01:4f8:140:3192::2</code> is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve</li>
|
||
<li><code>2408:8214:7a00:868f:7c1e:e0f3:20c6:c142</code> is some stupid Chinese bot making malicious POST requests</li>
|
||
<li>There are free database connections in the pool:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
5 dspaceApi
|
||
7 dspaceCli
|
||
23 dspaceWeb
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>It seems that the issue with CGSpace being “down” is actually because of CPU steal again!!!</li>
|
||
<li>I opened a ticket with support and asked them to migrate the VM to a less busy host</li>
|
||
</ul>
|
||
|
||
<h2 id="2019-04-08">2019-04-08</h2>
|
||
|
||
<ul>
|
||
<li>Start checking IITA’s last round of batch uploads from <a href="https://dspacetest.cgiar.org/handle/10568/100333">March on DSpace Test</a> (20193rd.xls)
|
||
|
||
<ul>
|
||
<li>Lots of problems with affiliations, I had to correct about sixty of them</li>
|
||
<li>I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code>$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
|
||
|
||
<ul>
|
||
<li>The matched values can be accessed with <code>cell.recon.match.name</code>, but some of the new values don’t appear, perhaps because I edited the original cell values?</li>
|
||
<li>I ended up using this GREL expression to copy all values to a new column:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code>if(cell.recon.matched, cell.recon.match.name, value)
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>See the <a href="https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon">OpenRefine variables documentation</a> for more notes about the <code>recon</code> object</li>
|
||
<li>I also noticed a handful of errors in our current list of affiliations so I corrected them:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>We should create a new list of affiliations to update our controlled vocabulary again</li>
|
||
<li>I dumped a list of the top 1500 affiliations:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
|
||
COPY 1500
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
|
||
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural and Livestock Research^M%';
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
|
||
COPY 60
|
||
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
|
||
COPY 20
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
|
||
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>UptimeRobot said that CGSpace (linode18) went down tonight
|
||
|
||
<ul>
|
||
<li>The load average is at <code>9.42, 8.87, 7.87</code></li>
|
||
<li>I looked at PostgreSQL and see shitloads of connections there:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
5 dspaceApi
|
||
7 dspaceCli
|
||
250 dspaceWeb
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>On a related note I see connection pool errors in the DSpace log:</li>
|
||
</ul>
|
||
|
||
<pre><code>2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
|
||
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>But still I see 10 to 30% CPU steal in <code>iostat</code> that is also reflected in the Munin graphs:</li>
|
||
</ul>
|
||
|
||
<p><img src="/cgspace-notes/2019/04/cpu-week2.png" alt="CPU usage week" /></p>
|
||
|
||
<ul>
|
||
<li>Linode Support still didn’t respond to my ticket from yesterday, so I attached a new output of <code>iostat 1 10</code> and asked them to move the VM to a less busy host</li>
|
||
<li>The web server logs are not very busy:</li>
|
||
</ul>
|
||
|
||
<pre><code># zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
124 40.77.167.135
|
||
135 95.108.181.88
|
||
139 157.55.39.206
|
||
190 66.249.79.133
|
||
202 45.5.186.2
|
||
284 207.46.13.95
|
||
359 18.196.196.108
|
||
457 157.55.39.164
|
||
457 40.77.167.132
|
||
3822 45.5.184.72
|
||
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
5 129.0.79.206
|
||
5 41.205.240.21
|
||
7 207.46.13.95
|
||
7 66.249.79.133
|
||
7 66.249.79.135
|
||
7 95.108.181.88
|
||
8 40.77.167.111
|
||
19 157.55.39.164
|
||
20 40.77.167.132
|
||
370 51.254.16.223
|
||
</code></pre>
|
||
|
||
<h2 id="2019-04-09">2019-04-09</h2>
|
||
|
||
<ul>
|
||
<li>Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning</li>
|
||
<li>Here are the top IPs in the web server logs around that time:</li>
|
||
</ul>
|
||
|
||
<pre><code># zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
18 66.249.79.139
|
||
21 157.55.39.160
|
||
29 66.249.79.137
|
||
38 66.249.79.135
|
||
50 34.200.212.137
|
||
54 66.249.79.133
|
||
100 102.128.190.18
|
||
1166 45.5.184.72
|
||
4251 45.5.186.2
|
||
4895 205.186.128.185
|
||
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
200 144.48.242.108
|
||
202 207.46.13.185
|
||
206 18.194.46.84
|
||
239 66.249.79.139
|
||
246 18.196.196.108
|
||
274 31.6.77.23
|
||
289 66.249.79.137
|
||
312 157.55.39.160
|
||
441 66.249.79.135
|
||
856 66.249.79.133
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li><code>45.5.186.2</code> is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:</li>
|
||
</ul>
|
||
|
||
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Database connection usage looks fine:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
5 dspaceApi
|
||
7 dspaceCli
|
||
11 dspaceWeb
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Ironically I do still see some 2 to 10% of CPU steal in <code>iostat 1 10</code></li>
|
||
<li>Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace
|
||
|
||
<ul>
|
||
<li>I told them how to use the REST API to get the CIAT Datasets collection and enumerate its items</li>
|
||
</ul></li>
|
||
<li>In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night</li>
|
||
</ul>
|
||
|
||
<h2 id="2019-04-10">2019-04-10</h2>
|
||
|
||
<ul>
|
||
<li>Abenet pointed out a possibility of validating funders against the <a href="https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API">CrossRef API</a></li>
|
||
<li>Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked</li>
|
||
</ul>
|
||
|
||
<pre><code>$ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Otherwise, they provide the funder data in <a href="https://www.crossref.org/services/funder-registry/">CSV and RDF format</a></li>
|
||
<li>I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…</li>
|
||
<li>If I want to write a script for this I could use the Python <a href="https://habanero.readthedocs.io/en/latest/modules/crossref.html">habanero library</a>:</li>
|
||
</ul>
|
||
|
||
<pre><code>from habanero import Crossref
|
||
cr = Crossref(mailto="me@cgiar.org")
|
||
x = cr.funders(query = "mercator")
|
||
</code></pre>
|
||
|
||
<!-- vim: set sw=2 ts=2: -->
|
||
|
||
|
||
|
||
|
||
|
||
</article>
|
||
|
||
|
||
|
||
</div> <!-- /.blog-main -->
|
||
|
||
<aside class="col-sm-3 ml-auto blog-sidebar">
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Recent Posts</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
|
||
<li><a href="/cgspace-notes/2019-04/">April, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2019-03/">March, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2019-02/">February, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2019-01/">January, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-12/">December, 2018</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Links</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
||
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
||
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
</aside>
|
||
|
||
|
||
</div> <!-- /.row -->
|
||
</div> <!-- /.container -->
|
||
|
||
|
||
|
||
<footer class="blog-footer">
|
||
<p>
|
||
|
||
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
||
</p>
|
||
<p>
|
||
<a href="#">Back to top</a>
|
||
</p>
|
||
</footer>
|
||
|
||
|
||
</body>
|
||
|
||
</html>
|