mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-18 19:22:18 +01:00
434 lines
16 KiB
HTML
434 lines
16 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
<meta property="og:title" content="April, 2019" />
|
|
<meta property="og:description" content="2019-04-01
|
|
|
|
|
|
Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
|
|
|
|
|
|
They asked if we had plans to enable RDF support in CGSpace
|
|
|
|
There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
|
|
|
|
|
|
I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
|
|
|
|
|
|
|
|
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
|
|
4432 200
|
|
|
|
|
|
|
|
In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
|
|
Apply country and region corrections and deletions on DSpace Test and CGSpace:
|
|
|
|
|
|
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
|
|
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
|
|
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
|
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
|
|
<meta property="article:published_time" content="2019-04-01T09:00:43+03:00"/>
|
|
<meta property="article:modified_time" content="2019-04-06T11:47:45+03:00"/>
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="April, 2019"/>
|
|
<meta name="twitter:description" content="2019-04-01
|
|
|
|
|
|
Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
|
|
|
|
|
|
They asked if we had plans to enable RDF support in CGSpace
|
|
|
|
There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
|
|
|
|
|
|
I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
|
|
|
|
|
|
|
|
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
|
|
4432 200
|
|
|
|
|
|
|
|
In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
|
|
Apply country and region corrections and deletions on DSpace Test and CGSpace:
|
|
|
|
|
|
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
|
|
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
|
|
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
|
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.54.0" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "April, 2019",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2019-04/",
|
|
"wordCount": "1044",
|
|
"datePublished": "2019-04-01T09:00:43+03:00",
|
|
"dateModified": "2019-04-06T11:47:45+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-04/">
|
|
|
|
<title>April, 2019 | CGSpace Notes</title>
|
|
|
|
<!-- combined, minified CSS -->
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2019-04/">April, 2019</a></h2>
|
|
<p class="blog-post-meta"><time datetime="2019-04-01T09:00:43+03:00">Mon Apr 01, 2019</time> by Alan Orth in
|
|
|
|
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2019-04-01">2019-04-01</h2>
|
|
|
|
<ul>
|
|
<li>Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
|
|
|
|
<ul>
|
|
<li>They asked if we had plans to enable RDF support in CGSpace</li>
|
|
</ul></li>
|
|
<li>There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
|
|
|
|
<ul>
|
|
<li>I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!</li>
|
|
</ul></li>
|
|
</ul>
|
|
|
|
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
|
|
4432 200
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
|
|
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
|
|
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
|
|
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
|
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
|
</code></pre>
|
|
|
|
<h2 id="2019-04-02">2019-04-02</h2>
|
|
|
|
<ul>
|
|
<li>CTA says the Amazon IPs are AWS gateways for real user traffic</li>
|
|
<li>I was trying to add Felix Shaw’s account back to the Administrators group on DSpace Test, but I couldn’t find his name in the user search of the groups page
|
|
|
|
<ul>
|
|
<li>If I searched for “Felix” or “Shaw” I saw other matches, included one for his personal email address!</li>
|
|
<li>I ended up finding him via searching for his email address</li>
|
|
</ul></li>
|
|
</ul>
|
|
|
|
<h2 id="2019-04-03">2019-04-03</h2>
|
|
|
|
<ul>
|
|
<li>Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
|
|
|
|
<ul>
|
|
<li>First I need to extract the ones that are unique from their list compared to our existing one:</li>
|
|
</ul></li>
|
|
</ul>
|
|
|
|
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!</li>
|
|
<li>Next I will resolve all their names using my <code>resolve-orcids.py</code> script:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim</li>
|
|
<li>One user’s name has changed so I will update those using my <code>fix-metadata-values.py</code> script:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I created a pull request and merged the changes to the 5_x-prod branch (<a href="https://github.com/ilri/DSpace/pull/417">#417</a>)</li>
|
|
<li>A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:</li>
|
|
</ul>
|
|
|
|
<pre><code>2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
|
|
1
|
|
3 http://localhost:8081/solr//statistics-2017
|
|
5662 http://localhost:8081/solr//statistics-2018
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I will have to keep an eye on it because nothing should be updating 2018 stats in 2019…</li>
|
|
</ul>
|
|
|
|
<h2 id="2019-04-05">2019-04-05</h2>
|
|
|
|
<ul>
|
|
<li>Uptime Robot reported that CGSpace (linode18) went down tonight</li>
|
|
<li>I see there are lots of PostgreSQL connections:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
|
5 dspaceApi
|
|
10 dspaceCli
|
|
250 dspaceWeb
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I still see those weird messages about updating the statistics-2018 Solr core:</li>
|
|
</ul>
|
|
|
|
<pre><code>2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Looking at <code>iostat 1 10</code> I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:</li>
|
|
</ul>
|
|
|
|
<p><img src="/cgspace-notes/2019/04/cpu-week.png" alt="CPU usage week" /></p>
|
|
|
|
<ul>
|
|
<li>The other thing visible there is that the past few days the load has spiked to 500% and I don’t think it’s a coincidence that the Solr updating thing is happening…</li>
|
|
<li>I ran all system updates and rebooted the server
|
|
|
|
<ul>
|
|
<li>The load was lower on the server after reboot, but Solr didn’t come back up properly according to the Solr Admin UI:</li>
|
|
</ul></li>
|
|
</ul>
|
|
|
|
<pre><code>statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I restarted it again and all the Solr cores came up properly…</li>
|
|
</ul>
|
|
|
|
<h2 id="2019-04-06">2019-04-06</h2>
|
|
|
|
<ul>
|
|
<li>Udana asked why item <a href="https://cgspace.cgiar.org/handle/10568/91278"><sup>10568</sup>⁄<sub>91278</sub></a> didn’t have an Altmetric badge on CGSpace, but on the <a href="https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity">WLE website</a> it does
|
|
|
|
<ul>
|
|
<li>I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all</li>
|
|
<li>I tweeted the item and I assume this will link the Handle with the DOI in the system</li>
|
|
</ul></li>
|
|
<li>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</li>
|
|
</ul>
|
|
|
|
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
222 18.195.78.144
|
|
245 207.46.13.58
|
|
303 207.46.13.194
|
|
328 66.249.79.33
|
|
564 207.46.13.210
|
|
566 66.249.79.62
|
|
575 40.77.167.66
|
|
1803 66.249.79.59
|
|
2834 2a01:4f8:140:3192::2
|
|
9623 45.5.184.72
|
|
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
31 66.249.79.62
|
|
41 207.46.13.210
|
|
42 40.77.167.66
|
|
54 42.113.50.219
|
|
132 66.249.79.59
|
|
785 2001:41d0:d:1990::
|
|
1164 45.5.184.72
|
|
2014 50.116.102.77
|
|
4267 45.5.186.2
|
|
4893 205.186.128.185
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li><code>45.5.184.72</code> is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:</li>
|
|
</ul>
|
|
|
|
<pre><code>GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”</li>
|
|
<li>They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):</li>
|
|
</ul>
|
|
|
|
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
|
|
22077 /handle/10568/72970/discover
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Yesterday they made 43,000 requests and we actually blocked most of them:</li>
|
|
</ul>
|
|
|
|
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
|
|
43631 /handle/10568/72970/discover
|
|
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
|
|
142 200
|
|
43489 503
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover</li>
|
|
<li>Maria from Bioversity recommended that we use the phrase “AGROVOC subject” instead of “Subject” in Listings and Reports
|
|
|
|
<ul>
|
|
<li>I made a pull request to update this and merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/418">#418</a>)</li>
|
|
</ul></li>
|
|
</ul>
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2019-04/">April, 2019</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2019-03/">March, 2019</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2019-02/">February, 2019</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2019-01/">January, 2019</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2018-12/">December, 2018</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p>
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|