mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-22 13:12:19 +01:00
335 lines
12 KiB
HTML
335 lines
12 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
<meta property="og:title" content="October, 2018" />
|
|
<meta property="og:description" content="2018-10-01 Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now 2018-10-03 I see Moayad was busy collecting item views and downloads from CGSpace yesterday: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1} ' | sort | uniq -c | sort -n | tail -n 10 933 40." />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-10/" /><meta property="article:published_time" content="2018-10-01T22:31:54+03:00"/>
|
|
<meta property="article:modified_time" content="2018-10-04T11:05:59+03:00"/>
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="October, 2018"/>
|
|
<meta name="twitter:description" content="2018-10-01 Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now 2018-10-03 I see Moayad was busy collecting item views and downloads from CGSpace yesterday: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1} ' | sort | uniq -c | sort -n | tail -n 10 933 40."/>
|
|
<meta name="generator" content="Hugo 0.49" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "October, 2018",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2018-10/",
|
|
"wordCount": "809",
|
|
"datePublished": "2018-10-01T22:31:54+03:00",
|
|
"dateModified": "2018-10-04T11:05:59+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-10/">
|
|
|
|
<title>October, 2018 | CGSpace Notes</title>
|
|
|
|
<!-- combined, minified CSS -->
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-Upm5uY/SXdvbjuIGH6fBjF5vOYUr9DguqBskM+EQpLBzO9U+9fMVmWEt+TTlGrWQ" crossorigin="anonymous">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-10/">October, 2018</a></h2>
|
|
<p class="blog-post-meta"><time datetime="2018-10-01T22:31:54+03:00">Mon Oct 01, 2018</time> by Alan Orth in
|
|
|
|
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
|
|
|
</p>
|
|
</header>
|
|
|
|
|
|
<h2 id="2018-10-01">2018-10-01</h2>
|
|
|
|
<ul>
|
|
<li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li>
|
|
<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I’m super busy in Nairobi right now</li>
|
|
</ul>
|
|
|
|
<h2 id="2018-10-03">2018-10-03</h2>
|
|
|
|
<ul>
|
|
<li>I see Moayad was busy collecting item views and downloads from CGSpace yesterday:</li>
|
|
</ul>
|
|
|
|
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}
|
|
' | sort | uniq -c | sort -n | tail -n 10
|
|
933 40.77.167.90
|
|
971 95.108.181.88
|
|
1043 41.204.190.40
|
|
1454 157.55.39.54
|
|
1538 207.46.13.69
|
|
1719 66.249.64.61
|
|
2048 50.116.102.77
|
|
4639 66.249.64.59
|
|
4736 35.237.175.180
|
|
150362 34.218.226.147
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Of those, about 20% were HTTP 500 responses (!):</li>
|
|
</ul>
|
|
|
|
<pre><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
|
|
118927 200
|
|
31435 500
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for <code>cg.creator.orcid</code> and then re-generated the names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
|
|
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I found a new corner case error that I need to check, given <em>and</em> family names deactivated:</li>
|
|
</ul>
|
|
|
|
<pre><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
|
|
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>It appears to be Jim Lorenzen… I need to check that later!</li>
|
|
<li>I merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/390">#390</a>)</li>
|
|
<li>Linode sent another alert about CPU usage on CGSpace (linode18) this evening</li>
|
|
<li>It seems that Moayad is making quite a lot of requests today:</li>
|
|
</ul>
|
|
|
|
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
1594 157.55.39.160
|
|
1627 157.55.39.173
|
|
1774 136.243.6.84
|
|
4228 35.237.175.180
|
|
4497 70.32.83.92
|
|
4856 66.249.64.59
|
|
7120 50.116.102.77
|
|
12518 138.201.49.199
|
|
87646 34.218.226.147
|
|
111729 213.139.53.62
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>But in super positive news, he says they are using my new <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> and it’s MUCH faster than using Atmire CUA’s internal “restlet” API</li>
|
|
<li>I don’t recognize the <code>138.201.49.199</code> IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:</li>
|
|
</ul>
|
|
|
|
<pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
|
|
8324 GET /bitstream
|
|
4193 GET /handle
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):</li>
|
|
</ul>
|
|
|
|
<pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
|
|
7 GET /handle/10568
|
|
4186 GET /handle/10947
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>The user agent is suspicious too:</li>
|
|
</ul>
|
|
|
|
<pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list</li>
|
|
<li>I looked in Solr’s statistics core and these hits were actually all counted as <code>isBot:false</code> (of course)… hmmm</li>
|
|
<li>I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers.py</a> script:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Where <code>2018-10-03-add-orcids.csv</code> contained:</li>
|
|
</ul>
|
|
|
|
<pre><code>dc.contributor.author,cg.creator.id
|
|
"Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
|
|
"Henson, S.",Sonal Henson: 0000-0002-2002-5462
|
|
"Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182
|
|
"Thornton, Philip K",Philip Thornton: 0000-0002-1854-0182
|
|
"Thornton, Phil",Philip Thornton: 0000-0002-1854-0182
|
|
"Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182
|
|
"Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182
|
|
"Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182
|
|
</code></pre>
|
|
|
|
<h2 id="2018-10-04">2018-10-04</h2>
|
|
|
|
<ul>
|
|
<li>Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)</li>
|
|
<li>Every item has at least a <code>LICENSE</code> bundle, and some have a <code>THUMBNAIL</code> bundle, but the indexing code is specifically checking for downloads from the <code>ORIGINAL</code> bundle
|
|
|
|
<ul>
|
|
<li><a href="https://cgspace.cgiar.org/handle/10568/97460"><sup>10568</sup>⁄<sub>97460</sub></a> (100550): has a thumbnail bitstream</li>
|
|
<li><a href="https://cgspace.cgiar.org/handle/10568/96112"><sup>10568</sup>⁄<sub>96112</sub></a> (96736): has only a LICENSE bitstream</li>
|
|
</ul></li>
|
|
<li>I see there are other bundles we might need to pay attention to: <code>TEXT</code>, <code>@_LOGO-COLLECTION_@</code>, <code>@_LOGO-COMMUNITY_@</code>, etc…</li>
|
|
<li>On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads</li>
|
|
<li>So it’s fixed, but I’m not sure why!</li>
|
|
<li>Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):</li>
|
|
</ul>
|
|
|
|
<pre><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
|
|
251226
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I found a logic error in the dspace-statistics-api <code>indexer.py</code> script that was causing item views to be inserted into downloads</li>
|
|
<li>I tagged version 0.4.2 of the tool and redeployed it on CGSpace</li>
|
|
</ul>
|
|
|
|
<h2 id="2018-10-05">2018-10-05</h2>
|
|
|
|
<ul>
|
|
<li>Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay’s work plan</li>
|
|
<li>We agreed that he would do monthly updates of the controlled vocabularies and generate a new one for the top 1,000 AGROVOC terms</li>
|
|
<li>Add a link to <a href="https://cgspace.cgiar.org/explorer/">AReS explorer</a> to the CGSpace homepage introduction text</li>
|
|
</ul>
|
|
|
|
<h2 id="2018-10-06">2018-10-06</h2>
|
|
|
|
<ul>
|
|
<li>Follow up with AgriKnowledge about including Handle links (<code>dc.identifier.uri</code>) on their item pages</li>
|
|
<li>In July, 2018 they had said their programmers would include the field in the next update of their website software</li>
|
|
<li><a href="https://repository.cimmyt.org/">CIMMYT’s DSpace repository</a> is now running DSpace 5.x!</li>
|
|
<li>It’s running OAI, but not REST, so I need to talk to Richard about that!</li>
|
|
</ul>
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2018-10/">October, 2018</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2018-09/">September, 2018</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2018-08/">August, 2018</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2018-07/">July, 2018</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2018-06/">June, 2018</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p>
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|