cgspace-notes/docs/2018-10/index.html

742 lines
36 KiB
HTML
Raw Normal View History

2018-10-01 21:33:15 +02:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="October, 2018" />
2018-10-03 10:52:48 +02:00
<meta property="og:description" content="2018-10-01 Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nairobi right now 2018-10-03 I see Moayad was busy collecting item views and downloads from CGSpace yesterday: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | awk &#39;{print $1} &#39; | sort | uniq -c | sort -n | tail -n 10 933 40." />
2018-10-01 21:33:15 +02:00
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-10/" /><meta property="article:published_time" content="2018-10-01T22:31:54&#43;03:00"/>
2018-10-22 11:33:49 +02:00
<meta property="article:modified_time" content="2018-10-21T12:22:33&#43;03:00"/>
2018-10-01 21:33:15 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2018"/>
2018-10-03 10:52:48 +02:00
<meta name="twitter:description" content="2018-10-01 Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nairobi right now 2018-10-03 I see Moayad was busy collecting item views and downloads from CGSpace yesterday: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | awk &#39;{print $1} &#39; | sort | uniq -c | sort -n | tail -n 10 933 40."/>
2018-10-13 20:17:20 +02:00
<meta name="generator" content="Hugo 0.49.2" />
2018-10-01 21:33:15 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "October, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-10/",
2018-10-22 11:33:49 +02:00
"wordCount": "3644",
2018-10-01 21:33:15 +02:00
"datePublished": "2018-10-01T22:31:54&#43;03:00",
2018-10-22 11:33:49 +02:00
"dateModified": "2018-10-21T12:22:33&#43;03:00",
2018-10-01 21:33:15 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-10/">
<title>October, 2018 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-Upm5uY/SXdvbjuIGH6fBjF5vOYUr9DguqBskM&#43;EQpLBzO9U&#43;9fMVmWEt&#43;TTlGrWQ" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-10/">October, 2018</a></h2>
<p class="blog-post-meta"><time datetime="2018-10-01T22:31:54&#43;03:00">Mon Oct 01, 2018</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2018-10-01">2018-10-01</h2>
<ul>
<li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li>
<li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I&rsquo;m super busy in Nairobi right now</li>
</ul>
2018-10-03 10:52:48 +02:00
<h2 id="2018-10-03">2018-10-03</h2>
<ul>
<li>I see Moayad was busy collecting item views and downloads from CGSpace yesterday:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | awk '{print $1}
' | sort | uniq -c | sort -n | tail -n 10
933 40.77.167.90
971 95.108.181.88
1043 41.204.190.40
1454 157.55.39.54
1538 207.46.13.69
1719 66.249.64.61
2048 50.116.102.77
4639 66.249.64.59
4736 35.237.175.180
150362 34.218.226.147
</code></pre>
<ul>
<li>Of those, about 20% were HTTP 500 responses (!):</li>
</ul>
<pre><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Oct/2018&quot; | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
118927 200
31435 500
</code></pre>
2018-10-03 16:54:58 +02:00
<ul>
<li>I added Phil Thornton and Sonal Henson&rsquo;s ORCID identifiers to the controlled vocabulary for <code>cg.creator.orcid</code> and then re-generated the names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
</ul>
<pre><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq &gt; 2018-10-03-orcids.txt
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
</code></pre>
<ul>
<li>I found a new corner case error that I need to check, given <em>and</em> family names deactivated:</li>
</ul>
<pre><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
</code></pre>
<ul>
<li>It appears to be Jim Lorenzen&hellip; I need to check that later!</li>
<li>I merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/390">#390</a>)</li>
2018-10-03 20:52:12 +02:00
<li>Linode sent another alert about CPU usage on CGSpace (linode18) this evening</li>
<li>It seems that Moayad is making quite a lot of requests today:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;03/Oct/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
1594 157.55.39.160
1627 157.55.39.173
1774 136.243.6.84
4228 35.237.175.180
4497 70.32.83.92
4856 66.249.64.59
7120 50.116.102.77
12518 138.201.49.199
87646 34.218.226.147
111729 213.139.53.62
</code></pre>
<ul>
<li>But in super positive news, he says they are using my new <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> and it&rsquo;s MUCH faster than using Atmire CUA&rsquo;s internal &ldquo;restlet&rdquo; API</li>
<li>I don&rsquo;t recognize the <code>138.201.49.199</code> IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:</li>
</ul>
<pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
8324 GET /bitstream
4193 GET /handle
</code></pre>
<ul>
<li>Suspiciously, it&rsquo;s only grabbing the CGIAR System Office community (handle prefix 10947):</li>
</ul>
<pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
7 GET /handle/10568
4186 GET /handle/10947
</code></pre>
<ul>
<li>The user agent is suspicious too:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
</code></pre>
<ul>
<li>It&rsquo;s clearly a bot and it&rsquo;s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list</li>
<li>I looked in Solr&rsquo;s statistics core and these hits were actually all counted as <code>isBot:false</code> (of course)&hellip; hmmm</li>
2018-10-03 21:35:11 +02:00
<li>I tagged all of Sonal and Phil&rsquo;s items with their ORCID identifiers on CGSpace using my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers.py</a> script:</li>
2018-10-03 16:54:58 +02:00
</ul>
2018-10-03 21:35:11 +02:00
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
</code></pre>
<ul>
<li>Where <code>2018-10-03-add-orcids.csv</code> contained:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
&quot;Henson, Sonal P.&quot;,Sonal Henson: 0000-0002-2002-5462
&quot;Henson, S.&quot;,Sonal Henson: 0000-0002-2002-5462
&quot;Thornton, P.K.&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Philip K&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phil&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Philip K.&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phillip&quot;,Philip Thornton: 0000-0002-1854-0182
&quot;Thornton, Phillip K.&quot;,Philip Thornton: 0000-0002-1854-0182
</code></pre>
2018-10-04 10:05:59 +02:00
<h2 id="2018-10-04">2018-10-04</h2>
<ul>
<li>Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)</li>
<li>Every item has at least a <code>LICENSE</code> bundle, and some have a <code>THUMBNAIL</code> bundle, but the indexing code is specifically checking for downloads from the <code>ORIGINAL</code> bundle
<ul>
<li><a href="https://cgspace.cgiar.org/handle/10568/97460"><sup>10568</sup>&frasl;<sub>97460</sub></a> (100550): has a thumbnail bitstream</li>
<li><a href="https://cgspace.cgiar.org/handle/10568/96112"><sup>10568</sup>&frasl;<sub>96112</sub></a> (96736): has only a LICENSE bitstream</li>
</ul></li>
<li>I see there are other bundles we might need to pay attention to: <code>TEXT</code>, <code>@_LOGO-COLLECTION_@</code>, <code>@_LOGO-COMMUNITY_@</code>, etc&hellip;</li>
<li>On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads</li>
<li>So it&rsquo;s fixed, but I&rsquo;m not sure why!</li>
<li>Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
251226
</code></pre>
2018-10-06 11:30:07 +02:00
<ul>
<li>I found a logic error in the dspace-statistics-api <code>indexer.py</code> script that was causing item views to be inserted into downloads</li>
<li>I tagged version 0.4.2 of the tool and redeployed it on CGSpace</li>
</ul>
<h2 id="2018-10-05">2018-10-05</h2>
<ul>
<li>Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay&rsquo;s work plan</li>
<li>We agreed that he would do monthly updates of the controlled vocabularies and generate a new one for the top 1,000 AGROVOC terms</li>
<li>Add a link to <a href="https://cgspace.cgiar.org/explorer/">AReS explorer</a> to the CGSpace homepage introduction text</li>
</ul>
<h2 id="2018-10-06">2018-10-06</h2>
<ul>
<li>Follow up with AgriKnowledge about including Handle links (<code>dc.identifier.uri</code>) on their item pages</li>
<li>In July, 2018 they had said their programmers would include the field in the next update of their website software</li>
<li><a href="https://repository.cimmyt.org/">CIMMYT&rsquo;s DSpace repository</a> is now running DSpace 5.x!</li>
<li>It&rsquo;s running OAI, but not REST, so I need to talk to Richard about that!</li>
</ul>
2018-10-10 12:33:00 +02:00
<h2 id="2018-10-08">2018-10-08</h2>
<ul>
<li>AgriKnowledge says they&rsquo;re going to add the <code>dc.identifier.uri</code> to their item view in November when they update their website software</li>
</ul>
<h2 id="2018-10-10">2018-10-10</h2>
<ul>
<li>Peter noticed that some recently added PDFs don&rsquo;t have thumbnails</li>
<li>When I tried to force them to be generated I got an error that I&rsquo;ve never seen before:</li>
</ul>
<pre><code>$ dspace filter-media -v -f -i 10568/97613
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
</code></pre>
<ul>
<li>I see there was an update to Ubuntu&rsquo;s ImageMagick on 2018-10-05, so maybe something changed or broke?</li>
<li>I get the same error when forcing <code>filter-media</code> to run on DSpace Test too, so it&rsquo;s gotta be an ImageMagic bug</li>
<li>The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an <a href="https://usn.ubuntu.com/3785-1/">Ubuntu Security Notice from 2018-10-04</a></li>
<li>Wow, someone on <a href="https://twitter.com/rosscampbell/status/1048268966819319808">Twitter posted about this breaking his web application</a> (and it was retweeted by the ImageMagick acount!)</li>
<li>I commented out the line that disables PDF thumbnails in <code>/etc/ImageMagick-6/policy.xml</code>:</li>
</ul>
<pre><code> &lt;!--&lt;policy domain=&quot;coder&quot; rights=&quot;none&quot; pattern=&quot;PDF&quot; /&gt;--&gt;
</code></pre>
<ul>
<li>This works, but I&rsquo;m not sure what ImageMagick&rsquo;s long-term plan is if they are going to disable ALL image formats&hellip;</li>
<li>I suppose I need to enable a workaround for this in Ansible?</li>
</ul>
2018-10-11 08:32:54 +02:00
<h2 id="2018-10-11">2018-10-11</h2>
<ul>
<li>I emailed DuraSpace to update <a href="https://duraspace.org/registry/entry/4188/?gvid=178">our entry in their DSpace registry</a> (the data was still on DSpace 3, JSPUI, etc)</li>
<li>Generate a list of the top 1500 values for <code>dc.subject</code> so Sisay can start making a controlled vocabulary for it:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
</code></pre>
2018-10-11 10:17:07 +02:00
<ul>
<li>Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!</li>
<li>Last week I emailed Altmetric to ask if their software would notice mentions of our Handle in the format &ldquo;handle:<sup>10568</sup>&frasl;<sub>80775</sub>&rdquo; because I noticed that the <a href="https://landportal.org/library/resources/handle1056880775/unlocking-farming-potential-bangladesh%E2%80%99-polders">Land Portal does this</a></li>
<li>Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <code>&lt;meta&gt;</code> tags in their page header, and using &ldquo;dct:identifier&rdquo; property instead of &ldquo;dc:identifier&rdquo;</li>
<li>I re-created my local DSpace databse container using <a href="https://github.com/containers/libpod">podman</a> instead of Docker:</li>
</ul>
<pre><code>$ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
$ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ sudo podman start dspacedb
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
</code></pre>
<ul>
<li>I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository</li>
<li>I can pull the <code>docker.bintray.io/jfrog/artifactory-oss:latest</code> image, but not start it</li>
<li>I decided to use a Sonatype Nexus repository instead:</li>
</ul>
<pre><code>$ mkdir -p ~/.local/lib/containers/volumes/nexus_data
$ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
</code></pre>
<ul>
<li>With a few changes to my local Maven <code>settings.xml</code> it is working well</li>
<li>Generate a list of the top 10,000 authors for Peter Ballantyne to look through:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
COPY 10000
</code></pre>
2018-10-11 13:25:13 +02:00
<ul>
<li>CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections</li>
<li>I decided to constrain the max height of these to 200px using CSS (<a href="https://github.com/ilri/DSpace/pull/392">#392</a>)</li>
</ul>
2018-10-13 20:17:20 +02:00
<h2 id="2018-10-13">2018-10-13</h2>
<ul>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
<li>Look through Peter&rsquo;s list of 746 author corrections in OpenRefine</li>
<li>I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:</li>
</ul>
<pre><code>or(
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/)),
isNotNull(value.match(/.*\u2019.*/)),
isNotNull(value.match(/.*\u00b4.*/))
)
</code></pre>
<ul>
<li>Then I exported and applied them on my local test server:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
</code></pre>
<ul>
<li>I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay&rsquo;s author controlled vocabulary</li>
</ul>
2018-10-14 08:48:26 +02:00
<h2 id="2018-10-14">2018-10-14</h2>
<ul>
<li>Merge the authors controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/393">#393</a>), usage rights (<a href="https://github.com/ilri/DSpace/pull/394">#394</a>), and the upstream DSpace 5.x cherry-picks (<a href="https://github.com/ilri/DSpace/pull/395">#394</a>) into our <code>5_x-prod</code> branch</li>
<li>Switch to new CGIAR LDAP server on CGSpace, as it&rsquo;s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)</li>
<li>Apply Peter&rsquo;s 746 author corrections on CGSpace and DSpace Test using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
</code></pre>
<ul>
<li>Run all system updates on CGSpace (linode19) and reboot the server</li>
2018-10-14 13:56:58 +02:00
<li>After rebooting the server I noticed that Handles are not resolving, and the <code>dspace-handle-server</code> systemd service is not running (or rather, it exited with success)</li>
<li>Restarting the service with systemd works for a few seconds, then the java process quits</li>
<li>I suspect that the systemd service type needs to be <code>forking</code> rather than <code>simple</code>, because the service calls the default DSpace <code>start-handle-server</code> shell script, which uses <code>nohup</code> and <code>&amp;</code> to background the java process</li>
<li>It would be nice if there was a cleaner way to start the service and then just log to the systemd journal rather than all this hiding and log redirecting</li>
<li>Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page&rsquo;s header, rather than the HTML properties they are using in their body</li>
<li>Peter pointed out that some thumbnails were still not getting generated
<ul>
<li>When I tried to generate them manually I noticed that the path to the CMYK profile had changed because Ubuntu upgraded Ghostscript from 9.18 to 9.25 last week&hellip; WTF?</li>
<li>Looks like I can use <code>/usr/share/ghostscript/current</code> instead of <code>/usr/share/ghostscript/9.25</code>&hellip;</li>
</ul></li>
<li>I limited the tall thumbnails even further to 170px because Peter said CTA&rsquo;s were still too tall at 200px (<a href="https://github.com/ilri/DSpace/pull/396">#396</a>)</li>
2018-10-14 08:48:26 +02:00
</ul>
2018-10-15 07:14:22 +02:00
<h2 id="2018-10-15">2018-10-15</h2>
<ul>
<li>Tomcat on DSpace Test (linode19) has somehow stopped running all the DSpace applications</li>
<li>I don&rsquo;t see anything in the Catalina logs or <code>dmesg</code>, and the Tomcat manager shows XMLUI, REST, OAI, etc all &ldquo;Running: false&rdquo;</li>
<li>Actually, now I remember that yesterday when I deployed the latest changes from git on DSpace Test I noticed a syntax error in one XML file when I was doing the discovery reindexing</li>
<li>I fixed it so that I could reindex, but I guess the rest of DSpace actually didn&rsquo;t start up&hellip;</li>
2018-10-15 15:03:02 +02:00
<li>Create an account on DSpace Test for Felix from Earlham so he can test COPO submission
<ul>
<li>I created a new collection and added him as the administrator so he can test submission</li>
2018-10-15 16:26:03 +02:00
<li>He said he actually wants to test creation of communities, collections, etc, so I had to make him a super admin for now</li>
<li>I told him we need to think about the workflow more seriously in the future</li>
2018-10-15 15:03:02 +02:00
</ul></li>
2018-10-15 16:26:03 +02:00
<li>I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:</li>
2018-10-15 07:14:22 +02:00
</ul>
2018-10-15 16:26:03 +02:00
<pre><code>$ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
</code></pre>
2018-10-16 16:26:18 +02:00
<h2 id="2018-10-16">2018-10-16</h2>
<ul>
<li>Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:</li>
</ul>
<pre><code>dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
</code></pre>
2018-10-16 23:33:01 +02:00
<ul>
<li>Talking to the CodeObia guys about the REST API I started to wonder why it&rsquo;s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it</li>
<li>Interestingly, the speed doesn&rsquo;t get better after you request the same thing multiple timesit&rsquo;s consistently bad on both CGSpace and DSpace Test!</li>
</ul>
<pre><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
...
0.35s user 0.06s system 1% cpu 25.133 total
0.31s user 0.04s system 1% cpu 25.223 total
0.27s user 0.06s system 1% cpu 27.858 total
0.20s user 0.05s system 1% cpu 23.838 total
0.30s user 0.05s system 1% cpu 24.301 total
$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
...
0.22s user 0.03s system 1% cpu 17.248 total
0.23s user 0.02s system 1% cpu 16.856 total
0.23s user 0.04s system 1% cpu 16.460 total
0.24s user 0.04s system 1% cpu 21.043 total
0.22s user 0.04s system 1% cpu 17.132 total
</code></pre>
<ul>
<li>I should note that at this time CGSpace is using Oracle Java and DSpace Test is using OpenJDK (both version 8)</li>
<li>I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?</li>
<li>I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!</li>
</ul>
<pre><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&amp;limit=100&amp;offset=0'
...
0.20s user 0.03s system 0% cpu 25.017 total
0.23s user 0.02s system 1% cpu 23.299 total
0.24s user 0.02s system 1% cpu 22.496 total
0.22s user 0.03s system 1% cpu 22.720 total
0.23s user 0.03s system 1% cpu 22.632 total
</code></pre>
<ul>
<li>If I make a request without the expands it is ten time faster:</li>
</ul>
<pre><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&amp;offset=0'
...
0.20s user 0.03s system 7% cpu 3.098 total
0.22s user 0.03s system 8% cpu 2.896 total
0.21s user 0.05s system 9% cpu 2.787 total
0.23s user 0.02s system 8% cpu 2.896 total
</code></pre>
<ul>
<li>I sent a mail to dspace-tech to ask how to profile this&hellip;</li>
</ul>
2018-10-17 13:57:11 +02:00
<h2 id="2018-10-17">2018-10-17</h2>
<ul>
<li>I decided to update most of the existing metadata values that we have in <code>dc.rights</code> on CGSpace to be machine readable in SPDX format (with Creative Commons version if it was included)</li>
<li>Most of the are from Bioversity, and I asked Maria for permission before updating them</li>
<li>I manually went through and looked at the existing values and updated them in several batches:</li>
</ul>
<pre><code>UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value LIKE '%/by/%';
UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%/by/%' AND text_value NOT LIKE '%zero%';
UPDATE metadatavalue SET text_value='CC-BY-NC-2.5' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE
'%/by-nc%' AND text_value LIKE '%2.5%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%/by-nc%' AND text_value LIKE '%4.0%';
UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%zero%';
UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value LIKE '%4.0%' AND text_value LIKE '%Attribution-NonCommercial-ShareAlike%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
UPDATE metadatavalue SET text_value='CC-BY-NC-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution-NonCommercial %';
UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value NOT LIKE '%zero%' AND text_value LIKE '%Attribution %';
UPDATE metadatavalue SET text_value='CC-BY-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78184;
UPDATE metadatavalue SET text_value='CC-BY' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value NOT LIKE '%zero%' AND text_value NOT LIKE '%CC0%' AND text_value LIKE '%Attribution %' AND text_value NOT LIKE '%CC-%';
UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND resource_id=78564;
</code></pre>
<ul>
2018-10-17 18:56:31 +02:00
<li>I updated the fields on CGSpace and then started a re-index of Discovery</li>
2018-10-17 13:57:11 +02:00
<li>We also need to re-think the <code>dc.rights</code> field in the submission form: we should probably use a popup controlled vocabulary and list the Creative Commons values with version numbers and allow the user to enter their own (like the ORCID identifier field)</li>
<li>Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server</li>
<li>IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script, and regenerated the controlled vocabulary:</li>
</ul>
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt;
2018-10-17-orcids.txt
$ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre>
<ul>
<li>I also decided to add the ORCID identifiers that MEL had sent us a few months ago&hellip;</li>
<li>One problem I had with the <code>resolve-orcids.py</code> script is that one user seems to have disabled their profile data since we last updated:</li>
</ul>
<pre><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
</code></pre>
<ul>
<li>So I need to handle that situation in the script for sure, but I&rsquo;m not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?</li>
2018-10-17 18:56:31 +02:00
<li>I made a pull request and merged the ORCID updates into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/397">#397</a>)</li>
2018-10-17 22:40:35 +02:00
<li>Improve the logic of name checking in my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script</li>
2018-10-17 13:57:11 +02:00
</ul>
2018-10-18 22:57:22 +02:00
<h2 id="2018-10-18">2018-10-18</h2>
<ul>
<li>I granted MEL&rsquo;s deposit user admin access to IITA, CIP, Bioversity, and RTB communities on DSpace Test so they can start testing real depositing</li>
<li>After they do some tests and we check the values Enrico will send a formal email to Peter et al to ask that they start depositing officially</li>
<li>I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually <a href="https://wiki.postgresql.org/wiki/Using_pg_upgrade_on_Ubuntu/Debian">migrate from 9.5 to 9.6</a>:</li>
</ul>
<pre><code># su - postgres
$ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
$ exit
# systemctl start postgresql
# dpkg -r postgresql-9.5 postgresql-client-9.5 postgresql-contrib-9.5
</code></pre>
2018-10-20 17:17:59 +02:00
<h2 id="2018-10-19">2018-10-19</h2>
<ul>
<li>Help Francesca from Bioversity generate a report about items they uploaded in 2015 through 2018</li>
<li>Linode emailed me to say that CGSpace (linode18) had high CPU usage for a few hours this afternoon</li>
<li>Looking at the nginx logs around that time I see the following IPs making the most requests:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;19/Oct/2018:(12|13|14|15)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
361 207.46.13.179
395 181.115.248.74
485 66.249.64.93
535 157.55.39.213
536 157.55.39.99
551 34.218.226.147
580 157.55.39.173
1516 35.237.175.180
1629 66.249.64.91
1758 5.9.6.51
</code></pre>
<ul>
<li>5.9.6.51 is MegaIndex, which I&rsquo;ve seen before&hellip;</li>
</ul>
<h2 id="2018-10-20">2018-10-20</h2>
<ul>
<li>I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace&rsquo;s Solr configuration is for 4.9</li>
<li>This means our existing Solr configuration doesn&rsquo;t run in Solr 5.5:</li>
</ul>
<pre><code>$ sudo docker pull solr:5
$ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
$ sudo docker logs my_solr
...
ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics] Caused by: solr.IntField
</code></pre>
<ul>
<li>Apparently a bunch of variable types were removed in <a href="https://issues.apache.org/jira/browse/SOLR-5936">Solr 5</a></li>
<li>So for now it&rsquo;s actually a huge pain in the ass to run the tests for my dspace-statistics-api</li>
2018-10-21 07:06:40 +02:00
<li>Linode sent a message that the CPU usage was high on CGSpace (linode18) last night</li>
<li>According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;20/Oct/2018:(14|15|16)&quot; | awk '{print $1}' | sort
| uniq -c | sort -n | tail -n 10
249 207.46.13.179
250 157.55.39.173
301 54.166.207.223
303 157.55.39.213
310 66.249.64.95
362 34.218.226.147
381 66.249.64.93
415 35.237.175.180
1205 66.249.64.91
1227 5.9.6.51
</code></pre>
<ul>
<li>This bot is only using the XMLUI and it does <em>not</em> seem to be re-using its sessions:</li>
</ul>
<pre><code># grep -c 5.9.6.51 /var/log/nginx/*.log
/var/log/nginx/access.log:9323
/var/log/nginx/error.log:0
/var/log/nginx/library-access.log:0
/var/log/nginx/oai.log:0
/var/log/nginx/rest.log:0
/var/log/nginx/statistics.log:0
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-10-20 | sort | uniq
8915
</code></pre>
<ul>
<li>Last month I added &ldquo;crawl&rdquo; to the Tomcat Crawler Session Manager Valve&rsquo;s regular expression matching, and it seems to be working for MegaIndex&rsquo;s user agent:</li>
2018-10-20 17:17:59 +02:00
</ul>
2018-10-21 07:06:40 +02:00
<pre><code>$ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'&quot;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&quot;'
</code></pre>
<ul>
<li>So I&rsquo;m not sure why this bot uses so many sessionsis it because it requests very slowly?</li>
</ul>
<h2 id="2018-10-21">2018-10-21</h2>
2018-10-21 11:22:33 +02:00
<ul>
<li>Discuss AfricaRice joining CGSpace</li>
</ul>
2018-10-22 11:33:49 +02:00
<h2 id="2018-10-22">2018-10-22</h2>
<ul>
<li>Post message to Yammer about usage rights (dc.rights)</li>
<li>Change <code>build.properties</code> to use HTTPS for Handles in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>We will still need to do a batch update of the <code>dc.identifier.uri</code> and other fields in the database:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
</code></pre>
<ul>
<li>While I was doing that I found two items using CGSpace URLs instead of handles in their <code>dc.identifier.uri</code> so I corrected those</li>
<li>I also found several items that had invalid characters or multiple Handles in some related URL field like <code>cg.link.reference</code> so I corrected those too</li>
</ul>
2018-10-01 21:33:15 +02:00
<!-- vim: set sw=2 ts=2: -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2018-10/">October, 2018</a></li>
<li><a href="/cgspace-notes/2018-09/">September, 2018</a></li>
<li><a href="/cgspace-notes/2018-08/">August, 2018</a></li>
<li><a href="/cgspace-notes/2018-07/">July, 2018</a></li>
<li><a href="/cgspace-notes/2018-06/">June, 2018</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>