Add notes for 2023-07-03

This commit is contained in:
Alan Orth 2023-07-04 08:03:36 +03:00
parent 0fab2a0f28
commit 309ffad285
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
398 changed files with 109483 additions and 0 deletions

View File

@ -10,4 +10,22 @@ categories: ["Notes"]
- Export CGSpace to check for missing Initiative collection mappings
- Start harvesting on AReS
## 2023-07-02
- Minor edits to the `crossref_doi_lookup.py` script while running some checks from 22,000 CGSpace DOIs
## 2023-07-03
- I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect
- I took the more accurate ones from Crossref and updated the items on CGSpace
- I took a few hundred ISBNs as well for where we were missing them
- I also tagged ~4,700 items with missing licenses as "Copyrighted; all rights reserved" based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer
- Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it's usually copyrighted (could still be open access, but we can't tell via Crossref)
- I would be curious to write a script to check the Unpaywall API for open access status...
- In the past I found that their *license* status was not very accurate, but the open access status might be more reliable
- More minor work on the DSpace 7 item views
- I learned some new Angular template syntax
- I created a custom component to show Creative Commons licenses on the simple item page
- I also decided that I don't like the Impact Area icons as a component because they don't have any visual meaning
<!-- vim: set sw=2 ts=2: -->

0
docs/.gitignore vendored Normal file
View File

296
docs/2015-11/index.html Normal file
View File

@ -0,0 +1,296 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="November, 2015" />
<meta property="og:description" content="2015-11-22
CGSpace went down
Looks like DSpace exhausted its PostgreSQL connection pool
Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2015-11/" />
<meta property="article:published_time" content="2015-11-23T17:00:57+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2015"/>
<meta name="twitter:description" content="2015-11-22
CGSpace went down
Looks like DSpace exhausted its PostgreSQL connection pool
Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2015",
"url": "https://alanorth.github.io/cgspace-notes/2015-11/",
"wordCount": "798",
"datePublished": "2015-11-23T17:00:57+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2015-11/">
<title>November, 2015 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2015-11/">November, 2015</a></h2>
<p class="blog-post-meta">
<time datetime="2015-11-23T17:00:57+03:00">Mon Nov 23, 2015</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2015-11-22">2015-11-22</h2>
<ul>
<li>CGSpace went down</li>
<li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
</code></pre><ul>
<li>For now I have increased the limit from 60 to 90, run updates, and rebooted the server</li>
</ul>
<h2 id="2015-11-24">2015-11-24</h2>
<ul>
<li>CGSpace went down again</li>
<li>Getting emails from uptimeRobot and uptimeButler that it&rsquo;s down, and Google Webmaster Tools is sending emails that there is an increase in crawl errors</li>
<li>Looks like there are still a bunch of idle PostgreSQL connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
96
</code></pre><ul>
<li>For some reason the number of idle connections is very high since we upgraded to DSpace 5</li>
</ul>
<h2 id="2015-11-25">2015-11-25</h2>
<ul>
<li>Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config</li>
<li>The OAI application requests stylesheets and javascript files with the path <code>/oai/static/css</code>, which gets matched here:</li>
</ul>
<pre tabindex="0"><code># static assets we can load from the file system directly with nginx
location ~ /(themes|static|aspects/ReportingSuite) {
try_files $uri @tomcat;
...
</code></pre><ul>
<li>The document root is relative to the xmlui app, so this gets a 404—I&rsquo;m not sure why it doesn&rsquo;t pass to <code>@tomcat</code></li>
<li>Anyways, I can&rsquo;t find any URIs with path <code>/static</code>, and the more important point is to handle all the static theme assets, so we can just remove <code>static</code> from the regex for now (who cares if we can&rsquo;t use nginx to send Etags for OAI CSS!)</li>
<li>Also, I noticed we aren&rsquo;t setting CSP headers on the static assets, because in nginx headers are inherited in child blocks, but if you use <code>add_header</code> in a child block it doesn&rsquo;t inherit the others</li>
<li>We simply need to add <code>include extra-security.conf;</code> to the above location block (but research and test first)</li>
<li>We should add WOFF assets to the list of things to set expires for:</li>
</ul>
<pre tabindex="0"><code>location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
</code></pre><ul>
<li>We should also add <code>aspects/Statistics</code> to the location block for static assets (minus <code>static</code> from above):</li>
</ul>
<pre tabindex="0"><code>location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
</code></pre><ul>
<li>Need to check <code>/about</code> on CGSpace, as it&rsquo;s blank on my local test server and we might need to add something there</li>
<li>CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):</li>
</ul>
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
93
</code></pre><ul>
<li>I looked closer at the idle connections and saw that many have been idle for hours (current time on server is <code>2015-11-25T20:20:42+0000</code>):</li>
</ul>
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | less -S
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start |
-------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
20951 | cgspace | 10966 | 18205 | cgspace | | 127.0.0.1 | | 37731 | 2015-11-25 13:13:02.837624+00 | | 20
20951 | cgspace | 10967 | 18205 | cgspace | | 127.0.0.1 | | 37737 | 2015-11-25 13:13:03.069421+00 | | 20
...
</code></pre><ul>
<li>There is a relevant Jira issue about this: <a href="https://jira.duraspace.org/browse/DS-1458">https://jira.duraspace.org/browse/DS-1458</a></li>
<li>It seems there is some sense changing DSpace&rsquo;s default <code>db.maxidle</code> from unlimited (-1) to something like 8 (Tomcat default) or 10 (Confluence default)</li>
<li>Change <code>db.maxidle</code> from -1 to 10, reduce <code>db.maxconnections</code> from 90 to 50, and restart postgres and tomcat7</li>
<li>Also redeploy DSpace Test with a clean sync of CGSpace and mirror these database settings there as well</li>
<li>Also deploy the nginx fixes for the <code>try_files</code> location block as well as the expires block</li>
</ul>
<h2 id="2015-11-26">2015-11-26</h2>
<ul>
<li>CGSpace behaving much better since changing <code>db.maxidle</code> yesterday, but still two up/down notices from monitoring this morning (better than 50!)</li>
<li>CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item</li>
<li>Not as bad for me, but still unsustainable if you have to get many:</li>
</ul>
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
8.415
</code></pre><ul>
<li>Monitoring e-mailed in the evening to say CGSpace was down</li>
<li>Idle connections in PostgreSQL again:</li>
</ul>
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
66
</code></pre><ul>
<li>At the time, the current DSpace pool size was 50&hellip;</li>
<li>I reduced the pool back to the default of 30, and reduced the <code>db.maxidle</code> settings from 10 to 8</li>
</ul>
<h2 id="2015-11-29">2015-11-29</h2>
<ul>
<li>Still more alerts that CGSpace has been up and down all day</li>
<li>Current database settings for DSpace:</li>
</ul>
<pre tabindex="0"><code>db.maxconnections = 30
db.maxwait = 5000
db.maxidle = 8
db.statementpool = true
</code></pre><ul>
<li>And idle connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
49
</code></pre><ul>
<li>Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace&rsquo;s thirst can ever be quenched</li>
<li>On another note, SUNScholar&rsquo;s notes suggest adjusting some other postgres variables: <a href="http://wiki.lib.sun.ac.za/index.php/SUNScholar/Optimisations/Database">http://wiki.lib.sun.ac.za/index.php/SUNScholar/Optimisations/Database</a></li>
<li>This might help with REST API speed (which I mentioned above and still need to do real tests)</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

318
docs/2015-12/index.html Normal file
View File

@ -0,0 +1,318 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="December, 2015" />
<meta property="og:description" content="2015-12-02
Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
# cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2015-12/" />
<meta property="article:published_time" content="2015-12-02T13:18:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="December, 2015"/>
<meta name="twitter:description" content="2015-12-02
Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
# cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "December, 2015",
"url": "https://alanorth.github.io/cgspace-notes/2015-12/",
"wordCount": "753",
"datePublished": "2015-12-02T13:18:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2015-12/">
<title>December, 2015 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2015-12/">December, 2015</a></h2>
<p class="blog-post-meta">
<time datetime="2015-12-02T13:18:00+03:00">Wed Dec 02, 2015</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2015-12-02">2015-12-02</h2>
<ul>
<li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li>
</ul>
<pre tabindex="0"><code># cd /home/dspacetest.cgiar.org/log
# ls -lh dspace.log.2015-11-18*
-rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
</code></pre><ul>
<li>I had used lrzip once, but it needs more memory and is harder to use as it requires the lrztar wrapper</li>
<li>Need to remember to go check if everything is ok in a few days and then change CGSpace</li>
<li>CGSpace went down again (due to PostgreSQL idle connections of course)</li>
<li>Current database settings for DSpace are <code>db.maxconnections = 30</code> and <code>db.maxidle = 8</code>, yet idle connections are exceeding this:</li>
</ul>
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
39
</code></pre><ul>
<li>I restarted PostgreSQL and Tomcat and it&rsquo;s back</li>
<li>On a related note of why CGSpace is so slow, I decided to finally try the <code>pgtune</code> script to tune the postgres settings:</li>
</ul>
<pre tabindex="0"><code># apt-get install pgtune
# pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune
# mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig
# mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf
</code></pre><ul>
<li>It introduced the following new settings:</li>
</ul>
<pre tabindex="0"><code>default_statistics_target = 50
maintenance_work_mem = 480MB
constraint_exclusion = on
checkpoint_completion_target = 0.9
effective_cache_size = 5632MB
work_mem = 48MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 1920MB
max_connections = 80
</code></pre><ul>
<li>Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc</li>
<li>For what it&rsquo;s worth, now the REST API should be faster (because of these PostgreSQL tweaks):</li>
</ul>
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.474
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
2.141
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.685
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.995
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.786
</code></pre><ul>
<li>Last week it was an average of 8 seconds&hellip; now this is 1/4 of that</li>
<li>CCAFS noticed that one of their items displays only the Atmire statlets: <a href="https://cgspace.cgiar.org/handle/10568/42445">https://cgspace.cgiar.org/handle/10568/42445</a></li>
</ul>
<p><img src="/cgspace-notes/2015/12/ccafs-item-no-metadata.png" alt="CCAFS item"></p>
<ul>
<li>The authorizations for the item are all public READ, and I don&rsquo;t see any errors in dspace.log when browsing that item</li>
<li>I filed a ticket on Atmire&rsquo;s issue tracker</li>
<li>I also filed a ticket on Atmire&rsquo;s issue tracker for the PostgreSQL stuff</li>
</ul>
<h2 id="2015-12-03">2015-12-03</h2>
<ul>
<li>CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)</li>
<li>Idle postgres connections look like this (with no change in DSpace db settings lately):</li>
</ul>
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
29
</code></pre><ul>
<li>I restarted Tomcat and postgres&hellip;</li>
<li>Atmire commented that we should raise the JVM heap size by ~500M, so it is now <code>-Xms3584m -Xmx3584m</code></li>
<li>We weren&rsquo;t out of heap yet, but it&rsquo;s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it&rsquo;s ok</li>
<li>A possible side effect is that I see that the REST API is twice as fast for the request above now:</li>
</ul>
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.368
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.968
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
1.006
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.849
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.806
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.854
</code></pre><h2 id="2015-12-05">2015-12-05</h2>
<ul>
<li>CGSpace has been up and down all day and REST API is completely unresponsive</li>
<li>PostgreSQL idle connections are currently:</li>
</ul>
<pre tabindex="0"><code>postgres@linode01:~$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep cgspace | grep -c idle
28
</code></pre><ul>
<li>I have reverted all the pgtune tweaks from the other day, as they didn&rsquo;t fix the stability issues, so I&rsquo;d rather not have them introducing more variables into the equation</li>
<li>The PostgreSQL stats from Munin all point to something database-related with the DSpace 5 upgrade around midlate November</li>
</ul>
<p><img src="/cgspace-notes/2015/12/postgres_bgwriter-year.png" alt="PostgreSQL bgwriter (year)">
<img src="/cgspace-notes/2015/12/postgres_cache_cgspace-year.png" alt="PostgreSQL cache (year)">
<img src="/cgspace-notes/2015/12/postgres_locks_cgspace-year.png" alt="PostgreSQL locks (year)">
<img src="/cgspace-notes/2015/12/postgres_scans_cgspace-year.png" alt="PostgreSQL scans (year)"></p>
<h2 id="2015-12-07">2015-12-07</h2>
<ul>
<li>Atmire sent <a href="https://github.com/ilri/DSpace/pull/161">some fixes</a> to DSpace&rsquo;s REST API code that was leaving contexts open (causing the slow performance and database issues)</li>
<li>After deploying the fix to CGSpace the REST API is consistently faster:</li>
</ul>
<pre tabindex="0"><code>$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.675
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.599
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.588
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.566
$ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
0.497
</code></pre><h2 id="2015-12-08">2015-12-08</h2>
<ul>
<li>Switch CGSpace log compression cron jobs from using lzop to xz—the compression isn&rsquo;t as good, but it&rsquo;s much faster and causes less IO/CPU load</li>
<li>Since we figured out (and fixed) the cause of the performance issue, I reverted Google Bot&rsquo;s crawl rate to the &ldquo;Let Google optimize&rdquo; setting</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

Binary file not shown.

After

Width:  |  Height:  |  Size: 397 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

254
docs/2016-01/index.html Normal file
View File

@ -0,0 +1,254 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="January, 2016" />
<meta property="og:description" content="2016-01-13
Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_collections.sh script I wrote last year.
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-01/" />
<meta property="article:published_time" content="2016-01-13T13:18:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="January, 2016"/>
<meta name="twitter:description" content="2016-01-13
Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_collections.sh script I wrote last year.
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-01/",
"wordCount": "466",
"datePublished": "2016-01-13T13:18:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-01/">
<title>January, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-01/">January, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-01-13T13:18:00+03:00">Wed Jan 13, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-01-13">2016-01-13</h2>
<ul>
<li>Move ILRI collection <code>10568/12503</code> from <code>10568/27869</code> to <code>10568/27629</code> using the <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">move_collections.sh</a> script I wrote last year.</li>
<li>I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.</li>
<li>Update GitHub wiki for documentation of <a href="https://github.com/ilri/DSpace/wiki/Maintenance-Tasks">maintenance tasks</a>.</li>
</ul>
<h2 id="2016-01-14">2016-01-14</h2>
<ul>
<li>Update CCAFS project identifiers in input-forms.xml</li>
<li>Run system updates and restart the server</li>
</ul>
<h2 id="2016-01-18">2016-01-18</h2>
<ul>
<li>Change &ldquo;Extension material&rdquo; to &ldquo;Extension Material&rdquo; in input-forms.xml (a mistake that fell through the cracks when we fixed the others in DSpace 4 era)</li>
</ul>
<h2 id="2016-01-19">2016-01-19</h2>
<ul>
<li>Work on tweaks and updates for the social sharing icons on item pages: add Delicious and Mendeley (from Academicons), make links open in new windows, and set the icon color to the theme&rsquo;s primary color (<a href="https://github.com/ilri/DSpace/issues/157">#157</a>)</li>
<li>Tweak date-based facets to show more values in drill-down ranges (<a href="https://github.com/ilri/DSpace/issues/162">#162</a>)</li>
<li>Need to remember to clear the Cocoon cache after deployment or else you don&rsquo;t see the new ranges immediately</li>
<li>Set up recipe on IFTTT to tweet new items from the CGSpace Atom feed to my twitter account</li>
<li>Altmetrics&rsquo; support for Handles is kinda weak, so they can&rsquo;t associate our items with DOIs until they are tweeted or blogged, etc first.</li>
</ul>
<h2 id="2016-01-21">2016-01-21</h2>
<ul>
<li>Still waiting for my IFTTT recipe to fire, two days later</li>
<li>It looks like the Atom feed on CGSpace hasn&rsquo;t changed in two days, but there have definitely been new items</li>
<li>The RSS feed is nearly as old, but has different old items there</li>
<li>On a hunch I cleared the Cocoon cache and now the feeds are fresh</li>
<li>Looks like there is configuration option related to this, <code>webui.feed.cache.age</code>, which defaults to 48 hours, though I&rsquo;m not sure what relation it has to the Cocoon cache</li>
<li>In any case, we should change this cache to be something more like 6 hours, as we publish new items several times per day.</li>
<li>Work around a CSS issue with long URLs in the item view (<a href="https://github.com/ilri/DSpace/issues/172">#172</a>)</li>
</ul>
<h2 id="2016-01-25">2016-01-25</h2>
<ul>
<li>Re-deploy CGSpace and DSpace Test with latest <code>5_x-prod</code> branch</li>
<li>This included the social icon fixes/updates, date-based facet tweaks, reducing the feed cache age, and fixing a layout issue in XMLUI item view when an item had long URLs</li>
</ul>
<h2 id="2016-01-26">2016-01-26</h2>
<ul>
<li>Run nginx updates on CGSpace and DSpace Test (<a href="http://mailman.nginx.org/pipermail/nginx/2016-January/049700.html">1.8.1 and 1.9.10, respectively</a>)</li>
<li>Run updates on DSpace Test and reboot for new Linode kernel <code>Linux 4.4.0-x86_64-linode63</code> (first update in months)</li>
</ul>
<h2 id="2016-01-28">2016-01-28</h2>
<ul>
<li>
<p>Start looking at importing some Bioversity data that had been prepared earlier this week</p>
</li>
<li>
<p>While checking the data I noticed something strange, there are 79 items but only 8 unique PDFs:</p>
<p>$ ls SimpleArchiveForBio/ | wc -l
79
$ find SimpleArchiveForBio/ -iname &ldquo;*.pdf&rdquo; -exec basename {} ; | sort -u | wc -l
8</p>
</li>
</ul>
<h2 id="2016-01-29">2016-01-29</h2>
<ul>
<li>Add five missing center-specific subjects to XMLUI item view (<a href="https://github.com/ilri/DSpace/issues/174">#174</a>)</li>
<li>This <a href="https://cgspace.cgiar.org/handle/10568/67062">CCAFS item</a> Before:</li>
</ul>
<p><img src="/cgspace-notes/2016/01/xmlui-subjects-before.png" alt="XMLUI subjects before"></p>
<ul>
<li>After:</li>
</ul>
<p><img src="/cgspace-notes/2016/01/xmlui-subjects-after.png" alt="XMLUI subjects after"></p>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

432
docs/2016-02/index.html Normal file
View File

@ -0,0 +1,432 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="February, 2016" />
<meta property="og:description" content="2016-02-05
Looking at some DAGRIS data for Abenet Yabowork
Lots of issues with spaces, newlines, etc causing the import to fail
I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)&hellip;
Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-02/" />
<meta property="article:published_time" content="2016-02-05T13:18:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2016"/>
<meta name="twitter:description" content="2016-02-05
Looking at some DAGRIS data for Abenet Yabowork
Lots of issues with spaces, newlines, etc causing the import to fail
I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)&hellip;
Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "February, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-02/",
"wordCount": "1657",
"datePublished": "2016-02-05T13:18:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-02/">
<title>February, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-02/">February, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-02-05T13:18:00+03:00">Fri Feb 05, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-02-05">2016-02-05</h2>
<ul>
<li>Looking at some DAGRIS data for Abenet Yabowork</li>
<li>Lots of issues with spaces, newlines, etc causing the import to fail</li>
<li>I noticed we have a very <em>interesting</em> list of countries on CGSpace:</li>
</ul>
<p><img src="/cgspace-notes/2016/02/cgspace-countries.png" alt="CGSpace country list"></p>
<ul>
<li>Not only are there 49,000 countries, we have some blanks (25)&hellip;</li>
<li>Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;</li>
</ul>
<h2 id="2016-02-06">2016-02-06</h2>
<ul>
<li>Found a way to get items with null/empty metadata values from SQL</li>
<li>First, find the <code>metadata_field_id</code> for the field you want from the <code>metadatafieldregistry</code> table:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select * from metadatafieldregistry;
</code></pre><ul>
<li>In this case our country field is 78</li>
<li>Now find all resources with type 2 (item) that have null/empty values for that field:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value=&#39;&#39; OR text_value IS NULL);
</code></pre><ul>
<li>Then you can find the handle that owns it from its <code>resource_id</code>:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = &#39;22678&#39;;
</code></pre><ul>
<li>It&rsquo;s 25 items so editing in the web UI is annoying, let&rsquo;s try SQL!</li>
</ul>
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value=&#39;&#39;;
DELETE 25
</code></pre><ul>
<li>After that perhaps a regular <code>dspace index-discovery</code> (no -b) <em>should</em> suffice&hellip;</li>
<li>Hmm, I indexed, cleared the Cocoon cache, and restarted Tomcat but the 25 &ldquo;|||&rdquo; countries are still there</li>
<li>Maybe I need to do a full re-index&hellip;</li>
<li>Yep! The full re-index seems to work.</li>
<li>Process the empty countries on CGSpace</li>
</ul>
<h2 id="2016-02-07">2016-02-07</h2>
<ul>
<li>Working on cleaning up Abenet&rsquo;s DAGRIS data with OpenRefine</li>
<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape(&quot;javascript&quot;)</code> which shows whitespace characters like <code>\r\n</code>!</li>
<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>
<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace(&quot;\.0&quot;, &quot;&quot;)</code></li>
<li>I need to start running DSpace in Mac OS X instead of a Linux VM</li>
<li>Install PostgreSQL from homebrew, then configure and import CGSpace database dump:</li>
</ul>
<pre tabindex="0"><code>$ postgres -D /opt/brew/var/postgres
$ createuser --superuser postgres
$ createuser --pwprompt dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql postgres
postgres=# alter user dspacetest createuser;
postgres=# \q
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-02-07.backup
$ psql postgres
postgres=# alter user dspacetest nocreateuser;
postgres=# \q
$ vacuumdb dspacetest
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
</code></pre><ul>
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat&rsquo;s webapps folder:</li>
</ul>
<pre tabindex="0"><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
$ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
$ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
$ ln -sfv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/oai
$ ln -sfv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/solr
$ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
</code></pre><ul>
<li>Add CATALINA_OPTS in <code>/opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh</code>, as this script is sourced by the <code>catalina</code> startup script</li>
<li>For example:</li>
</ul>
<pre tabindex="0"><code>CATALINA_OPTS=&#34;-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8&#34;
</code></pre><ul>
<li>After verifying that the site is working, start a full index:</li>
</ul>
<pre tabindex="0"><code>$ ~/dspace/bin/dspace index-discovery -b
</code></pre><h2 id="2016-02-08">2016-02-08</h2>
<ul>
<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>
<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme&rsquo;s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>
</ul>
<p><img src="/cgspace-notes/2016/02/submit-button-ilri.png" alt="ILRI submission buttons">
<img src="/cgspace-notes/2016/02/submit-button-drylands.png" alt="Drylands submission buttons"></p>
<h2 id="2016-02-09">2016-02-09</h2>
<ul>
<li>Re-sync DSpace Test with CGSpace</li>
<li>Help Sisay with OpenRefine</li>
<li>Enable HTTPS on DSpace Test using Let&rsquo;s Encrypt:</li>
</ul>
<pre tabindex="0"><code>$ cd ~/src/git
$ git clone https://github.com/letsencrypt/letsencrypt
$ cd letsencrypt
$ sudo service nginx stop
# add port 443 to firewall rules
$ ./letsencrypt-auto certonly --standalone -d dspacetest.cgiar.org
$ sudo service nginx start
$ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-become-pass
</code></pre><ul>
<li>We should install it in /opt/letsencrypt and then script the renewal script, but first we have to wire up some variables and template stuff based on the script here: <a href="https://letsencrypt.org/howitworks/">https://letsencrypt.org/howitworks/</a></li>
<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs&hellip;</li>
<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,&quot;&quot;)</code></li>
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
<li>Logs don&rsquo;t always show anything right when it fails, but eventually one of these appears:</li>
</ul>
<pre tabindex="0"><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>or</li>
</ul>
<pre tabindex="0"><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre><ul>
<li>Right now DSpace Test&rsquo;s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
</ul>
<pre tabindex="0"><code># free -m
total used free shared buffers cached
Mem: 3950 3902 48 9 37 1311
-/+ buffers/cache: 2552 1397
Swap: 255 57 198
</code></pre><ul>
<li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
</ul>
<h2 id="2016-02-11">2016-02-11</h2>
<ul>
<li>Massaging some CIAT data in OpenRefine</li>
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>
<pre tabindex="0"><code>value.split(&#39;/&#39;)[-1]
</code></pre><ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>
<pre tabindex="0"><code>$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&gt; Downloading 64661.pdf
&gt; Creating thumbnail for 64661.pdf
Processing 64195.pdf
&gt; Downloading 64195.pdf
&gt; Creating thumbnail for 64195.pdf
</code></pre><h2 id="2016-02-12">2016-02-12</h2>
<ul>
<li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li>
<li>A few items are using the same exact PDF</li>
<li>A few items are using HTM or DOC files</li>
<li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li>
<li>A few items have no item</li>
<li>Also, I&rsquo;m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>
</ul>
<h2 id="2016-02-12-1">2016-02-12</h2>
<ul>
<li>Looking at CIAT&rsquo;s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I&rsquo;m not sure if we can use those</li>
<li>265 items have dirty, URL-encoded filenames:</li>
</ul>
<pre tabindex="0"><code>$ ls | grep -c -E &#34;%&#34;
265
</code></pre><ul>
<li>I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames</li>
<li>This python2 snippet seems to work in the CLI, but not so well in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>$ python -c &#34;import urllib, sys; print urllib.unquote(sys.argv[1])&#34; CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
</code></pre><ul>
<li>Merge pull requests for submission form theming (<a href="https://github.com/ilri/DSpace/pull/178">#178</a>) and missing center subjects in XMLUI item views (<a href="https://github.com/ilri/DSpace/pull/176">#176</a>)</li>
<li>They will be deployed on CGSpace the next time I re-deploy</li>
</ul>
<h2 id="2016-02-16">2016-02-16</h2>
<ul>
<li>Turns out OpenRefine has an unescape function!</li>
</ul>
<pre tabindex="0"><code>value.unescape(&#34;url&#34;)
</code></pre><ul>
<li>This turns the URLs into human-readable versions that we can use as proper filenames</li>
<li>Run web server and system updates on DSpace Test and reboot</li>
<li>To merge <code>dc.identifier.url</code> and <code>dc.identifier.url[]</code>, rename the second column so it doesn&rsquo;t have the brackets, like <code>dc.identifier.url2</code></li>
<li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with &ldquo;||&rdquo; in between</li>
<li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li>
<li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li>
<li>This also works for records that have multiple URLs (separated by &ldquo;||&rdquo;)</li>
</ul>
<h2 id="2016-02-17">2016-02-17</h2>
<ul>
<li>Re-deploy CGSpace, run all system updates, and reboot</li>
<li>More work on CIAT data, cleaning and doing a last metadata-only import into DSpace Test</li>
<li>SAFBuilder has a bug preventing it from processing filenames containing more than one underscore</li>
<li>Need to re-process the filename column to replace multiple underscores with one: <code>value.replace(/_{2,}/, &quot;_&quot;)</code></li>
</ul>
<h2 id="2016-02-20">2016-02-20</h2>
<ul>
<li>Turns out the &ldquo;bug&rdquo; in SAFBuilder isn&rsquo;t a bug, it&rsquo;s a feature that allows you to encode extra information like the destintion bundle in the filename</li>
<li>Also, it seems DSpace&rsquo;s SAF import tool doesn&rsquo;t like importing filenames that have accents in them:</li>
</ul>
<pre tabindex="0"><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
</code></pre><ul>
<li>Need to rename files to have no accents or umlauts, etc&hellip;</li>
<li>Useful custom text facet for URLs ending with &ldquo;.pdf&rdquo;: <code>value.endsWith(&quot;.pdf&quot;)</code></li>
</ul>
<h2 id="2016-02-22">2016-02-22</h2>
<ul>
<li>To change Spanish accents to ASCII in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace(&#39;ó&#39;,&#39;o&#39;).replace(&#39;í&#39;,&#39;i&#39;).replace(&#39;á&#39;,&#39;a&#39;).replace(&#39;é&#39;,&#39;e&#39;).replace(&#39;ñ&#39;,&#39;n&#39;)
</code></pre><ul>
<li>But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac</li>
<li>On closer inspection, I can import files with the following names on Linux (DSpace Test):</li>
</ul>
<pre tabindex="0"><code>Bitstream: tést.pdf
Bitstream: tést señora.pdf
Bitstream: tést señora alimentación.pdf
</code></pre><ul>
<li>Seems it could be something with the HFS+ filesystem actually, as it&rsquo;s not UTF-8 (<a href="http://www.cio.com/article/2868393/linus-torvalds-apples-hfs-is-probably-the-worst-file-system-ever.html">it&rsquo;s something like UCS-2</a>)</li>
<li>HFS+ stores filenames as a string, and filenames with accents get stored as <a href="https://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/">character+accent</a> whereas Linux&rsquo;s ext4 stores them as an array of bytes</li>
<li>Running the SAFBuilder on Mac OS X works if you&rsquo;re going to import the resulting bundle on Mac OS X, but if your DSpace is running on Linux you need to run the SAFBuilder there where the filesystem&rsquo;s encoding matches</li>
</ul>
<h2 id="2016-02-29">2016-02-29</h2>
<ul>
<li>Got notified by some CIFOR colleagues that the Google Scholar team had contacted them about CGSpace&rsquo;s incorrect ordering of authors in Google Scholar metadata</li>
<li>Turns out there is a patch, and it was merged in DSpace 5.4: <a href="https://jira.duraspace.org/browse/DS-2679">https://jira.duraspace.org/browse/DS-2679</a></li>
<li>I&rsquo;ve merged it into our <code>5_x-prod</code> branch that is currently based on DSpace 5.1</li>
<li>We found a bug when a user searches from the homepage, sorts the results, and then tries to click &ldquo;View More&rdquo; in a sidebar facet</li>
<li>I am not sure what causes it yet, but I opened an issue for it: <a href="https://github.com/ilri/DSpace/issues/179">https://github.com/ilri/DSpace/issues/179</a></li>
<li>Have more problems with SAFBuilder on Mac OS X</li>
<li>Now it doesn&rsquo;t recognize description hints in the filename column, like: <code>test.pdf__description:Blah</code></li>
<li>But on Linux it works fine</li>
<li>Trying to test Atmire&rsquo;s series of stats and CUA fixes from January and February, but their branch history is really messy and it&rsquo;s hard to see what&rsquo;s going on</li>
<li>Rebasing their branch on top of our production branch results in a broken Tomcat, so I&rsquo;m going to tell them to fix their history and make a proper pull request</li>
<li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li>
<li>It&rsquo;s tricky to parse those things in some programming languages so I&rsquo;d rather just get rid of the weird stuff now in OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#39;&#39;).replace(&#39;_=_&#39;,&#39;_&#39;).replace(&#39;,&#39;,&#39;&#39;).replace(&#39;[&#39;,&#39;&#39;).replace(&#39;]&#39;,&#39;&#39;).replace(&#39;(&#39;,&#39;&#39;).replace(&#39;)&#39;,&#39;&#39;).replace(&#39;_.pdf&#39;,&#39;.pdf&#39;).replace(&#39;._&#39;,&#39;_&#39;)
</code></pre><ul>
<li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li>
<li>Re-deploy CGSpace with the Google Scholar fix, but I&rsquo;m waiting on the Atmire fixes for now, as the branch history is ugly</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

370
docs/2016-03/index.html Normal file
View File

@ -0,0 +1,370 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="March, 2016" />
<meta property="og:description" content="2016-03-02
Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-03/" />
<meta property="article:published_time" content="2016-03-02T16:50:00+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2016"/>
<meta name="twitter:description" content="2016-03-02
Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "March, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-03/",
"wordCount": "1581",
"datePublished": "2016-03-02T16:50:00+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-03/">
<title>March, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-03/">March, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-03-02T16:50:00+03:00">Wed Mar 02, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-03-02">2016-03-02</h2>
<ul>
<li>Looking at issues with author authorities on CGSpace</li>
<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module</li>
<li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li>
</ul>
<h2 id="2016-03-07">2016-03-07</h2>
<ul>
<li>Troubleshooting the issues with the slew of commits for Atmire modules in <a href="https://github.com/ilri/DSpace/pull/182">#182</a></li>
<li>Their changes on <code>5_x-dev</code> branch work, but it is messy as hell with merge commits and old branch base</li>
<li>When I rebase their branch on the latest <code>5_x-prod</code> I get blank white pages</li>
<li>I identified one commit that causes the issue and let them know</li>
<li>Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &#34;Lucene Merge Thread #19&#34; org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
</code></pre><h2 id="2016-03-08">2016-03-08</h2>
<ul>
<li>Add a few new filters to Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/issues/180">#180</a>)</li>
<li>We had also wanted to add a few to the Content and Usage module but I have to ask the editors which ones they were</li>
</ul>
<h2 id="2016-03-10">2016-03-10</h2>
<ul>
<li>Disable the lucene cron job on CGSpace as it shouldn&rsquo;t be needed anymore</li>
<li>Discuss ORCiD and duplicate authors on Yammer</li>
<li>Request new documentation for Atmire CUA and L&amp;R modules, as ours are from 2013</li>
<li>Walk Sisay through some data cleaning workflows in OpenRefine</li>
<li>Start cleaning up the configuration for Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/issues/185">#184</a>)</li>
<li>It is very messed up because some labels are incorrect, fields are missing, etc</li>
</ul>
<p><img src="/cgspace-notes/2016/03/cua-label-mixup.png" alt="Mixed up label in Atmire CUA"></p>
<ul>
<li>Update documentation for Atmire modules</li>
</ul>
<h2 id="2016-03-11">2016-03-11</h2>
<ul>
<li>As I was looking at the CUA config I realized our Discovery config is all messed up and confusing</li>
<li>I&rsquo;ve opened an issue to track some of that work (<a href="https://github.com/ilri/DSpace/issues/186">#186</a>)</li>
<li>I did some major cleanup work on Discovery and XMLUI stuff related to the <code>dc.type</code> indexes (<a href="https://github.com/ilri/DSpace/pull/187">#187</a>)</li>
<li>We had been confusing <code>dc.type</code> (a Dublin Core value) with <code>dc.type.output</code> (a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc.</li>
<li>There is still some more work to be done to remove references to old <code>outputtype</code> and <code>output</code></li>
</ul>
<h2 id="2016-03-14">2016-03-14</h2>
<ul>
<li>Fix some items that had invalid dates (I noticed them in the log during a re-indexing)</li>
<li>Reset <code>search.index.*</code> to the default, as it is only used by Lucene (deprecated by Discovery in DSpace 5.x): <a href="https://github.com/ilri/DSpace/pull/188">#188</a></li>
<li>Make titles in Discovery and Browse by more consistent (singular, sentence case, etc) (<a href="https://github.com/ilri/DSpace/issues/186">#186</a>)</li>
<li>Also four or so center-specific subject strings were missing for Discovery</li>
</ul>
<p><img src="/cgspace-notes/2016/03/missing-xmlui-string.png" alt="Missing XMLUI string"></p>
<h2 id="2016-03-15">2016-03-15</h2>
<ul>
<li>Create simple theme for new AVCD community just for a unique Google Tracking ID (<a href="https://github.com/ilri/DSpace/pull/191">#191</a>)</li>
</ul>
<h2 id="2016-03-16">2016-03-16</h2>
<ul>
<li>Still having problems deploying Atmire&rsquo;s CUA updates and fixes from January!</li>
<li>More discussion on the GitHub issue here: <a href="https://github.com/ilri/DSpace/pull/182">https://github.com/ilri/DSpace/pull/182</a></li>
<li>Clean up Atmire CUA config (<a href="https://github.com/ilri/DSpace/pull/193">#193</a>)</li>
<li>Help Sisay with some PostgreSQL queries to clean up the incorrect <code>dc.contributor.corporateauthor</code> field</li>
<li>I noticed that we have some weird values in <code>dc.language</code>:</li>
</ul>
<pre tabindex="0"><code># select * from metadatavalue where metadata_field_id=37;
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
1942571 | 35342 | 37 | hi | | 1 | | -1 | 2
1942468 | 35345 | 37 | hi | | 1 | | -1 | 2
1942479 | 35337 | 37 | hi | | 1 | | -1 | 2
1942505 | 35336 | 37 | hi | | 1 | | -1 | 2
1942519 | 35338 | 37 | hi | | 1 | | -1 | 2
1942535 | 35340 | 37 | hi | | 1 | | -1 | 2
1942555 | 35341 | 37 | hi | | 1 | | -1 | 2
1942588 | 35343 | 37 | hi | | 1 | | -1 | 2
1942610 | 35346 | 37 | hi | | 1 | | -1 | 2
1942624 | 35347 | 37 | hi | | 1 | | -1 | 2
1942639 | 35339 | 37 | hi | | 1 | | -1 | 2
</code></pre><ul>
<li>It seems this <code>dc.language</code> field isn&rsquo;t really used, but we should delete these values</li>
<li>Also, <code>dc.language.iso</code> has some weird values, like &ldquo;En&rdquo; and &ldquo;English&rdquo;</li>
</ul>
<h2 id="2016-03-17">2016-03-17</h2>
<ul>
<li>It turns out <code>hi</code> is the ISO 639 language code for Hindi, but these should be in <code>dc.language.iso</code> instead of <code>dc.language</code></li>
<li>I fixed the eleven items with <code>hi</code> as well as some using the incorrect <code>vn</code> for Vietnamese</li>
<li>Start discussing CG core with Abenet and Sisay</li>
<li>Re-sync CGSpace database to DSpace Test for Atmire to do some tests about the problematic CUA patches</li>
<li>The patches work fine with a clean database, so the error was caused by some mismatch in CUA versions and the database during my testing</li>
</ul>
<h2 id="2016-03-18">2016-03-18</h2>
<ul>
<li>Merge Atmire fixes into <code>5_x-prod</code></li>
<li>Discuss thumbnails with Francesca from Bioversity</li>
<li>Some of their items end up with thumbnails that have a big white border around them:</li>
</ul>
<p><img src="/cgspace-notes/2016/03/bioversity-thumbnail-bad.jpg" alt="Excessive whitespace in thumbnail"></p>
<ul>
<li>Turns out we can add <code>-trim</code> to the GraphicsMagick options to trim the whitespace</li>
</ul>
<p><img src="/cgspace-notes/2016/03/bioversity-thumbnail-good.jpg" alt="Trimmed thumbnail"></p>
<ul>
<li>Command used:</li>
</ul>
<pre tabindex="0"><code>$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
</code></pre><ul>
<li>Also, it looks like adding <code>-sharpen 0x1.0</code> really improves the quality of the image for only a few KB</li>
</ul>
<h2 id="2016-03-21">2016-03-21</h2>
<ul>
<li>Fix 66 site errors in Google&rsquo;s webmaster tools</li>
<li>I looked at a bunch of them and they were old URLs, weird things linked from non-existent items, etc, so I just marked them all as fixed</li>
<li>We also have 1,300 &ldquo;soft 404&rdquo; errors for URLs like: <a href="https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity">https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity</a></li>
<li>I&rsquo;ve marked them as fixed as well since the ones I tested were working fine</li>
<li>This raises another question, as many of these pages are linked from Discovery search results and might create a duplicate content problem&hellip;</li>
<li>Results pages like this give items that Google already knows from the sitemap: <a href="https://cgspace.cgiar.org/discover?filtertype=author&amp;filter_relational_operator=equals&amp;filter=Orth%2C+A">https://cgspace.cgiar.org/discover?filtertype=author&amp;filter_relational_operator=equals&amp;filter=Orth%2C+A</a>.</li>
<li>There are some access denied errors on JSPUI links (of course! we forbid them!), but I&rsquo;m not sure why Google is trying to index them&hellip;</li>
<li>For example:
<ul>
<li>This: <a href="https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf">https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf</a></li>
<li>Linked from: <a href="https://cgspace.cgiar.org/jspui/handle/10568/809">https://cgspace.cgiar.org/jspui/handle/10568/809</a></li>
</ul>
</li>
<li>I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time!</li>
<li>Google says the first time it saw this particular error was September 29, 2015&hellip; so maybe it accidentally saw it somehow&hellip;</li>
<li>On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content</li>
</ul>
<p><img src="/cgspace-notes/2016/03/google-index.png" alt="CGSpace pages in Google index"></p>
<ul>
<li>Turns out this is a problem with DSpace&rsquo;s <code>robots.txt</code>, and there&rsquo;s a Jira ticket since December, 2015: <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>I am not sure if I want to apply it yet</li>
<li>For now I&rsquo;ve just set a bunch of these dynamic pages to not appear in search results by using the URL Parameters tool in Webmaster Tools</li>
</ul>
<p><img src="/cgspace-notes/2016/03/url-parameters.png" alt="URL parameters cause millions of dynamic pages">
<img src="/cgspace-notes/2016/03/url-parameters2.png" alt="Setting pages with the filter_0 param not to show in search results"></p>
<ul>
<li>Move AVCD collection to new community and update <code>move_collection.sh</code> script: <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">https://gist.github.com/alanorth/392c4660e8b022d99dfa</a></li>
<li>It seems Feedburner can do HTTPS now, so we might be able to update our feeds and simplify the nginx configs</li>
<li>De-deploy CGSpace with latest <code>5_x-prod</code> branch</li>
<li>Run updates on CGSpace and reboot server (new kernel, <code>4.5.0</code>)</li>
<li>Deploy Let&rsquo;s Encrypt certificate for cgspace.cgiar.org, but still need to work it into the ansible playbooks</li>
</ul>
<h2 id="2016-03-22">2016-03-22</h2>
<ul>
<li>Merge robots.txt patch and disallow indexing of browse pages as our sitemap is consumed correctly (<a href="https://github.com/ilri/DSpace/issues/198">#198</a>)</li>
</ul>
<h2 id="2016-03-23">2016-03-23</h2>
<ul>
<li>Abenet is having problems saving group memberships, and she gets this error: <a href="https://gist.github.com/alanorth/87281c061c2de57b773e">https://gist.github.com/alanorth/87281c061c2de57b773e</a></li>
</ul>
<pre tabindex="0"><code>Can&#39;t find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
</code></pre><ul>
<li>I can reproduce the same error on DSpace Test and on my Mac</li>
<li>Looks to be an issue with the Atmire modules, I&rsquo;ve submitted a ticket to their tracker.</li>
</ul>
<h2 id="2016-03-24">2016-03-24</h2>
<ul>
<li>Atmire sent a patch for the group saving issue: <a href="https://github.com/ilri/DSpace/pull/201">https://github.com/ilri/DSpace/pull/201</a></li>
<li>I tested it locally and it works, so I merged it to <code>5_x-prod</code> and will deploy on CGSpace this week</li>
</ul>
<h2 id="2016-03-25">2016-03-25</h2>
<ul>
<li>Having problems with Listings and Reports, seems to be caused by a rogue reference to <code>dc.type.output</code></li>
<li>This is the error we get when we proceed to the second page of Listings and Reports: <a href="https://gist.github.com/alanorth/b2d7fb5b82f94898caaf">https://gist.github.com/alanorth/b2d7fb5b82f94898caaf</a></li>
<li>Commenting out the line works, but I haven&rsquo;t figured out the proper syntax for referring to <code>dc.type.*</code></li>
</ul>
<h2 id="2016-03-28">2016-03-28</h2>
<ul>
<li>Look into enabling the embargo during item submission, see: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Embargo#Embargo-SubmissionProcess">https://wiki.lyrasis.org/display/DSDOC5x/Embargo#Embargo-SubmissionProcess</a></li>
<li>Seems we only want <code>AccessStep</code> because <code>UploadWithEmbargoStep</code> disables the ability to edit embargos at the item level</li>
<li>This pull request enables the ability to set an item-level embargo during submission: <a href="https://github.com/ilri/DSpace/pull/203">https://github.com/ilri/DSpace/pull/203</a></li>
<li>I figured out that the problem with Listings and Reports was because I disabled the <code>search.index.*</code> last week, and they are still used by JSPUI apparently</li>
<li>This pull request re-enables them: <a href="https://github.com/ilri/DSpace/pull/202">https://github.com/ilri/DSpace/pull/202</a></li>
<li>Re-deploy DSpace Test, run all system updates, and restart the server</li>
<li>Looks like the Listings and Reports fix was NOT due to the search indexes (which are actually not used), and rather due to the filter configuration in the Listings and Reports config</li>
<li>This pull request simply updates the config for the dc.type.outputdc.type change that was made last week: <a href="https://github.com/ilri/DSpace/pull/204">https://github.com/ilri/DSpace/pull/204</a></li>
<li>Deploy robots.txt fix, embargo for item submissions, and listings and reports fix on CGSpace</li>
</ul>
<h2 id="2016-03-29">2016-03-29</h2>
<ul>
<li>Skype meeting with Peter and Addis team to discuss metadata changes for Dublin Core, CGcore, and CGSpace-specific fields</li>
<li>We decided to proceed with some deletes first, then identify CGSpace-specific fields to clean/move to <code>cg.*</code>, and then worry about broader changes to DC</li>
<li>Before we move or rename and fields we need to circulate a list of fields we intend to change to CCAFS, CWPF, etc who might be harvesting the fields</li>
<li>After all of this we need to start implementing controlled vocabularies for fields, either with the Javascript lookup or like existing ILRI subjects</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

549
docs/2016-04/index.html Normal file
View File

@ -0,0 +1,549 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="April, 2016" />
<meta property="og:description" content="2016-04-04
Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!
This will save us a few gigs of backup space we&rsquo;re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-04/" />
<meta property="article:published_time" content="2016-04-04T11:06:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2016"/>
<meta name="twitter:description" content="2016-04-04
Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit
We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc
After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!
This will save us a few gigs of backup space we&rsquo;re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "April, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-04/",
"wordCount": "2006",
"datePublished": "2016-04-04T11:06:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-04/">
<title>April, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-04/">April, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-04-04T11:06:00+03:00">Mon Apr 04, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-04-04">2016-04-04</h2>
<ul>
<li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li>
<li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li>
<li>After running DSpace for over five years I&rsquo;ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li>
<li>This will save us a few gigs of backup space we&rsquo;re paying for on S3</li>
<li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li>
</ul>
<pre tabindex="0"><code>Run start time: 03/06/2016 04:00:22
Error retrieving bitstream ID 71274 from asset store.
java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.&lt;init&gt;(FileInputStream.java:146)
at edu.sdsc.grid.io.local.LocalFileInputStream.open(LocalFileInputStream.java:171)
at edu.sdsc.grid.io.GeneralFileInputStream.&lt;init&gt;(GeneralFileInputStream.java:145)
at edu.sdsc.grid.io.local.LocalFileInputStream.&lt;init&gt;(LocalFileInputStream.java:139)
at edu.sdsc.grid.io.FileFactory.newFileInputStream(FileFactory.java:630)
at org.dspace.storage.bitstore.BitstreamStorageManager.retrieve(BitstreamStorageManager.java:525)
at org.dspace.checker.BitstreamDAO.getBitstream(BitstreamDAO.java:60)
at org.dspace.checker.CheckerCommand.processBitstream(CheckerCommand.java:303)
at org.dspace.checker.CheckerCommand.checkBitstream(CheckerCommand.java:171)
at org.dspace.checker.CheckerCommand.process(CheckerCommand.java:120)
at org.dspace.app.checker.ChecksumChecker.main(ChecksumChecker.java:236)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:225)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:77)
******************************************************
</code></pre><ul>
<li>So this would be the <code>tomcat7</code> Unix user, who seems to have a default limit of 1024 files in its shell</li>
<li>For what it&rsquo;s worth, we have been setting the actual Tomcat 7 process&rsquo; limit to 16384 for a few years (in <code>/etc/default/tomcat7</code>)</li>
<li>Looks like cron will read limits from <code>/etc/security/limits.*</code> so we can do something for the tomcat7 user there</li>
<li>Submit pull request for Tomcat 7 limits in Ansible dspace role (<a href="https://github.com/ilri/rmg-ansible-public/pull/30">#30</a>)</li>
</ul>
<h2 id="2016-04-05">2016-04-05</h2>
<ul>
<li>Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don&rsquo;t need!</li>
</ul>
<pre tabindex="0"><code># s3cmd ls s3://cgspace.cgiar.org/log/ &gt; /tmp/s3-logs.txt
# grep checker.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep cocoon.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep handle-plugin.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
# grep solr.log /tmp/s3-logs.txt | awk &#39;{print $4}&#39; | xargs s3cmd del
</code></pre><ul>
<li>Also, adjust the cron jobs for backups so they only backup <code>dspace.log</code> and some stats files (.dat)</li>
<li>Try to do some metadata field migrations using the Atmire batch UI (<code>dc.Species</code> → <code>cg.species</code>) but it took several hours and even missed a few records</li>
</ul>
<h2 id="2016-04-06">2016-04-06</h2>
<ul>
<li>A better way to move metadata on this scale is via SQL, for example <code>dc.type.output</code> → <code>dc.type</code> (their IDs in the metadatafieldregistry are 66 and 109, respectively):</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
UPDATE 40852
</code></pre><ul>
<li>After that an <code>index-discovery -bf</code> is required</li>
<li>Start working on metadata migrations, add 25 or so new metadata fields to CGSpace</li>
</ul>
<h2 id="2016-04-07">2016-04-07</h2>
<ul>
<li>Write shell script to do the migration of fields: <a href="https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b">https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b</a></li>
<li>Testing with a few fields it seems to work well:</li>
</ul>
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40883
UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
UPDATE 21420
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
UPDATE 51258
</code></pre><h2 id="2016-04-08">2016-04-08</h2>
<ul>
<li>Discuss metadata renaming with Abenet, we decided it&rsquo;s better to start with the center-specific subjects like ILRI, CIFOR, CCAFS, IWMI, and CPWF</li>
<li>I&rsquo;ve e-mailed CCAFS and CPWF people to ask them how much time it will take for them to update their systems to cope with this change</li>
</ul>
<h2 id="2016-04-10">2016-04-10</h2>
<ul>
<li>Looking at the DOI issue <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=678507860">reported by Leroy from CIAT a few weeks ago</a></li>
<li>It seems the <code>dx.doi.org</code> URLs are much more proper in our repository!</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like &#39;http://dx.doi.org%&#39;;
count
-------
5638
(1 row)
dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like &#39;http://doi.org%&#39;;
count
-------
3
</code></pre><ul>
<li>I will manually edit the <code>dc.identifier.doi</code> in <a href="https://cgspace.cgiar.org/handle/10568/72509?show=full">10568/72509</a> and tweet the link, then check back in a week to see if the donut gets updated</li>
</ul>
<h2 id="2016-04-11">2016-04-11</h2>
<ul>
<li>The donut is already updated and shows the correct number now</li>
<li>CCAFS people say it will only take them an hour to update their code for the metadata renames, so I proposed we&rsquo;d do it tentatively on Monday the 18th.</li>
</ul>
<h2 id="2016-04-12">2016-04-12</h2>
<ul>
<li>Looking at quality of WLE data (<code>cg.subject.iwmi</code>) in SQL:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
</code></pre><ul>
<li>Listings and Reports is still not returning reliable data for <code>dc.type</code></li>
<li>I think we need to ask Atmire, as their documentation isn&rsquo;t too clear on the format of the filter configs</li>
<li>Alternatively, I want to see if I move all the data from <code>dc.type.output</code> to <code>dc.type</code> and then re-index, if it behaves better</li>
<li>Looking at our <code>input-forms.xml</code> I see we have two sets of ILRI subjects, but one has a few extra subjects</li>
<li>Remove one set of ILRI subjects and remove duplicate <code>VALUE CHAINS</code> from existing list (<a href="https://github.com/ilri/DSpace/pull/216">#216</a>)</li>
<li>I decided to keep the set of subjects that had <code>FMD</code> and <code>RANGELANDS</code> added, as it appears to have been requested to have been added, and might be the newer list</li>
<li>I found 226 blank metadatavalues:</li>
</ul>
<pre tabindex="0"><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><ul>
<li>I think we should delete them and do a full re-index:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 226
</code></pre><ul>
<li>I deleted them on CGSpace but I&rsquo;ll wait to do the re-index as we&rsquo;re going to be doing one in a few days for the metadata changes anyways</li>
<li>In other news, moving the <code>dc.type.output</code> to <code>dc.type</code> and re-indexing seems to have fixed the Listings and Reports issue from above</li>
<li>Unfortunately this isn&rsquo;t a very good solution, because Listings and Reports config should allow us to filter on <code>dc.type.*</code> but the documentation isn&rsquo;t very clear and I couldn&rsquo;t reach Atmire today</li>
<li>We want to do the <code>dc.type.output</code> move on CGSpace anyways, but we should wait as it might affect other external people!</li>
</ul>
<h2 id="2016-04-14">2016-04-14</h2>
<ul>
<li>Communicate with Macaroni Bros again about <code>dc.type</code></li>
<li>Help Sisay with some rsync and Linux stuff</li>
<li>Notify CIAT people of metadata changes (I had forgotten them last week)</li>
</ul>
<h2 id="2016-04-15">2016-04-15</h2>
<ul>
<li>DSpace Test had crashed, so I ran all system updates, rebooted, and re-deployed DSpace code</li>
</ul>
<h2 id="2016-04-18">2016-04-18</h2>
<ul>
<li>Talk to CIAT people about their portal again</li>
<li>Start looking more at the fields we want to delete</li>
<li>The following metadata fields have 0 items using them, so we can just remove them from the registry and any references in XMLUI, input forms, etc:
<ul>
<li>dc.description.abstractother</li>
<li>dc.whatwasknown</li>
<li>dc.whatisnew</li>
<li>dc.description.nationalpartners</li>
<li>dc.peerreviewprocess</li>
<li>cg.species.animal</li>
</ul>
</li>
<li>Deleted!</li>
<li>The following fields have some items using them and I have to decide what to do with them (delete or move):
<ul>
<li>dc.icsubject.icrafsubject: 6 items, mostly in CPWF collections</li>
<li>dc.type.journal: 11 items, mostly in ILRI collections</li>
<li>dc.publicationcategory: 1 item, in CPWF</li>
<li>dc.GRP: 2 items, CPWF</li>
<li>dc.Species.animal: 6 items, in ILRI and AnGR</li>
<li>cg.livestock.agegroup: 9 items, in ILRI collections</li>
<li>cg.livestock.function: 20 items, mostly in EADD</li>
</ul>
</li>
<li>Test metadata migration on local instance again:</li>
</ul>
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40885
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
UPDATE 51330
UPDATE metadatavalue SET metadata_field_id=208 WHERE metadata_field_id=82
UPDATE 5986
UPDATE metadatavalue SET metadata_field_id=210 WHERE metadata_field_id=88
UPDATE 2456
UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106
UPDATE 3872
UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
UPDATE 46075
$ JAVA_OPTS=&#34;-Xms512m -Xmx512m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace index-discovery -bf
</code></pre><ul>
<li>CGSpace was down but I&rsquo;m not sure why, this was in <code>catalina.out</code>:</li>
</ul>
<pre tabindex="0"><code>Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
at org.dspace.rest.Resource.processFinally(Resource.java:163)
at org.dspace.rest.HandleResource.getObject(HandleResource.java:81)
at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
...
</code></pre><ul>
<li>Everything else in the system looked normal (50GB disk space available, nothing weird in dmesg, etc)</li>
<li>After restarting Tomcat a few more of these errors were logged but the application was up</li>
</ul>
<h2 id="2016-04-19">2016-04-19</h2>
<ul>
<li>Get handles for items that are using a given metadata field, ie <code>dc.Species.animal</code> (105):</li>
</ul>
<pre tabindex="0"><code># select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
handle
-------------
10568/10298
10568/16413
10568/16774
10568/34487
</code></pre><ul>
<li>Delete metadata values for <code>dc.GRP</code> and <code>dc.icsubject.icrafsubject</code>:</li>
</ul>
<pre tabindex="0"><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=83;
</code></pre><ul>
<li>They are old ICRAF fields and we haven&rsquo;t used them since 2011 or so</li>
<li>Also delete them from the metadata registry</li>
<li>CGSpace went down again, <code>dspace.log</code> had this:</li>
</ul>
<pre tabindex="0"><code>2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>I restarted Tomcat and PostgreSQL and now it&rsquo;s back up</li>
<li>I bet this is the same crash as yesterday, but I only saw the errors in <code>catalina.out</code></li>
<li>Looks to be related to this, from <code>dspace.log</code>:</li>
</ul>
<pre tabindex="0"><code>2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
</code></pre><ul>
<li>We have 18,000 of these errors right now&hellip;</li>
<li>Delete a few more old metadata values: <code>dc.Species.animal</code>, <code>dc.type.journal</code>, and <code>dc.publicationcategory</code>:</li>
</ul>
<pre tabindex="0"><code># delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=85;
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=95;
</code></pre><ul>
<li>And then remove them from the metadata registry</li>
</ul>
<h2 id="2016-04-20">2016-04-20</h2>
<ul>
<li>Re-deploy DSpace Test with the new subject and type fields, run all system updates, and reboot the server</li>
<li>Migrate fields and re-deploy CGSpace with the new subject and type fields, run all system updates, and reboot the server</li>
<li>Field migration went well:</li>
</ul>
<pre tabindex="0"><code>$ ./migrate-fields.sh
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
UPDATE 40909
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
UPDATE 51419
UPDATE metadatavalue SET metadata_field_id=208 WHERE metadata_field_id=82
UPDATE 5986
UPDATE metadatavalue SET metadata_field_id=210 WHERE metadata_field_id=88
UPDATE 2458
UPDATE metadatavalue SET metadata_field_id=215 WHERE metadata_field_id=106
UPDATE 3872
UPDATE metadatavalue SET metadata_field_id=217 WHERE metadata_field_id=108
UPDATE 46075
</code></pre><ul>
<li>Also, I migrated CGSpace to using the PGDG PostgreSQL repo as the infrastructure playbooks had been using it for a while and it seemed to be working well</li>
<li>Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)</li>
<li>Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &#34;Aborting context in finally statement&#34; dspace.log.2016-04-20
21252
</code></pre><ul>
<li>I found a recent discussion on the DSpace mailing list and I&rsquo;ve asked for advice there</li>
<li>Looks like this issue was noted and fixed in DSpace 5.5 (we&rsquo;re on 5.1): <a href="https://jira.duraspace.org/browse/DS-2936">https://jira.duraspace.org/browse/DS-2936</a></li>
<li>I&rsquo;ve sent a message to Atmire asking about compatibility with DSpace 5.5</li>
</ul>
<h2 id="2016-04-21">2016-04-21</h2>
<ul>
<li>Fix a bunch of metadata consistency issues with IITA Journal Articles (Peer review, Formally published, messed up DOIs, etc)</li>
<li>Atmire responded with DSpace 5.5 compatible versions for their modules, so I&rsquo;ll start testing those in a few weeks</li>
</ul>
<h2 id="2016-04-22">2016-04-22</h2>
<ul>
<li>Import 95 records into <a href="https://cgspace.cgiar.org/handle/10568/42219">CTA&rsquo;s Agrodok collection</a></li>
</ul>
<h2 id="2016-04-26">2016-04-26</h2>
<ul>
<li>Test embargo during item upload</li>
<li>Seems to be working but the help text is misleading as to the date format</li>
<li>It turns out the <code>robots.txt</code> issue we thought we solved last month isn&rsquo;t solved because you can&rsquo;t use wildcards in URL patterns: <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>Write some nginx rules to add <code>X-Robots-Tag</code> HTTP headers to the dynamic requests from <code>robots.txt</code> instead</li>
<li>A few URLs to test with:
<ul>
<li><a href="https://dspacetest.cgiar.org/handle/10568/440/browse?type=bioversity">https://dspacetest.cgiar.org/handle/10568/440/browse?type=bioversity</a></li>
<li><a href="https://dspacetest.cgiar.org/handle/10568/913/discover">https://dspacetest.cgiar.org/handle/10568/913/discover</a></li>
<li><a href="https://dspacetest.cgiar.org/handle/10568/1/search-filter?filtertype_0=country&amp;filter_0=VIETNAM&amp;filter_relational_operator_0=equals&amp;field=country">https://dspacetest.cgiar.org/handle/10568/1/search-filter?filtertype_0=country&amp;filter_0=VIETNAM&amp;filter_relational_operator_0=equals&amp;field=country</a></li>
</ul>
</li>
</ul>
<h2 id="2016-04-27">2016-04-27</h2>
<ul>
<li>I woke up to ten or fifteen &ldquo;up&rdquo; and &ldquo;down&rdquo; emails from the monitoring website</li>
<li>Looks like the last one was &ldquo;down&rdquo; from about four hours ago</li>
<li>I think there must be something with this REST stuff:</li>
</ul>
<pre tabindex="0"><code># grep -c &#34;Aborting context in finally statement&#34; dspace.log.2016-04-*
dspace.log.2016-04-01:0
dspace.log.2016-04-02:0
dspace.log.2016-04-03:0
dspace.log.2016-04-04:0
dspace.log.2016-04-05:0
dspace.log.2016-04-06:0
dspace.log.2016-04-07:0
dspace.log.2016-04-08:0
dspace.log.2016-04-09:0
dspace.log.2016-04-10:0
dspace.log.2016-04-11:0
dspace.log.2016-04-12:235
dspace.log.2016-04-13:44
dspace.log.2016-04-14:0
dspace.log.2016-04-15:35
dspace.log.2016-04-16:0
dspace.log.2016-04-17:0
dspace.log.2016-04-18:11942
dspace.log.2016-04-19:28496
dspace.log.2016-04-20:28474
dspace.log.2016-04-21:28654
dspace.log.2016-04-22:28763
dspace.log.2016-04-23:28773
dspace.log.2016-04-24:28775
dspace.log.2016-04-25:28626
dspace.log.2016-04-26:28655
dspace.log.2016-04-27:7271
</code></pre><ul>
<li>I restarted tomcat and it is back up</li>
<li>Add Spanish XMLUI strings so those users see &ldquo;CGSpace&rdquo; instead of &ldquo;DSpace&rdquo; in the user interface (<a href="https://github.com/ilri/DSpace/pull/222">#222</a>)</li>
<li>Submit patch to upstream DSpace for the misleading help text in the embargo step of the item submission: <a href="https://jira.duraspace.org/browse/DS-3172">https://jira.duraspace.org/browse/DS-3172</a></li>
<li>Update infrastructure playbooks for nginx 1.10.x (stable) release: <a href="https://github.com/ilri/rmg-ansible-public/issues/32">https://github.com/ilri/rmg-ansible-public/issues/32</a></li>
<li>Currently running on DSpace Test, we&rsquo;ll give it a few days before we adjust CGSpace</li>
<li>CGSpace down, restarted tomcat and it&rsquo;s back up</li>
</ul>
<h2 id="2016-04-28">2016-04-28</h2>
<ul>
<li>Problems with stability again. I&rsquo;ve blocked access to <code>/rest</code> for now to see if the number of errors in the log files drop</li>
<li>Later we could maybe start logging access to <code>/rest</code> and perhaps whitelist some IPs&hellip;</li>
</ul>
<h2 id="2016-04-30">2016-04-30</h2>
<ul>
<li>Logs for today and yesterday have zero references to this REST error, so I&rsquo;m going to open back up the REST API but log all requests</li>
</ul>
<pre tabindex="0"><code>location /rest {
access_log /var/log/nginx/rest.log;
proxy_pass http://127.0.0.1:8443;
}
</code></pre><ul>
<li>I will check the logs again in a few days to look for patterns, see who is accessing it, etc</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

425
docs/2016-05/index.html Normal file
View File

@ -0,0 +1,425 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="May, 2016" />
<meta property="og:description" content="2016-05-01
Since yesterday there have been 10,000 REST errors and the site has been unstable again
I have blocked access to the API now
There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-05/" />
<meta property="article:published_time" content="2016-05-01T23:06:00+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2016"/>
<meta name="twitter:description" content="2016-05-01
Since yesterday there have been 10,000 REST errors and the site has been unstable again
I have blocked access to the API now
There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-05/",
"wordCount": "1349",
"datePublished": "2016-05-01T23:06:00+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-05/">
<title>May, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-05/">May, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-05-01T23:06:00+03:00">Sun May 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-05-01">2016-05-01</h2>
<ul>
<li>Since yesterday there have been 10,000 REST errors and the site has been unstable again</li>
<li>I have blocked access to the API now</li>
<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
</code></pre><ul>
<li>The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29</li>
<li>100% of the requests coming from Ethiopia are like this and result in an HTTP 500:</li>
</ul>
<pre tabindex="0"><code>GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
</code></pre><ul>
<li>For now I&rsquo;ll block just the Ethiopian IP</li>
<li>The owner of that application has said that the <code>NaN</code> (not a number) is an error in his code and he&rsquo;ll fix it</li>
</ul>
<h2 id="2016-05-03">2016-05-03</h2>
<ul>
<li>Update nginx to 1.10.x branch on CGSpace</li>
<li>Fix a reference to <code>dc.type.output</code> in Discovery that I had missed when we migrated to <code>dc.type</code> last month (<a href="https://github.com/ilri/DSpace/pull/223">#223</a>)</li>
</ul>
<p><img src="/cgspace-notes/2016/05/discovery-types.png" alt="Item type in Discovery results"></p>
<h2 id="2016-05-06">2016-05-06</h2>
<ul>
<li>DSpace Test is down, <code>catalina.out</code> has lots of messages about heap space from some time yesterday (!)</li>
<li>It looks like Sisay was doing some batch imports</li>
<li>Hmm, also disk space is full</li>
<li>I decided to blow away the solr indexes, since they are 50GB and we don&rsquo;t really need all the Atmire stuff there right now</li>
<li>I will re-generate the Discovery indexes after re-deploying</li>
<li>Testing <code>renew-letsencrypt.sh</code> script for nginx</li>
</ul>
<pre tabindex="0"><code>#!/usr/bin/env bash
readonly SERVICE_BIN=/usr/sbin/service
readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto
# stop nginx so LE can listen on port 443
$SERVICE_BIN nginx stop
$LETSENCRYPT_BIN renew -nvv --standalone --standalone-supported-challenges tls-sni-01 &gt; /var/log/letsencrypt/renew.log 2&gt;&amp;1
LE_RESULT=$?
$SERVICE_BIN nginx start
if [[ &#34;$LE_RESULT&#34; != 0 ]]; then
echo &#39;Automated renewal failed:&#39;
cat /var/log/letsencrypt/renew.log
exit 1
fi
</code></pre><ul>
<li>Seems to work well</li>
</ul>
<h2 id="2016-05-10">2016-05-10</h2>
<ul>
<li>Start looking at more metadata migrations</li>
<li>There are lots of fields in <code>dcterms</code> namespace that look interesting, like:
<ul>
<li>dcterms.type</li>
<li>dcterms.spatial</li>
</ul>
</li>
<li>Not sure what <code>dcterms</code> is&hellip;</li>
<li>Looks like these were <a href="https://wiki.lyrasis.org/display/DSDOC5x/Metadata+and+Bitstream+Format+Registries#MetadataandBitstreamFormatRegistries-DublinCoreTermsRegistry(DCTERMS)">added in DSpace 4</a> to allow for future work to make DSpace more flexible</li>
<li>CGSpace&rsquo;s <code>dc</code> registry has 96 items, and the default DSpace one has 73.</li>
</ul>
<h2 id="2016-05-11">2016-05-11</h2>
<ul>
<li>
<p>Identify and propose the next phase of CGSpace fields to migrate:</p>
<ul>
<li>dc.title.jtitle → cg.title.journal</li>
<li>dc.identifier.status → cg.identifier.status</li>
<li>dc.river.basin → cg.river.basin</li>
<li>dc.Species → cg.species</li>
<li>dc.targetaudience → cg.targetaudience</li>
<li>dc.fulltextstatus → cg.fulltextstatus</li>
<li>dc.editon → cg.edition</li>
<li>dc.isijournal → cg.isijournal</li>
</ul>
</li>
<li>
<p>Start a test rebase of the <code>5_x-prod</code> branch on top of the <code>dspace-5.5</code> tag</p>
</li>
<li>
<p>There were a handful of conflicts that I didn&rsquo;t understand</p>
</li>
<li>
<p>After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:</p>
</li>
</ul>
<pre tabindex="0"><code>[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -&gt; [Help 1]
</code></pre><ul>
<li>I&rsquo;ve sent them a question about it</li>
<li>A user mentioned having problems with uploading a 33 MB PDF</li>
<li>I told her I would increase the limit temporarily tomorrow morning</li>
<li>Turns out she was able to decrease the size of the PDF so we didn&rsquo;t have to do anything</li>
</ul>
<h2 id="2016-05-12">2016-05-12</h2>
<ul>
<li>Looks like the issue that Abenet was having a few days ago with &ldquo;Connection Reset&rdquo; in Firefox might be due to a Firefox 46 issue: <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1268775">https://bugzilla.mozilla.org/show_bug.cgi?id=1268775</a></li>
<li>I finally found a copy of the latest CG Core metadata guidelines and it looks like we can add a few more fields to our next migration:
<ul>
<li>dc.rplace.region → cg.coverage.region</li>
<li>dc.cplace.country → cg.coverage.country</li>
</ul>
</li>
<li>Questions for CG people:
<ul>
<li>Our <code>dc.place</code> and <code>dc.srplace.subregion</code> could both map to <code>cg.coverage.admin-unit</code>?</li>
<li>Should we use <code>dc.contributor.crp</code> or <code>cg.contributor.crp</code> for the CRP (ours is <code>dc.crsubject.crpsubject</code>)?</li>
<li>Our <code>dc.contributor.affiliation</code> and <code>dc.contributor.corporate</code> could both map to <code>dc.contributor</code> and possibly <code>dc.contributor.center</code> depending on if it&rsquo;s a CG center or not</li>
<li><code>dc.title.jtitle</code> could either map to <code>dc.publisher</code> or <code>dc.source</code> depending on how you read things</li>
</ul>
</li>
<li>Found ~200 messed up CIAT values in <code>dc.publisher</code>:</li>
</ul>
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to &#34;% %&#34;;
</code></pre><h2 id="2016-05-13">2016-05-13</h2>
<ul>
<li>More theorizing about CGcore</li>
<li>Add two new fields:
<ul>
<li>dc.srplace.subregion → cg.coverage.admin-unit</li>
<li>dc.place → cg.place</li>
</ul>
</li>
<li><code>dc.place</code> is our own field, so it&rsquo;s easy to move</li>
<li>I&rsquo;ve removed <code>dc.title.jtitle</code> from the list for now because there&rsquo;s no use moving it out of DC until we know where it will go (see discussion yesterday)</li>
</ul>
<h2 id="2016-05-18">2016-05-18</h2>
<ul>
<li>Work on 707 CCAFS records</li>
<li>They have thumbnails on Flickr and elsewhere</li>
<li>In OpenRefine I created a new <code>filename</code> column based on the <code>thumbnail</code> column with the following GREL:</li>
</ul>
<pre tabindex="0"><code>if(cells[&#39;thumbnails&#39;].value.contains(&#39;hqdefault&#39;), cells[&#39;thumbnails&#39;].value.split(&#39;/&#39;)[-2] + &#39;.jpg&#39;, cells[&#39;thumbnails&#39;].value.split(&#39;/&#39;)[-1])
</code></pre><ul>
<li>Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL</li>
<li>So for the <code>hqdefault.jpg</code> ones I just take the UUID (-2) and use it as the filename</li>
<li>Before importing with SAFBuilder I tested adding &ldquo;__bundle:THUMBNAIL&rdquo; to the <code>filename</code> column and it works fine</li>
</ul>
<h2 id="2016-05-19">2016-05-19</h2>
<ul>
<li>More quality control on <code>filename</code> field of CCAFS records to make processing in shell and SAFBuilder more reliable:</li>
</ul>
<pre tabindex="0"><code>value.replace(&#39;_&#39;,&#39;&#39;).replace(&#39;-&#39;,&#39;&#39;)
</code></pre><ul>
<li>We need to hold off on moving <code>dc.Species</code> to <code>cg.species</code> because it is only used for plants, and might be better to move it to something like <code>cg.species.plant</code></li>
<li>And <code>dc.identifier.fund</code> is MOSTLY used for CPWF project identifier but has some other sponsorship things
<ul>
<li>We should move PN*, SG*, CBA, IA, and PHASE* values to <code>cg.identifier.cpwfproject</code></li>
<li>The rest, like BMGF and USAID etc, might have to go to either <code>dc.description.sponsorship</code> or <code>cg.identifier.fund</code> (not sure yet)</li>
<li>There are also some mistakes in CPWF&rsquo;s things, like &ldquo;PN 47&rdquo;</li>
<li>This ought to catch all the CPWF values (there don&rsquo;t appear to be and SG* values):</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code># select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like &#39;PN%&#39; or text_value like &#39;PHASE%&#39; or text_value = &#39;CBA&#39; or text_value = &#39;IA&#39;);
</code></pre><h2 id="2016-05-20">2016-05-20</h2>
<ul>
<li>More work on CCAFS Video and Images records</li>
<li>For SAFBuilder we need to modify filename column to have the thumbnail bundle:</li>
</ul>
<pre tabindex="0"><code>value + &#34;__bundle:THUMBNAIL&#34;
</code></pre><ul>
<li>Also, I fixed some weird characters using OpenRefine&rsquo;s transform with the following GREL:</li>
</ul>
<pre tabindex="0"><code>value.replace(/\u0081/,&#39;&#39;)
</code></pre><ul>
<li>Write shell script to resize thumbnails with height larger than 400: <a href="https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256">https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256</a></li>
<li>Upload 707 CCAFS records to DSpace Test</li>
<li>A few miscellaneous fixes for XMLUI display niggles (spaces in item lists and link target <code>_black</code>): <a href="https://github.com/ilri/DSpace/pull/224">#224</a></li>
<li>Work on configuration changes for Phase 2 metadata migrations</li>
</ul>
<h2 id="2016-05-23">2016-05-23</h2>
<ul>
<li>Try to import the CCAFS Images and Videos to CGSpace but had some issues with LibreOffice and OpenRefine</li>
<li>LibreOffice excludes empty cells when it exports and all the fields shift over to the left and cause URLs to go to Subjects, etc.</li>
<li>Google Docs does this better, but somehow reorders the rows and when I paste the thumbnail/filename row in they don&rsquo;t match!</li>
<li>I will have to try later</li>
</ul>
<h2 id="2016-05-30">2016-05-30</h2>
<ul>
<li>Export CCAFS video and image records from DSpace Test using the migrate option (<code>-m</code>):</li>
</ul>
<pre tabindex="0"><code>$ mkdir ~/ccafs-images
$ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
</code></pre><ul>
<li>And then import to CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &amp;&gt; /tmp/ccafs-images-may30.log
</code></pre><ul>
<li>But now we have double authors for &ldquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rdquo; in the authority</li>
<li>I&rsquo;m trying to do a Discovery index before messing with the authority index</li>
<li>Looks like we are missing the <code>index-authority</code> cron job, so who knows what&rsquo;s up with our authority index</li>
<li>Run system updates on DSpace Test, re-deploy code, and reboot the server</li>
<li>Clean up and import ~200 CTA records to CGSpace via CSV like:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34;
$ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &amp;&gt; ~/CTA-May30/CTA-42229.log
</code></pre><ul>
<li>Discovery indexing took a few hours for some reason, and after that I started the <code>index-authority</code> script</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace index-authority
</code></pre><h2 id="2016-05-31">2016-05-31</h2>
<ul>
<li>The <code>index-authority</code> script ran over night and was finished in the morning</li>
<li>Hopefully this was because we haven&rsquo;t been running it regularly and it will speed up next time</li>
<li>I am running it again with a timer to see:</li>
</ul>
<pre tabindex="0"><code>$ time /home/cgspace.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Cleaning the old index
Writing new data
All done !
real 37m26.538s
user 2m24.627s
sys 0m20.540s
</code></pre><ul>
<li>Update <code>tomcat7</code> crontab on CGSpace and DSpace Test to have the <code>index-authority</code> script that we were missing</li>
<li>Add new ILRI subject and CCAFS project tags to <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/226">#226</a>, <a href="https://github.com/ilri/DSpace/pull/225">#225</a>)</li>
<li>Manually mapped the authors of a few old CCAFS records to the new CCAFS authority UUID and re-indexed authority indexes to see if it helps correct those items.</li>
<li>Re-sync DSpace Test data with CGSpace</li>
<li>Clean up and import ~65 more CTA items into CGSpace</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

463
docs/2016-06/index.html Normal file
View File

@ -0,0 +1,463 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="June, 2016" />
<meta property="og:description" content="2016-06-01
Experimenting with IFPRI OAI (we want to harvest their publications)
After reading the ContentDM documentation I found IFPRI&rsquo;s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-06/" />
<meta property="article:published_time" content="2016-06-01T10:53:00+03:00" />
<meta property="article:modified_time" content="2020-11-30T12:10:20+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2016"/>
<meta name="twitter:description" content="2016-06-01
Experimenting with IFPRI OAI (we want to harvest their publications)
After reading the ContentDM documentation I found IFPRI&rsquo;s OAI endpoint: http://ebrary.ifpri.org/oai/oai.php
After reading the OAI documentation and testing with an OAI validator I found out how to get their publications
This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "June, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-06/",
"wordCount": "1551",
"datePublished": "2016-06-01T10:53:00+03:00",
"dateModified": "2020-11-30T12:10:20+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-06/">
<title>June, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-06/">June, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-06-01T10:53:00+03:00">Wed Jun 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-06-01">2016-06-01</h2>
<ul>
<li>Experimenting with IFPRI OAI (we want to harvest their publications)</li>
<li>After reading the <a href="https://www.oclc.org/support/services/contentdm/help/server-admin-help/oai-support.en.html">ContentDM documentation</a> I found IFPRI&rsquo;s OAI endpoint: <a href="http://ebrary.ifpri.org/oai/oai.php">http://ebrary.ifpri.org/oai/oai.php</a></li>
<li>After reading the <a href="https://www.openarchives.org/OAI/openarchivesprotocol.html">OAI documentation</a> and testing with an <a href="http://validator.oaipmh.com/">OAI validator</a> I found out how to get their publications</li>
<li>This is their publications set: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc">http://ebrary.ifpri.org/oai/oai.php?verb=ListRecords&amp;from=2016-01-01&amp;set=p15738coll2&amp;metadataPrefix=oai_dc</a></li>
<li>You can see the others by using the OAI <code>ListSets</code> verb: <a href="http://ebrary.ifpri.org/oai/oai.php?verb=ListSets">http://ebrary.ifpri.org/oai/oai.php?verb=ListSets</a></li>
<li>Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in <code>dc.identifier.fund</code> to <code>cg.identifier.cpwfproject</code> and then the rest to <code>dc.description.sponsorship</code></li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like &#39;PN%&#39; or text_value like &#39;PHASE%&#39; or text_value = &#39;CBA&#39; or text_value = &#39;IA&#39;);
UPDATE 497
dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
UPDATE 14
</code></pre><ul>
<li>Fix a few minor miscellaneous issues in <code>dspace.cfg</code> (<a href="https://github.com/ilri/DSpace/pull/227">#227</a>)</li>
</ul>
<h2 id="2016-06-02">2016-06-02</h2>
<ul>
<li>Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with <code>cg.coverage.admin-unit</code></li>
<li>Seems that the Browse configuration in <code>dspace.cfg</code> can&rsquo;t handle the &lsquo;-&rsquo; in the field name:</li>
</ul>
<pre tabindex="0"><code>webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
</code></pre><ul>
<li>But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error</li>
<li>I&rsquo;ve sent a message to the DSpace mailing list to ask about the Browse index definition</li>
<li>A user was having problems with submission and from the stacktrace it looks like a Sherpa/Romeo issue</li>
<li>I found a thread on the mailing list talking about it and there is bug report and a patch: <a href="https://jira.duraspace.org/browse/DS-2740">https://jira.duraspace.org/browse/DS-2740</a></li>
<li>The patch applies successfully on DSpace 5.1 so I will try it later</li>
</ul>
<h2 id="2016-06-03">2016-06-03</h2>
<ul>
<li>Investigating the CCAFS authority issue, I exported the metadata for the Videos collection</li>
<li>The top two authors are:</li>
</ul>
<pre tabindex="0"><code>CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600
</code></pre><ul>
<li>So the only difference is the &ldquo;confidence&rdquo;</li>
<li>Ok, well THAT is interesting:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
Orth, Alan | | -1
Orth, Alan | | -1
Orth, Alan | | -1
Orth, Alan | | -1
Orth, A. | 05c2c622-d252-4efb-b9ed-95a07d3adf11 | -1
Orth, A. | 05c2c622-d252-4efb-b9ed-95a07d3adf11 | -1
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
Orth, A. | ab606e3a-2b04-4c7d-9423-14beccf54257 | -1
Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 | 600
Orth, Alan | ad281dbf-ef81-4007-96c3-a7f5d2eaa6d9 | 600
(13 rows)
</code></pre><ul>
<li>And now an actually relevent example:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39; and confidence = 500;
count
-------
707
(1 row)
dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39; and confidence != 500;
count
-------
253
(1 row)
</code></pre><ul>
<li>Trying something experimental:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
UPDATE 960
</code></pre><ul>
<li>And then re-indexing authority and Discovery&hellip;?</li>
<li>After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet</li>
<li>The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:</li>
</ul>
<pre tabindex="0"><code>webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
</code></pre><ul>
<li>That would only be for the &ldquo;Browse by&rdquo; function&hellip; so we&rsquo;ll have to see what effect that has later</li>
</ul>
<h2 id="2016-06-04">2016-06-04</h2>
<ul>
<li>Re-sync DSpace Test with CGSpace and perform test of metadata migration again</li>
<li>Run phase two of metadata migrations on CGSpace (see the <a href="https://gist.github.com/alanorth/1a730bec5ac9457a8fb0e3e72c98d09c">migration notes</a>)</li>
<li>Run all system updates and reboot CGSpace server</li>
</ul>
<h2 id="2016-06-07">2016-06-07</h2>
<ul>
<li>Figured out how to export a list of the unique values from a metadata field ordered by count:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
</code></pre><ul>
<li>
<p>Identified the next round of fields to migrate:</p>
<ul>
<li>dc.title.jtitle → dc.source</li>
<li>dc.crsubject.crpsubject → cg.contributor.crp</li>
<li>dc.contributor.affiliation → cg.contributor.affiliation</li>
<li>dc.Species → cg.species</li>
<li>dc.contributor.corporate → dc.contributor</li>
<li>dc.identifier.url → cg.identifier.url</li>
<li>dc.identifier.doi → cg.identifier.doi</li>
<li>dc.identifier.googleurl → cg.identifier.googleurl</li>
<li>dc.identifier.dataurl → cg.identifier.dataurl</li>
</ul>
</li>
<li>
<p>Discuss pulling data from IFPRI&rsquo;s ContentDM with Ryan Miller</p>
</li>
<li>
<p>Looks like OAI is kinda obtuse for this, and if we use ContentDM&rsquo;s API we&rsquo;ll be able to access their internal field names (rather than trying to figure out how they stuffed them into various, repeated Dublin Core fields)</p>
</li>
</ul>
<h2 id="2016-06-08">2016-06-08</h2>
<ul>
<li>Discuss controlled vocabularies for ~28 fields</li>
<li>Looks like this is all we need: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies">https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ConfiguringControlledVocabularies</a></li>
<li>I wrote an XPath expression to extract the ILRI subjects from <code>input-forms.xml</code> (from the xmlstarlet package):</li>
</ul>
<pre tabindex="0"><code>$ xml sel -t -m &#39;//value-pairs[@value-pairs-name=&#34;ilrisubject&#34;]/pair/displayed-value/text()&#39; -c &#39;.&#39; -n dspace/config/input-forms.xml
</code></pre><ul>
<li>Write to Atmire about the use of <code>atmire.orcid.id</code> to see if we can change it</li>
<li>Seems to be a virtual field that is queried from the authority cache&hellip; hmm</li>
<li>In other news, I found out that the About page that we haven&rsquo;t been using lives in <code>dspace/config/about.xml</code>, so now we can update the text</li>
<li>File bug about <code>closed=&quot;true&quot;</code> attribute of controlled vocabularies not working: <a href="https://jira.duraspace.org/browse/DS-3238">https://jira.duraspace.org/browse/DS-3238</a></li>
</ul>
<h2 id="2016-06-09">2016-06-09</h2>
<ul>
<li>Atmire explained that the <code>atmire.orcid.id</code> field doesn&rsquo;t exist in the schema, as it actually comes from the authority cache during XMLUI run time</li>
<li>This means we don&rsquo;t see it when harvesting via OAI or REST, for example</li>
<li>They opened a feature ticket on the DSpace tracker to ask for support of this: <a href="https://jira.duraspace.org/browse/DS-3239">https://jira.duraspace.org/browse/DS-3239</a></li>
</ul>
<h2 id="2016-06-10">2016-06-10</h2>
<ul>
<li>Investigating authority confidences</li>
<li>It looks like the values are documented in <code>Choices.java</code></li>
<li>Experiment with setting all 960 CCAFS author values to be 500:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = &#39;CGIAR Research Program on Climate Change, Agriculture and Food Security&#39;;
UPDATE 960
</code></pre><ul>
<li>After the database edit, I did a full Discovery re-index</li>
<li>And now there are exactly 960 items in the authors facet for &lsquo;CGIAR Research Program on Climate Change, Agriculture and Food Security&rsquo;</li>
<li>Now I ran the same on CGSpace</li>
<li>Merge controlled vocabulary functionality for animal breeds to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/236">#236</a>)</li>
<li>Write python script to update metadata values in batch via PostgreSQL: <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a></li>
<li>We need to use this to correct some pretty ugly values in fields like <code>dc.description.sponsorship</code></li>
<li>Merge item display tweaks from earlier this week (<a href="https://github.com/ilri/DSpace/pull/231">#231</a>)</li>
<li>Merge controlled vocabulary functionality for subregions (<a href="https://github.com/ilri/DSpace/pull/238">#238</a>)</li>
</ul>
<h2 id="2016-06-11">2016-06-11</h2>
<ul>
<li>Merge controlled vocabulary for sponsorship field (<a href="https://github.com/ilri/DSpace/pull/239">#239</a>)</li>
<li>Fix character encoding issues for animal breed lookup that I merged yesterday</li>
</ul>
<h2 id="2016-06-17">2016-06-17</h2>
<ul>
<li>Linode has free RAM upgrades for their 13th birthday so I migrated DSpace Test (4→8GB of RAM)</li>
</ul>
<h2 id="2016-06-18">2016-06-18</h2>
<ul>
<li>
<p>Clean up titles and hints in <code>input-forms.xml</code> to use title/sentence case and a few more consistency things (<a href="https://github.com/ilri/DSpace/pull/241">#241</a>)</p>
</li>
<li>
<p>The final list of fields to migrate in the third phase of metadata migrations is:</p>
<ul>
<li>dc.title.jtitle → dc.source</li>
<li>dc.crsubject.crpsubject → cg.contributor.crp</li>
<li>dc.contributor.affiliation → cg.contributor.affiliation</li>
<li>dc.srplace.subregion → cg.coverage.subregion</li>
<li>dc.Species → cg.species</li>
<li>dc.contributor.corporate → dc.contributor</li>
<li>dc.identifier.url → cg.identifier.url</li>
<li>dc.identifier.doi → cg.identifier.doi</li>
<li>dc.identifier.googleurl → cg.identifier.googleurl</li>
<li>dc.identifier.dataurl → cg.identifier.dataurl</li>
</ul>
</li>
<li>
<p>Interesting &ldquo;Sunburst&rdquo; visualization on a Digital Commons page: <a href="http://www.repository.law.indiana.edu/sunburst.html">http://www.repository.law.indiana.edu/sunburst.html</a></p>
</li>
<li>
<p>Final testing on metadata fix/delete for <code>dc.description.sponsorship</code> cleanup</p>
</li>
<li>
<p>Need to run <code>fix-metadata-values.py</code> and then <code>fix-metadata-values.py</code></p>
</li>
</ul>
<h2 id="2016-06-20">2016-06-20</h2>
<ul>
<li>CGSpace&rsquo;s HTTPS certificate expired last night and I didn&rsquo;t notice, had to renew:</li>
</ul>
<pre tabindex="0"><code># /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook &#34;/usr/bin/service nginx stop&#34; --post-hook &#34;/usr/bin/service nginx start&#34;
</code></pre><ul>
<li>I really need to fix that cron job&hellip;</li>
</ul>
<h2 id="2016-06-24">2016-06-24</h2>
<ul>
<li>Run the replacements/deletes for <code>dc.description.sponsorship</code> (investors) on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t &#39;correct investor&#39; -m 29 -d cgspace -p &#39;fuuu&#39; -u cgspace
$ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p &#39;fuuu&#39; -u cgspace
</code></pre><ul>
<li>The scripts for this are here:
<ul>
<li><a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a></li>
<li><a href="https://gist.github.com/alanorth/bd7d58c947f686401a2b1fadc78736be">delete-metadata-values.py</a></li>
</ul>
</li>
<li>Add new sponsors to controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/244">#244</a>)</li>
<li>Refine submission form labels and hints</li>
</ul>
<h2 id="2016-06-28">2016-06-28</h2>
<ul>
<li>Testing the cleanup of <code>dc.contributor.corporate</code> with 13 deletions and 121 replacements</li>
<li>There are still ~97 fields that weren&rsquo;t indicated to do anything</li>
<li>After the above deletions and replacements I regenerated a CSV and sent it to Peter <em>et al</em> to have a look</li>
</ul>
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
</code></pre><ul>
<li>Re-evaluate <code>dc.contributor.corporate</code> and it seems we will move it to <code>dc.contributor.author</code> as this is more in line with how editors are actually using it</li>
</ul>
<h2 id="2016-06-29">2016-06-29</h2>
<ul>
<li>Test run of <code>migrate-fields.sh</code> with the following re-mappings:</li>
</ul>
<pre tabindex="0"><code>72 55 #dc.source
86 230 #cg.contributor.crp
91 211 #cg.contributor.affiliation
94 212 #cg.species
107 231 #cg.coverage.subregion
126 3 #dc.contributor.author
73 219 #cg.identifier.url
74 220 #cg.identifier.doi
79 222 #cg.identifier.googleurl
89 223 #cg.identifier.dataurl
</code></pre><ul>
<li>Run all cleanups and deletions of <code>dc.contributor.corporate</code> on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t &#39;Correct style&#39; -m 126 -d cgspace -u cgspace -p &#39;fuuu&#39;
$ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t &#39;should be&#39; -m 126 -d cgspace -u cgspace -p &#39;fuuu&#39;
$ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Re-deploy CGSpace and DSpace Test with latest June changes</li>
<li>Now the sharing and Altmetric bits are more prominent:</li>
</ul>
<p><img src="/cgspace-notes/2016/06/xmlui-altmetric-sharing.png" alt="DSpace 5.1 XMLUI With Altmetric Badge"></p>
<ul>
<li>Run all system updates on the servers and reboot</li>
<li>Start working on config changes for phase three of the metadata migrations</li>
</ul>
<h2 id="2016-06-30">2016-06-30</h2>
<ul>
<li>Wow, there are 95 authors in the database who have &lsquo;,&rsquo; at the end of their name:</li>
</ul>
<pre tabindex="0"><code># select text_value from metadatavalue where metadata_field_id=3 and text_value like &#39;%,&#39;;
</code></pre><ul>
<li>We need to use something like this to fix them, need to write a proper regex later:</li>
</ul>
<pre tabindex="0"><code># update metadatavalue set text_value = regexp_replace(text_value, &#39;(Poole, J),&#39;, &#39;\1&#39;) where metadata_field_id=3 and text_value = &#39;Poole, J,&#39;;
</code></pre>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

379
docs/2016-07/index.html Normal file
View File

@ -0,0 +1,379 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="July, 2016" />
<meta property="og:description" content="2016-07-01
Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:
dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.&#43;?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.&#43;?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.&#43;?,$&#39;;
text_value
------------
(0 rows)
In this case the select query was showing 95 results before the update
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-07/" />
<meta property="article:published_time" content="2016-07-01T10:53:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="July, 2016"/>
<meta name="twitter:description" content="2016-07-01
Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:
dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.&#43;?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.&#43;?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.&#43;?,$&#39;;
text_value
------------
(0 rows)
In this case the select query was showing 95 results before the update
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "July, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-07/",
"wordCount": "866",
"datePublished": "2016-07-01T10:53:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-07/">
<title>July, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-07/">July, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-07-01T10:53:00+03:00">Fri Jul 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-07-01">2016-07-01</h2>
<ul>
<li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
text_value
------------
(0 rows)
</code></pre><ul>
<li>In this case the select query was showing 95 results before the update</li>
</ul>
<h2 id="2016-07-02">2016-07-02</h2>
<ul>
<li>Comment on DSpace Jira ticket about author lookup search text (<a href="https://jira.duraspace.org/browse/DS-2329">DS-2329</a>)</li>
</ul>
<h2 id="2016-07-04">2016-07-04</h2>
<ul>
<li>Seems the database&rsquo;s author authority values mean nothing without the <code>authority</code> Solr core from the host where they were created!</li>
</ul>
<h2 id="2016-07-05">2016-07-05</h2>
<ul>
<li>Amend <code>backup-solr.sh</code> script so it backs up the entire Solr folder</li>
<li>We <em>really</em> only need <code>statistics</code> and <code>authority</code> but meh</li>
<li>Fix metadata for species on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p &#39;fuuu&#39;
</code></pre><ul>
<li>Will run later on CGSpace</li>
<li>A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is &ldquo;ungraded&rdquo;</li>
<li>I tested the <a href="https://jira.duraspace.org/browse/DS-2740">patch for DS-2740</a> that I had found last month and it seems to work</li>
<li>I will merge it to <code>5_x-prod</code></li>
</ul>
<h2 id="2016-07-06">2016-07-06</h2>
<ul>
<li>Delete 23 blank metadata values from CGSpace:</li>
</ul>
<pre tabindex="0"><code>cgspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 23
</code></pre><ul>
<li>Complete phase three of metadata migration, for the following fields:
<ul>
<li>dc.title.jtitle → dc.source</li>
<li>dc.crsubject.crpsubject → cg.contributor.crp</li>
<li>dc.contributor.affiliation → cg.contributor.affiliation</li>
<li>dc.Species → cg.species</li>
<li>dc.srplace.subregion → cg.coverage.subregion</li>
<li>dc.contributor.corporate → dc.contributor.author</li>
<li>dc.identifier.url → cg.identifier.url</li>
<li>dc.identifier.doi → cg.identifier.doi</li>
<li>dc.identifier.googleurl → cg.identifier.googleurl</li>
<li>dc.identifier.dataurl → cg.identifier.dataurl</li>
</ul>
</li>
<li>Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p &#39;fuuu&#39;
$ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>I then ran all server updates and rebooted the server</li>
</ul>
<h2 id="2016-07-11">2016-07-11</h2>
<ul>
<li>Doing some author cleanups from Peter and Abenet:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
$ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UTF8.csv -m 3 -u dspacetest -d dspacetest -p fuuu
</code></pre><h2 id="2016-07-13">2016-07-13</h2>
<ul>
<li>Run the author cleanups on CGSpace and start a full Discovery re-index</li>
</ul>
<h2 id="2016-07-14">2016-07-14</h2>
<ul>
<li>Test LDAP settings for new root LDAP</li>
<li>Seems to work when binding as a top-level user</li>
</ul>
<h2 id="2016-07-18">2016-07-18</h2>
<ul>
<li>Adjust identifiers in XMLUI item display to be more prominent</li>
<li>Add species and breed to the XMLUI item display</li>
<li>CGSpace crashed late at night and the DSpace logs were showing:</li>
</ul>
<pre tabindex="0"><code>2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
...
</code></pre><ul>
<li>I suspect it&rsquo;s someone hitting REST too much:</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
710 66.249.78.38
1781 181.118.144.29
24904 70.32.99.142
</code></pre><ul>
<li>I just blocked access to <code>/rest</code> for that last IP for now:</li>
</ul>
<pre tabindex="0"><code> # log rest requests
location /rest {
access_log /var/log/nginx/rest.log;
proxy_pass http://127.0.0.1:8443;
deny 70.32.99.142;
}
</code></pre><h2 id="2016-07-21">2016-07-21</h2>
<ul>
<li>Mitigate the <a href="https://httpoxy.org">HTTPoxy</a> vulnerability for Tomcat etc in nginx: <a href="https://github.com/ilri/rmg-ansible-public/pull/38">https://github.com/ilri/rmg-ansible-public/pull/38</a></li>
<li>Unblock 70.32.99.142 from <code>/rest</code> as it has been blocked for a few days</li>
</ul>
<h2 id="2016-07-22">2016-07-22</h2>
<ul>
<li>Help Paola from CCAFS with thumbnails for batch uploads</li>
<li>She has been struggling to get the dimensions right, and manually enlarging smaller thumbnails, renaming PNGs to JPG, etc</li>
<li>Altmetric reports having an issue with some of our authors being doubled&hellip;</li>
<li>This is related to authority and confidence!</li>
<li>We might need to use <code>index.authority.ignore-prefered=true</code> to tell the Discovery index to prefer the variation that exists in the metadatavalue rather than what it finds in the authority cache.</li>
<li>Trying these on DSpace Test after a discussion by Daniel Scharon on the dspace-tech mailing list:</li>
</ul>
<pre tabindex="0"><code>index.authority.ignore-prefered.dc.contributor.author=true
index.authority.ignore-variants.dc.contributor.author=false
</code></pre><ul>
<li>After reindexing I don&rsquo;t see any change in Discovery&rsquo;s display of authors, and still have entries like:</li>
</ul>
<pre tabindex="0"><code>Grace, D. (464)
Grace, D. (62)
</code></pre><ul>
<li>I asked for clarification of the following options on the DSpace mailing list:</li>
</ul>
<pre tabindex="0"><code>index.authority.ignore
index.authority.ignore-prefered
index.authority.ignore-variants
</code></pre><ul>
<li>In the mean time, I will try these on DSpace Test (plus a reindex):</li>
</ul>
<pre tabindex="0"><code>index.authority.ignore=true
index.authority.ignore-prefered=true
index.authority.ignore-variants=true
</code></pre><ul>
<li>Enabled usage of <code>X-Forwarded-For</code> in DSpace admin control panel (<a href="https://github.com/ilri/DSpace/pull/255">#255</a></li>
<li>It was misconfigured and disabled, but already working for some reason <em>sigh</em></li>
<li>&hellip; no luck. Trying with just:</li>
</ul>
<pre tabindex="0"><code>index.authority.ignore=true
</code></pre><ul>
<li>After re-indexing and clearing the XMLUI cache nothing has changed</li>
</ul>
<h2 id="2016-07-25">2016-07-25</h2>
<ul>
<li>Trying a few more settings (plus reindex) for Discovery on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>index.authority.ignore-prefered.dc.contributor.author=true
index.authority.ignore-variants=true
</code></pre><ul>
<li>Run all OS updates and reboot DSpace Test server</li>
<li>No changes to Discovery after reindexing&hellip; hmm.</li>
<li>Integrate and massively clean up About page (<a href="https://github.com/ilri/DSpace/pull/256">#256</a>)</li>
</ul>
<p><img src="/cgspace-notes/2016/07/cgspace-about-page.png" alt="About page"></p>
<ul>
<li>The DSpace source code mentions the configuration key <code>discovery.index.authority.ignore-prefered.*</code> (with prefix of discovery, despite the docs saying otherwise), so I&rsquo;m trying the following on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>discovery.index.authority.ignore-prefered.dc.contributor.author=true
discovery.index.authority.ignore-variants=true
</code></pre><ul>
<li>Still no change!</li>
<li>Deploy species, breed, and identifier changes to CGSpace, as well as About page</li>
<li>Run Linode RAM upgrade (8→12GB)</li>
<li>Re-sync DSpace Test with CGSpace</li>
<li>I noticed that our backup scripts don&rsquo;t send Solr cores to S3 so I amended the script</li>
</ul>
<h2 id="2016-07-31">2016-07-31</h2>
<ul>
<li>Work on removing Dryland Systems and Humidtropics subjects from Discovery sidebar and Browse by</li>
<li>Also change &ldquo;Subjects&rdquo; to &ldquo;AGROVOC keywords&rdquo; in Discovery sidebar/search and Browse by (<a href="https://github.com/ilri/DSpace/issues/257">#257</a>)</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

443
docs/2016-08/index.html Normal file
View File

@ -0,0 +1,443 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="August, 2016" />
<meta property="og:description" content="2016-08-01
Add updated distribution license from Sisay (#259)
Play with upgrading Mirage 2 dependencies in bower.json because most are several versions of out date
Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
bower stuff is a dead end, waste of time, too many issues
Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
Start working on DSpace 5.15.5 port:
$ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-08/" />
<meta property="article:published_time" content="2016-08-01T15:53:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2016"/>
<meta name="twitter:description" content="2016-08-01
Add updated distribution license from Sisay (#259)
Play with upgrading Mirage 2 dependencies in bower.json because most are several versions of out date
Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more
bower stuff is a dead end, waste of time, too many issues
Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
Start working on DSpace 5.15.5 port:
$ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "August, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-08/",
"wordCount": "1514",
"datePublished": "2016-08-01T15:53:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-08/">
<title>August, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-08/">August, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-08-01T15:53:00+03:00">Mon Aug 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-08-01">2016-08-01</h2>
<ul>
<li>Add updated distribution license from Sisay (<a href="https://github.com/ilri/DSpace/issues/259">#259</a>)</li>
<li>Play with upgrading Mirage 2 dependencies in <code>bower.json</code> because most are several versions of out date</li>
<li>Bootstrap is at 3.3.0 but upstream is at 3.3.7, and upgrading to anything beyond 3.3.1 breaks glyphicons and probably more</li>
<li>bower stuff is a dead end, waste of time, too many issues</li>
<li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li>
<li>Start working on DSpace 5.15.5 port:</li>
</ul>
<pre tabindex="0"><code>$ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
</code></pre><ul>
<li>Lots of conflicts that don&rsquo;t make sense (ie, shouldn&rsquo;t conflict!)</li>
<li>This file in particular conflicts almost 10 times: <code>dspace/modules/xmlui-mirage2/src/main/webapp/themes/CGIAR/styles/_style.scss</code></li>
<li>Checking out a clean branch at 5.5 and cherry-picking our commits works where that file would normally have a conflict</li>
<li>Seems to be related to merge commits</li>
<li><code>git rebase --preserve-merges</code> doesn&rsquo;t seem to help</li>
<li>Eventually I just turned on git rerere and solved the conflicts and completed the 403 commit rebase</li>
<li>The 5.5 code now builds but doesn&rsquo;t run (white page in Tomcat)</li>
</ul>
<h2 id="2016-08-02">2016-08-02</h2>
<ul>
<li>Ask Atmire for help with DSpace 5.5 issue</li>
<li>Vanilla DSpace 5.5 deploys and runs fine</li>
<li>Playing with DSpace in Ubuntu 16.04 and Tomcat 7</li>
<li>Everything is still fucked up, even vanilla DSpace 5.5</li>
</ul>
<h2 id="2016-08-04">2016-08-04</h2>
<ul>
<li>Ask on DSpace mailing list about duplicate authors, Discovery and author text values</li>
<li>Atmire responded with some new DSpace 5.5 ready versions to try for their modules</li>
</ul>
<h2 id="2016-08-05">2016-08-05</h2>
<ul>
<li>Fix item display incorrectly displaying Species when Breeds were present (<a href="https://github.com/ilri/DSpace/pull/260">#260</a>)</li>
<li>Experiment with fixing more authors, like Delia Grace:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#39;, confidence=600 where metadata_field_id=3 and text_value=&#39;Grace, D.&#39;;
</code></pre><h2 id="2016-08-06">2016-08-06</h2>
<ul>
<li>Finally figured out how to remove &ldquo;View/Open&rdquo; and &ldquo;Bitstreams&rdquo; from the item view</li>
</ul>
<h2 id="2016-08-07">2016-08-07</h2>
<ul>
<li>Start working on Ubuntu 16.04 Ansible playbook for Tomcat 8, PostgreSQL 9.5, Oracle 8, etc</li>
</ul>
<h2 id="2016-08-08">2016-08-08</h2>
<ul>
<li>Still troubleshooting Atmire modules on DSpace 5.5</li>
<li>Vanilla DSpace 5.5 works on Tomcat 7&hellip;</li>
<li>Ooh, and vanilla DSpace 5.5 works on Tomcat 8 with Java 8!</li>
<li>Some notes about setting up Tomcat 8, since it&rsquo;s new on this machine&hellip;</li>
<li>Install latest Oracle Java 8 JDK</li>
<li>Create <code>setenv.sh</code> in Tomcat 8 <code>libexec/bin</code> directory:</li>
</ul>
<pre tabindex="0"><code>CATALINA_OPTS=&#34;-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8&#34;
CATALINA_OPTS=&#34;$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib&#34;
JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
</code></pre><ul>
<li>Edit Tomcat 8 <code>server.xml</code> to add regular HTTP listener for solr</li>
<li>Symlink webapps:</li>
</ul>
<pre tabindex="0"><code>$ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
$ ln -sv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
$ ln -sv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/oai
$ ln -sv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/jspui
$ ln -sv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/rest
$ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/solr
</code></pre><h2 id="2016-08-09">2016-08-09</h2>
<ul>
<li>More tests of Atmire&rsquo;s 5.5 modules on a clean, working instance of <code>5_x-prod</code></li>
<li>Still fails, though perhaps differently than before (Flyway): <a href="https://gist.github.com/alanorth/5d49c45a16efd7c6bc1e6642e66118b2">https://gist.github.com/alanorth/5d49c45a16efd7c6bc1e6642e66118b2</a></li>
<li>More work on Tomcat 8 and Java 8 stuff for Ansible playbooks</li>
</ul>
<h2 id="2016-08-10">2016-08-10</h2>
<ul>
<li>Turns out DSpace 5.x isn&rsquo;t ready for Tomcat 8: <a href="https://jira.duraspace.org/browse/DS-3092">https://jira.duraspace.org/browse/DS-3092</a></li>
<li>So we&rsquo;ll need to use Tomcat 7 + Java 8 on Ubuntu 16.04</li>
<li>More work on the Ansible stuff for this, allowing Tomcat 7 to use Java 8</li>
<li>Merge pull request for fixing the type Discovery index to use <code>dc.type</code> (<a href="https://github.com/ilri/DSpace/pull/262">#262</a>)</li>
<li>Merge pull request for removing &ldquo;Bitstream&rdquo; text from item display, as it confuses users and isn&rsquo;t necessary (<a href="https://github.com/ilri/DSpace/pull/263">#263</a>)</li>
</ul>
<h2 id="2016-08-11">2016-08-11</h2>
<ul>
<li>Finally got DSpace (5.5) running on Ubuntu 16.04, Tomcat 7, Java 8, PostgreSQL 9.5 via the updated Ansible stuff</li>
</ul>
<p><img src="/cgspace-notes/2016/08/dspace55-ubuntu16.04.png" alt="DSpace 5.5 on Ubuntu 16.04, Tomcat 7, Java 8, PostgreSQL 9.5"></p>
<h2 id="2016-08-14">2016-08-14</h2>
<ul>
<li>Update Mirage 2 build notes for Ubuntu 16.04: <a href="https://gist.github.com/alanorth/2cf9c15834dc68a514262fcb04004cb0">https://gist.github.com/alanorth/2cf9c15834dc68a514262fcb04004cb0</a></li>
</ul>
<h2 id="2016-08-15">2016-08-15</h2>
<ul>
<li>Notes on NodeJS + nginx + systemd: <a href="https://gist.github.com/alanorth/51acd476891c67dfe27725848cf5ace1">https://gist.github.com/alanorth/51acd476891c67dfe27725848cf5ace1</a></li>
</ul>
<p><img src="/cgspace-notes/2016/08/nodejs-nginx.png" alt="ExpressJS running behind nginx"></p>
<h2 id="2016-08-16">2016-08-16</h2>
<ul>
<li>Troubleshoot Paramiko connection issues with Ansible on ILRI servers: <a href="https://github.com/ilri/rmg-ansible-public/issues/37">#37</a></li>
<li>Turns out we need to add some MACs to our <code>sshd_config</code>: hmac-sha2-512,hmac-sha2-256</li>
<li>Update DSpace Test&rsquo;s Java to version 8 to start testing this configuration (<a href="https://wiki.apache.org/solr/ShawnHeisey">seeing as Solr recommends it</a>)</li>
</ul>
<h2 id="2016-08-17">2016-08-17</h2>
<ul>
<li>More work on Let&rsquo;s Encrypt stuff for Ansible roles</li>
<li>Yesterday Atmire responded about DSpace 5.5 issues and asked me to try the <code>dspace database repair</code> command to fix Flyway issues</li>
<li>The <code>dspace database</code> command doesn&rsquo;t even run: <a href="https://gist.github.com/alanorth/c43c8d89e8df346d32c0ee938be90cd5">https://gist.github.com/alanorth/c43c8d89e8df346d32c0ee938be90cd5</a></li>
<li>Oops, it looks like the missing classes causing <code>dspace database</code> to fail were coming from the old <code>~/dspace/config/spring</code> folder</li>
<li>After removing the spring folder and running ant install again, <code>dspace database</code> works</li>
<li>I see there are missing and pending Flyway migrations, but running <code>dspace database repair</code> and <code>dspace database migrate</code> does nothing: <a href="https://gist.github.com/alanorth/41ed5abf2ff32d8ac9eedd1c3d015d70">https://gist.github.com/alanorth/41ed5abf2ff32d8ac9eedd1c3d015d70</a></li>
</ul>
<h2 id="2016-08-18">2016-08-18</h2>
<ul>
<li>Fix &ldquo;CONGO,DR&rdquo; country name in <code>input-forms.xml</code> (<a href="https://github.com/ilri/DSpace/pull/264">#264</a>)</li>
<li>Also need to fix existing records using the incorrect form in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;CONGO, DR&#39; where resource_type_id=2 and metadata_field_id=228 and text_value=&#39;CONGO,DR&#39;;
</code></pre><ul>
<li>I asked a question on the DSpace mailing list about updating &ldquo;preferred&rdquo; forms of author names from ORCID</li>
</ul>
<h2 id="2016-08-21">2016-08-21</h2>
<ul>
<li>A few days ago someone on the DSpace mailing list suggested I try <code>dspace dsrun org.dspace.authority.UpdateAuthorities</code> to update preferred author names from ORCID</li>
<li>If you set <code>auto-update-items=true</code> in <code>dspace/config/modules/solrauthority.cfg</code> it is supposed to update records it finds automatically</li>
<li>I updated my name format on ORCID and I&rsquo;ve been running that script a few times per day since then but nothing has changed</li>
<li>Still troubleshooting Atmire modules on DSpace 5.5</li>
<li>I sent them some new verbose logs: <a href="https://gist.github.com/alanorth/700748995649688148ceba89d760253e">https://gist.github.com/alanorth/700748995649688148ceba89d760253e</a></li>
</ul>
<h2 id="2016-08-22">2016-08-22</h2>
<ul>
<li>Database migrations are fine on DSpace 5.1:</li>
</ul>
<pre tabindex="0"><code>$ ~/dspace/bin/dspace database info
Database URL: jdbc:postgresql://localhost:5432/dspacetest
Database Schema: public
Database Software: PostgreSQL version 9.3.14
Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 901)
+----------------+----------------------------+---------------------+---------+
| Version | Description | Installed on | State |
+----------------+----------------------------+---------------------+---------+
| 1.1 | Initial DSpace 1.1 databas | | PreInit |
| 1.2 | Upgrade to DSpace 1.2 sche | | PreInit |
| 1.3 | Upgrade to DSpace 1.3 sche | | PreInit |
| 1.3.9 | Drop constraint for DSpace | | PreInit |
| 1.4 | Upgrade to DSpace 1.4 sche | | PreInit |
| 1.5 | Upgrade to DSpace 1.5 sche | | PreInit |
| 1.5.9 | Drop constraint for DSpace | | PreInit |
| 1.6 | Upgrade to DSpace 1.6 sche | | PreInit |
| 1.7 | Upgrade to DSpace 1.7 sche | | PreInit |
| 1.8 | Upgrade to DSpace 1.8 sche | | PreInit |
| 3.0 | Upgrade to DSpace 3.x sche | | PreInit |
| 4.0 | Initializing from DSpace 4 | 2015-11-20 12:42:52 | Success |
| 5.0.2014.08.08 | DS-1945 Helpdesk Request a | 2015-11-20 12:42:53 | Success |
| 5.0.2014.09.25 | DS 1582 Metadata For All O | 2015-11-20 12:42:55 | Success |
| 5.0.2014.09.26 | DS-1582 Metadata For All O | 2015-11-20 12:42:55 | Success |
| 5.0.2015.01.27 | MigrateAtmireExtraMetadata | 2015-11-20 12:43:29 | Success |
| 5.1.2015.12.03 | Atmire CUA 4 migration | 2016-03-21 17:10:41 | Success |
| 5.1.2015.12.03 | Atmire MQM migration | 2016-03-21 17:10:42 | Success |
+----------------+----------------------------+---------------------+---------+
</code></pre><ul>
<li>So I&rsquo;m not sure why they have problems when we move to DSpace 5.5 (even the 5.1 migrations themselves show as &ldquo;Missing&rdquo;)</li>
</ul>
<h2 id="2016-08-23">2016-08-23</h2>
<ul>
<li>Help Paola from CCAFS with her thumbnails again</li>
<li>Talk to Atmire about the DSpace 5.5 issue, and it seems to be caused by a bug in FlywayDB</li>
<li>They said I should delete the Atmire migrations</li>
</ul>
<pre tabindex="0"><code>dspacetest=# delete from schema_version where description = &#39;Atmire CUA 4 migration&#39; and version=&#39;5.1.2015.12.03.2&#39;;
dspacetest=# delete from schema_version where description = &#39;Atmire MQM migration&#39; and version=&#39;5.1.2015.12.03.3&#39;;
</code></pre><ul>
<li>After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!</li>
</ul>
<pre tabindex="0"><code>org.apache.avalon.framework.configuration.ConfigurationException: Type &#39;ThemeResourceReader&#39; does not exist for &#39;map:read&#39; at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
</code></pre><ul>
<li>Looks like we&rsquo;re missing some stuff in the XMLUI module&rsquo;s <code>sitemap.xmap</code>, as well as in each of our XMLUI themes</li>
<li>Diff them with these to get the <code>ThemeResourceReader</code> changes:
<ul>
<li><code>dspace-xmlui/src/main/webapp/sitemap.xmap</code></li>
<li><code>dspace-xmlui-mirage2/src/main/webapp/sitemap.xmap</code></li>
</ul>
</li>
<li>Then we had some NullPointerException from the SolrLogger class, which is apparently part of Atmire&rsquo;s CUA module</li>
<li>I tried with a small version bump to CUA but it didn&rsquo;t work (version <code>5.5-4.1.1-0</code>)</li>
<li>Also, I started looking into huge pages to prepare for PostgreSQL 9.5, but it seems Linode&rsquo;s kernels don&rsquo;t enable them</li>
</ul>
<h2 id="2016-08-24">2016-08-24</h2>
<ul>
<li>Clean up and import 48 CCAFS records into DSpace Test</li>
<li>SQL to get all journal titles from dc.source (55), since it&rsquo;s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ &#39;.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*&#39;;
</code></pre><h2 id="2016-08-25">2016-08-25</h2>
<ul>
<li>Atmire suggested adding a missing bean to <code>dspace/config/spring/api/atmire-cua.xml</code> but it doesn&rsquo;t help:</li>
</ul>
<pre tabindex="0"><code>...
Error creating bean with name &#39;MetadataStorageInfoService&#39;
...
</code></pre><ul>
<li>Atmire sent an updated version of <code>dspace/config/spring/api/atmire-cua.xml</code> and now XMLUI starts but gives a null pointer exception:</li>
</ul>
<pre tabindex="0"><code>Java stacktrace: java.lang.NullPointerException
at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy103.startElement(Unknown Source)
at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
at org.apache.cocoon.environment.internal.EnvironmentChanger.startElement(EnvironmentStack.java:140)
at org.apache.cocoon.xml.AbstractXMLPipe.startElement(AbstractXMLPipe.java:94)
...
</code></pre><ul>
<li>Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:</li>
</ul>
<pre tabindex="0"><code>$ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
$ JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
</code></pre><ul>
<li>Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs</li>
</ul>
<h2 id="2016-08-26">2016-08-26</h2>
<ul>
<li>CGSpace had issues tonight, not entirely crashing, but becoming unresponsive</li>
<li>The dspace log had this:</li>
</ul>
<pre tabindex="0"><code>2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Related to /rest no doubt</li>
</ul>
<h2 id="2016-08-27">2016-08-27</h2>
<ul>
<li>Run corrections for Delia Grace and <code>CONGO, DR</code>, and deploy August changes to CGSpace</li>
<li>Run all system updates and reboot the server</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

660
docs/2016-09/index.html Normal file
View File

@ -0,0 +1,660 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="September, 2016" />
<meta property="og:description" content="2016-09-01
Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace
We had been using DC=ILRI to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-09/" />
<meta property="article:published_time" content="2016-09-01T15:53:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2016"/>
<meta name="twitter:description" content="2016-09-01
Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors
Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace
We had been using DC=ILRI to determine whether a user was ILRI or not
It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "September, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-09/",
"wordCount": "3298",
"datePublished": "2016-09-01T15:53:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-09/">
<title>September, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-09/">September, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-09-01T15:53:00+03:00">Thu Sep 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-09-01">2016-09-01</h2>
<ul>
<li>Discuss helping CCAFS with some batch tagging of ORCID IDs for their authors</li>
<li>Discuss how the migration of CGIAR&rsquo;s Active Directory to a flat structure will break our LDAP groups in DSpace</li>
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
</code></pre><ul>
<li>User who has been migrated to the root vs user still in the hierarchical structure:</li>
</ul>
<pre tabindex="0"><code>distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Ethiopia,DC=ILRI,DC=CGIARAD,DC=ORG
</code></pre><ul>
<li>Changing the DSpace LDAP config to use <code>OU=ILRIHUB</code> seems to work:</li>
</ul>
<p><img src="/cgspace-notes/2016/09/ilri-ldap-users.png" alt="DSpace groups based on LDAP DN"></p>
<ul>
<li>Notes for local PostgreSQL database recreation from production snapshot:</li>
</ul>
<pre tabindex="0"><code>$ dropdb dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql dspacetest -c &#39;alter user dspacetest createuser;&#39;
$ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup
$ psql dspacetest -c &#39;alter user dspacetest nocreateuser;&#39;
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
$ vacuumdb dspacetest
</code></pre><ul>
<li>Some names that I thought I fixed in July seem not to be:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Poole, %&#39;;
text_value | authority | confidence
-----------------------+--------------------------------------+------------
Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb | 600
Poole, Elizabeth Jane | 41628f42-fc38-4b38-b473-93aec9196326 | 600
Poole, Elizabeth Jane | 83b82da0-f652-4ebc-babc-591af1697919 | 600
Poole, Elizabeth Jane | c3a22456-8d6a-41f9-bba0-de51ef564d45 | 600
Poole, E.J. | c3a22456-8d6a-41f9-bba0-de51ef564d45 | 600
Poole, E.J. | 0fbd91b9-1b71-4504-8828-e26885bf8b84 | 600
(6 rows)
</code></pre><ul>
<li>At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;c3a22456-8d6a-41f9-bba0-de51ef564d45&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Poole, %&#39;;
UPDATE 69
</code></pre><ul>
<li>And for Peter Ballantyne:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Ballantyne, %&#39;;
text_value | authority | confidence
-------------------+--------------------------------------+------------
Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 | 600
Ballantyne, Peter | 4f04ca06-9a76-4206-bd9c-917ca75d278e | 600
Ballantyne, P.G. | 4f04ca06-9a76-4206-bd9c-917ca75d278e | 600
Ballantyne, Peter | ba5f205b-b78b-43e5-8e80-0c9a1e1ad2ca | 600
Ballantyne, Peter | 20f21160-414c-4ecf-89ca-5f2cb64e75c1 | 600
(5 rows)
</code></pre><ul>
<li>Again, a few have the correct ORCID, but there should only be one authority&hellip;</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;4f04ca06-9a76-4206-bd9c-917ca75d278e&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Ballantyne, %&#39;;
UPDATE 58
</code></pre><ul>
<li>And for me:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Orth, A%&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
Orth, A. | 4884def0-4d7e-4256-9dd4-018cd60a5871 | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
(3 rows)
dspacetest=# update metadatavalue set authority=&#39;1a1943a0-3f87-402f-9afe-e52fb46a513e&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Orth, %&#39;;
UPDATE 11
</code></pre><ul>
<li>And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;0e414b4c-4671-4a23-b570-6077aca647d8&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Campbell, B%&#39;;
UPDATE 166
dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Campbell, B%&#39;;
text_value | authority | confidence
------------------------+--------------------------------------+------------
Campbell, Bruce | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
Campbell, Bruce Morgan | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
Campbell, B. | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
Campbell, B.M. | 0e414b4c-4671-4a23-b570-6077aca647d8 | 600
(4 rows)
</code></pre><ul>
<li>After updating the Authority indexes (<code>bin/dspace index-authority</code>) everything looks good</li>
<li>Run authority updates on CGSpace</li>
</ul>
<h2 id="2016-09-05">2016-09-05</h2>
<ul>
<li>After one week of logging TLS connections on CGSpace:</li>
</ul>
<pre tabindex="0"><code># zgrep &#34;DES-CBC3&#34; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
# zgrep &#34;DES-CBC3&#34; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk &#39;{print $6}&#39; | sort | uniq
TLSv1/DES-CBC3-SHA
TLSv1/EDH-RSA-DES-CBC3-SHA
</code></pre><ul>
<li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li>
<li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li>
</ul>
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>This gives you, for example: <code>Mainstreaming gender in agricultural R&amp;D.pdf__description:Brief</code></li>
</ul>
<h2 id="2016-09-06">2016-09-06</h2>
<ul>
<li>Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file</li>
<li>Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF:
<ul>
<li>Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf</li>
<li>Imports fine on DSpace running on Mac OS X</li>
<li>Fails to import on DSpace running on Linux with error <code>No such file or directory</code></li>
</ul>
</li>
<li>Change diacritic in file name from á to a and re-create SAF bundle and zip
<ul>
<li>Success on both Mac OS X and Linux&hellip;</li>
</ul>
</li>
<li>Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)</li>
<li>See: <a href="http://www.fileformat.info/info/unicode/char/e1/index.htm">http://www.fileformat.info/info/unicode/char/e1/index.htm</a></li>
<li>See: <a href="http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0">http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0</a></li>
<li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li>
<li>We should definitely clean filenames so they don&rsquo;t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>&quot;</code></li>
</ul>
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#34;&#34;).replace(&#34;,&#34;,&#34;&#34;).replace(&#39;&#34;&#39;,&#39;&#39;)
</code></pre><ul>
<li>I need to write a Python script to match that for renaming files in the file system</li>
<li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li>
<li>Seems that the latter method causes a null pointer exception, so I will just have to use the former method</li>
<li>In the end I was able to import the files after unzipping them ONLY on Linux
<ul>
<li>The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above</li>
</ul>
</li>
<li>Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the <code>tomcat7</code> user, and deleting the bundle, for each collection&rsquo;s items:</li>
</ul>
<pre tabindex="0"><code>$ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
$ JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34; /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
$ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
</code></pre><h2 id="2016-09-07">2016-09-07</h2>
<ul>
<li>Erase and rebuild DSpace Test based on latest Ubuntu 16.04, PostgreSQL 9.5, and Java 8 stuff</li>
<li>Reading about PostgreSQL maintenance and it seems manual vacuuming is only for certain workloads, such as heavy update/write loads</li>
<li>I suggest we disable our nightly manual vacuum task, as we&rsquo;re a mostly read workload, and I&rsquo;d rather stick as close to the documentation as possible since we haven&rsquo;t done any testing/observation of PostgreSQL</li>
<li>See: <a href="https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html">https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html</a></li>
<li>CGSpace went down and the error seems to be the same as always (lately):</li>
</ul>
<pre tabindex="0"><code>2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
...
</code></pre><ul>
<li>Since CGSpace had crashed I quickly deployed the new LDAP settings before restarting Tomcat</li>
</ul>
<h2 id="2016-09-13">2016-09-13</h2>
<ul>
<li>CGSpace crashed twice today, errors from <code>catalina.out</code>:</li>
</ul>
<pre tabindex="0"><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
</code></pre><ul>
<li>I enabled logging of requests to <code>/rest</code> again</li>
</ul>
<h2 id="2016-09-14">2016-09-14</h2>
<ul>
<li>CGSpace crashed again, errors from <code>catalina.out</code>:</li>
</ul>
<pre tabindex="0"><code>org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
</code></pre><ul>
<li>I restarted Tomcat and it was ok again</li>
<li>CGSpace crashed a few hours later, errors from <code>catalina.out</code>:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &#34;http-bio-127.0.0.1-8081-exec-25&#34; java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
</code></pre><ul>
<li>We haven&rsquo;t seen that in quite a while&hellip;</li>
<li>Indeed, in a month of logs it only occurs 15 times:</li>
</ul>
<pre tabindex="0"><code># grep -rsI &#34;OutOfMemoryError&#34; /var/log/tomcat7/catalina.* | wc -l
15
</code></pre><ul>
<li>I also see a bunch of errors from dspace.log:</li>
</ul>
<pre tabindex="0"><code>2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Looking at REST requests, it seems there is one IP hitting us nonstop:</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | sort -n | uniq -c | sort -h | tail -n 3
820 50.87.54.15
12872 70.32.99.142
25744 70.32.83.92
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 3
7966 181.118.144.29
54706 70.32.99.142
109412 70.32.83.92
</code></pre><ul>
<li>Those are the same IPs that were hitting us heavily in July, 2016 as well&hellip;</li>
<li>I think the stability issues are definitely from REST</li>
<li>Crashed AGAIN, errors from dspace.log:</li>
</ul>
<pre tabindex="0"><code>2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>And more heap space errors:</li>
</ul>
<pre tabindex="0"><code># grep -rsI &#34;OutOfMemoryError&#34; /var/log/tomcat7/catalina.* | wc -l
19
</code></pre><ul>
<li>There are no more rest requests since the last crash, so maybe there are other things causing this.</li>
<li>Hmm, I noticed a shitload of IPs from 180.76.0.0/16 are connecting to both CGSpace and DSpace Test (58 unique IPs concurrently!)</li>
<li>They seem to be coming from Baidu, and so far during today alone account for 1/6 of every connection:</li>
</ul>
<pre tabindex="0"><code># grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
29084
# grep -c ip_addr=180.76.15 /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
5192
</code></pre><ul>
<li>Other recent days are the same&hellip; hmmm.</li>
<li>From the activity control panel I can see 58 unique IPs hitting the site <em>concurrently</em>, which has GOT to hurt our stability</li>
<li>A list of all 2000 unique IPs from CGSpace logs today:</li>
</ul>
<pre tabindex="0"><code># grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: &#39;{print $5}&#39; | sort -n | uniq -c | sort -h | tail -n 100
</code></pre><ul>
<li>Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc&hellip; do we have any real users?</li>
<li>Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
</code></pre><ul>
<li>Looking into the Catalina logs again around the time of the first crash, I see:</li>
</ul>
<pre tabindex="0"><code>Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2
Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs.
Commit
Commit done
dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-193&#34; java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<li>And after that I see a bunch of &ldquo;pool error Timeout waiting for idle object&rdquo;</li>
<li>Later, near the time of the next crash I see:</li>
</ul>
<pre tabindex="0"><code>dn:CN=Haman\, Magdalena (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Wed Sep 14 11:29:55 UTC 2016 | Query:id: 79078 AND type:2
Wed Sep 14 11:30:20 UTC 2016 | Updating : 6/6 docs.
Commit
Commit done
Sep 14, 2016 11:32:22 AM com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator buildModelAndSchemas
SEVERE: Failed to generate the schema for the JAX-B elements
com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions
java.util.Map is an interface, and JAXB can&#39;t handle interfaces.
this problem is related to the following location:
at java.util.Map
at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender()
at com.atmire.dspace.rest.common.Statlet
java.util.Map does not have a no-arg default constructor.
this problem is related to the following location:
at java.util.Map
at public java.util.Map com.atmire.dspace.rest.common.Statlet.getRender()
at com.atmire.dspace.rest.common.Statlet
</code></pre><ul>
<li>Then 20 minutes later another outOfMemoryError:</li>
</ul>
<pre tabindex="0"><code>Exception in thread &#34;http-bio-127.0.0.1-8081-exec-25&#34; java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.decode(StringCoding.java:215)
</code></pre><ul>
<li>Perhaps these particular issues <em>are</em> memory issues, the munin graphs definitely show some weird purging/allocating behavior starting this week</li>
</ul>
<p><img src="/cgspace-notes/2016/09/tomcat_jvm-day.png" alt="Tomcat JVM usage day">
<img src="/cgspace-notes/2016/09/tomcat_jvm-week.png" alt="Tomcat JVM usage week">
<img src="/cgspace-notes/2016/09/tomcat_jvm-month.png" alt="Tomcat JVM usage month"></p>
<ul>
<li>And really, we did reduce the memory of CGSpace in late 2015, so maybe we should just increase it again, now that our usage is higher and we are having memory errors in the logs</li>
<li>Oh great, the configuration on the actual server is different than in configuration management!</li>
<li>Seems we added a bunch of settings to the <code>/etc/default/tomcat7</code> in December, 2015 and never updated our ansible repository:</li>
</ul>
<pre tabindex="0"><code>JAVA_OPTS=&#34;-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts&#34;
</code></pre><ul>
<li>So I&rsquo;m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)</li>
<li>Increased JVM heap to 4096m on CGSpace (linode01)</li>
</ul>
<h2 id="2016-09-15">2016-09-15</h2>
<ul>
<li>Looking at Google Webmaster Tools again, it seems the work I did on URL query parameters and blocking via the <code>X-Robots-Tag</code> HTTP header in March, 2016 seem to have had a positive effect on Google&rsquo;s index for CGSpace</li>
</ul>
<p><img src="/cgspace-notes/2016/09/google-webmaster-tools-index.png" alt="Google Webmaster Tools for CGSpace"></p>
<h2 id="2016-09-16">2016-09-16</h2>
<ul>
<li>CGSpace crashed again, and there are TONS of heap space errors but the datestamps aren&rsquo;t on those lines so I&rsquo;m not sure if they were yesterday:</li>
</ul>
<pre tabindex="0"><code>dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
Thu Sep 15 18:45:25 UTC 2016 | Query:id: 55785 AND type:2
Thu Sep 15 18:45:26 UTC 2016 | Updating : 100/218 docs.
Thu Sep 15 18:45:26 UTC 2016 | Updating : 200/218 docs.
Thu Sep 15 18:45:27 UTC 2016 | Updating : 218/218 docs.
Commit
Commit done
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-247&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-241&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-243&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-258&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-268&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-263&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;http-bio-127.0.0.1-8081-exec-280&#34; java.lang.OutOfMemoryError: Java heap space
Exception in thread &#34;Thread-54216&#34; org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id 7feaa95d-8e1f-4f45-80bb
-e14ef82ee224 to the index; possible analysis error.
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at com.atmire.statistics.SolrLogThread.run(SourceFile:25)
</code></pre><ul>
<li>I bumped the heap space from 4096m to 5120m to see if this is <em>really</em> about heap speace or not.</li>
<li>Looking into some of these errors that I&rsquo;ve seen this week but haven&rsquo;t noticed before:</li>
</ul>
<pre tabindex="0"><code># zcat -f -- /var/log/tomcat7/catalina.* | grep -c &#39;Failed to generate the schema for the JAX-B elements&#39;
113
</code></pre><ul>
<li>I&rsquo;ve sent a message to Atmire about the Solr error to see if it&rsquo;s related to their batch update module</li>
</ul>
<h2 id="2016-09-19">2016-09-19</h2>
<ul>
<li>Work on cleanups for author affiliations after Peter sent me his list of corrections/deletions:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2-deletions.csv -m 211 -u dspace -d dspace -p fuuu
</code></pre><ul>
<li>After that we need to take the top ~300 and make a controlled vocabulary for it</li>
<li>I dumped a list of the top 300 affiliations from the database, sorted it alphabetically in OpenRefine, and created a controlled vocabulary for it (<a href="https://github.com/ilri/DSpace/pull/267">#267</a>)</li>
</ul>
<h2 id="2016-09-20">2016-09-20</h2>
<ul>
<li>Run all system updates on DSpace Test and reboot the server</li>
<li>Merge changes for sponsorship and affiliation controlled vocabularies (<a href="https://github.com/ilri/DSpace/pull/267">#267</a>, <a href="https://github.com/ilri/DSpace/pull/268">#268</a>)</li>
<li>Merge minor changes to <code>messages.xml</code> to reconcile it with the stock DSpace 5.1 one (<a href="https://github.com/ilri/DSpace/pull/269">#269</a>)</li>
<li>Peter asked about adding title search to Discovery</li>
<li>The index was already defined, so I just added it to the search filters</li>
<li>It works but CGSpace apparently uses <code>OR</code> for search terms, which makes the search results basically useless</li>
<li>I need to read the docs and ask on the mailing list to see if we can tweak that</li>
<li>Generate a new list of sponsors from the database for Peter Ballantyne so we can clean them up and update the controlled vocabulary</li>
</ul>
<h2 id="2016-09-21">2016-09-21</h2>
<ul>
<li>Turns out the Solr search logic switched from OR to AND in DSpace 6.0 and the change is easy to backport: <a href="https://jira.duraspace.org/browse/DS-2809">https://jira.duraspace.org/browse/DS-2809</a></li>
<li>We just need to set this in <code>dspace/solr/search/conf/schema.xml</code>:</li>
</ul>
<pre tabindex="0"><code>&lt;solrQueryParser defaultOperator=&#34;AND&#34;/&gt;
</code></pre><ul>
<li>It actually works really well, and search results return much less hits now (before, after):</li>
</ul>
<p><img src="/cgspace-notes/2016/09/cgspace-search.png" alt="CGSpace search with &amp;ldquo;OR&amp;rdquo; boolean logic">
<img src="/cgspace-notes/2016/09/dspacetest-search.png" alt="DSpace Test search with &amp;ldquo;AND&amp;rdquo; boolean logic"></p>
<ul>
<li>Found a way to improve the configuration of Atmire&rsquo;s Content and Usage Analysis (CUA) module for date fields</li>
</ul>
<pre tabindex="0"><code>-content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
+content.analysis.dataset.option.8=metadata:dc.date.accessioned:date(month)
</code></pre><ul>
<li>This allows the module to treat the field as a date rather than a text string, so we can interrogate it more intelligently</li>
<li>Add <code>dc.date.accessioned</code> to XMLUI Discovery search filters</li>
<li>Major CGSpace crash because ILRI forgot to pay the Linode bill</li>
<li>45 minutes of downtime!</li>
<li>Start processing the fixes to <code>dc.description.sponsorship</code> from Peter Ballantyne:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>I need to run these and the others from a few days ago on CGSpace the next time we run updates</li>
<li>Also, I need to update the controlled vocab for sponsors based on these</li>
</ul>
<h2 id="2016-09-22">2016-09-22</h2>
<ul>
<li>Update controlled vocabulary for sponsorship based on the latest corrected values from the database</li>
</ul>
<h2 id="2016-09-25">2016-09-25</h2>
<ul>
<li>Merge accession date improvements for CUA module (<a href="https://github.com/ilri/DSpace/pull/275">#275</a>)</li>
<li>Merge addition of accession date to Discovery search filters (<a href="https://github.com/ilri/DSpace/pull/276">#276</a>)</li>
<li>Merge updates to sponsorship controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/277">#277</a>)</li>
<li>I&rsquo;ve been trying to add a search filter for <code>dc.description</code> so the IITA people can search for some tags they use there, but for some reason the filter never shows up in Atmire&rsquo;s CUA</li>
<li>Not sure if it&rsquo;s something like we already have too many filters there (30), or the filter name is reserved, etc&hellip;</li>
<li>Generate a list of ILRI subjects for Peter and Abenet to look through/fix:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
</code></pre><ul>
<li>Regenerate Discovery indexes a few times after playing with <code>discovery.xml</code> index definitions (syntax, parameters, etc).</li>
<li>Merge changes to boolean logic in Solr search (<a href="https://github.com/ilri/DSpace/pull/274">#274</a>)</li>
<li>Run all sponsorship and affiliation fixes on CGSpace, deploy latest <code>5_x-prod</code> branch, and re-index Discovery on CGSpace</li>
<li>Tested OCSP stapling on DSpace Test&rsquo;s nginx and it works:</li>
</ul>
<pre tabindex="0"><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP response:
======================================
OCSP Response Data:
...
Cert Status: good
</code></pre><ul>
<li>I&rsquo;ve been monitoring this for almost two years in this GitHub issue: <a href="https://github.com/ilri/DSpace/issues/38">https://github.com/ilri/DSpace/issues/38</a></li>
</ul>
<h2 id="2016-09-27">2016-09-27</h2>
<ul>
<li>Discuss fixing some ORCIDs for CCAFS author Sonja Vermeulen with Magdalena Haman</li>
<li>This author has a few variations:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeu
len, S%&#39;;
</code></pre><ul>
<li>And it looks like <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code> is the authority with the correct ORCID linked</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;fe4b719f-6cc4-4d65-8504-7a83130b9f83w&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
UPDATE 101
</code></pre><ul>
<li>Hmm, now her name is missing from the authors facet and only shows the authority ID</li>
<li>On the production server there is an item with her ORCID but it is using a different authority: f01f7b7b-be3f-4df7-a61d-b73c067de88d</li>
<li>Maybe I used the wrong one&hellip; I need to look again at the production database</li>
<li>On a clean snapshot of the database I see the correct authority should be <code>f01f7b7b-be3f-4df7-a61d-b73c067de88d</code>, not <code>fe4b719f-6cc4-4d65-8504-7a83130b9f83</code></li>
<li>Updating her authorities again and reindexing:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;f01f7b7b-be3f-4df7-a61d-b73c067de88d&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
UPDATE 101
</code></pre><ul>
<li>Use GitHub icon from Font Awesome instead of a PNG to save one extra network request</li>
<li>We can also replace the RSS and mail icons in community text!</li>
<li>Fix reference to <code>dc.type.*</code> in Atmire CUA module, as we now only index <code>dc.type</code> for &ldquo;Output type&rdquo;</li>
</ul>
<h2 id="2016-09-28">2016-09-28</h2>
<ul>
<li>Make a placeholder pull request for <code>discovery.xml</code> changes (<a href="https://github.com/ilri/DSpace/pull/278">#278</a>), as I still need to test their effect on Atmire content analysis module</li>
<li>Make a placeholder pull request for Font Awesome changes (<a href="https://github.com/ilri/DSpace/pull/279">#279</a>), which replaces the GitHub image in the footer with an icon, and add style for RSS and @ icons that I will start replacing in community/collection HTML intros</li>
<li>Had some issues with local test server after messing with Solr too much, had to blow everything away and re-install from CGSpace</li>
<li>Going to try to update Sonja Vermeulen&rsquo;s authority to 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0, as that seems to be one of her authorities that has an ORCID</li>
<li>Merge Font Awesome changes (<a href="https://github.com/ilri/DSpace/pull/279">#279</a>)</li>
<li>Minor fix to a string in Atmire&rsquo;s CUA module (<a href="https://github.com/ilri/DSpace/pull/280">#280</a>)</li>
<li>This seems to be what I&rsquo;ll need to do for Sonja Vermeulen (but with <code>2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0</code> instead on the live site):</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set authority=&#39;09e4da69-33a3-45ca-b110-7d3f82d2d6d2&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen, S%&#39;;
dspacetest=# update metadatavalue set authority=&#39;09e4da69-33a3-45ca-b110-7d3f82d2d6d2&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Vermeulen SJ%&#39;;
</code></pre><ul>
<li>And then update Discovery and Authority indexes</li>
<li>Minor fix for &ldquo;Subject&rdquo; string in Discovery search and Atmire modules (<a href="https://github.com/ilri/DSpace/pull/281">#281</a>)</li>
<li>Start testing batch fixes for ILRI subject from Peter:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu
$ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -m 203 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2016-09-29">2016-09-29</h2>
<ul>
<li>Add <code>cg.identifier.ciatproject</code> to metadata registry in preparation for CIAT project tag</li>
<li>Merge changes for CIAT project tag (<a href="https://github.com/ilri/DSpace/pull/282">#282</a>)</li>
<li>DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console</li>
<li>People on DSpace mailing list gave me a query to get authors from certain collections:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/5472&#39;, &#39;10568/5473&#39;)));
</code></pre><h2 id="2016-09-30">2016-09-30</h2>
<ul>
<li>Deny access to REST API&rsquo;s <code>find-by-metadata-field</code> endpoint to protect against an upstream security issue (DS-3250)</li>
<li>There is a patch but it is only for 5.5 and doesn&rsquo;t apply cleanly to 5.1</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

426
docs/2016-10/index.html Normal file
View File

@ -0,0 +1,426 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="October, 2016" />
<meta property="og:description" content="2016-10-03
Testing adding ORCIDs to a CSV file for a single item to see if the author orders get messed up
Need to test the following scenarios to see how author order is affected:
ORCIDs only
ORCIDs plus normal authors
I exported a random item&rsquo;s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-10/" />
<meta property="article:published_time" content="2016-10-03T15:53:00+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2016"/>
<meta name="twitter:description" content="2016-10-03
Testing adding ORCIDs to a CSV file for a single item to see if the author orders get messed up
Need to test the following scenarios to see how author order is affected:
ORCIDs only
ORCIDs plus normal authors
I exported a random item&rsquo;s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "October, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-10/",
"wordCount": "1828",
"datePublished": "2016-10-03T15:53:00+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-10/">
<title>October, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-10/">October, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-10-03T15:53:00+03:00">Mon Oct 03, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-10-03">2016-10-03</h2>
<ul>
<li>Testing adding <a href="https://wiki.lyrasis.org/display/DSDOC5x/ORCID+Integration#ORCIDIntegration-EditingexistingitemsusingBatchCSVEditing">ORCIDs to a CSV</a> file for a single item to see if the author orders get messed up</li>
<li>Need to test the following scenarios to see how author order is affected:
<ul>
<li>ORCIDs only</li>
<li>ORCIDs plus normal authors</li>
</ul>
</li>
<li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
</ul>
<pre tabindex="0"><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
</code></pre><ul>
<li>Hmm, with the <code>dc.contributor.author</code> column removed, DSpace doesn&rsquo;t detect any changes</li>
<li>With a blank <code>dc.contributor.author</code> column, DSpace wants to remove all non-ORCID authors and add the new ORCID authors</li>
<li>I added the <a href="https://github.com/ilri/DSpace/issues/234">disclaimer text</a> to the About page, then added a footer link to the disclaimer&rsquo;s ID, but there is a Bootstrap issue that causes the page content to disappear when using in-page anchors: <a href="https://github.com/twbs/bootstrap/issues/1768">https://github.com/twbs/bootstrap/issues/1768</a></li>
</ul>
<p><img src="/cgspace-notes/2016/10/bootstrap-issue.png" alt="Bootstrap issue with in-page anchors"></p>
<ul>
<li>Looks like we&rsquo;ll just have to add the text to the About page (without a link) or add a separate page</li>
</ul>
<h2 id="2016-10-04">2016-10-04</h2>
<ul>
<li>Start testing cleanups of authors that Peter sent last week</li>
<li>Out of 40,000+ rows, Peter had indicated corrections for ~3,200 of them—too many to look through carefully, so I did some basic quality checking:
<ul>
<li>Trim leading/trailing whitespace</li>
<li>Find invalid characters</li>
<li>Cluster values to merge obvious authors</li>
</ul>
</li>
<li>That left us with 3,180 valid corrections and 3 deletions:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
$ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -m 3 -d dspacetest -u dspacetest -p fuuu
</code></pre><ul>
<li>Remove old about page (<a href="https://github.com/ilri/DSpace/pull/284">#284</a>)</li>
<li>CGSpace crashed a few times today</li>
<li>Generate list of unique authors in CCAFS collections:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/32729&#39;, &#39;10568/5472&#39;, &#39;10568/5473&#39;, &#39;10568/10288&#39;, &#39;10568/70974&#39;, &#39;10568/3547&#39;, &#39;10568/3549&#39;, &#39;10568/3531&#39;,&#39;10568/16890&#39;,&#39;10568/5470&#39;,&#39;10568/3546&#39;, &#39;10568/36024&#39;, &#39;10568/66581&#39;, &#39;10568/21789&#39;, &#39;10568/5469&#39;, &#39;10568/5468&#39;, &#39;10568/3548&#39;, &#39;10568/71053&#39;, &#39;10568/25167&#39;))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
</code></pre><h2 id="2016-10-05">2016-10-05</h2>
<ul>
<li>Work on more infrastructure cleanups for Ansible DSpace role</li>
<li>Clean up Let&rsquo;s Encrypt plumbing and submit pull request for rmg-ansible-public (<a href="https://github.com/ilri/rmg-ansible-public/pull/60">#60</a>)</li>
</ul>
<h2 id="2016-10-06">2016-10-06</h2>
<ul>
<li>Nice! DSpace Test (linode02) is now having <code>java.lang.OutOfMemoryError: Java heap space</code> errors&hellip;</li>
<li>Heap space is 2048m, and we have 5GB of RAM being used for OS cache (Solr!) so let&rsquo;s just bump the memory to 3072m</li>
<li>Magdalena from CCAFS asked why the colors in the thumbnails for these <a href="https://cgspace.cgiar.org/handle/10568/71249">two</a> <a href="https://cgspace.cgiar.org/handle/10568/71259">items</a> look different, even though they are the same in the PDF itself</li>
</ul>
<p><img src="/cgspace-notes/2016/10/cmyk-vs-srgb.jpg" alt="CMYK vs sRGB colors"></p>
<ul>
<li>Turns out the first PDF was exported from InDesign using CMYK and the second one was using sRGB</li>
<li>Run all system updates on DSpace Test and reboot it</li>
</ul>
<h2 id="2016-10-08">2016-10-08</h2>
<ul>
<li>Re-deploy CGSpace with latest changes from late September and early October</li>
<li>Run fixes for ILRI subjects and delete blank metadata values:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 11
</code></pre><ul>
<li>Run all system updates and reboot CGSpace</li>
<li>Delete ten gigs of old 2015 Tomcat logs that never got rotated (WTF?):</li>
</ul>
<pre tabindex="0"><code>root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
47
</code></pre><ul>
<li>Delete 2GB <code>cron-filter-media.log</code> file, as it is just a log from a cron job and it doesn&rsquo;t get rotated like normal log files (almost a year now maybe)</li>
</ul>
<h2 id="2016-10-14">2016-10-14</h2>
<ul>
<li>Run all system updates on DSpace Test and reboot server</li>
<li>Looking into some issues with Discovery filters in Atmire&rsquo;s content and usage analysis module after adjusting the filter class</li>
<li>Looks like changing the filters from <code>configuration.DiscoverySearchFilterFacet</code> to <code>configuration.DiscoverySearchFilter</code> breaks them in Atmire CUA module</li>
</ul>
<h2 id="2016-10-17">2016-10-17</h2>
<ul>
<li>A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t &#39;correct name&#39; -m 3 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)</li>
</ul>
<h2 id="2016-10-18">2016-10-18</h2>
<ul>
<li>Start working on DSpace 5.5 porting work again:</li>
</ul>
<pre tabindex="0"><code>$ git checkout -b 5_x-55 5_x-prod
$ git rebase -i dspace-5.5
</code></pre><ul>
<li>Have to fix about ten merge conflicts, mostly in the SCSS for the CGIAR theme</li>
<li>Skip 1e34751b8cf17021f45d4cf2b9a5800c93fb4cb2 in lieu of upstream&rsquo;s 55e623d1c2b8b7b1fa45db6728e172e06bfa8598 (fixes X-Forwarded-For header) because I had made the same fix myself and it&rsquo;s better to use the upstream one</li>
<li>I notice this rebase gets rid of GitHub merge commits&hellip; which actually might be fine because merges are fucking annoying to deal with when remote people merge without pulling and rebasing their branch first</li>
<li>Finished up applying the 5.5 sitemap changes to all themes</li>
<li>Merge the <code>discovery.xml</code> cleanups (<a href="https://github.com/ilri/DSpace/pull/278">#278</a>)</li>
<li>Merge some minor edits to the distribution license (<a href="https://github.com/ilri/DSpace/pull/285">#285</a>)</li>
</ul>
<h2 id="2016-10-19">2016-10-19</h2>
<ul>
<li>When we move to DSpace 5.5 we should also cherry pick some patches from 5.6 branch:
<ul>
<li><a href="https://jira.duraspace.org/browse/DS-3246">memory cleanup</a>: 9f0f5940e7921765c6a22e85337331656b18a403</li>
<li>sql injection: c6fda557f731dbc200d7d58b8b61563f86fe6d06</li>
<li>pdfbox security issue: b5330b78153b2052ed3dc2fd65917ccdbfcc0439</li>
</ul>
</li>
</ul>
<h2 id="2016-10-20">2016-10-20</h2>
<ul>
<li>Run CCAFS author corrections on CGSpace</li>
<li>Discovery reindexing took forever and kinda caused CGSpace to crash, so I ran all system updates and rebooted the server</li>
</ul>
<h2 id="2016-10-25">2016-10-25</h2>
<ul>
<li>Move the LIVES community from the top level to the ILRI projects community</li>
</ul>
<pre tabindex="0"><code>$ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
</code></pre><ul>
<li>Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA</li>
<li>Start looking at batch fixing of &ldquo;old&rdquo; ILRI website links without www or https, for example:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like &#39;http://ilri.org%&#39;;
</code></pre><ul>
<li>Also CCAFS has HTTPS and their links should use it where possible:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and text_value like &#39;http://ccafs.cgiar.org%&#39;;
</code></pre><ul>
<li>And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like &#39;%Iconrss2.png%&#39;;
</code></pre><ul>
<li>Turns out there are shit tons of varieties of this, like with http, https, www, separate <code>&lt;/img&gt;</code> tags, alignments, etc</li>
<li>Had to find all variations and replace them individually:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;,&#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;&gt;&lt;/img&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img align=&#34;left&#34; src=&#34;https://ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;https://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-rss fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/Iconrss2.png&#34;/&gt;%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;&#39;, &#39;&lt;span class=&#34;fa fa-at fa-2x&#34; aria-hidden=&#34;true&#34;&gt;&lt;/span&gt;&#39;) where resource_type_id in (3,4) and text_value like &#39;%&lt;img valign=&#34;center&#34; align=&#34;left&#34; src=&#34;http://www.ilri.org/images/email.jpg&#34;/&gt;%&#39;;
</code></pre><ul>
<li>Getting rid of these reduces the number of network requests each client makes on community/collection pages, and makes use of Font Awesome icons (which they are already loading anyways!)</li>
<li>And now that I start looking, I want to fix a bunch of links to popular sites that should be using HTTPS, like Twitter, Facebook, Google, Feed Burner, DOI, etc</li>
<li>I should look to see if any of those domains is sending an HTTP 301 or setting HSTS headers to their HTTPS domains, then just replace them</li>
</ul>
<h2 id="2016-10-27">2016-10-27</h2>
<ul>
<li>Run Font Awesome fixes on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>dspace=# \i /tmp/font-awesome-text-replace.sql
UPDATE 17
UPDATE 17
UPDATE 3
UPDATE 3
UPDATE 30
UPDATE 30
UPDATE 1
UPDATE 1
UPDATE 7
UPDATE 7
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 0
</code></pre><ul>
<li>Looks much better now:</li>
</ul>
<p><img src="/cgspace-notes/2016/10/cgspace-icons.png" alt="CGSpace with old icons">
<img src="/cgspace-notes/2016/10/dspacetest-fontawesome-icons.png" alt="DSpace Test with Font Awesome icons"></p>
<ul>
<li>Run the same replacements on CGSpace</li>
</ul>
<h2 id="2016-10-30">2016-10-30</h2>
<ul>
<li>Fix some messed up authors on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;799da1d8-22f3-43f5-8233-3d2ef5ebf8a8&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Charleston, B.%&#39;;
UPDATE 10
dspace=# update metadatavalue set authority=&#39;e936f5c5-343d-4c46-aa91-7a1fff6277ed&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Knight-Jones%&#39;;
UPDATE 36
</code></pre><ul>
<li>I updated the authority index but nothing seemed to change, so I&rsquo;ll wait and do it again after I update Discovery below</li>
<li>Skype chat with Tsega about the <a href="https://github.com/ilri/ckm-cgspace-contentdm-bridge">IFPRI contentdm bridge</a></li>
<li>We tested harvesting OAI in an example collection to see how it works</li>
<li>Talk to Carlos Quiros about CG Core metadata in CGSpace</li>
<li>Get a list of countries from CGSpace so I can do some batch corrections:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
</code></pre><ul>
<li>Fix a bunch of countries in Open Refine and run the corrections on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t &#39;correct&#39; -m 228 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Run a shit ton of author fixes from Peter Ballantyne that we&rsquo;ve been cleaning up for two months:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
</code></pre><ul>
<li>Run a few URL corrections for ilri.org and doi.org, etc:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://www.ilri.org&#39;,&#39;https://www.ilri.org&#39;) where resource_type_id=2 and text_value like &#39;%http://www.ilri.org%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://mahider.ilri.org&#39;, &#39;https://cgspace.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://mahider.%.org%&#39; and metadata_field_id not in (28);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://dx.doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and text_value like &#39;%http://dx.doi.org%&#39; and metadata_field_id not in (18,26,28,111);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and text_value like &#39;%http://doi.org%&#39; and metadata_field_id not in (18,26,28,111);
</code></pre><ul>
<li>I skipped metadata fields like citation and description</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

602
docs/2016-11/index.html Normal file

File diff suppressed because one or more lines are too long

838
docs/2016-12/index.html Normal file
View File

@ -0,0 +1,838 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="December, 2016" />
<meta property="og:description" content="2016-12-02
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade
I&rsquo;ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-12/" />
<meta property="article:published_time" content="2016-12-02T10:43:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="December, 2016"/>
<meta name="twitter:description" content="2016-12-02
CGSpace was down for five hours in the morning while I was sleeping
While looking in the logs for errors, I see tons of warnings about Atmire MQM:
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade
I&rsquo;ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "December, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-12/",
"wordCount": "4078",
"datePublished": "2016-12-02T10:43:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-12/">
<title>December, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-12/">December, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-12-02T10:43:00+03:00">Fri Dec 02, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-12-02">2016-12-02</h2>
<ul>
<li>CGSpace was down for five hours in the morning while I was sleeping</li>
<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
</ul>
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
</code></pre><ul>
<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li>
<li>I&rsquo;ve raised a ticket with Atmire to ask</li>
<li>Another worrying error from dspace.log is:</li>
</ul>
<pre tabindex="0"><code>org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:111)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:274)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
at com.googlecode.psiprobe.Tomcat70AgentValve.invoke(Tomcat70AgentValve.java:44)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1041)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:313)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at com.atmire.statistics.generator.TopNDSODatasetGenerator.toDatasetQuery(SourceFile:39)
at com.atmire.statistics.display.StatisticsDataVisitsMultidata.createDataset(SourceFile:108)
at org.dspace.statistics.content.StatisticsDisplay.createDataset(SourceFile:384)
at org.dspace.statistics.content.StatisticsDisplay.getDataset(SourceFile:404)
at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generateJsonData(SourceFile:170)
at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generate(SourceFile:246)
at com.atmire.app.xmlui.aspect.statistics.JSONStatsMostPopular.generate(JSONStatsMostPopular.java:145)
at sun.reflect.GeneratedMethodAccessor296.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.cocoon.core.container.spring.avalon.PoolableProxyHandler.invoke(PoolableProxyHandler.java:71)
at com.sun.proxy.$Proxy96.process(Unknown Source)
at org.apache.cocoon.components.treeprocessor.sitemap.ReadNode.invoke(ReadNode.java:94)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.components.treeprocessor.sitemap.MountNode.invoke(MountNode.java:117)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:55)
at org.apache.cocoon.components.treeprocessor.sitemap.MatchNode.invoke(MatchNode.java:87)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelineNode.invoke(PipelineNode.java:143)
at org.apache.cocoon.components.treeprocessor.AbstractParentProcessingNode.invokeNodes(AbstractParentProcessingNode.java:78)
at org.apache.cocoon.components.treeprocessor.sitemap.PipelinesNode.invoke(PipelinesNode.java:81)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:239)
at org.apache.cocoon.components.treeprocessor.ConcreteTreeProcessor.process(ConcreteTreeProcessor.java:171)
at org.apache.cocoon.components.treeprocessor.TreeProcessor.process(TreeProcessor.java:247)
at org.apache.cocoon.servlet.RequestProcessor.process(RequestProcessor.java:351)
at org.apache.cocoon.servlet.RequestProcessor.service(RequestProcessor.java:169)
at org.apache.cocoon.sitemap.SitemapServlet.service(SitemapServlet.java:84)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:468)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443)
at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
at com.sun.proxy.$Proxy89.service(Unknown Source)
at org.dspace.springmvc.CocoonView.render(CocoonView.java:113)
at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1180)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:950)
... 35 more
</code></pre><ul>
<li>The first error I see in dspace.log this morning is:</li>
</ul>
<pre tabindex="0"><code>2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;&#34;b0b541c1-ec15-48bf-9209-6dbe8e338cdc&#34;
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
</code></pre><ul>
<li>Looking through DSpace&rsquo;s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries</li>
<li>The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:</li>
</ul>
<pre tabindex="0"><code>2016-12-02 03:00:42,606 INFO org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&amp;shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&amp;fq=-isInternal:true&amp;fq=-(author_mtdt:&#34;CGIAR\+Institutional\+Learning\+and\+Change\+Initiative&#34;++AND+subject_mtdt:&#34;PARTNERSHIPS&#34;+AND+subject_mtdt:&#34;RESEARCH&#34;+AND+subject_mtdt:&#34;AGRICULTURE&#34;+AND+subject_mtdt:&#34;DEVELOPMENT&#34;++AND+iso_mtdt:&#34;en&#34;+)&amp;rows=0&amp;wt=javabin&amp;version=2} hits=0 status=0 QTime=19
2016-12-02 08:28:23,908 INFO org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
</code></pre><ul>
<li>DSpace&rsquo;s own Solr logs don&rsquo;t give IP addresses, so I will have to enable Nginx&rsquo;s logging of <code>/solr</code> so I can see where this request came from</li>
<li>I enabled logging of <code>/rest/</code> and I think I&rsquo;ll leave it on for good</li>
<li>Also, the disk is nearly full because of log file issues, so I&rsquo;m running some compression on DSpace logs</li>
<li>Normally these stay uncompressed for a month just in case we need to look at them, so now I&rsquo;ve just compressed anything older than 2 weeks so we can get some disk space back</li>
</ul>
<h2 id="2016-12-04">2016-12-04</h2>
<ul>
<li>I got a weird report from the CGSpace checksum checker this morning</li>
<li>It says 732 bitstreams have potential issues, for example:</li>
</ul>
<pre tabindex="0"><code>------------------------------------------------
Bitstream Id = 6
Process Start Date = Dec 4, 2016
Process End Date = Dec 4, 2016
Checksum Expected = a1d9eef5e2d85f50f67ce04d0329e96a
Checksum Calculated = a1d9eef5e2d85f50f67ce04d0329e96a
Result = Bitstream marked deleted in bitstream table
-----------------------------------------------
...
------------------------------------------------
Bitstream Id = 77581
Process Start Date = Dec 4, 2016
Process End Date = Dec 4, 2016
Checksum Expected = 9959301aa4ca808d00957dff88214e38
Checksum Calculated =
Result = The bitstream could not be found
-----------------------------------------------
</code></pre><ul>
<li>The first one seems ok, but I don&rsquo;t know what to make of the second one&hellip;</li>
<li>I had a look and there is indeed no file with the second checksum in the assetstore (ie, looking in <code>[dspace-dir]/assetstore/99/59/30/...</code>)</li>
<li>For what it&rsquo;s worth, there is no item on DSpace Test or S3 backups with that checksum either&hellip;</li>
<li>In other news, I&rsquo;m looking at JVM settings from the Solr 4.10.2 release, from <code>bin/solr.in.sh</code>:</li>
</ul>
<pre tabindex="0"><code># These GC settings have shown to work well for a number of common Solr workloads
GC_TUNE=&#34;-XX:-UseSuperWord \
-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
-XX:+CMSScavengeBeforeRemark \
-XX:PretenureSizeThreshold=64m \
-XX:CMSFullGCsBeforeCompaction=1 \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSTriggerPermRatio=80 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled \
-XX:+AggressiveOpts&#34;
</code></pre><ul>
<li>I need to try these because they are recommended by the Solr project itself</li>
<li>Also, as always, I need to read <a href="https://wiki.apache.org/solr/ShawnHeisey">Shawn Heisey&rsquo;s wiki page on Solr</a></li>
</ul>
<h2 id="2016-12-05">2016-12-05</h2>
<ul>
<li>I did some basic benchmarking on a local DSpace before and after the JVM settings above, but there wasn&rsquo;t anything amazingly obvious</li>
<li>I want to make the changes on DSpace Test and monitor the JVM heap graphs for a few days to see if they change the JVM GC patterns or anything (munin graphs)</li>
<li>Spin up new CGSpace server on Linode</li>
<li>I did a few traceroutes from Jordan and Kenya and it seems that Linode&rsquo;s Frankfurt datacenter is a few less hops and perhaps less packet loss than the London one, so I put the new server in Frankfurt</li>
<li>Do initial provisioning</li>
<li>Atmire responded about the MQM warnings in the DSpace logs</li>
<li>Apparently we need to change the batch edit consumers in <code>dspace/config/dspace.cfg</code>:</li>
</ul>
<pre tabindex="0"><code>event.consumer.batchedit.filters = Community|Collection+Create
</code></pre><ul>
<li>I haven&rsquo;t tested it yet, but I created a pull request: <a href="https://github.com/ilri/DSpace/pull/289">#289</a></li>
</ul>
<h2 id="2016-12-06">2016-12-06</h2>
<ul>
<li>Some author authority corrections and name standardizations for Peter:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;b041f2f4-19e7-4113-b774-0439baabd197&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Mora Benard%&#39;;
UPDATE 11
dspace=# update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Hoek, R%&#39;;
UPDATE 36
dspace=# update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%an der Hoek%&#39; and text_value !~ &#39;^.*W\.?$&#39;;
UPDATE 14
dspace=# update metadatavalue set authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne, P%&#39;;
UPDATE 42
dspace=# update metadatavalue set authority=&#39;0d8369bb-57f7-4b2f-92aa-af820b183aca&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thornton, P%&#39;;
UPDATE 360
dspace=# update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 561
</code></pre><ul>
<li>Pay attention to the regex to prevent false positives in tricky cases with Dutch names!</li>
<li>I will run these updates on DSpace Test and then force a Discovery reindex, and then run them on CGSpace next week</li>
<li>More work on the KM4Dev Journal article</li>
<li>In other news, it seems the batch edit patch is working, there are no more WARN errors in the logs and the batch edit seems to work</li>
<li>I need to check the CGSpace logs to see if there are still errors there, and then deploy/monitor it there</li>
<li>Paola from CCAFS mentioned she also has the &ldquo;take task&rdquo; bug on CGSpace</li>
<li>Reading about <a href="https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html"><code>shared_buffers</code> in PostgreSQL configuration</a> (default is 128MB)</li>
<li>Looks like we have ~5GB of memory used by caches on the test server (after OS and JVM heap!), so we might as well bump up the buffers for Postgres</li>
<li>The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn&rsquo;t dedicated (also runs Solr, which can benefit from OS cache) so let&rsquo;s try 1024MB</li>
<li>In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):</li>
</ul>
<pre tabindex="0"><code>$ time JAVA_OPTS=&#34;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&#34; /home/dspacetest.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Exception: null
java.lang.NullPointerException
at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
real 8m39.913s
user 1m54.190s
sys 0m22.647s
</code></pre><h2 id="2016-12-07">2016-12-07</h2>
<ul>
<li>For what it&rsquo;s worth, after running the same SQL updates on my local test server, <code>index-authority</code> runs and completes just fine</li>
<li>I will have to test more</li>
<li>Anyways, I noticed that some of the authority values I set actually have versions of author names we don&rsquo;t want, ie &ldquo;Grace, D.&rdquo;</li>
<li>For example, do a Solr query for &ldquo;first_name:Grace&rdquo; and look at the results</li>
<li>Querying that ID shows the fields that need to be changed:</li>
</ul>
<pre tabindex="0"><code>{
&#34;responseHeader&#34;: {
&#34;status&#34;: 0,
&#34;QTime&#34;: 1,
&#34;params&#34;: {
&#34;q&#34;: &#34;id:0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#34;,
&#34;indent&#34;: &#34;true&#34;,
&#34;wt&#34;: &#34;json&#34;,
&#34;_&#34;: &#34;1481102189244&#34;
}
},
&#34;response&#34;: {
&#34;numFound&#34;: 1,
&#34;start&#34;: 0,
&#34;docs&#34;: [
{
&#34;id&#34;: &#34;0b4fcbc1-d930-4319-9b4d-ea1553cca70b&#34;,
&#34;field&#34;: &#34;dc_contributor_author&#34;,
&#34;value&#34;: &#34;Grace, D.&#34;,
&#34;deleted&#34;: false,
&#34;creation_date&#34;: &#34;2016-11-10T15:13:40.318Z&#34;,
&#34;last_modified_date&#34;: &#34;2016-11-10T15:13:40.318Z&#34;,
&#34;authority_type&#34;: &#34;person&#34;,
&#34;first_name&#34;: &#34;D.&#34;,
&#34;last_name&#34;: &#34;Grace&#34;
}
]
}
}
</code></pre><ul>
<li>I think I can just update the <code>value</code>, <code>first_name</code>, and <code>last_name</code> fields&hellip;</li>
<li>The update syntax should be something like this, but I&rsquo;m getting errors from Solr:</li>
</ul>
<pre tabindex="0"><code>$ curl &#39;localhost:8081/solr/authority/update?commit=true&amp;wt=json&amp;indent=true&#39; -H &#39;Content-type:application/json&#39; -d &#39;[{&#34;id&#34;:&#34;1&#34;,&#34;price&#34;:{&#34;set&#34;:100}}]&#39;
{
&#34;responseHeader&#34;:{
&#34;status&#34;:400,
&#34;QTime&#34;:0},
&#34;error&#34;:{
&#34;msg&#34;:&#34;Unexpected character &#39;[&#39; (code 91) in prolog; expected &#39;&lt;&#39;\n at [row,col {unknown-source}]: [1,1]&#34;,
&#34;code&#34;:400}}
</code></pre><ul>
<li>When I try using the XML format I get an error that the <code>updateLog</code> needs to be configured for that core</li>
<li>Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 561
</code></pre><ul>
<li>Then I&rsquo;ll reindex discovery and authority and see how the authority Solr core looks</li>
<li>After this, now there are authorities for some of the &ldquo;Grace, D.&rdquo; and &ldquo;Grace, Delia&rdquo; text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):</li>
</ul>
<pre tabindex="0"><code>$ curl &#39;localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&amp;wt=json&amp;indent=true&#39;
{
&#34;responseHeader&#34;:{
&#34;status&#34;:0,
&#34;QTime&#34;:0,
&#34;params&#34;:{
&#34;q&#34;:&#34;id:18ea1525-2513-430a-8817-a834cd733fbc&#34;,
&#34;indent&#34;:&#34;true&#34;,
&#34;wt&#34;:&#34;json&#34;}},
&#34;response&#34;:{&#34;numFound&#34;:1,&#34;start&#34;:0,&#34;docs&#34;:[
{
&#34;id&#34;:&#34;18ea1525-2513-430a-8817-a834cd733fbc&#34;,
&#34;field&#34;:&#34;dc_contributor_author&#34;,
&#34;value&#34;:&#34;Grace, Delia&#34;,
&#34;deleted&#34;:false,
&#34;creation_date&#34;:&#34;2016-12-07T10:54:34.356Z&#34;,
&#34;last_modified_date&#34;:&#34;2016-12-07T10:54:34.356Z&#34;,
&#34;authority_type&#34;:&#34;person&#34;,
&#34;first_name&#34;:&#34;Delia&#34;,
&#34;last_name&#34;:&#34;Grace&#34;}]
}}
</code></pre><ul>
<li>So now I could set them all to this ID and the name would be ok, but there has to be a better way!</li>
<li>In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!</li>
<li>Better to use:</li>
</ul>
<pre tabindex="0"><code>dspace#= update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
</code></pre><ul>
<li>This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!</li>
<li>Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID</li>
<li>Deploy MQM WARN fix on CGSpace (<a href="https://github.com/ilri/DSpace/pull/289">#289</a>)</li>
<li>Deploy &ldquo;take task&rdquo; hack/fix on CGSpace (<a href="https://github.com/ilri/DSpace/pull/290">#290</a>)</li>
<li>I ran the following author corrections and then reindexed discovery:</li>
</ul>
<pre tabindex="0"><code>update metadatavalue set authority=&#39;b041f2f4-19e7-4113-b774-0439baabd197&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Mora Benard%&#39;;
update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Hoek, R%&#39;;
update metadatavalue set text_value = &#39;Hoek, Rein van der&#39;, authority=&#39;4d6cbce2-6fd5-4b43-9363-58d18e7952c9&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%an der Hoek%&#39; and text_value !~ &#39;^.*W\.?$&#39;;
update metadatavalue set authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne, P%&#39;;
update metadatavalue set authority=&#39;0d8369bb-57f7-4b2f-92aa-af820b183aca&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thornton, P%&#39;;
update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
</code></pre><h2 id="2016-12-08">2016-12-08</h2>
<ul>
<li>Something weird happened and Peter Thorne&rsquo;s names all ended up as &ldquo;Thorne&rdquo;, I guess because the original authority had that as its name value:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Thorne%&#39;;
text_value | authority | confidence
------------------+--------------------------------------+------------
Thorne, P.J. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
Thorne | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
Thorne-Lyman, A. | 0781e13a-1dc8-4e3f-82e8-5c422b44a344 | -1
Thorne, M. D. | 54c52649-cefd-438d-893f-3bcef3702f07 | -1
Thorne, P.J | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
Thorne, P. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
(6 rows)
</code></pre><ul>
<li>I generated a new UUID using <code>uuidgen | tr [A-Z] [a-z]</code> and set it along with correct name variation for all records:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;b2f7603d-2fb5-4018-923a-c4ec8d85b3bb&#39;, text_value=&#39;Thorne, P.J.&#39; where resource_type_id=2 and metadata_field_id=3 and authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39;;
UPDATE 43
</code></pre><ul>
<li>Apparently we also need to normalize Phil Thornton&rsquo;s names to <code>Thornton, Philip K.</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
text_value | authority | confidence
---------------------+--------------------------------------+------------
Thornton, P | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, P K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, P K | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton. P.K. | 3e1e6639-d4fb-449e-9fce-ce06b5b0f702 | -1
Thornton, P K . | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, P.K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, P.K | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, Philip K | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, Philip K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, P. K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
(10 rows)
</code></pre><ul>
<li>Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;2df8136e-d8f4-4142-b58c-562337cab764&#39;, text_value=&#39;Thornton, Philip K.&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
UPDATE 362
</code></pre><ul>
<li>It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)</li>
<li>Everything looks ok after authority and discovery reindex</li>
<li>In other news, I think we should really be using more RAM for PostgreSQL&rsquo;s <code>shared_buffers</code></li>
<li>The <a href="https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html">PostgreSQL documentation</a> recommends using 25% of the system&rsquo;s RAM on dedicated systems, but we should use a bit less since we also have a massive JVM heap and also benefit from some RAM being used by the OS cache</li>
</ul>
<h2 id="2016-12-09">2016-12-09</h2>
<ul>
<li>More work on finishing rough draft of KM4Dev article</li>
<li>Set PostgreSQL&rsquo;s <code>shared_buffers</code> on CGSpace to 10% of system RAM (1200MB)</li>
<li>Run the following author corrections on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;34df639a-42d8-4867-a3f2-1892075fcb3f&#39;, text_value=&#39;Thorne, P.J.&#39; where resource_type_id=2 and metadata_field_id=3 and authority=&#39;18349f29-61b1-44d7-ac60-89e55546e812&#39; or authority=&#39;021cd183-946b-42bb-964e-522ebff02993&#39;;
dspace=# update metadatavalue set authority=&#39;2df8136e-d8f4-4142-b58c-562337cab764&#39;, text_value=&#39;Thornton, Philip K.&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^Thornton[,\.]? P.*&#39;;
</code></pre><ul>
<li>The authority IDs were different now than when I was looking a few days ago so I had to adjust them here</li>
</ul>
<h2 id="2016-12-11">2016-12-11</h2>
<ul>
<li>After enabling a sizable <code>shared_buffers</code> for CGSpace&rsquo;s PostgreSQL configuration the number of connections to the database dropped significantly</li>
</ul>
<p><img src="/cgspace-notes/2016/12/postgres_bgwriter-week.png" alt="postgres_bgwriter-week">
<img src="/cgspace-notes/2016/12/postgres_connections_ALL-week.png" alt="postgres_connections_ALL-week"></p>
<ul>
<li>Looking at CIAT records from last week again, they have a lot of double authors like:</li>
</ul>
<pre tabindex="0"><code>International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::500
International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::0
</code></pre><ul>
<li>Some in the same <code>dc.contributor.author</code> field, and some in others like <code>dc.contributor.author[en_US]</code> etc</li>
<li>Removing the duplicates in OpenRefine and uploading a CSV to DSpace says &ldquo;no changes detected&rdquo;</li>
<li>Seems like the only way to sortof clean these up would be to start in SQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;International Center for Tropical Agriculture&#39;;
text_value | authority | confidence
-----------------------------------------------+--------------------------------------+------------
International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | -1
International Center for Tropical Agriculture | | 600
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 500
International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | 600
International Center for Tropical Agriculture | | -1
International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 | 500
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 600
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | -1
International Center for Tropical Agriculture | 3026b1de-9302-4f3e-85ab-ef48da024eb2 | 0
dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = &#39;International Center for Tropical Agriculture&#39;;
UPDATE 1693
dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, text_value=&#39;International Center for Tropical Agriculture&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%CIAT%&#39;;
UPDATE 35
</code></pre><ul>
<li>Work on article for KM4Dev journal</li>
</ul>
<h2 id="2016-12-13">2016-12-13</h2>
<ul>
<li>Checking in on CGSpace postgres stats again, looks like the <code>shared_buffers</code> change from a few days ago really made a big impact:</li>
</ul>
<p><img src="/cgspace-notes/2016/12/postgres_bgwriter-week-2016-12-13.png" alt="postgres_bgwriter-week">
<img src="/cgspace-notes/2016/12/postgres_connections_ALL-week-2016-12-13.png" alt="postgres_connections_ALL-week"></p>
<ul>
<li>Looking at logs, it seems we need to evaluate which logs we keep and for how long</li>
<li>Basically the only ones we <em>need</em> are <code>dspace.log</code> because those are used for legacy statistics (need to keep for 1 month)</li>
<li>Other logs will be an issue because they don&rsquo;t have date stamps</li>
<li>I will add date stamps to the logs we&rsquo;re storing from the tomcat7 user&rsquo;s cron jobs at least, using: <code>$(date --iso-8601)</code></li>
<li>Would probably be better to make custom logrotate files for them in the future</li>
<li>Clean up some unneeded log files from 2014 (they weren&rsquo;t large, just don&rsquo;t need them)</li>
<li>So basically, new cron jobs for logs should look something like this:</li>
<li>Find any file named <code>*.log*</code> that isn&rsquo;t <code>dspace.log*</code>, isn&rsquo;t already zipped, and is older than one day, and zip it:</li>
</ul>
<pre tabindex="0"><code># find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex &#34;.*\.log.*&#34; ! -iregex &#34;.*dspace\.log.*&#34; ! -iregex &#34;.*\.(gz|lrz|lzo|xz)&#34; ! -newermt &#34;Yesterday&#34; -exec schedtool -B -e ionice -c2 -n7 xz {} \;
</code></pre><ul>
<li>Since there is <code>xzgrep</code> and <code>xzless</code> we can actually just zip them after one day, why not?!</li>
<li>We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that</li>
<li>I use <code>schedtool -B</code> and <code>ionice -c2 -n7</code> to set the CPU scheduling to <code>SCHED_BATCH</code> and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less</li>
<li>When the tasks are running you can see that the policies do apply:</li>
</ul>
<pre tabindex="0"><code>$ schedtool $(ps aux | grep &#34;xz /home&#34; | grep -v grep | awk &#39;{print $2}&#39;) &amp;&amp; ionice -p $(ps aux | grep &#34;xz /home&#34; | grep -v grep | awk &#39;{print $2}&#39;)
PID 17049: PRIO 0, POLICY B: SCHED_BATCH , NICE 0, AFFINITY 0xf
best-effort: prio 7
</code></pre><ul>
<li>All in all this should free up a few gigs (we were at 9.3GB free when I started)</li>
<li>Next thing to look at is whether we need Tomcat&rsquo;s access logs</li>
<li>I just looked and it seems that we saved 10GB by zipping these logs</li>
<li>Some users pointed out issues with the &ldquo;most popular&rdquo; stats on a community or collection</li>
<li>This error appears in the logs when you try to view them:</li>
</ul>
<pre tabindex="0"><code>2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:650)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:111)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:274)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:505)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:956)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:436)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1078)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:625)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:316)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at com.atmire.statistics.generator.TopNDSODatasetGenerator.toDatasetQuery(SourceFile:39)
at com.atmire.statistics.display.StatisticsDataVisitsMultidata.createDataset(SourceFile:108)
at org.dspace.statistics.content.StatisticsDisplay.createDataset(SourceFile:384)
at org.dspace.statistics.content.StatisticsDisplay.getDataset(SourceFile:404)
at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generateJsonData(SourceFile:170)
at com.atmire.statistics.mostpopular.JSONStatsMostPopularGenerator.generate(SourceFile:246)
at com.atmire.app.xmlui.aspect.statistics.JSONStatsMostPopular.generate(JSONStatsMostPopular.java:145)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</code></pre><ul>
<li>It happens on development and production, so I will have to ask Atmire</li>
<li>Most likely an issue with installation/configuration</li>
</ul>
<h2 id="2016-12-14">2016-12-14</h2>
<ul>
<li>Atmire sent a quick fix for the <code>last-update.txt</code> file not found error</li>
<li>After applying pull request <a href="https://github.com/ilri/DSpace/pull/291">#291</a> on DSpace Test I no longer see the error in the logs after the <code>UpdateSolrStorageReports</code> task runs</li>
<li>Also, I&rsquo;m toying with the idea of moving the <code>tomcat7</code> user&rsquo;s cron jobs to <code>/etc/cron.d</code> so we can manage them in Ansible</li>
<li>Made a pull request with a template for the cron jobs (<a href="https://github.com/ilri/rmg-ansible-public/pull/75">#75</a>)</li>
<li>Testing SMTP from the new CGSpace server and it&rsquo;s not working, I&rsquo;ll have to tell James</li>
</ul>
<h2 id="2016-12-15">2016-12-15</h2>
<ul>
<li>Start planning for server migration this weekend, letting users know</li>
<li>I am trying to figure out what the process is to <a href="http://handle.net/hnr_support.html">update the server&rsquo;s IP in the Handle system</a>, and emailing the hdladmin account bounces(!)</li>
<li>I will contact the Jane Euler directly as I know I&rsquo;ve corresponded with her in the past</li>
<li>She said that I should indeed just re-run the <code>[dspace]/bin/dspace make-handle-config</code> command and submit the new <code>sitebndl.zip</code> file to the CNRI website</li>
<li>Also I was troubleshooting some workflow issues from Bizuwork</li>
<li>I re-created the same scenario by adding a non-admin account and submitting an item, but I was able to successfully approve and commit it</li>
<li>So it turns out it&rsquo;s not a bug, it&rsquo;s just that Peter was added as a reviewer/admin AFTER the items were submitted</li>
<li>This is how DSpace works, and I need to ask if there is a way to override someone&rsquo;s submission, as the other reviewer seems to not be paying attention, or has perhaps taken the item from the task pool?</li>
<li>Run a batch edit to add &ldquo;RANGELANDS&rdquo; ILRI subject to all items containing the word &ldquo;RANGELANDS&rdquo; in their metadata for Peter Ballantyne</li>
</ul>
<p><img src="/cgspace-notes/2016/12/batch-edit1.png" alt="Select all items with &amp;ldquo;rangelands&amp;rdquo; in metadata">
<img src="/cgspace-notes/2016/12/batch-edit2.png" alt="Add RANGELANDS ILRI subject"></p>
<h2 id="2016-12-18">2016-12-18</h2>
<ul>
<li>Add four new CRP subjects for 2017 and sort the input forms alphabetically (<a href="https://github.com/ilri/DSpace/pull/294">#294</a>)</li>
<li>Test the SMTP on the new server and it&rsquo;s working</li>
<li>Last week, when we asked CGNET to update the DNS records this weekend, they misunderstood and did it immediately</li>
<li>We quickly told them to undo it, but I just realized they didn&rsquo;t undo the IPv6 AAAA record!</li>
<li>None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then</li>
<li>Update some names and authorities in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;5ff35043-942e-4d0a-b377-4daed6e3c1a3&#39;, confidence=600, text_value=&#39;Duncan, Alan&#39; where resource_type_id=2 and metadata_field_id=3 and text_value ~ &#39;^.*Duncan,? A.*&#39;;
UPDATE 204
dspace=# update metadatavalue set authority=&#39;46804b53-ea30-4a85-9ccf-b79a35816fa9&#39;, confidence=600, text_value=&#39;Mekonnen, Kindu&#39; where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Mekonnen, K%&#39;;
UPDATE 89
dspace=# update metadatavalue set authority=&#39;f840da02-26e7-4a74-b7ba-3e2b723f3684&#39;, confidence=600, text_value=&#39;Lukuyu, Ben A.&#39; where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Lukuyu, B%&#39;;
UPDATE 140
</code></pre><ul>
<li>Generated a new UUID for Ben using <code>uuidgen | tr [A-Z] [a-z]</code> as the one in Solr had his ORCID but the name format was incorrect</li>
<li>In theory DSpace should be able to check names from ORCID and update the records in the database, but I find that this doesn&rsquo;t work (see Jira bug <a href="https://jira.duraspace.org/browse/DS-3302">DS-3302</a>)</li>
<li>I need to run these updates along with the other one for CIAT that I found last week</li>
<li>Enable OCSP stapling for hosts &gt;= Ubuntu 16.04 in our Ansible playbooks (<a href="https://github.com/ilri/rmg-ansible-public/pull/76">#76</a>)</li>
<li>Working for DSpace Test on the second response:</li>
</ul>
<pre tabindex="0"><code>$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP response: no response sent
$ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP Response Data:
...
Cert Status: good
</code></pre><ul>
<li>Migrate CGSpace to new server, roughly following these steps:</li>
<li>On old server:</li>
</ul>
<pre tabindex="0"><code># service tomcat7 stop
# /home/backup/scripts/postgres_backup.sh
</code></pre><ul>
<li>On new server:</li>
</ul>
<pre tabindex="0"><code># systemctl stop tomcat7
# rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/assetstore/ /home/cgspace.cgiar.org/assetstore/
# rsync -4 -av --delete 178.79.187.182:/home/backup/ /home/backup/
# rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/solr/ /home/cgspace.cgiar.org/solr
# su - postgres
$ dropdb cgspace
$ createdb -O cgspace --encoding=UNICODE cgspace
$ psql cgspace -c &#39;alter user cgspace createuser;&#39;
$ pg_restore -O -U cgspace -d cgspace -W -h localhost /home/backup/postgres/cgspace_2016-12-18.backup
$ psql cgspace -c &#39;alter user cgspace nocreateuser;&#39;
$ psql -U cgspace -f ~tomcat7/src/git/DSpace/dspace/etc/postgres/update-sequences.sql cgspace -h localhost
$ vacuumdb cgspace
$ psql cgspace
postgres=# \i /tmp/author-authority-updates-2016-12-11.sql
postgres=# \q
$ exit
# chown -R tomcat7:tomcat7 /home/cgspace.cgiar.org
# rsync -4 -av 178.79.187.182:/home/cgspace.cgiar.org/log/*.dat /home/cgspace.cgiar.org/log/
# rsync -4 -av 178.79.187.182:/home/cgspace.cgiar.org/log/dspace.log.2016-1[12]* /home/cgspace.cgiar.org/log/
# su - tomcat7
$ cd src/git/DSpace/dspace/target/dspace-installer
$ ant update clean_backups
$ exit
# systemctl start tomcat7
</code></pre><ul>
<li>It took about twenty minutes and afterwards I had to check a few things, like:
<ul>
<li>check and enable systemd timer for let&rsquo;s encrypt</li>
<li>enable root cron jobs</li>
<li>disable root cron jobs on old server after!</li>
<li>enable tomcat7 cron jobs</li>
<li>disable tomcat7 cron jobs on old server after!</li>
<li>regenerate <code>sitebndl.zip</code> with new IP for handle server and submit it to Handle.net</li>
</ul>
</li>
</ul>
<h2 id="2016-12-22">2016-12-22</h2>
<ul>
<li>Abenet wanted a CSV of the IITA community, but the web export doesn&rsquo;t include the <code>dc.date.accessioned</code> field</li>
<li>I had to export it from the command line using the <code>-a</code> flag:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
</code></pre><h2 id="2016-12-28">2016-12-28</h2>
<ul>
<li>We&rsquo;ve been getting two alerts per day about CPU usage on the new server from Linode</li>
<li>These are caused by the batch jobs for Solr etc that run in the early morning hours</li>
<li>The Linode default is to alert at 90% CPU usage for two hours, but I see the old server was at 150%, so maybe we just need to adjust it</li>
<li>Speaking of the old server (linode01), I think we can decommission it now</li>
<li>I checked the S3 logs on the new server (linode18) to make sure the backups have been running and everything looks good</li>
<li>In other news, I was looking at the Munin graphs for PostgreSQL on the new server and it looks slightly worrying:</li>
</ul>
<p><img src="/cgspace-notes/2016/12/postgres_size_ALL-week.png" alt="munin postgres stats"></p>
<ul>
<li>I will have to check later why the size keeps increasing</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 63 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 103 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 136 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 258 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 138 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 685 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 864 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 91 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 438 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 434 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 93 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 194 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 KiB

423
docs/2017-01/index.html Normal file
View File

@ -0,0 +1,423 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="January, 2017" />
<meta property="og:description" content="2017-01-02
I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-01/" />
<meta property="article:published_time" content="2017-01-02T10:43:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="January, 2017"/>
<meta name="twitter:description" content="2017-01-02
I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-01/",
"wordCount": "1594",
"datePublished": "2017-01-02T10:43:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-01/">
<title>January, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-01/">January, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-01-02T10:43:00+03:00">Mon Jan 02, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-01-02">2017-01-02</h2>
<ul>
<li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li>
<li>I tested on DSpace Test as well and it doesn&rsquo;t work there either</li>
<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years</li>
</ul>
<h2 id="2017-01-04">2017-01-04</h2>
<ul>
<li>I tried to shard my local dev instance and it fails the same way:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace stats-util -s
Moving: 9318 into core statistics-2016
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2291)
at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.apache.http.client.ClientProtocolException
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
... 10 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. The cause lists the reason the original request failed.
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
... 14 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
... 16 more
</code></pre><ul>
<li>And the DSpace log shows:</li>
</ul>
<pre tabindex="0"><code>2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}-&gt;http://localhost:8081: Broken pipe (Write failed)
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}-&gt;http://localhost:8081
</code></pre><ul>
<li>Despite failing instantly, a <code>statistics-2016</code> directory was created, but it only has a data dir (no conf)</li>
<li>The Tomcat access logs show more:</li>
</ul>
<pre tabindex="0"><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-17YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 63
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&#34; 200 4359517
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 16248
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 409 156
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &#34;POST /solr/datatables/update?wt=javabin&amp;version=2 HTTP/1.1&#34; 200 41
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &#34;POST /solr/datatables/update HTTP/1.1&#34; 200 40
</code></pre><ul>
<li>Very interesting&hellip; it creates the core and then fails somehow</li>
</ul>
<h2 id="2017-01-08">2017-01-08</h2>
<ul>
<li>Put Sisay&rsquo;s <code>item-view.xsl</code> code to show mapped collections on CGSpace (<a href="https://github.com/ilri/DSpace/pull/295">#295</a>)</li>
</ul>
<h2 id="2017-01-09">2017-01-09</h2>
<ul>
<li>A user wrote to tell me that the new display of an item&rsquo;s mappings had a crazy bug for at least one item: <a href="https://cgspace.cgiar.org/handle/10568/78596">https://cgspace.cgiar.org/handle/10568/78596</a></li>
<li>She said she only mapped it once, but it appears to be mapped 184 times</li>
</ul>
<p><img src="/cgspace-notes/2017/01/mapping-crazy-duplicate.png" alt="Crazy item mapping"></p>
<h2 id="2017-01-10">2017-01-10</h2>
<ul>
<li>I tried to clean up the duplicate mappings by exporting the item&rsquo;s metadata to CSV, editing, and re-importing, but DSpace said &ldquo;no changes were detected&rdquo;</li>
<li>I&rsquo;ve asked on the dspace-tech mailing list to see if anyone can help</li>
<li>I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help</li>
<li>For example, this shows 186 mappings for the item, the first three of which are real:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80596&#39;;
</code></pre><ul>
<li>Then I deleted the others:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from collection2item where item_id = &#39;80596&#39; and id not in (90792, 90806, 90807);
</code></pre><ul>
<li>And in the item view it now shows the correct mappings</li>
<li>I will have to ask the DSpace people if this is a valid approach</li>
<li>Finish looking at the Journal Title corrections of the top 500 Journal Titles so we can make a controlled vocabulary from it</li>
</ul>
<h2 id="2017-01-11">2017-01-11</h2>
<ul>
<li>Maria found another item with duplicate mappings: <a href="https://cgspace.cgiar.org/handle/10568/78658">https://cgspace.cgiar.org/handle/10568/78658</a></li>
<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li>
</ul>
<pre tabindex="0"><code>Traceback (most recent call last):
File &#34;./fix-metadata-values.py&#34;, line 80, in &lt;module&gt;
print(&#34;Fixing {} occurences of: {}&#34;.format(records_to_fix, record[0]))
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\xe4&#39; in position 15: ordinal not in range(128)
</code></pre><ul>
<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li>
</ul>
<pre tabindex="0"><code>print(&#34;Fixing {} occurences of: {}&#34;.format(records_to_fix, record[0].encode(&#39;utf-8&#39;)))
</code></pre><ul>
<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li>
<li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li>
<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Now get the top 500 journal titles:</li>
</ul>
<pre tabindex="0"><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
</code></pre><ul>
<li>The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November</li>
<li>I will have to go through these and fix some more before making the controlled vocabulary</li>
<li>Added 30 more corrections or so, now there are 49 total and I&rsquo;ll have to get the top 500 after applying them</li>
</ul>
<h2 id="2017-01-13">2017-01-13</h2>
<ul>
<li>Add <code>FOOD SYSTEMS</code> to CIAT subjects, waiting to merge: <a href="https://github.com/ilri/DSpace/pull/296">https://github.com/ilri/DSpace/pull/296</a></li>
</ul>
<h2 id="2017-01-16">2017-01-16</h2>
<ul>
<li>Fix the two items Maria found with duplicate mappings with this script:</li>
</ul>
<pre tabindex="0"><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
delete from collection2item where item_id = &#39;80596&#39; and id not in (90792, 90806, 90807);
/* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
delete from collection2item where id = &#39;91082&#39;;
</code></pre><h2 id="2017-01-17">2017-01-17</h2>
<ul>
<li>Helping clean up some file names in the 232 CIAT records that Sisay worked on last week</li>
<li>There are about 30 files with <code>%20</code> (space) and Spanish accents in the file name</li>
<li>At first I thought we should fix these, but actually it is <a href="https://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1">prescribed by the W3 working group to convert these to UTF8 and URL encode them</a>!</li>
<li>And the file names don&rsquo;t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore</li>
<li>Seems like the only ones I should replace are the <code>'</code> apostrophe characters, as <code>%27</code>:</li>
</ul>
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#39;%27&#39;)
</code></pre><ul>
<li>Add the item&rsquo;s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:</li>
</ul>
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>Test importing of the new CIAT records (actually there are 232, not 234):</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
</code></pre><ul>
<li>Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB</li>
<li>These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:</li>
</ul>
<pre tabindex="0"><code>$ convert -compress Zip -density 150x150 input.pdf output.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
</code></pre><ul>
<li>Somewhere on the Internet suggested using a DPI of 144</li>
</ul>
<h2 id="2017-01-19">2017-01-19</h2>
<ul>
<li>In testing a random sample of CIAT&rsquo;s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are</li>
<li>Import 232 CIAT records into CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
</code></pre><h2 id="2017-01-22">2017-01-22</h2>
<ul>
<li>Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel&rsquo;s CSV exporter)</li>
<li>There were also some issues with an invalid dc.date.issued field, and I trimmed leading / trailing whitespace and cleaned up some URLs with unneeded parameters like ?show=full</li>
</ul>
<h2 id="2017-01-23">2017-01-23</h2>
<ul>
<li>I merged Atmire&rsquo;s pull request into the development branch so they can deploy it on DSpace Test</li>
<li>Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):</li>
</ul>
<pre tabindex="0"><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&#34;$community&#34; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&#34;$community&#34;; done
</code></pre><ul>
<li>Move some collections with <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move-collections.sh</code></a> using the following config:</li>
</ul>
<pre tabindex="0"><code>10568/42161 10568/171 10568/79341
10568/41914 10568/171 10568/79340
</code></pre><h2 id="2017-01-24">2017-01-24</h2>
<ul>
<li>Run all updates on DSpace Test and reboot the server</li>
<li>Run fixes for Journal titles on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p &#39;password&#39;
</code></pre><ul>
<li>Create a new list of the top 500 journal titles from the database:</li>
</ul>
<pre tabindex="0"><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
</code></pre><ul>
<li>Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request (<a href="https://github.com/ilri/DSpace/pull/298">#298</a>)</li>
<li>This would be the last issue remaining to close the meta issue about switching to controlled vocabularies (<a href="https://github.com/ilri/DSpace/pull/69">#69</a>)</li>
</ul>
<h2 id="2017-01-25">2017-01-25</h2>
<ul>
<li>Atmire says the <code>com.atmire.statistics.util.UpdateSolrStorageReports</code> and <code>com.atmire.utils.ReportSender</code> are no longer necessary because they are using a Spring scheduler for these tasks now</li>
<li>Pull request to remove them from the Ansible templates: <a href="https://github.com/ilri/rmg-ansible-public/pull/80">https://github.com/ilri/rmg-ansible-public/pull/80</a></li>
<li>Still testing the Atmire modules on DSpace Test, and it looks like a few issues we had reported are now fixed:
<ul>
<li>XLS Export from Content statistics</li>
<li>Most popular items</li>
<li>Show statistics on collection pages</li>
</ul>
</li>
<li>But now we have a new issue with the &ldquo;Types&rdquo; in Content statistics not being respected—we only get the defaults, despite having custom settings in <code>dspace/config/modules/atmire-cua.cfg</code></li>
</ul>
<h2 id="2017-01-27">2017-01-27</h2>
<ul>
<li>Magdalena pointed out that somehow the Anonymous group had been added to the Administrators group on CGSpace (!)</li>
<li>Discuss plans to update CCAFS metadata and communities for their new flagships and phase II project identifiers</li>
<li>The flagships are in <code>cg.subject.ccafs</code>, and we need to probably make a new field for the phase II project identifiers</li>
</ul>
<h2 id="2017-01-28">2017-01-28</h2>
<ul>
<li>Merge controlled vocabulary for journal titles (<code>dc.source</code>) into CGSpace (<a href="https://github.com/ilri/DSpace/pull/298">#298</a>)</li>
<li>Merge new CIAT subject into CGSpace (<a href="https://github.com/ilri/DSpace/pull/296">#296</a>)</li>
</ul>
<h2 id="2017-01-29">2017-01-29</h2>
<ul>
<li>Run all system updates on DSpace Test, redeploy DSpace code, and reboot the server</li>
<li>Run all system updates on CGSpace, redeploy DSpace code, and reboot the server</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

477
docs/2017-02/index.html Normal file
View File

@ -0,0 +1,477 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="February, 2017" />
<meta property="og:description" content="2017-02-07
An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------&#43;---------------&#43;---------
92551 | 313 | 80278
92550 | 313 | 80278
90774 | 1051 | 80278
(3 rows)
dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-02/" />
<meta property="article:published_time" content="2017-02-07T07:04:52-08:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2017"/>
<meta name="twitter:description" content="2017-02-07
An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------&#43;---------------&#43;---------
92551 | 313 | 80278
92550 | 313 | 80278
90774 | 1051 | 80278
(3 rows)
dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "February, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-02/",
"wordCount": "2028",
"datePublished": "2017-02-07T07:04:52-08:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-02/">
<title>February, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-02/">February, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-02-07T07:04:52-08:00">Tue Feb 07, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-02-07">2017-02-07</h2>
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
92550 | 313 | 80278
90774 | 1051 | 80278
(3 rows)
dspace=# delete from collection2item where id = 92551 and item_id = 80278;
DELETE 1
</code></pre><ul>
<li>Create issue on GitHub to track the addition of CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/issues/301">#301</a>)</li>
<li>Looks like we&rsquo;ll be using <code>cg.identifier.ccafsprojectpii</code> as the field name</li>
</ul>
<h2 id="2017-02-08">2017-02-08</h2>
<ul>
<li>We also need to rename some of the CCAFS Phase I flagships:
<ul>
<li>CLIMATE-SMART AGRICULTURAL PRACTICESCLIMATE-SMART TECHNOLOGIES AND PRACTICES</li>
<li>CLIMATE RISK MANAGEMENTCLIMATE SERVICES AND SAFETY NETS</li>
<li>LOW EMISSIONS AGRICULTURELOW EMISSIONS DEVELOPMENT</li>
<li>POLICIES AND INSTITUTIONSPRIORITIES AND POLICIES FOR CSA</li>
</ul>
</li>
<li>The climate risk management one doesn&rsquo;t exist, so I will have to ask Magdalena if they want me to add it to the input forms</li>
<li>Start testing some nearly 500 author corrections that CCAFS sent me:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t &#39;correct name&#39; -m 3 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2017-02-09">2017-02-09</h2>
<ul>
<li>More work on CCAFS Phase II stuff</li>
<li>Looks like simply adding a new metadata field to <code>dspace/config/registries/cgiar-types.xml</code> and restarting DSpace causes the field to get added to the rregistry</li>
<li>It requires a restart but at least it allows you to manage the registry programmatically</li>
<li>It&rsquo;s not a very good way to manage the registry, though, as removing one there doesn&rsquo;t cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created</li>
<li>Testing some corrections on CCAFS Phase II flagships (<code>cg.subject.ccafs</code>):</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><h2 id="2017-02-10">2017-02-10</h2>
<ul>
<li>CCAFS said they want to wait on the flagship updates (<code>cg.subject.ccafs</code>) on CGSpace, perhaps for a month or so</li>
<li>Help Marianne Gadeberg (WLE) with some user permissions as it seems she had previously been using a personal email account, and is now on a CGIAR one</li>
<li>I manually added her new account to ~25 authorizations that her hold user was on</li>
</ul>
<h2 id="2017-02-14">2017-02-14</h2>
<ul>
<li>Add <code>SCALING</code> to ILRI subjects (<a href="https://github.com/ilri/DSpace/pull/304">#304</a>), as Sisay&rsquo;s attempts were all sloppy</li>
<li>Cherry pick some patches from the DSpace 5.7 branch:
<ul>
<li>DS-3363 CSV import error says &ldquo;row&rdquo;, means &ldquo;column&rdquo;: f7b6c83e991db099003ee4e28ca33d3c7bab48c0</li>
<li>DS-3479 avoid adding empty metadata values during import: 329f3b48a6de7fad074d825fd12118f7e181e151</li>
<li>[DS-3456] 5x Clarify command line options for statisics import/export tools (#1623): 567ec083c8a94eb2bcc1189816eb4f767745b278</li>
<li>[DS-3458]5x Allow Shard Process to Append to an existing repo: 3c8ecb5d1fd69a1dcfee01feed259e80abbb7749</li>
</ul>
</li>
<li>I still need to test these, especially as the last two which change some stuff with Solr maintenance</li>
</ul>
<h2 id="2017-02-15">2017-02-15</h2>
<ul>
<li>Update rvm on DSpace Test and CGSpace as there was a <a href="https://github.com/justinsteven/advisories/blob/master/2017_rvm_cd_command_execution.md">security disclosure about versions less than 1.28.0</a></li>
</ul>
<h2 id="2017-02-16">2017-02-16</h2>
<ul>
<li>Looking at memory info from munin on CGSpace:</li>
</ul>
<p><img src="/cgspace-notes/2017/02/meminfo_phisical-week.png" alt="CGSpace meminfo"></p>
<ul>
<li>We are using only ~8GB of RAM for applications, and 16GB for caches!</li>
<li>The Linode machine we&rsquo;re on has 24GB of RAM but only because that&rsquo;s the only instance that had enough disk space for us (384GB)&hellip;</li>
<li>We should probably look into Google Compute Engine or Digital Ocean where we can get more storage without having to follow a linear increase in instance pricing for CPU/memory as well</li>
<li>Especially because we only use 2 out of 8 CPUs basically:</li>
</ul>
<p><img src="/cgspace-notes/2017/02/cpu-week.png" alt="CGSpace CPU"></p>
<ul>
<li>Fix issue with duplicate declaration of in atmire-dspace-xmlui <code>pom.xml</code> (causing non-fatal warnings during the maven build)</li>
<li>Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site&rsquo;s properties file:</li>
</ul>
<pre tabindex="0"><code>handle.canonical.prefix = https://hdl.handle.net/
</code></pre><ul>
<li>And then a SQL command to update existing records:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://hdl.handle.net&#39;, &#39;https://hdl.handle.net&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;uri&#39;);
UPDATE 58193
</code></pre><ul>
<li>Seems to work fine!</li>
<li>I noticed a few items that have incorrect DOI links (<code>dc.identifier.doi</code>), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value not like &#39;http%://%&#39;;
</code></pre><ul>
<li>This will replace any that begin with <code>10.</code> and change them to <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^10\..+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;10.%&#39;;
</code></pre><ul>
<li>This will get any that begin with <code>doi:10.</code> and change them to <code>https://dx.doi.org/10.x</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;^doi:(10\..+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;doi:10%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>dx.doi.org/10.</code> to be <code>https://dx.doi.org/10.</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^dx.doi.org/.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;dx.doi.org/%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>http//</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;^http//(dx.doi.org/.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;http//%&#39;;
</code></pre><ul>
<li>Fix DOIs like <code>dx.doi.org./</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^dx.doi.org\./.+$)&#39;, &#39;https://dx.doi.org/\1&#39;) where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;dx.doi.org./%&#39;
</code></pre><ul>
<li>Delete some invalid DOIs:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value in (&#39;DOI&#39;,&#39;CPWF Mekong&#39;,&#39;Bulawayo, Zimbabwe&#39;,&#39;bb&#39;);
</code></pre><ul>
<li>Fix some other random outliers:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.1016/j.aquaculture.2015.09.003&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.5337/2016.200&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;doi: https://dx.doi.org/10.5337/2016.200&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/doi:10.1371/journal.pone.0062898&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;Http://dx.doi.org/doi:10.1371/journal.pone.0062898&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.10.1016/j.cosust.2013.11.012&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;http:dx.doi.10.1016/j.cosust.2013.11.012&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.1080/03632415.2014.883570&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;org/10.1080/03632415.2014.883570&#39;;
dspace=# update metadatavalue set text_value = &#39;https://dx.doi.org/10.15446/agron.colomb.v32n3.46052&#39; where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value = &#39;Doi: 10.15446/agron.colomb.v32n3.46052&#39;;
</code></pre><ul>
<li>And do another round of <code>http://</code> → <code>https://</code> cleanups:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://dx.doi.org&#39;, &#39;https://dx.doi.org&#39;) where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;identifier&#39; and qualifier = &#39;doi&#39;) and text_value like &#39;http://dx.doi.org%&#39;;
</code></pre><ul>
<li>Run all DOI corrections on CGSpace</li>
<li>Something to think about here is to write a <a href="https://wiki.lyrasis.org/display/DSDOC5x/Curation+System#CurationSystem-ScriptedTasks">Curation Task</a> in Java to do these sanity checks / corrections every night</li>
<li>Then we could add a cron job for them and run them from the command line like:</li>
</ul>
<pre tabindex="0"><code>[dspace]/bin/dspace curate -t noop -i 10568/79891
</code></pre><h2 id="2017-02-20">2017-02-20</h2>
<ul>
<li>Run all system updates on DSpace Test and reboot the server</li>
<li>Run CCAFS author corrections on DSpace Test and CGSpace and force a full discovery reindex</li>
<li>Fix label of CCAFS subjects in Atmire Listings and Reports module</li>
<li>Help Sisay with SQL commands</li>
<li>Help Paola from CCAFS with the Atmire Listings and Reports module</li>
<li>Testing the <code>fix-metadata-values.py</code> script on macOS and it seems like we don&rsquo;t need to use <code>.encode('utf-8')</code> anymore when printing strings to the screen</li>
<li>It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string &ldquo;Entwicklung &amp; Ländlicher Raum&rdquo; without the <code>encode()</code> call, but print it as a bytes when it <em>is</em> used:</li>
</ul>
<pre tabindex="0"><code>$ python
Python 3.6.0 (default, Dec 25 2016, 17:30:53)
&gt;&gt;&gt; print(&#39;Entwicklung &amp; Ländlicher Raum&#39;)
Entwicklung &amp; Ländlicher Raum
&gt;&gt;&gt; print(&#39;Entwicklung &amp; Ländlicher Raum&#39;.encode())
b&#39;Entwicklung &amp; L\xc3\xa4ndlicher Raum&#39;
</code></pre><ul>
<li>So for now I will remove the encode call from the script (though it was never used on the versions on the Linux hosts), leading me to believe it really <em>was</em> a temporary problem, perhaps due to macOS or the Python build I was using.</li>
</ul>
<h2 id="2017-02-21">2017-02-21</h2>
<ul>
<li>Testing regenerating PDF thumbnails, like I started in 2016-11</li>
<li>It seems there is a bug in <code>filter-media</code> that causes it to process formats that aren&rsquo;t part of its configuration:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p &#34;ImageMagick PDF Thumbnail&#34;
File: earlywinproposal_esa_postharvest.pdf.jpg
FILTERED: bitstream 13787 (item: 10568/16881) and created &#39;earlywinproposal_esa_postharvest.pdf.jpg&#39;
File: postHarvest.jpg.jpg
FILTERED: bitstream 16524 (item: 10568/24655) and created &#39;postHarvest.jpg.jpg&#39;
</code></pre><ul>
<li>According to <code>dspace.cfg</code> the ImageMagick PDF Thumbnail plugin should only process PDFs:</li>
</ul>
<pre tabindex="0"><code>filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
</code></pre><ul>
<li>I&rsquo;ve sent a message to the mailing list and might file a Jira issue</li>
<li>Ask Atmire about the failed interpolation of the <code>dspace.internalUrl</code> variable in <code>atmire-cua.cfg</code></li>
</ul>
<h2 id="2017-02-22">2017-02-22</h2>
<ul>
<li>Atmire said I can add <code>dspace.internalUrl</code> to my build properties and the error will go away</li>
<li>It should be the local URL for accessing Tomcat from the server&rsquo;s own perspective, ie: http://localhost:8080</li>
</ul>
<h2 id="2017-02-26">2017-02-26</h2>
<ul>
<li>Find all fields with &ldquo;<a href="http://hdl.handle.net">http://hdl.handle.net</a>&rdquo; values (most are in <code>dc.identifier.uri</code>, but some are in other URL-related fields like <code>cg.link.reference</code>, <code>cg.identifier.dataurl</code>, and <code>cg.identifier.url</code>):</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like &#39;http://hdl.handle.net%&#39;;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://hdl.handle.net&#39;, &#39;https://hdl.handle.net&#39;) where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like &#39;http://hdl.handle.net%&#39;;
UPDATE 58633
</code></pre><ul>
<li>This works but I&rsquo;m thinking I&rsquo;ll wait on the replacement as there are perhaps some other places that rely on <code>http://hdl.handle.net</code> (grep the code, it&rsquo;s scary how many things are hard coded)</li>
<li>Send message to dspace-tech mailing list with concerns about this</li>
</ul>
<h2 id="2017-02-27">2017-02-27</h2>
<ul>
<li>LDAP users cannot log in today, looks to be an issue with CGIAR&rsquo;s LDAP server:</li>
</ul>
<pre tabindex="0"><code>$ openssl s_client -connect svcgroot2.cgiarad.org:3269
CONNECTED(00000003)
depth=0 CN = SVCGROOT2.CGIARAD.ORG
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = SVCGROOT2.CGIARAD.ORG
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:/CN=SVCGROOT2.CGIARAD.ORG
i:/CN=CGIARAD-RDWA-CA
---
</code></pre><ul>
<li>For some reason it is now signed by a private certificate authority</li>
<li>This error seems to have started on 2017-02-25:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &#34;unable to find valid certification path&#34; [dspace]/log/dspace.log.2017-02-*
[dspace]/log/dspace.log.2017-02-01:0
[dspace]/log/dspace.log.2017-02-02:0
[dspace]/log/dspace.log.2017-02-03:0
[dspace]/log/dspace.log.2017-02-04:0
[dspace]/log/dspace.log.2017-02-05:0
[dspace]/log/dspace.log.2017-02-06:0
[dspace]/log/dspace.log.2017-02-07:0
[dspace]/log/dspace.log.2017-02-08:0
[dspace]/log/dspace.log.2017-02-09:0
[dspace]/log/dspace.log.2017-02-10:0
[dspace]/log/dspace.log.2017-02-11:0
[dspace]/log/dspace.log.2017-02-12:0
[dspace]/log/dspace.log.2017-02-13:0
[dspace]/log/dspace.log.2017-02-14:0
[dspace]/log/dspace.log.2017-02-15:0
[dspace]/log/dspace.log.2017-02-16:0
[dspace]/log/dspace.log.2017-02-17:0
[dspace]/log/dspace.log.2017-02-18:0
[dspace]/log/dspace.log.2017-02-19:0
[dspace]/log/dspace.log.2017-02-20:0
[dspace]/log/dspace.log.2017-02-21:0
[dspace]/log/dspace.log.2017-02-22:0
[dspace]/log/dspace.log.2017-02-23:0
[dspace]/log/dspace.log.2017-02-24:0
[dspace]/log/dspace.log.2017-02-25:7
[dspace]/log/dspace.log.2017-02-26:8
[dspace]/log/dspace.log.2017-02-27:90
</code></pre><ul>
<li>Also, it seems that we need to use a different user for LDAP binds, as we&rsquo;re still using the temporary one from the root migration, so maybe we can go back to the previous user we were using</li>
<li>So it looks like the certificate is invalid AND the bind users we had been using were deleted</li>
<li>Biruk Debebe recreated the bind user and now we are just waiting for CGNET to update their certificates</li>
<li>Regarding the <code>filter-media</code> issue I found earlier, it seems that the ImageMagick PDF plugin will also process JPGs if they are in the &ldquo;Content Files&rdquo; (aka <code>ORIGINAL</code>) bundle</li>
<li>The problem likely lies in the logic of <code>ImageMagickThumbnailFilter.java</code>, as <code>ImageMagickPdfThumbnailFilter.java</code> extends it</li>
<li>Run CIAT corrections on CGSpace</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;3026b1de-9302-4f3e-85ab-ef48da024eb2&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = &#39;International Center for Tropical Agriculture&#39;;
</code></pre><ul>
<li>CGNET has fixed the certificate chain on their LDAP server</li>
<li>Redeploy CGSpace and DSpace Test to on latest <code>5_x-prod</code> branch with fixes for LDAP bind user</li>
<li>Run all system updates on CGSpace server and reboot</li>
</ul>
<h2 id="2017-02-28">2017-02-28</h2>
<ul>
<li>After running the CIAT corrections and updating the Discovery and authority indexes, there is still no change in the number of items listed for CIAT in Discovery</li>
<li>Ah, this is probably because some items have the <code>International Center for Tropical Agriculture</code> author twice, which I first noticed in 2016-12 but couldn&rsquo;t figure out how to fix</li>
<li>I think I can do it by first exporting all metadatavalues that have the author <code>International Center for Tropical Agriculture</code></li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;International Center for Tropical Agriculture&#39;) to /tmp/ciat.csv with csv;
COPY 1968
</code></pre><ul>
<li>And then use awk to print the duplicate lines to a separate file:</li>
</ul>
<pre tabindex="0"><code>$ awk -F&#39;,&#39; &#39;seen[$1]++&#39; /tmp/ciat.csv &gt; /tmp/ciat-dupes.csv
</code></pre><ul>
<li>From that file I can create a list of 279 deletes and put them in a batch script like:</li>
</ul>
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
</code></pre>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

409
docs/2017-03/index.html Normal file
View File

@ -0,0 +1,409 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="March, 2017" />
<meta property="og:description" content="2017-03-01
Run the 279 CIAT author corrections on CGSpace
2017-03-02
Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
They might come in at the top level in one &ldquo;CGIAR System&rdquo; community, or with several communities
I need to spend a bit of time looking at the multiple handle support in DSpace and see if new content can be minted in both handles, or just one?
Need to send Peter and Michael some notes about this in a few days
Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600&#43;0&#43;0 8-bit CMYK 168KB 0.000u 0:00.000
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-03/" />
<meta property="article:published_time" content="2017-03-01T17:08:52+02:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2017"/>
<meta name="twitter:description" content="2017-03-01
Run the 279 CIAT author corrections on CGSpace
2017-03-02
Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
They might come in at the top level in one &ldquo;CGIAR System&rdquo; community, or with several communities
I need to spend a bit of time looking at the multiple handle support in DSpace and see if new content can be minted in both handles, or just one?
Need to send Peter and Michael some notes about this in a few days
Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
Filed an issue on DSpace issue tracker for the filter-media bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600&#43;0&#43;0 8-bit CMYK 168KB 0.000u 0:00.000
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "March, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-03/",
"wordCount": "1538",
"datePublished": "2017-03-01T17:08:52+02:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-03/">
<title>March, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-03/">March, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-03-01T17:08:52+02:00">Wed Mar 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-03-01">2017-03-01</h2>
<ul>
<li>Run the 279 CIAT author corrections on CGSpace</li>
</ul>
<h2 id="2017-03-02">2017-03-02</h2>
<ul>
<li>Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace</li>
<li>CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles</li>
<li>They might come in at the top level in one &ldquo;CGIAR System&rdquo; community, or with several communities</li>
<li>I need to spend a bit of time looking at the multiple handle support in DSpace and see if new content can be minted in both handles, or just one?</li>
<li>Need to send Peter and Michael some notes about this in a few days</li>
<li>Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI</li>
<li>Filed an issue on DSpace issue tracker for the <code>filter-media</code> bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: <a href="https://jira.duraspace.org/browse/DS-3516">DS-3516</a></li>
<li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
<li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
</ul>
<pre tabindex="0"><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
</code></pre><ul>
<li>This results in discolored thumbnails when compared to the original PDF, for example sRGB and CMYK:</li>
</ul>
<p><img src="/cgspace-notes/2017/03/thumbnail-srgb.jpg" alt="Thumbnail in sRGB colorspace"></p>
<p><img src="/cgspace-notes/2017/03/thumbnail-cmyk.jpg" alt="Thumbnial in CMYK colorspace"></p>
<ul>
<li>I filed an issue for the color space thing: <a href="https://jira.duraspace.org/browse/DS-3517">DS-3517</a></li>
</ul>
<h2 id="2017-03-03">2017-03-03</h2>
<ul>
<li>I created a patch for DS-3517 and made a pull request against upstream <code>dspace-5_x</code>: <a href="https://github.com/DSpace/DSpace/pull/1669">https://github.com/DSpace/DSpace/pull/1669</a></li>
<li>Looks like <code>-colorspace sRGB</code> alone isn&rsquo;t enough, we need to use profiles:</li>
</ul>
<pre tabindex="0"><code>$ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
</code></pre><ul>
<li>This reads the input file, applies the CMYK profile, applies the RGB profile, then writes the file</li>
<li>Note that you should set the first profile immediately after the input file</li>
<li>Also, it is better to use profiles than setting <code>-colorspace</code></li>
<li>This is a great resource describing the color stuff: <a href="http://www.imagemagick.org/Usage/formats/#profiles">http://www.imagemagick.org/Usage/formats/#profiles</a></li>
<li>Somehow we need to detect the color system being used by the input file and handle each case differently (with profiles)</li>
<li>This is trivial with <code>identify</code> (even by the <a href="http://im4java.sourceforge.net/api/org/im4java/core/IMOps.html#identify">Java ImageMagick API</a>):</li>
</ul>
<pre tabindex="0"><code>$ identify -format &#39;%r\n&#39; alc_contrastes_desafios.pdf\[0\]
DirectClass CMYK
$ identify -format &#39;%r\n&#39; Africa\ group\ of\ negotiators.pdf\[0\]
DirectClass sRGB Alpha
</code></pre><h2 id="2017-03-04">2017-03-04</h2>
<ul>
<li>Spent more time looking at the ImageMagick CMYK issue</li>
<li>The <code>default_cmyk.icc</code> and <code>default_rgb.icc</code> files are both part of the Ghostscript GPL distribution, but according to DSpace&rsquo;s <code>LICENSES_THIRD_PARTY</code> file, DSpace doesn&rsquo;t allow distribution of dependencies that are licensed solely under the GPL</li>
<li>So this issue is kinda pointless now, as the ICC profiles are absolutely necessary to make a meaningful CMYK→sRGB conversion</li>
</ul>
<h2 id="2017-03-05">2017-03-05</h2>
<ul>
<li>Look into helping developers from landportal.info with a query for items related to LAND on the REST API</li>
<li>They want something like the items that are returned by the general &ldquo;LAND&rdquo; query in the search interface, but we cannot do that</li>
<li>We can only return specific results for metadata fields, like:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;https://dspacetest.cgiar.org/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;LAND REFORM&#34;, &#34;language&#34;: null}&#39; | json_pp
</code></pre><ul>
<li>But there are hundreds of combinations of fields and values (like <code>dc.subject</code> and all the center subjects), and we can&rsquo;t use wildcards in REST!</li>
<li>Reading about enabling multiple handle prefixes in DSpace</li>
<li>There is a mailing list thread from 2011 about it: <a href="http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html">http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html</a></li>
<li>And a comment from Atmire&rsquo;s Bram about it on the DSpace wiki: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296">https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296</a></li>
<li>Bram mentions an undocumented configuration option <code>handle.plugin.checknameauthority</code>, but I noticed another one in <code>dspace.cfg</code>:</li>
</ul>
<pre tabindex="0"><code># List any additional prefixes that need to be managed by this handle server
# (as for examle handle prefix coming from old dspace repository merged in
# that repository)
# handle.additional.prefixes = prefix1[, prefix2]
</code></pre><ul>
<li>Because of this I noticed that our Handle server&rsquo;s <code>config.dct</code> was potentially misconfigured!</li>
<li>We had some default values still present:</li>
</ul>
<pre tabindex="0"><code>&#34;300:0.NA/YOUR_NAMING_AUTHORITY&#34;
</code></pre><ul>
<li>I&rsquo;ve changed them to the following and restarted the handle server:</li>
</ul>
<pre tabindex="0"><code>&#34;300:0.NA/10568&#34;
</code></pre><ul>
<li>In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk</li>
<li>From <code>dspace/config/crosswalks/google-metadata.properties</code>:</li>
</ul>
<pre tabindex="0"><code>google.citation_doi = cg.identifier.doi
</code></pre><ul>
<li>This works, and makes DSpace output the following metadata on the item view page:</li>
</ul>
<pre tabindex="0"><code>&lt;meta content=&#34;https://dx.doi.org/10.1186/s13059-017-1153-y&#34; name=&#34;citation_doi&#34;&gt;
</code></pre><ul>
<li>Submitted and merged pull request for this: <a href="https://github.com/ilri/DSpace/pull/305">https://github.com/ilri/DSpace/pull/305</a></li>
<li>Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of &ldquo;,&rdquo;: <a href="https://github.com/ilri/DSpace/pull/306">https://github.com/ilri/DSpace/pull/306</a></li>
<li>I want to show it briefly to Abenet and Peter to get feedback</li>
</ul>
<h2 id="2017-03-06">2017-03-06</h2>
<ul>
<li>Someone on the mailing list said that <code>handle.plugin.checknameauthority</code> should be false if we&rsquo;re using multiple handle prefixes</li>
</ul>
<h2 id="2017-03-07">2017-03-07</h2>
<ul>
<li>I set up a top-level community as a test for the CGIAR Library and imported one item with the the 10947 handle prefix</li>
<li>When testing the Handle resolver locally it shows the item to be on the local repository</li>
<li>So this seems to work, with the following caveats:
<ul>
<li>New items will have the default handle</li>
<li>Communities and collections will have the default handle</li>
<li>Only items imported manually can have the other handles</li>
</ul>
</li>
<li>I need to talk to Michael and Peter to share the news, and discuss the structure of their community(s) and try some actual test data</li>
<li>We&rsquo;ll need to do some data cleaning to make sure they are using the same fields we are, like <code>dc.type</code> and <code>cg.identifier.status</code></li>
<li>Another thing is that the import process creates new <code>dc.date.accessioned</code> and <code>dc.date.available</code> fields, so we end up with duplicates (is it important to preserve the originals for these?)</li>
<li>Report DS-3520 issue to Atmire</li>
</ul>
<h2 id="2017-03-08">2017-03-08</h2>
<ul>
<li>Merge the author separator changes to <code>5_x-prod</code>, as everyone has responded positively about it, and it&rsquo;s the default in Mirage2 afterall!</li>
<li>Cherry pick the <code>commons-collections</code> patch from DSpace&rsquo;s <code>dspace-5_x</code> branch to address DS-3520: <a href="https://jira.duraspace.org/browse/DS-3520">https://jira.duraspace.org/browse/DS-3520</a></li>
</ul>
<h2 id="2017-03-09">2017-03-09</h2>
<ul>
<li>Export list of sponsors so Peter can clean it up:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;sponsorship&#39;) group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
COPY 285
</code></pre><h2 id="2017-03-12">2017-03-12</h2>
<ul>
<li>Test the sponsorship fixes and deletes from Peter:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
$ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Generate a new list of unique sponsors so we can update the controlled vocabulary:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;sponsorship&#39;)) to /tmp/sponsorship.csv with csv;
</code></pre><ul>
<li>Pull request for controlled vocabulary if Peter approves: <a href="https://github.com/ilri/DSpace/pull/308">https://github.com/ilri/DSpace/pull/308</a></li>
<li>Review Sisay&rsquo;s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: <a href="https://github.com/ilri/DSpace/pull/307">https://github.com/ilri/DSpace/pull/307</a></li>
<li>Created an issue to track the progress on the Livestock CRP theme: <a href="https://github.com/ilri/DSpace/issues/309">https://github.com/ilri/DSpace/issues/309</a></li>
<li>Created a basic theme for the Livestock CRP community</li>
</ul>
<p><img src="/cgspace-notes/2017/03/livestock-theme.png" alt="Livestock CRP theme"></p>
<h2 id="2017-03-15">2017-03-15</h2>
<ul>
<li>Merge pull request for controlled vocabulary updates for sponsor: <a href="https://github.com/ilri/DSpace/pull/308">https://github.com/ilri/DSpace/pull/308</a></li>
<li>Merge pull request for Livestock CRP theme: <a href="https://github.com/ilri/DSpace/issues/309">https://github.com/ilri/DSpace/issues/309</a></li>
<li>Create pull request for PABRA subjects (waiting for confirmation from Abenet before merging): <a href="https://github.com/ilri/DSpace/pull/310">https://github.com/ilri/DSpace/pull/310</a></li>
<li>Create pull request for CCAFS Phase II migrations (waiting for confirmation from CCAFS people): <a href="https://github.com/ilri/DSpace/pull/311">https://github.com/ilri/DSpace/pull/311</a></li>
<li>I also need to ask if either of these new fields need to be added to Discovery facets, search, and Atmire modules</li>
<li>Run all system updates on DSpace Test and re-deploy CGSpace</li>
</ul>
<h2 id="2017-03-16">2017-03-16</h2>
<ul>
<li>Merge pull request for PABRA subjects: <a href="https://github.com/ilri/DSpace/pull/310">https://github.com/ilri/DSpace/pull/310</a></li>
<li>Abenet and Peter say we can add them to Discovery, Atmire modules, etc, but I might not have time to do it now</li>
<li>Help Sisay with RTB theme again</li>
<li>Remove ICARDA subject from Discovery sidebar facets: <a href="https://github.com/ilri/DSpace/pull/312">https://github.com/ilri/DSpace/pull/312</a></li>
<li>Remove ICARDA subject from Browse and item submission form: <a href="https://github.com/ilri/DSpace/pull/313">https://github.com/ilri/DSpace/pull/313</a></li>
<li>Merge the CCAFS Phase II changes but hold off on doing the flagship metadata updates until Macaroni Bros gets their importer updated</li>
<li>Deploy latest changes and investor fixes/deletions on CGSpace</li>
<li>Run system updates on CGSpace and reboot server</li>
</ul>
<h2 id="2017-03-20">2017-03-20</h2>
<ul>
<li>Create basic XMLUI theme for PABRA community: <a href="https://github.com/ilri/DSpace/pull/315">https://github.com/ilri/DSpace/pull/315</a></li>
</ul>
<h2 id="2017-03-24">2017-03-24</h2>
<ul>
<li>Still helping Sisay try to figure out how to create a theme for the RTB community</li>
</ul>
<h2 id="2017-03-28">2017-03-28</h2>
<ul>
<li>CCAFS said they are ready for the flagship updates for Phase II to be run (<code>cg.subject.ccafs</code>), so I ran them on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>We&rsquo;ve been waiting since February to run these</li>
<li>Also, I generated a list of all CCAFS flagships because there are a dozen or so more than there should be:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
</code></pre><ul>
<li>I sent a list to CCAFS people so they can tell me if some should be deleted or moved, etc</li>
<li>Test, squash, and merge Sisay&rsquo;s RTB theme into <code>5_x-prod</code>: <a href="https://github.com/ilri/DSpace/pull/316">https://github.com/ilri/DSpace/pull/316</a></li>
</ul>
<h2 id="2017-03-29">2017-03-29</h2>
<ul>
<li>Dump a list of fields in the DC and CG schemas to compare with CG Core:</li>
</ul>
<pre tabindex="0"><code>dspace=# select case when metadata_schema_id=1 then &#39;dc&#39; else &#39;cg&#39; end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
</code></pre><ul>
<li>Ooh, a better one!</li>
</ul>
<pre tabindex="0"><code>dspace=# select coalesce(case when metadata_schema_id=1 then &#39;dc.&#39; else &#39;cg.&#39; end) || concat_ws(&#39;.&#39;, element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
</code></pre><h2 id="2017-03-30">2017-03-30</h2>
<ul>
<li>Adjust the Linode CPU usage alerts for the CGSpace server from 150% to 200%, as generally the nightly Solr indexing causes a usage around 150190%, so this should make the alerts less regular</li>
<li>Adjust the threshold for DSpace Test from 90 to 100%</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

639
docs/2017-04/index.html Normal file
View File

@ -0,0 +1,639 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="April, 2017" />
<meta property="og:description" content="2017-04-02
Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): https://github.com/ilri/DSpace/pull/317
Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
Remove redundant/duplicate text in the DSpace submission license
Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-04/" />
<meta property="article:published_time" content="2017-04-02T17:08:52+02:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2017"/>
<meta name="twitter:description" content="2017-04-02
Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): https://github.com/ilri/DSpace/pull/317
Quick proof-of-concept hack to add dc.rights to the input form, including some inline instructions/hints:
Remove redundant/duplicate text in the DSpace submission license
Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "April, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-04/",
"wordCount": "2917",
"datePublished": "2017-04-02T17:08:52+02:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-04/">
<title>April, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-04/">April, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-04-02T17:08:52+02:00">Sun Apr 02, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-04-02">2017-04-02</h2>
<ul>
<li>Merge one change to CCAFS flagships that I had forgotten to remove last month (&ldquo;MANAGING CLIMATE RISK&rdquo;): <a href="https://github.com/ilri/DSpace/pull/317">https://github.com/ilri/DSpace/pull/317</a></li>
<li>Quick proof-of-concept hack to add <code>dc.rights</code> to the input form, including some inline instructions/hints:</li>
</ul>
<p><img src="/cgspace-notes/2017/04/dc-rights.png" alt="dc.rights in the submission form"></p>
<ul>
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-03">2017-04-03</h2>
<ul>
<li>Continue testing the CMYK patch on more communities:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&gt; /tmp/filter-media-cmyk.txt 2&gt;&amp;1
</code></pre><ul>
<li>So far there are almost 500:</li>
</ul>
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
484
</code></pre><ul>
<li>Looking at the CG Core document again, I&rsquo;ll send some feedback to Peter and Abenet:
<ul>
<li>We use cg.contributor.crp to indicate the CRP(s) affiliated with the item</li>
<li>DSpace has dc.date.available, but this field isn&rsquo;t particularly meaningful other than as an automatic timestamp at the time of item accession (and is identical to dc.date.accessioned)</li>
<li>dc.relation exists in CGSpace, but isn&rsquo;t used—rather dc.relation.ispartofseries, which is used ~5,000 times to Series name and number within that series</li>
</ul>
</li>
<li>Also, I&rsquo;m noticing some weird outliers in <code>cg.coverage.region</code>, need to remember to go correct these later:</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
</code></pre><h2 id="2017-04-04">2017-04-04</h2>
<ul>
<li>The <code>filter-media</code> script has been running on more large communities and now there are many more CMYK PDFs that have been fixed:</li>
</ul>
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
1584
</code></pre><ul>
<li>Trying to find a way to get the number of items submitted by a certain user in 2016</li>
<li>It&rsquo;s not possible in the DSpace search / module interfaces, but might be able to be derived from <code>dc.description.provenance</code>, as that field contains the name and email of the submitter/approver, ie:</li>
</ul>
<pre tabindex="0"><code>Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
No. of bitstreams: 1^M
ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
</code></pre><ul>
<li>This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a &ldquo;checksum&rdquo; (ie, there was a bitstream in the submission):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^(Submitted|Approved).*giampieri.*2016-.*checksum.*&#39;;
</code></pre><ul>
<li>Then this one does the same, but for fields that don&rsquo;t contain checksums (ie, there was no bitstream in the submission):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^(Submitted|Approved).*giampieri.*2016-.*&#39; and text_value !~ &#39;^(Submitted|Approved).*giampieri.*2016-.*checksum.*&#39;;
</code></pre><ul>
<li>For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.</li>
<li>It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled&hellip;</li>
<li>In that case it might just be better to see how many the user submitted (both <em>with</em> and <em>without</em> bitstreams):</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ &#39;^Submitted.*giampieri.*2016-.*&#39;;
</code></pre><h2 id="2017-04-05">2017-04-05</h2>
<ul>
<li>After doing a few more large communities it seems this is the final count of CMYK PDFs:</li>
</ul>
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
2505
</code></pre><h2 id="2017-04-06">2017-04-06</h2>
<ul>
<li>After reading the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">notes for DCAT April 2017</a> I am testing some new settings for PostgreSQL on DSpace Test:
<ul>
<li><code>db.maxconnections</code> 30→70 (the default PostgreSQL config allows 100 connections, so DSpace&rsquo;s default of 30 is quite low)</li>
<li><code>db.maxwait</code> 5000→10000</li>
<li><code>db.maxidle</code> 8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)</li>
</ul>
</li>
<li>I need to look at the Munin graphs after a few days to see if the load has changed</li>
<li>Run system updates on DSpace Test and reboot the server</li>
<li>Discussing harvesting CIFOR&rsquo;s DSpace via OAI</li>
<li>Sisay added their OAI as a source to a new collection, but using the Simple Dublin Core method, so many fields are unqualified and duplicated</li>
<li>Looking at the <a href="https://wiki.lyrasis.org/display/DSDOC5x/XMLUI+Configuration+and+Customization">documentation</a> it seems that we probably want to be using DSpace Intermediate Metadata</li>
</ul>
<h2 id="2017-04-10">2017-04-10</h2>
<ul>
<li>Adjust Linode CPU usage alerts on DSpace servers
<ul>
<li>CGSpace from 200 to 250%</li>
<li>DSpace Test from 100 to 150%</li>
</ul>
</li>
<li>Remove James from Linode access</li>
<li>Look into having CIFOR use a sub prefix of 10568 like 10568.01</li>
<li>Handle.net calls this <a href="https://www.handle.net/faq.html#4">&ldquo;derived prefixes&rdquo;</a> and it seems this would work with DSpace if we wanted to go that route</li>
<li>CIFOR is starting to test aligning their metadata more with CGSpace/CG core</li>
<li>They shared a <a href="https://data.cifor.org/dspace/xmlui/handle/11463/947?show=full">test item</a> which is using <code>cg.coverage.country</code>, <code>cg.subject.cifor</code>, <code>dc.subject</code>, and <code>dc.date.issued</code></li>
<li>Looking at their OAI I&rsquo;m not sure it has updated as I don&rsquo;t see the new fields: <a href="https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc///col_11463_6/900">https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=oai_dc///col_11463_6/900</a></li>
<li>Maybe they need to make sure they are running the OAI cache refresh cron job, or maybe OAI doesn&rsquo;t export these?</li>
<li>I added <code>cg.subject.cifor</code> to the metadata registry and I&rsquo;m waiting for the harvester to re-harvest to see if it picks up more data now</li>
<li>Another possiblity is that we could use a cross walk&hellip; but I&rsquo;ve never done it.</li>
</ul>
<h2 id="2017-04-11">2017-04-11</h2>
<ul>
<li>Looking at the item from CIFOR it hasn&rsquo;t been updated yet, maybe they aren&rsquo;t running the cron job</li>
<li>I emailed Usman from CIFOR to ask if he&rsquo;s running the cron job</li>
</ul>
<h2 id="2017-04-12">2017-04-12</h2>
<ul>
<li>CIFOR says they have cleaned their OAI cache and that the cron job for OAI import is enabled</li>
<li>Now I see updated fields, like <code>dc.date.issued</code> but none from the CG or CIFOR namespaces</li>
<li>Also, DSpace Test hasn&rsquo;t re-harvested this item yet, so I will wait one more day before forcing a re-harvest</li>
<li>Looking at CIFOR&rsquo;s OAI using different metadata formats, like qualified Dublin Core and DSpace Intermediate Metadata:
<ul>
<li>QDC: <a href="https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=qdc///col_11463_6/900">https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=qdc///col_11463_6/900</a></li>
<li>DIM: <a href="https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=dim///col_11463_6/900">https://data.cifor.org/dspace/oai/request?verb=ListRecords&amp;resumptionToken=dim///col_11463_6/900</a></li>
</ul>
</li>
<li>Looking at one of CGSpace&rsquo;s items in OAI it doesn&rsquo;t seem that metadata fields other than those in the DC schema are exported:
<ul>
<li><a href="https://cgspace.cgiar.org/handle/10568/33346?show=full">https://cgspace.cgiar.org/handle/10568/33346?show=full</a></li>
<li><a href="https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=dim&amp;set=col_10568_68619">https://cgspace.cgiar.org/oai/request?verb=ListRecords&amp;metadataPrefix=dim&amp;set=col_10568_68619</a></li>
</ul>
</li>
<li>Side note: WTF, I just saw an item on CGSpace&rsquo;s OAI that is using <code>dc.cplace.country</code> and <code>dc.rplace.region</code>, which we stopped using in 2016 after the metadata migrations:</li>
</ul>
<p><img src="/cgspace-notes/2017/04/cplace.png" alt="stale metadata in OAI"></p>
<ul>
<li>The particular item is <a href="http://hdl.handle.net/10568/6">10568/6</a> and, for what it&rsquo;s worth, the stale metadata only appears in the OAI view:
<ul>
<li>XMLUI: <a href="https://cgspace.cgiar.org/handle/10568/6?show=full">https://cgspace.cgiar.org/handle/10568/6?show=full</a></li>
<li>OAI: <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:cgspace.cgiar.org:10568/6">https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:cgspace.cgiar.org:10568/6</a></li>
</ul>
</li>
<li>I don&rsquo;t see these fields anywhere in our source code or the database&rsquo;s metadata registry, so maybe it&rsquo;s just a cache issue</li>
<li>I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace</li>
<li>Running <code>dspace oai import</code> and <code>dspace oai clean-cache</code> have zero effect, but this seems to rebuild the cache from scratch:</li>
</ul>
<pre tabindex="0"><code>$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
...
63900 items imported so far...
64000 items imported so far...
Total: 64056 items
Purging cached OAI responses.
OAI 2.0 manager action ended. It took 829 seconds.
</code></pre><ul>
<li>After reading some threads on the DSpace mailing list, I see that <code>clean-cache</code> is actually only for caching <em>responses</em>, ie to client requests in the OAI web application</li>
<li>These are stored in <code>[dspace]/var/oai/requests/</code></li>
<li>The import command should theoretically catch situations like this where an item&rsquo;s metadata was updated, but in this case we changed the metadata schema and it doesn&rsquo;t seem to catch it (could be a bug!)</li>
<li>Attempting a full rebuild of OAI on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
...
58700 items imported so far...
Total: 58789 items
Purging cached OAI responses.
OAI 2.0 manager action ended. It took 1032 seconds.
real 17m20.156s
user 4m35.293s
sys 1m29.310s
</code></pre><ul>
<li>Now the data for 10568/6 is correct in OAI: <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:cgspace.cgiar.org:10568/6">https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:cgspace.cgiar.org:10568/6</a></li>
<li>Perhaps I need to file a bug for this, or at least ask on the DSpace Test mailing list?</li>
<li>I wonder if we could use a crosswalk to convert to a format that CG Core wants, like <code>&lt;date Type=&quot;Available&quot;&gt;</code></li>
</ul>
<h2 id="2017-04-13">2017-04-13</h2>
<ul>
<li>Checking the <a href="https://dspacetest.cgiar.org/handle/11463/947?show=full">CIFOR item on DSpace Test</a>, it still doesn&rsquo;t have the new metadata</li>
<li>The collection status shows this message from the harvester:</li>
</ul>
<blockquote>
<p>Last Harvest Result: OAI server did not contain any updates on 2017-04-13 02:19:47.964</p>
</blockquote>
<ul>
<li>I don&rsquo;t know why there were no updates detected, so I will reset and reimport the collection</li>
<li>Usman has set up a custom crosswalk called <code>dimcg</code> that now shows CG and CIFOR metadata namespaces, but we can&rsquo;t use it because DSpace can only harvest DIM by default (from the harvesting user interface)</li>
<li>Also worth noting that the REST interface exposes all fields in the item, including CG and CIFOR fields: <a href="https://data.cifor.org/dspace/rest/items/944?expand=metadata">https://data.cifor.org/dspace/rest/items/944?expand=metadata</a></li>
<li>After re-importing the CIFOR collection it looks <em>very</em> good!</li>
<li>It seems like they have done a full metadata migration with <code>dc.date.issued</code> and <code>cg.coverage.country</code> etc</li>
<li>Submit pull request to upstream DSpace for the PDF thumbnail bug (DS-3516): <a href="https://github.com/DSpace/DSpace/pull/1709">https://github.com/DSpace/DSpace/pull/1709</a></li>
</ul>
<h2 id="2017-04-14">2017-04-14</h2>
<ul>
<li>DSpace committers reviewed my patch for DS-3516 and proposed a simpler idea involving incorrect use of <code>SelfRegisteredInputFormats</code></li>
<li>I tested the idea and it works, so I made a new patch: <a href="https://github.com/DSpace/DSpace/pull/1709">https://github.com/DSpace/DSpace/pull/1709</a></li>
<li>I discovered that we can override metadata formats in OAI by creating a new &ldquo;context&rdquo;: <a href="https://wiki.lyrasis.org/display/DSDOC5x/OAI+2.0+Server">https://wiki.lyrasis.org/display/DSDOC5x/OAI+2.0+Server</a></li>
<li>This allows us to have, say a default &ldquo;request&rdquo; context and a &ldquo;cgiar&rdquo; context, both of which implement the DSpace Intermediate Metadata formats, but have the later use a overridden version that exposes CG metadata</li>
<li>Compare the following results:
<ul>
<li><a href="https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:dspacetest.cgiar.org:10568/6">https://dspacetest.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:dspacetest.cgiar.org:10568/6</a></li>
<li><a href="https://dspacetest.cgiar.org/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:dspacetest.cgiar.org:10568/6">https://dspacetest.cgiar.org/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:dspacetest.cgiar.org:10568/6</a></li>
</ul>
</li>
<li>Reboot DSpace Test server to get new Linode kernel</li>
</ul>
<h2 id="2017-04-17">2017-04-17</h2>
<ul>
<li>CIFOR has now implemented a new &ldquo;cgiar&rdquo; context in their OAI that exposes CG fields, so I am re-harvesting that to see how it looks in the Discovery sidebars and searches</li>
<li>See: <a href="https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947">https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947</a></li>
<li>One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see <code>harvester.autoStart</code> in <code>dspace/config/modules/oai.cfg</code>)</li>
<li>Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:</li>
</ul>
<pre tabindex="0"><code>Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(435) is still referenced from table &#34;bundle&#34;.
</code></pre><h2 id="2017-04-18">2017-04-18</h2>
<ul>
<li>Helping Tsega test his new <a href="https://github.com/ilri/ckm-cgspace-rest-api">CGSpace REST API Rails app</a> on DSpace Test</li>
<li>Setup and run with:</li>
</ul>
<pre tabindex="0"><code>$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
$ cd ckm-cgspace-rest-api/app
$ gem install bundler
$ bundle
$ cd ..
$ rails -s
</code></pre><ul>
<li>I used Ansible to create a PostgreSQL user that only has <code>SELECT</code> privileges on the tables it needs:</li>
</ul>
<pre tabindex="0"><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a &#39;db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
</code></pre><ul>
<li>Need to look into <a href="https://github.com/puma/puma/blob/master/docs/systemd.md">running this via systemd</a></li>
<li>This is interesting for creating runnable commands from <code>bundle</code>:</li>
</ul>
<pre tabindex="0"><code>$ bundle binstubs puma --path ./sbin
</code></pre><h2 id="2017-04-19">2017-04-19</h2>
<ul>
<li>Usman sent another link to their OAI interface, where the country names are now capitalized: <a href="https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947">https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&amp;metadataPrefix=dim&amp;identifier=oai:data.cifor.org:11463/947</a></li>
<li>Looking at the same item in XMLUI, the countries are not capitalized: <a href="https://data.cifor.org/dspace/xmlui/handle/11463/947?show=full">https://data.cifor.org/dspace/xmlui/handle/11463/947?show=full</a></li>
<li>So it seems he did it in the crosswalk!</li>
<li>Keep working on Ansible stuff for deploying the CKM REST API</li>
<li>We can use systemd&rsquo;s <code>Environment</code> stuff to pass the database parameters to Rails</li>
<li>Abenet noticed that the &ldquo;Workflow Statistics&rdquo; option is missing now, but we have screenshots from a presentation in 2016 when it was there</li>
<li>I filed a ticket with Atmire</li>
<li>Looking at 933 CIAT records from Sisay, he&rsquo;s having problems creating a SAF bundle to import to DSpace Test</li>
<li>I started by looking at his CSV in OpenRefine, and I see there a <em>bunch</em> of fields with whitespace issues that I cleaned up:</li>
</ul>
<pre tabindex="0"><code>value.replace(&#34; ||&#34;,&#34;||&#34;).replace(&#34;|| &#34;,&#34;||&#34;).replace(&#34; || &#34;,&#34;||&#34;)
</code></pre><ul>
<li>Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:</li>
</ul>
<pre tabindex="0"><code>unescape(value,&#34;url&#34;)
</code></pre><ul>
<li>Then create the filename column using the following transform from URL:</li>
</ul>
<pre tabindex="0"><code>value.split(&#39;/&#39;)[-1].replace(/#.*$/,&#34;&#34;)
</code></pre><ul>
<li>The <code>replace</code> part is because some URLs have an anchor like <code>#page=14</code> which we obviously don&rsquo;t want on the filename</li>
<li>Also, we need to only use the PDF on the item corresponding with page 1, so we don&rsquo;t end up with literally hundreds of duplicate PDFs</li>
<li>Alternatively, I could export each page to a standalone PDF&hellip;</li>
</ul>
<h2 id="2017-04-20">2017-04-20</h2>
<ul>
<li>Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful</li>
<li>I re-enabled it with a hidden config key <code>workflow.stats.enabled = true</code> on DSpace Test and will evaluate adding it on CGSpace</li>
<li>Looking at the CIAT data again, a bunch of items have metadata values ending in <code>||</code>, which might cause blank fields to be added at import time</li>
<li>Cleaning them up with OpenRefine:</li>
</ul>
<pre tabindex="0"><code>value.replace(/\|\|$/,&#34;&#34;)
</code></pre><ul>
<li>Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle</li>
<li>I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items</li>
</ul>
<p><img src="/cgspace-notes/2017/04/openrefine-flagging-duplicates.png" alt="Flagging and filtering duplicates in OpenRefine"></p>
<ul>
<li>Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace</li>
<li>Unbelievable, there are also metadata values like:</li>
</ul>
<pre tabindex="0"><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
</code></pre><ul>
<li>Add a description to the file names using:</li>
</ul>
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
</code></pre><ul>
<li>Test import of 933 records:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
$ wc -l /tmp/ciat
933 /tmp/ciat
</code></pre><ul>
<li>Run system updates on CGSpace and reboot server</li>
<li>This includes switching nginx to using upstream with keepalive instead of direct <code>proxy_pass</code></li>
<li>Re-deploy CGSpace to latest <code>5_x-prod</code>, including the PABRA and RTB XMLUI themes, as well as the PDF processing and CMYK changes</li>
<li>More work on Ansible infrastructure stuff for Tsega&rsquo;s CKM DSpace REST API</li>
<li>I&rsquo;m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre><h2 id="2017-04-22">2017-04-22</h2>
<ul>
<li>Someone on the dspace-tech mailing list responded with a suggestion about the foreign key violation in the <code>cleanup</code> task</li>
<li>The solution is to remove the ID (ie set to NULL) from the <code>primary_bitstream_id</code> column in the <code>bundle</code> table</li>
<li>After doing that and running the <code>cleanup</code> task again I find more bitstreams that are affected and end up with a long list of IDs that need to be fixed:</li>
</ul>
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
</code></pre><h2 id="2017-04-24">2017-04-24</h2>
<ul>
<li>Two users mentioned some items they recently approved not showing up in the search / XMLUI</li>
<li>I looked at the logs from yesterday and it seems the Discovery indexing has been crashing:</li>
</ul>
<pre tabindex="0"><code>2017-04-24 00:00:15,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
2017-04-24 00:00:15,586 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (56 of 58853): 74507
2017-04-24 00:00:15,614 ERROR com.atmire.dspace.discovery.AtmireSolrService @ this IndexWriter is closed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this IndexWriter is closed
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
at org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:331)
at org.dspace.discovery.SolrServiceImpl.unIndexContent(SolrServiceImpl.java:315)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:803)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
at org.dspace.discovery.IndexClient.main(IndexClient.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
</code></pre><ul>
<li>Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:</li>
</ul>
<pre tabindex="0"><code># grep -c &#39;IndexWriter is closed&#39; [dspace]/log/dspace.log.2017-04-*
[dspace]/log/dspace.log.2017-04-01:0
[dspace]/log/dspace.log.2017-04-02:0
[dspace]/log/dspace.log.2017-04-03:0
[dspace]/log/dspace.log.2017-04-04:0
[dspace]/log/dspace.log.2017-04-05:0
[dspace]/log/dspace.log.2017-04-06:0
[dspace]/log/dspace.log.2017-04-07:0
[dspace]/log/dspace.log.2017-04-08:0
[dspace]/log/dspace.log.2017-04-09:0
[dspace]/log/dspace.log.2017-04-10:0
[dspace]/log/dspace.log.2017-04-11:0
[dspace]/log/dspace.log.2017-04-12:0
[dspace]/log/dspace.log.2017-04-13:0
[dspace]/log/dspace.log.2017-04-14:0
[dspace]/log/dspace.log.2017-04-15:0
[dspace]/log/dspace.log.2017-04-16:0
[dspace]/log/dspace.log.2017-04-17:0
[dspace]/log/dspace.log.2017-04-18:0
[dspace]/log/dspace.log.2017-04-19:0
[dspace]/log/dspace.log.2017-04-20:2293
[dspace]/log/dspace.log.2017-04-21:5992
[dspace]/log/dspace.log.2017-04-22:13278
[dspace]/log/dspace.log.2017-04-23:22720
[dspace]/log/dspace.log.2017-04-24:21422
</code></pre><ul>
<li>I restarted Tomcat and re-ran the discovery process manually:</li>
</ul>
<pre tabindex="0"><code>[dspace]/bin/dspace index-discovery
</code></pre><ul>
<li>Now everything is ok</li>
<li>Finally finished manually running the cleanup task over and over and null&rsquo;ing the conflicting IDs:</li>
</ul>
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
</code></pre><ul>
<li>Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it&rsquo;s likely we haven&rsquo;t had a cleanup task complete successfully in years&hellip;</li>
</ul>
<h2 id="2017-04-25">2017-04-25</h2>
<ul>
<li>Finally finished running the PDF thumbnail re-processing on CGSpace, the final count of CMYK PDFs is about 2751</li>
<li>Preparing to run the cleanup task on CGSpace, I want to see how many files are in the assetstore:</li>
</ul>
<pre tabindex="0"><code># find [dspace]/assetstore/ -type f | wc -l
113104
</code></pre><ul>
<li>Troubleshooting the Atmire Solr update process that runs at 3:00 AM every morning, after finishing at 100% it has this error:</li>
</ul>
<pre tabindex="0"><code>[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
[=================================================&gt; ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:13
[==================================================&gt;]100% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:13
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.dspace.statistics.content.SpecifiedDSODatasetGenerator
at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:254)
at org.dspace.statistics.content.StatisticsDisplay.&lt;init&gt;(SourceFile:203)
at com.atmire.statistics.display.StatisticsGraph.&lt;init&gt;(SourceFile:116)
at com.atmire.statistics.display.StatisticsGraphFactory.getStatisticsDisplay(SourceFile:25)
at com.atmire.statistics.display.StatisticsDisplayFactory.parseStatisticsDisplay(SourceFile:67)
at com.atmire.statistics.display.StatisticsDisplayFactory.getStatisticsDisplays(SourceFile:49)
at com.atmire.statistics.statlet.XmlParser.getStatisticsDisplays(SourceFile:178)
at com.atmire.statistics.statlet.XmlParser.getStatisticsDisplays(SourceFile:111)
at com.atmire.utils.ReportSender$ReportRunnable.run(SourceFile:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.SpecifiedDSODatasetGenerator
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1858)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1701)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.atmire.statistics.statlet.XmlParser.parsedatasetGenerator(SourceFile:299)
at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:250)
... 13 more
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpaceObjectDatasetGenerator
at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:254)
at org.dspace.statistics.content.StatisticsDisplay.&lt;init&gt;(SourceFile:203)
at com.atmire.statistics.display.StatisticsGraph.&lt;init&gt;(SourceFile:116)
at com.atmire.statistics.display.StatisticsGraphFactory.getStatisticsDisplay(SourceFile:25)
at com.atmire.statistics.display.StatisticsDisplayFactory.parseStatisticsDisplay(SourceFile:67)
at com.atmire.statistics.display.StatisticsDisplayFactory.getStatisticsDisplays(SourceFile:49)
at com.atmire.statistics.statlet.XmlParser.getStatisticsDisplays(SourceFile:178)
at com.atmire.statistics.statlet.XmlParser.getStatisticsDisplays(SourceFile:111)
at com.atmire.utils.ReportSender$ReportRunnable.run(SourceFile:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpaceObjectDatasetGenerator
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1858)
at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1701)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.atmire.statistics.statlet.XmlParser.parsedatasetGenerator(SourceFile:299)
at com.atmire.statistics.display.StatisticsGraph.parseDatasetGenerators(SourceFile:250)
</code></pre><ul>
<li>Run system updates on DSpace Test and reboot the server (new Java 8 131)</li>
<li>Run the SQL cleanups on the bundle table on CGSpace and run the <code>[dspace]/bin/dspace cleanup</code> task</li>
<li>I will be interested to see the file count in the assetstore as well as the database size after the next backup (last backup size is 111M)</li>
<li>Final file count after the cleanup task finished: 77843</li>
<li>So that is 30,000 files, and about 7GB</li>
<li>Add logging to the cleanup cron task</li>
</ul>
<h2 id="2017-04-26">2017-04-26</h2>
<ul>
<li>The size of the CGSpace database dump went from 111MB to 96MB, not sure about actual database size though</li>
<li>Update RVM&rsquo;s Ruby from 2.3.0 to 2.4.0 on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
$ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
... reload shell to get new Ruby
$ gem install sass -v 3.3.14
$ gem install compass -v 1.0.3
</code></pre><ul>
<li>Help Tsega re-deploy the ckm-cgspace-rest-api on DSpace Test</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

445
docs/2017-05/index.html Normal file
View File

@ -0,0 +1,445 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="May, 2017" />
<meta property="og:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-05/" />
<meta property="article:published_time" content="2017-05-01T16:21:52+02:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2017"/>
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-05/",
"wordCount": "2398",
"datePublished": "2017-05-01T16:21:52+02:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-05/">
<title>May, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-05/">May, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-05-01T16:21:52+02:00">Mon May 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-05-01">2017-05-01</h2>
<ul>
<li>ICARDA apparently started working on CG Core on their MEL repository</li>
<li>They have done a few <code>cg.*</code> fields, but not very consistent and even copy some of CGSpace items:
<ul>
<li><a href="https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full">https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full</a></li>
<li><a href="https://cgspace.cgiar.org/handle/10568/73683">https://cgspace.cgiar.org/handle/10568/73683</a></li>
</ul>
</li>
</ul>
<h2 id="2017-05-02">2017-05-02</h2>
<ul>
<li>Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request</li>
</ul>
<h2 id="2017-05-04">2017-05-04</h2>
<ul>
<li>Sync DSpace Test with database and assetstore from CGSpace</li>
<li>Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server</li>
<li>Now I can see the workflow statistics and am able to select users, but everything returns 0 items</li>
<li>Megan says there are still some mapped items are not appearing since last week, so I forced a full <code>index-discovery -b</code></li>
<li>Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: <a href="https://cgspace.cgiar.org/handle/10568/80731">https://cgspace.cgiar.org/handle/10568/80731</a></li>
</ul>
<h2 id="2017-05-05">2017-05-05</h2>
<ul>
<li>Discovered that CGSpace has ~700 items that are missing the <code>cg.identifier.status</code> field</li>
<li>Need to perhaps try using the &ldquo;required metadata&rdquo; curation task to find fields missing these items:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - &gt; /tmp/curation.out
</code></pre><ul>
<li>It seems the curation task dies when it finds an item which has missing metadata</li>
</ul>
<h2 id="2017-05-06">2017-05-06</h2>
<ul>
<li>Add &ldquo;Blog Post&rdquo; to <code>dc.type</code></li>
<li>Create ticket on Atmire tracker to ask about commissioning them to develop the feature to expose ORCID via REST/OAI: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510</a></li>
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC5x/Curation+System">DSpace curation docs</a> the fact that the <code>requiredmetadata</code> curation task stops when it finds a missing metadata field is by design</li>
</ul>
<h2 id="2017-05-07">2017-05-07</h2>
<ul>
<li>Testing one replacement for CCAFS Flagships (<code>cg.subject.ccafs</code>), first changed in the submission forms, and then in the database:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
</code></pre><ul>
<li>Also, CCAFS wants to re-order their flagships to prioritize the Phase II ones</li>
<li>Waiting for feedback from CCAFS, then I can merge <a href="https://github.com/ilri/DSpace/pull/320">#320</a></li>
</ul>
<h2 id="2017-05-08">2017-05-08</h2>
<ul>
<li>Start working on CGIAR Library migration</li>
<li>We decided to use AIP export to preserve the hierarchies and handles of communities and collections</li>
<li>When ingesting some collections I was getting <code>java.lang.OutOfMemoryError: GC overhead limit exceeded</code>, which can be solved by disabling the GC timeout with <code>-XX:-UseGCOverheadLimit</code></li>
<li>Other times I was getting an error about heap space, so I kept bumping the RAM allocation by 512MB each time (up to 4096m!) it crashed</li>
<li>This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using <code>dspace cleanup -v</code>, or else you&rsquo;ll run out of disk space</li>
<li>In the end I realized it&rsquo;s better to use submission mode (<code>-s</code>) to ingest the community object as a single AIP without its children, followed by each of the collections:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit&#34;
$ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
</code></pre><ul>
<li>Note that in submission mode DSpace ignores the handle specified in <code>mets.xml</code> in the zip file, so you need to turn that off with <code>-o ignoreHandle=false</code></li>
<li>The <code>-u</code> option supresses prompts, to allow the process to run without user input</li>
<li>Give feedback to CIFOR about their data quality:
<ul>
<li>Suggestion: uppercase dc.subject, cg.coverage.region, and cg.coverage.subregion in your crosswalk so they match CGSpace and therefore can be faceted / reported on easier</li>
<li>Suggestion: use CGSpace&rsquo;s CRP names (cg.contributor.crp), see: dspace/config/input-forms.xml</li>
<li>Suggestion: clean up duplicates and errors in funders, perhaps use a controlled vocabulary like ours, see: dspace/config/controlled-vocabularies/dc-description-sponsorship.xml</li>
<li>Suggestion: use dc.type &ldquo;Blog Post&rdquo; instead of &ldquo;Blog&rdquo; for your blog post items (we are also adding a &ldquo;Blog Post&rdquo; type to CGSpace soon)</li>
<li>Question: many of your items use dc.document.uri AND cg.identifier.url with the same text value?</li>
</ul>
</li>
<li>Help Marianne from WLE with an Open Search query to show the latest WLE CRP outputs: <a href="https://cgspace.cgiar.org/open-search/discover?query=crpsubject:WATER%2C+LAND+AND+ECOSYSTEMS&amp;sort_by=2&amp;order=DESC">https://cgspace.cgiar.org/open-search/discover?query=crpsubject:WATER%2C+LAND+AND+ECOSYSTEMS&amp;sort_by=2&amp;order=DESC</a></li>
<li>This uses the webui&rsquo;s item list sort options, see <code>webui.itemlist.sort-option</code> in <code>dspace.cfg</code></li>
<li>The equivalent Discovery search would be: <a href="https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&amp;filter_relational_operator_1=equals&amp;filter_1=WATER%2C+LAND+AND+ECOSYSTEMS&amp;submit_apply_filter=&amp;query=&amp;rpp=10&amp;sort_by=dc.date.issued_dt&amp;order=desc">https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&amp;filter_relational_operator_1=equals&amp;filter_1=WATER%2C+LAND+AND+ECOSYSTEMS&amp;submit_apply_filter=&amp;query=&amp;rpp=10&amp;sort_by=dc.date.issued_dt&amp;order=desc</a></li>
</ul>
<h2 id="2017-05-09">2017-05-09</h2>
<ul>
<li>The CGIAR Library metadata has some blank metadata values, which leads to <code>|||</code> in the Discovery facets</li>
<li>Clean these up in the database using:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><ul>
<li>I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up</li>
<li>Hours into the re-ingestion I ran into more errors, and had to erase everything and start over <em>again</em>!</li>
<li>Now, no matter what I do I keep getting foreign key errors&hellip;</li>
</ul>
<pre tabindex="0"><code>Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint &#34;handle_pkey&#34;
Detail: Key (handle_id)=(80928) already exists.
</code></pre><ul>
<li>I think those errors actually come from me running the <code>update-sequences.sql</code> script while Tomcat/DSpace are running</li>
<li>Apparently you need to stop Tomcat!</li>
</ul>
<h2 id="2017-05-10">2017-05-10</h2>
<ul>
<li>Atmire says they are willing to extend the ORCID implementation, and I&rsquo;ve asked them to provide a quote</li>
<li>I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields</li>
<li>Finally finished importing all the CGIAR Library content, final method was:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit&#34;
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2516/10947-2516.zip
$ [dspace]/bin/dspace packager -s -t AIP -o ignoreHandle=false -e some@user.com -p 10568/80923 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
</code></pre><ul>
<li>Basically, import the smaller communities using recursive AIP import (with <code>skipIfParentMissing</code>)</li>
<li>Then, for the larger collection, create the community, collections, and items separately, ingesting the items one by one</li>
<li>The <code>-XX:-UseGCOverheadLimit</code> JVM option helps with some issues in large imports</li>
<li>After this I ran the <code>update-sequences.sql</code> script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
</code></pre><h2 id="2017-05-13">2017-05-13</h2>
<ul>
<li>After quite a bit of troubleshooting with importing cleaned up data as CSV, it seems that there are actually <a href="https://en.wikipedia.org/wiki/Null_character">NUL</a> characters in the <code>dc.description.abstract</code> field (at least) on the lines where CSV importing was failing</li>
<li>I tried to find a way to remove the characters in vim or Open Refine, but decided it was quicker to just remove the column temporarily and import it</li>
<li>The import was successful and detected 2022 changes, which should likely be the rest that were failing to import before</li>
</ul>
<h2 id="2017-05-15">2017-05-15</h2>
<ul>
<li>To delete the blank lines that cause isses during import we need to use a regex in vim <code>g/^$/d</code></li>
<li>After that I started looking in the <code>dc.subject</code> field to try to pull countries and regions out, but there are too many values in there</li>
<li>Bump the Academicons dependency of the Mirage 2 themes from 1.6.0 to 1.8.0 because the upstream deleted the old tag and now the build is failing: <a href="https://github.com/ilri/DSpace/pull/321">#321</a></li>
<li>Merge changes to CCAFS project identifiers and flagships: <a href="https://github.com/ilri/DSpace/pull/320">#320</a></li>
<li>Run updates for CCAFS flagships on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>
<p>These include:</p>
<ul>
<li>GENDER AND SOCIAL DIFFERENTIATION→GENDER AND SOCIAL INCLUSION</li>
<li>MANAGING CLIMATE RISK→CLIMATE SERVICES AND SAFETY NETS</li>
</ul>
</li>
<li>
<p>Re-deploy CGSpace and DSpace Test and run system updates</p>
</li>
<li>
<p>Reboot DSpace Test</p>
</li>
<li>
<p>Fix cron jobs for log management on DSpace Test, as they weren&rsquo;t catching <code>dspace.log.*</code> files correctly and we had over six months of them and they were taking up many gigs of disk space</p>
</li>
</ul>
<h2 id="2017-05-16">2017-05-16</h2>
<ul>
<li>Discuss updates to WLE themes for their Phase II</li>
<li>Make an issue to track the changes to <code>cg.subject.wle</code>: <a href="https://github.com/ilri/DSpace/issues/322">#322</a></li>
</ul>
<h2 id="2017-05-17">2017-05-17</h2>
<ul>
<li>Looking into the error I get when trying to create a new collection on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>ERROR: duplicate key value violates unique constraint &#34;handle_pkey&#34; Detail: Key (handle_id)=(84834) already exists.
</code></pre><ul>
<li>I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn&rsquo;t helped</li>
<li>It appears item with <code>handle_id</code> 84834 is one of the imported CGIAR Library items:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from handle where handle_id=84834;
handle_id | handle | resource_type_id | resource_id
-----------+------------+------------------+-------------
84834 | 10947/1332 | 2 | 87113
</code></pre><ul>
<li>Looks like the max <code>handle_id</code> is actually much higher:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
handle_id | handle | resource_type_id | resource_id
-----------+----------+------------------+-------------
86873 | 10947/99 | 2 | 89153
(1 row)
</code></pre><ul>
<li>I&rsquo;ve posted on the dspace-test mailing list to see if I can just manually set the <code>handle_seq</code> to that value</li>
<li>Actually, it seems I can manually set the handle sequence using:</li>
</ul>
<pre tabindex="0"><code>dspace=# select setval(&#39;handle_seq&#39;,86873);
</code></pre><ul>
<li>After that I can create collections just fine, though I&rsquo;m not sure if it has other side effects</li>
</ul>
<h2 id="2017-05-21">2017-05-21</h2>
<ul>
<li>Start creating a basic theme for the CGIAR System Organization&rsquo;s community on CGSpace</li>
<li>Using colors from the <a href="http://library.cgiar.org/handle/10947/2699">CGIAR Branding guidelines (2014)</a></li>
<li>Make a GitHub issue to track this work: <a href="https://github.com/ilri/DSpace/issues/324">#324</a></li>
</ul>
<h2 id="2017-05-22">2017-05-22</h2>
<ul>
<li>Do some cleanups of community and collection names in CGIAR System Management Office community on DSpace Test, as well as move some items as Peter requested</li>
<li>Peter wanted a list of authors in here, so I generated a list of collections using the &ldquo;View Source&rdquo; on each community and this hacky awk:</li>
</ul>
<pre tabindex="0"><code>$ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ &#39;{print $3&#34;/&#34;$4}&#39; | awk -F\&#34; &#39;{print $1}&#39; | vim -
</code></pre><ul>
<li>Then I joined them together and ran this old SQL query from the dspace-tech mailing list which gives you authors for items in those collections:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;)
AND resource_type_id = 2
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10947/2&#39;, &#39;10947/3&#39;, &#39;10947/1
0&#39;, &#39;10947/4&#39;, &#39;10947/5&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;, &#39;10947/11&#39;, &#39;10947/25&#39;, &#39;10947/12&#39;, &#39;10947/26&#39;, &#39;10947/27&#39;, &#39;10947/28&#39;, &#39;10947/29&#39;, &#39;109
47/30&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/17&#39;, &#39;10947
/18&#39;, &#39;10947/38&#39;, &#39;10947/19&#39;, &#39;10947/39&#39;, &#39;10947/40&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/2512&#39;, &#39;10947/44&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/45&#39;, &#39;10947
/46&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/2518&#39;, &#39;10947/2776&#39;, &#39;10947/2790&#39;, &#39;10947/2521&#39;,
&#39;10947/2522&#39;, &#39;10947/2782&#39;, &#39;10947/2525&#39;, &#39;10947/2836&#39;, &#39;10947/2524&#39;, &#39;10947/2878&#39;, &#39;10947/2520&#39;, &#39;10947/2523&#39;, &#39;10947/2786&#39;, &#39;10947/2631&#39;, &#39;10947/2589&#39;, &#39;109
47/2519&#39;, &#39;10947/2708&#39;, &#39;10947/2526&#39;, &#39;10947/2871&#39;, &#39;10947/2527&#39;, &#39;10947/4467&#39;, &#39;10947/3457&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2533&#39;, &#39;10947/2530&#39;, &#39;10947/2
531&#39;, &#39;10947/2532&#39;, &#39;10947/2538&#39;, &#39;10947/2534&#39;, &#39;10947/2540&#39;, &#39;10947/2900&#39;, &#39;10947/2539&#39;, &#39;10947/2784&#39;, &#39;10947/2536&#39;, &#39;10947/2805&#39;, &#39;10947/2541&#39;, &#39;10947/2535&#39;
, &#39;10947/2537&#39;, &#39;10568/93761&#39;)));
</code></pre><ul>
<li>To get a CSV (with counts) from that:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*)
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;)
AND resource_type_id = 2
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10947/2&#39;, &#39;10947/3&#39;, &#39;10947/10&#39;, &#39;10947/4&#39;, &#39;10947/5&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;, &#39;10947/11&#39;, &#39;10947/25&#39;, &#39;10947/12&#39;, &#39;10947/26&#39;, &#39;10947/27&#39;, &#39;10947/28&#39;, &#39;10947/29&#39;, &#39;10947/30&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/38&#39;, &#39;10947/19&#39;, &#39;10947/39&#39;, &#39;10947/40&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/2512&#39;, &#39;10947/44&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/45&#39;, &#39;10947/46&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/2518&#39;, &#39;10947/2776&#39;, &#39;10947/2790&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2782&#39;, &#39;10947/2525&#39;, &#39;10947/2836&#39;, &#39;10947/2524&#39;, &#39;10947/2878&#39;, &#39;10947/2520&#39;, &#39;10947/2523&#39;, &#39;10947/2786&#39;, &#39;10947/2631&#39;, &#39;10947/2589&#39;, &#39;10947/2519&#39;, &#39;10947/2708&#39;, &#39;10947/2526&#39;, &#39;10947/2871&#39;, &#39;10947/2527&#39;, &#39;10947/4467&#39;, &#39;10947/3457&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2533&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2538&#39;, &#39;10947/2534&#39;, &#39;10947/2540&#39;, &#39;10947/2900&#39;, &#39;10947/2539&#39;, &#39;10947/2784&#39;, &#39;10947/2536&#39;, &#39;10947/2805&#39;, &#39;10947/2541&#39;, &#39;10947/2535&#39;, &#39;10947/2537&#39;, &#39;10568/93761&#39;))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;
</code></pre><h2 id="2017-05-23">2017-05-23</h2>
<ul>
<li>Add Affiliation to filters on Listing and Reports module (<a href="https://github.com/ilri/DSpace/pull/325">#325</a>)</li>
<li>Start looking at WLE&rsquo;s Phase II metadata updates but it seems they are not tagging their items properly, as their website importer infers which theme to use based on the name of the CGSpace collection!</li>
<li>For now I&rsquo;ve suggested that they just change the collection names and that we fix their metadata manually afterwards</li>
<li>Also, they have a lot of messed up values in their <code>cg.subject.wle</code> field so I will clean up some of those first:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
COPY 111
</code></pre><ul>
<li>Respond to Atmire message about ORCIDs, saying that right now we&rsquo;d prefer to just have them available via REST API like any other metadata field, and that I&rsquo;m available for a Skype</li>
</ul>
<h2 id="2017-05-26">2017-05-26</h2>
<ul>
<li>Increase max file size in nginx so that CIP can upload some larger PDFs</li>
<li>Agree to talk with Atmire after the June DSpace developers meeting where they will be discussing exposing ORCIDs via REST/OAI</li>
</ul>
<h2 id="2017-05-28">2017-05-28</h2>
<ul>
<li>File an issue on GitHub to explore/track migration to proper country/region codes (ISO 2/3 and UN M.49): <a href="https://github.com/ilri/DSpace/issues/326">#326</a></li>
<li>Ask Peter how the Landportal.info people should acknowledge us as the source of data on their website</li>
<li>Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+June+2017">June, 2017 DCAT meeting</a></li>
<li>Find all of Amos Omore&rsquo;s author name variations so I can link them to his authority entry that has an ORCID:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;Omore, A%&#39;;
</code></pre><ul>
<li>Set the authority for all variations to one containing an ORCID:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;4428ee88-90ef-4107-b837-3c0ec988520b&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Omore, A%&#39;;
UPDATE 187
</code></pre><ul>
<li>Next I need to do Edgar Twine:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like &#39;Twine, E%&#39;;
</code></pre><ul>
<li>But it doesn&rsquo;t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via &ldquo;Edit this Item&rdquo; and looked up his ORCID and linked it there</li>
<li>Now I should be able to set his name variations to the new authority:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set authority=&#39;f70d0a01-d562-45b8-bca3-9cf7f249bc8b&#39;, confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like &#39;Twine, E%&#39;;
</code></pre><ul>
<li>Run the corrections on CGSpace and then update discovery / authority</li>
<li>I notice that there are a handful of <code>java.lang.OutOfMemoryError: Java heap space</code> errors in the Catalina logs on CGSpace, I should go look into that&hellip;</li>
</ul>
<h2 id="2017-05-29">2017-05-29</h2>
<ul>
<li>Discuss WLE themes and subjects with Mia and Macaroni Bros</li>
<li>We decided we need to create metadata fields for Phase I and II themes</li>
<li>I&rsquo;ve updated the existing GitHub issue for Phase II (<a href="https://github.com/ilri/DSpace/issues/322">#322</a>) and created a new one to track the changes for Phase I themes (<a href="https://github.com/ilri/DSpace/issues/327">#327</a>)</li>
<li>After Macaroni Bros update the WLE website importer we will rename the WLE collections to reflect Phase II</li>
<li>Also, we need to have Mia and Udana look through the existing metadata in <code>cg.subject.wle</code> as it is quite a mess</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

324
docs/2017-06/index.html Normal file
View File

@ -0,0 +1,324 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="June, 2017" />
<meta property="og:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-06/" />
<meta property="article:published_time" content="2017-06-01T10:14:52+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2017"/>
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "June, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-06/",
"wordCount": "1261",
"datePublished": "2017-06-01T10:14:52+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-06/">
<title>June, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-06/">June, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-06-01T10:14:52+03:00">Thu Jun 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-06-01">2017-06-01</h2>
<ul>
<li>After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes</li>
<li>The <code>cg.identifier.wletheme</code> field will be used for both Phase I and Phase II Research Themes</li>
<li>Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there</li>
<li>The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo;</li>
<li>Tagged all items in the current Phase I collections with their appropriate themes</li>
<li>Create pull request to add Phase II research themes to the submission form: <a href="https://github.com/ilri/DSpace/pull/328">#328</a></li>
<li>Add <code>cg.subject.system</code> to CGSpace metadata registry, for subject from the upcoming CGIAR Library migration</li>
</ul>
<h2 id="2017-06-04">2017-06-04</h2>
<ul>
<li>After adding <code>cg.identifier.wletheme</code> to 1106 WLE items I can see the field on XMLUI but not in REST!</li>
<li>Strangely it happens on DSpace Test AND on CGSpace!</li>
<li>I tried to re-index Discovery but it didn&rsquo;t fix it</li>
<li>Run all system updates on DSpace Test and reboot the server</li>
<li>After rebooting the server (and therefore restarting Tomcat) the new metadata field is available</li>
<li>I&rsquo;ve sent a message to the dspace-tech mailing list to ask if this is a bug and whether I should file a Jira ticket</li>
</ul>
<h2 id="2016-06-05">2016-06-05</h2>
<ul>
<li>Rename WLE&rsquo;s &ldquo;Research Themes&rdquo; sub-community to &ldquo;WLE Phase I Research Themes&rdquo; on DSpace Test so Macaroni Bros can continue their testing</li>
<li>Macaroni Bros tested it and said it&rsquo;s fine, so I renamed it on CGSpace as well</li>
<li>Working on how to automate the extraction of the CIAT Book chapters, doing some magic in OpenRefine to extract page fromto from cg.identifier.url and dc.format.extent, respectively:
<ul>
<li>cg.identifier.url: <code>value.split(&quot;page=&quot;, &quot;&quot;)[1]</code></li>
<li>dc.format.extent: <code>value.replace(&quot;p. &quot;, &quot;&quot;).split(&quot;-&quot;)[1].toNumber() - value.replace(&quot;p. &quot;, &quot;&quot;).split(&quot;-&quot;)[0].toNumber()</code></li>
</ul>
</li>
<li>Finally, after some filtering to see which small outliers there were (based on dc.format.extent using &ldquo;p. 1-14&rdquo; vs &ldquo;29 p.&rdquo;), create a new column with last page number:
<ul>
<li><code>cells[&quot;dc.page.from&quot;].value.toNumber() + cells[&quot;dc.format.pages&quot;].value.toNumber()</code></li>
</ul>
</li>
<li>Then create a new, unique file name to be used in the output, based on a SHA1 of the dc.title and with a description:
<ul>
<li>dc.page.to: <code>value.split(&quot; &quot;)[0].replace(&quot;,&quot;,&quot;&quot;).toLowercase() + &quot;-&quot; + sha1(value).get(1,9) + &quot;.pdf__description:&quot; + cells[&quot;dc.type&quot;].value</code></li>
</ul>
</li>
<li>Start processing 769 records after filtering the following (there are another 159 records that have some other format, or for example they have their own PDF which I will process later), using a modified <code>generate-thumbnails.py</code> script to read certain fields and then pass to GhostScript:
<ul>
<li>cg.identifier.url: <code>value.contains(&quot;page=&quot;)</code></li>
<li>dc.format.extent: <code>or(value.contains(&quot;p. &quot;),value.contains(&quot; p.&quot;))</code></li>
<li>Command like: <code>$ gs -dNOPAUSE -dBATCH -dFirstPage=14 -dLastPage=27 -sDEVICE=pdfwrite -sOutputFile=beans.pdf -f 12605-1.pdf</code></li>
</ul>
</li>
<li>17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF</li>
<li>I&rsquo;ve flagged them and proceeded without them (752 total) on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
</code></pre><ul>
<li>I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)</li>
<li>Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT</li>
<li>Restart Tomcat on CGSpace so that the <code>cg.identifier.wletheme</code> field is available on REST API for Macaroni Bros</li>
</ul>
<h2 id="2017-06-07">2017-06-07</h2>
<ul>
<li>Testing <a href="https://github.com/ilri/DSpace/pull/319">Atmire&rsquo;s patch for the CUA Workflow Statistics again</a></li>
<li>Still doesn&rsquo;t seem to give results I&rsquo;d expect, like there are no results for Maria Garruccio, or for the ILRI community!</li>
<li>Then I&rsquo;ll file an update to the issue on Atmire&rsquo;s tracker</li>
<li>Created a new branch with just the relevant changes, so I can send it to them</li>
<li>One thing I noticed is that there is a failed database migration related to CUA:</li>
</ul>
<pre tabindex="0"><code>+----------------+----------------------------+---------------------+---------+
| Version | Description | Installed on | State |
+----------------+----------------------------+---------------------+---------+
| 1.1 | Initial DSpace 1.1 databas | | PreInit |
| 1.2 | Upgrade to DSpace 1.2 sche | | PreInit |
| 1.3 | Upgrade to DSpace 1.3 sche | | PreInit |
| 1.3.9 | Drop constraint for DSpace | | PreInit |
| 1.4 | Upgrade to DSpace 1.4 sche | | PreInit |
| 1.5 | Upgrade to DSpace 1.5 sche | | PreInit |
| 1.5.9 | Drop constraint for DSpace | | PreInit |
| 1.6 | Upgrade to DSpace 1.6 sche | | PreInit |
| 1.7 | Upgrade to DSpace 1.7 sche | | PreInit |
| 1.8 | Upgrade to DSpace 1.8 sche | | PreInit |
| 3.0 | Upgrade to DSpace 3.x sche | | PreInit |
| 4.0 | Initializing from DSpace 4 | 2015-11-20 12:42:52 | Success |
| 5.0.2014.08.08 | DS-1945 Helpdesk Request a | 2015-11-20 12:42:53 | Success |
| 5.0.2014.09.25 | DS 1582 Metadata For All O | 2015-11-20 12:42:55 | Success |
| 5.0.2014.09.26 | DS-1582 Metadata For All O | 2015-11-20 12:42:55 | Success |
| 5.0.2015.01.27 | MigrateAtmireExtraMetadata | 2015-11-20 12:43:29 | Success |
| 5.0.2017.04.28 | CUA eperson metadata migra | 2017-06-07 11:07:28 | OutOrde |
| 5.5.2015.12.03 | Atmire CUA 4 migration | 2016-11-27 06:39:05 | OutOrde |
| 5.5.2015.12.03 | Atmire MQM migration | 2016-11-27 06:39:06 | OutOrde |
| 5.6.2016.08.08 | CUA emailreport migration | 2017-01-29 11:18:56 | OutOrde |
+----------------+----------------------------+---------------------+---------+
</code></pre><ul>
<li>Merge the pull request for <a href="https://github.com/ilri/DSpace/pull/328">WLE Phase II themes</a></li>
</ul>
<h2 id="2017-06-18">2017-06-18</h2>
<ul>
<li>Redeploy CGSpace with latest changes from <code>5_x-prod</code>, run system updates, and reboot the server</li>
<li>Continue working on ansible infrastructure changes for CGIAR Library</li>
</ul>
<h2 id="2017-06-20">2017-06-20</h2>
<ul>
<li>Import Abenet and Peter&rsquo;s changes to the CGIAR Library CRP community</li>
<li>Due to them using Windows and renaming some columns there were formatting, encoding, and duplicate metadata value issues</li>
<li>I had to remove some fields from the CSV and rename some back to, ie, <code>dc.subject[en_US]</code> just so DSpace would detect changes properly</li>
<li>Now it looks much better: <a href="https://dspacetest.cgiar.org/handle/10947/2517">https://dspacetest.cgiar.org/handle/10947/2517</a></li>
<li>Removing the HTML tags and HTML/XML entities using the following GREL:
<ul>
<li><code>replace(value,/&lt;\/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;/,'')</code></li>
<li><code>value.unescape(&quot;html&quot;).unescape(&quot;xml&quot;)</code></li>
</ul>
</li>
<li>Finally import 914 CIAT Book Chapters to CGSpace in two batches:</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &amp;&gt; /tmp/ciat-books.log
$ JAVA_OPTS=&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34; [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &amp;&gt; /tmp/ciat-books2.log
</code></pre><h2 id="2017-06-25">2017-06-25</h2>
<ul>
<li>WLE has said that one of their Phase II research themes is being renamed from <code>Regenerating Degraded Landscapes</code> to <code>Restoring Degraded Landscapes</code></li>
<li>Pull request with the changes to <code>input-forms.xml</code>: <a href="https://github.com/ilri/DSpace/pull/329">#329</a></li>
<li>As of now it doesn&rsquo;t look like there are any items using this research theme so we don&rsquo;t need to do any updates:</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like &#39;Regenerating Degraded Landscapes%&#39;;
text_value
------------
(0 rows)
</code></pre><ul>
<li>Marianne from WLE asked if they can have both Phase I and II research themes together in the item submission form</li>
<li>Perhaps we can add them together in the same question for <code>cg.identifier.wletheme</code></li>
</ul>
<h2 id="2017-06-30">2017-06-30</h2>
<ul>
<li>CGSpace went down briefly, I see lots of these errors in the dspace logs:</li>
</ul>
<pre tabindex="0"><code>Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
</code></pre><ul>
<li>After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load</li>
<li>Might be a good time to adjust DSpace&rsquo;s database connection settings, like I first mentioned in April, 2017 after reading the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">2017-04 DCAT comments</a></li>
<li>I&rsquo;ve adjusted the following in CGSpace&rsquo;s config:
<ul>
<li><code>db.maxconnections</code> 30→70 (the default PostgreSQL config allows 100 connections, so DSpace&rsquo;s default of 30 is quite low)</li>
<li><code>db.maxwait</code> 5000→10000</li>
<li><code>db.maxidle</code> 8→20 (DSpace default is -1, unlimited, but we had set it to 8 earlier)</li>
</ul>
</li>
<li>We will need to adjust this again (as well as the <code>pg_hba.conf</code> settings) when we deploy tsega&rsquo;s REST API</li>
<li>Whip up a test for Marianne of WLE to be able to show both their Phase I and II research themes in the CGSpace item submission form:</li>
</ul>
<p><img src="/cgspace-notes/2017/06/wle-theme-test-a.png" alt="Test A for displaying the Phase I and II research themes">
<img src="/cgspace-notes/2017/06/wle-theme-test-b.png" alt="Test B for displaying the Phase I and II research themes"></p>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

329
docs/2017-07/index.html Normal file
View File

@ -0,0 +1,329 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="July, 2017" />
<meta property="og:description" content="2017-07-01
Run system updates and reboot DSpace Test
2017-07-04
Merge changes for WLE Phase II theme rename (#329)
Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace
We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the output into quasi XML:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-07/" />
<meta property="article:published_time" content="2017-07-01T18:03:52+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="July, 2017"/>
<meta name="twitter:description" content="2017-07-01
Run system updates and reboot DSpace Test
2017-07-04
Merge changes for WLE Phase II theme rename (#329)
Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace
We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the output into quasi XML:
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "July, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-07/",
"wordCount": "1151",
"datePublished": "2017-07-01T18:03:52+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-07/">
<title>July, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-07/">July, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-07-01T18:03:52+03:00">Sat Jul 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-07-01">2017-07-01</h2>
<ul>
<li>Run system updates and reboot DSpace Test</li>
</ul>
<h2 id="2017-07-04">2017-07-04</h2>
<ul>
<li>Merge changes for WLE Phase II theme rename (<a href="https://github.com/ilri/DSpace/pull/329">#329</a>)</li>
<li>Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace</li>
<li>We can use PostgreSQL&rsquo;s extended output format (<code>-x</code>) plus <code>sed</code> to format the output into quasi XML:</li>
</ul>
<pre tabindex="0"><code>$ psql dspacenew -x -c &#39;select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;&#39; | sed -r &#39;s:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::&#39;
</code></pre><ul>
<li>The <code>sed</code> script is from a post on the <a href="https://www.postgresql.org/message-id/437E44A5.508%40ultimeth.com">PostgreSQL mailing list</a></li>
<li>Abenet says the ILRI board wants to be able to have &ldquo;lead author&rdquo; for every item, so I&rsquo;ve whipped up a WIP test in the <code>5_x-lead-author</code> branch</li>
<li>It works but is still very rough and we haven&rsquo;t thought out the whole lifecycle yet</li>
</ul>
<p><img src="/cgspace-notes/2017/07/lead-author-test.png" alt="Testing lead author in submission form"></p>
<ul>
<li>I assume that &ldquo;lead author&rdquo; would actually be the first question on the item submission form</li>
<li>We also need to check to see which ORCID authority core this uses, because it seems to be using an entirely new one rather than the one for <code>dc.contributor.author</code> (which makes sense of course, but fuck, all the author problems aren&rsquo;t bad enough?!)</li>
<li>Also would need to edit XMLUI item displays to incorporate this into authors list</li>
<li>And fuck, then anyone consuming our data via REST / OAI will not notice that we have an author outside of <code>dc.contributor.authors</code>&hellip; ugh</li>
<li>What if we modify the item submission form to use <a href="https://wiki.lyrasis.org/display/DSDOC5x/Submission+User+Interface#SubmissionUserInterface-ItemtypeBasedMetadataCollection"><code>type-bind</code> fields to show/hide certain fields depending on the type</a>?</li>
</ul>
<h2 id="2017-07-05">2017-07-05</h2>
<ul>
<li>Adjust WLE Research Theme to include both Phase I and II on the submission form according to editor feedback (<a href="https://github.com/ilri/DSpace/pull/330">#330</a>)</li>
<li>Generate list of fields in the current CGSpace <code>cg</code> scheme so we can record them properly in the metadata registry:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace -x -c &#39;select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;&#39; | sed -r &#39;s:^-\[ RECORD (.*) \]-+$:&lt;/dc-type&gt;\n&lt;dc-type&gt;\n&lt;schema&gt;cg&lt;/schema&gt;:;s:([^ ]*) +\| (.*): &lt;\1&gt;\2&lt;/\1&gt;:;s:^$:&lt;/dc-type&gt;:;1s:&lt;/dc-type&gt;\n::&#39; &gt; cg-types.xml
</code></pre><ul>
<li>CGSpace was unavailable briefly, and I saw this error in the DSpace log file:</li>
</ul>
<pre tabindex="0"><code>2017-07-05 13:05:36,452 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
</code></pre><ul>
<li>Looking at the <code>pg_stat_activity</code> table I saw there were indeed 98 active connections to PostgreSQL, and at this time the limit is 100, so that makes sense</li>
<li>Tsega restarted Tomcat and it&rsquo;s working now</li>
<li>Abenet said she was generating a report with Atmire&rsquo;s CUA module, so it could be due to that?</li>
<li>Looking in the logs I see this random error again that I should report to DSpace:</li>
</ul>
<pre tabindex="0"><code>2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
</code></pre><ul>
<li>Seems to come from <code>dspace-api/src/main/java/org/dspace/statistics/SolrLogger.java</code></li>
</ul>
<h2 id="2017-07-06">2017-07-06</h2>
<ul>
<li>Sisay tried to help by making a <a href="https://github.com/ilri/DSpace/pull/331">pull request for the RTB flagships</a> but there are formatting errors, unrelated changes, and the flagship names are not in the style I requested</li>
<li>Abenet talked to CIP and they said they are actually ok with using collection names rather than adding a new metadata field</li>
</ul>
<h2 id="2017-07-13">2017-07-13</h2>
<ul>
<li>Remove <code>UKaid</code> from the controlled vocabulary for <code>dc.description.sponsorship</code>, as <code>Department for International Development, United Kingdom</code> is the correct form and it is already present (<a href="https://github.com/ilri/DSpace/pull/334">#334</a>)</li>
</ul>
<h2 id="2017-07-14">2017-07-14</h2>
<ul>
<li>Sisay sent me a patch to add &ldquo;Photo Report&rdquo; to <code>dc.type</code> so I&rsquo;ve added it to the <code>5_x-prod</code> branch</li>
</ul>
<h2 id="2017-07-17">2017-07-17</h2>
<ul>
<li>Linode shut down our seventeen (17) VMs due to nonpayment of the July 1st invoice</li>
<li>It took me a few hours to find the ICT/Finance contacts to pay the bill and boot all the servers back up</li>
<li>Since the server was down anyways, I decided to run all system updates and re-deploy CGSpace so that the latest changes to <code>input-forms.xml</code> and the sponsors controlled vocabulary</li>
</ul>
<h2 id="2017-07-20">2017-07-20</h2>
<ul>
<li>Skype chat with Addis team about the status of the CGIAR Library migration</li>
<li>Need to add the CGIAR System Organization subjects to Discovery Facets (test first)</li>
<li>Tentative list of dates for the migration:
<ul>
<li>August 4: aim to finish data cleanup and then give Peter a list of authors</li>
<li>August 18: ready to show System Office</li>
<li>September 4: all feedback and decisions (including workflows) from System Office</li>
<li>September 10/11: go live?</li>
</ul>
</li>
<li>Talk to Tsega and Danny about exporting/injesting the blog posts from Drupal into DSpace?</li>
<li>Followup meeting on August 8/9?</li>
<li>Sent Abenet the 2415 records from CGIAR Library&rsquo;s Historical Archive (10947/1) after cleaning up the author authorities and HTML entities in <code>dc.contributor.author</code> and <code>dc.description.abstract</code> using OpenRefine:
<ul>
<li>Authors: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,&quot;&quot;)</code></li>
<li>Abstracts: <code>replace(value,/&lt;\/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;&gt;\s]+))?)+\s*|\s*)\/?&gt;/,'')</code></li>
</ul>
</li>
</ul>
<h2 id="2017-07-24">2017-07-24</h2>
<ul>
<li>Move two top-level communities to be sub-communities of ILRI Projects</li>
</ul>
<pre tabindex="0"><code>$ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=&#34;$community&#34;; done
</code></pre><ul>
<li>Discuss CGIAR Library data cleanup with Sisay and Abenet</li>
</ul>
<h2 id="2017-07-27">2017-07-27</h2>
<ul>
<li>Help Sisay with some transforms to add descriptions to the <code>filename</code> column of some CIAT Presentations he&rsquo;s working on in OpenRefine</li>
<li>Marianne emailed a few days ago to ask why &ldquo;Integrating Ecosystem Solutions&rdquo; was not in the list of WLE Phase I Research Themes on the input form</li>
<li>I told her that I only added the themes that I saw in the <a href="https://cgspace.cgiar.org/handle/10568/34508">WLE Phase I Research Themes</a> community</li>
<li>Then Mia from WLE also emailed to ask where some WLE focal regions went, and I said I didn&rsquo;t understand what she was talking about, as all we did in our previous work was rename the old &ldquo;Research Themes&rdquo; subcommunity to &ldquo;WLE Phase I Research Themes&rdquo; and add a new subcommunity for &ldquo;WLE Phase II Research Themes&rdquo;.</li>
<li>Discuss some modifications to the CCAFS project tags in CGSpace submission form and in the database</li>
</ul>
<h2 id="2017-07-28">2017-07-28</h2>
<ul>
<li>Discuss updates to the Phase II CCAFS project tags with Andrea from Macaroni Bros</li>
<li>I will do the renaming and untagging of items in CGSpace database, and he will update his webservice with the latest project tags and I will get the XML from here for our <code>input-forms.xml</code>: <a href="https://ccafs.cgiar.org/export/ccafsproject">https://ccafs.cgiar.org/export/ccafsproject</a></li>
</ul>
<h2 id="2017-07-29">2017-07-29</h2>
<ul>
<li>Move some WLE items into appropriate Phase I Research Themes communities and delete some empty collections in WLE Regions community</li>
</ul>
<h2 id="2017-07-30">2017-07-30</h2>
<ul>
<li>Start working on CCAFS project tag cleanup</li>
<li>More questions about inconsistencies and spelling mistakes in their tags, so I&rsquo;ve sent some questions for followup</li>
</ul>
<h2 id="2017-07-31">2017-07-31</h2>
<ul>
<li>Looks like the final list of metadata corrections for CCAFS project tags will be:</li>
</ul>
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-FP4_CRMWestAfrica&#39;;
update metadatavalue set text_value=&#39;FP3_VietnamLED&#39; where resource_type_id=2 and metadata_field_id=134 and text_value=&#39;FP3_VeitnamLED&#39;;
update metadatavalue set text_value=&#39;PII-FP1_PIRCCA&#39; where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-SEA_PIRCCA&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-WA_IntegratedInterventions&#39;;
</code></pre><ul>
<li>Now just waiting to run them on CGSpace, and then apply the modified input forms after Macaroni Bros give me an updated list</li>
<li>Temporarily increase the nginx upload limit to 200MB for Sisay to upload the CIAT presentations</li>
<li>Looking at CGSpace activity page, there are 52 Baidu bots concurrently crawling our website (I copied the activity page to a text file and grep it)!</li>
</ul>
<pre tabindex="0"><code>$ grep 180.76. /tmp/status | awk &#39;{print $5}&#39; | sort | uniq | wc -l
52
</code></pre><ul>
<li>From looking at the <code>dspace.log</code> I see they are all using the same session, which means our Crawler Session Manager Valve is working</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

571
docs/2017-08/index.html Normal file
View File

@ -0,0 +1,571 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="August, 2017" />
<meta property="og:description" content="2017-08-01
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
/handle/10568/3353/discover
/handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
We might actually have to block these requests with HTTP 403 depending on the user agent
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-08/" />
<meta property="article:published_time" content="2017-08-01T11:51:52+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2017"/>
<meta name="twitter:description" content="2017-08-01
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
/handle/10568/3353/discover
/handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
We might actually have to block these requests with HTTP 403 depending on the user agent
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "August, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
"wordCount": "3542",
"datePublished": "2017-08-01T11:51:52+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-08/">
<title>August, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-08/">August, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-08-01T11:51:52+03:00">Tue Aug 01, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-08-01">2017-08-01</h2>
<ul>
<li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li>
<li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li>
<li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li>
<li>This means our Tomcat Crawler Session Valve is working</li>
<li>But many of the bots are browsing dynamic URLs like:
<ul>
<li>/handle/10568/3353/discover</li>
<li>/handle/10568/16510/browse</li>
</ul>
</li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
</ul>
<h2 id="2017-08-02">2017-08-02</h2>
<ul>
<li>Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)</li>
<li>I think Atmire&rsquo;s Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can&rsquo;t figure it out</li>
<li>I had a look at the moduel configuration and couldn&rsquo;t figure out a way to do this, so I <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets">opened a ticket on the Atmire tracker</a></li>
<li>Atmire responded about the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=500">missing workflow statistics issue</a> a few weeks ago but I didn&rsquo;t see it for some reason</li>
<li>They said they added a publication and saw the workflow stat for the user, so I should try again and let them know</li>
</ul>
<h2 id="2017-08-05">2017-08-05</h2>
<ul>
<li>Usman from CIFOR emailed to ask about the status of our OAI tests for harvesting their DSpace repository</li>
<li>I told him that the OAI appears to not be harvesting properly after the first sync, and that the control panel shows an &ldquo;Internal error&rdquo; for that collection:</li>
</ul>
<p><img src="/cgspace-notes/2017/08/cifor-oai-harvesting.png" alt="CIFOR OAI harvesting"></p>
<ul>
<li>I don&rsquo;t see anything related in our logs, so I asked him to check for our server&rsquo;s IP in their logs</li>
<li>Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn&rsquo;t reset the collection, just the harvester status!)</li>
</ul>
<h2 id="2017-08-07">2017-08-07</h2>
<ul>
<li>Apply Abenet&rsquo;s corrections for the CGIAR Library&rsquo;s Consortium subcommunity (697 records)</li>
<li>I had to fix a few small things, like moving the <code>dc.title</code> column away from the beginning of the row, delete blank spaces in the abstract in vim using <code>:g/^$/d</code>, add the <code>dc.subject[en_US]</code> column back, as she had deleted it and DSpace didn&rsquo;t detect the changes made there (we needed to blank the values instead)</li>
</ul>
<h2 id="2017-08-08">2017-08-08</h2>
<ul>
<li>Apply Abenet&rsquo;s corrections for the CGIAR Library&rsquo;s historic archive subcommunity (2415 records)</li>
<li>I had to add the <code>dc.subject[en_US]</code> column back with blank values so that DSpace could detect the changes</li>
<li>I applied the changes in 500 item batches</li>
</ul>
<h2 id="2017-08-09">2017-08-09</h2>
<ul>
<li>Run system updates on DSpace Test and reboot server</li>
<li>Help ICARDA upgrade their MELSpace to DSpace 5.7 using the <a href="https://github.com/alanorth/docker-dspace">docker-dspace</a> container
<ul>
<li>We had to import the PostgreSQL dump to the PostgreSQL container using: <code>pg_restore -U postgres -d dspace blah.dump</code></li>
<li>Otherwise, when using <code>-O</code> it messes up the permissions on the schema and DSpace can&rsquo;t read it</li>
</ul>
</li>
</ul>
<h2 id="2017-08-10">2017-08-10</h2>
<ul>
<li>Apply last updates to the CGIAR Library&rsquo;s Fund community (812 items)</li>
<li>Had to do some quality checks and column renames before importing, as either Sisay or Abenet renamed a few columns and the metadata importer wanted to remove/add new metadata for title, abstract, etc.</li>
<li>Also I applied the HTML entities unescape transform on the abstract column in Open Refine</li>
<li>I need to get an author list from the database for only the CGIAR Library community to send to Peter</li>
<li>It turns out that I had already used this SQL query in <a href="/cgspace-notes/2017-05">May, 2017</a> to get the authors from CGIAR Library:</li>
</ul>
<pre tabindex="0"><code>dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
</code></pre><ul>
<li>Meeting with Peter and CGSpace team
<ul>
<li>Alan to follow up with ICARDA about depositing in CGSpace, we want ICARD and Drylands legacy content but not duplicates</li>
<li>Alan to follow up on dc.rights, where are we?</li>
<li>Alan to follow up with Atmire about a dedicated field for ORCIDs, based on the discussion in the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+June+2017">June, 2017 DCAT meeting</a></li>
<li>Alan to ask about how to query external services like AGROVOC in the DSpace submission form</li>
</ul>
</li>
<li>Follow up with Atmire on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510">ticket about ORCID metadata in DSpace</a></li>
<li>Follow up with Lili and Andrea about the pending CCAFS metadata and flagship updates</li>
</ul>
<h2 id="2017-08-11">2017-08-11</h2>
<ul>
<li>CGSpace had load issues and was throwing errors related to PostgreSQL</li>
<li>I told Tsega to reduce the max connections from 70 to 40 because actually each web application gets that limit and so for xmlui, oai, jspui, rest, etc it could be 70 x 4 = 280 connections depending on the load, and the PostgreSQL config itself is only 100!</li>
<li>I learned this on a recent discussion on the DSpace wiki</li>
<li>I need to either look into setting up a database pool through JNDI or increase the PostgreSQL max connections</li>
<li>Also, I need to find out where the load is coming from (rest?) and possibly block bots from accessing dynamic pages like Browse and Discover instead of just sending an X-Robots-Tag HTTP header</li>
<li>I noticed that Google has bitstreams from the <code>rest</code> interface in the search index. I need to ask on the dspace-tech mailing list to see what other people are doing about this, and maybe start issuing an <code>X-Robots-Tag: none</code> there!</li>
</ul>
<h2 id="2017-08-12">2017-08-12</h2>
<ul>
<li>I sent a message to the mailing list about the duplicate content issue with <code>/rest</code> and <code>/bitstream</code> URLs</li>
<li>Looking at the logs for the REST API on <code>/rest</code>, it looks like there is someone hammering doing testing or something on it&hellip;</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
140 66.249.66.91
404 66.249.66.90
1479 50.116.102.77
9794 45.5.184.196
85736 70.32.83.92
</code></pre><ul>
<li>The top offender is 70.32.83.92 which is actually the same IP as ccafs.cgiar.org, so I will email the Macaroni Bros to see if they can test on DSpace Test instead</li>
<li>I&rsquo;ve enabled logging of <code>/oai</code> requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)</li>
</ul>
<pre tabindex="0"><code> # log oai requests
location /oai {
access_log /var/log/nginx/oai.log;
proxy_pass http://tomcat_http;
}
</code></pre><h2 id="2017-08-13">2017-08-13</h2>
<ul>
<li>Macaroni Bros say that CCAFS wants them to check once every hour for changes</li>
<li>I told them to check every four or six hours</li>
</ul>
<h2 id="2017-08-14">2017-08-14</h2>
<ul>
<li>Run author corrections on CGIAR Library community from Peter</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/authors-fix-523.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p fuuuu
</code></pre><ul>
<li>There were only three deletions so I just did them manually:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;C&#39;;
DELETE 1
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value=&#39;WSSD&#39;;
</code></pre><ul>
<li>Generate a new list of authors from the CGIAR Library community for Peter to look through now that the initial corrections have been done</li>
<li>Thinking about resource limits for PostgreSQL again after last week&rsquo;s CGSpace crash and related to a recently discussion I had in the comments of the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">April, 2017 DCAT meeting notes</a></li>
<li>In that thread Chris Wilper suggests a new default of 35 max connections for <code>db.maxconnections</code> (from the current default of 30), knowing that <em>each DSpace web application</em> gets to use up to this many on its own</li>
<li>It would be good to approximate what the theoretical maximum number of connections on a busy server would be, perhaps by looking to see which apps use SQL:</li>
</ul>
<pre tabindex="0"><code>$ grep -rsI SQLException dspace-jspui | wc -l
473
$ grep -rsI SQLException dspace-oai | wc -l
63
$ grep -rsI SQLException dspace-rest | wc -l
139
$ grep -rsI SQLException dspace-solr | wc -l
0
$ grep -rsI SQLException dspace-xmlui | wc -l
866
</code></pre><ul>
<li>Of those five applications we&rsquo;re running, only <code>solr</code> appears not to use the database directly</li>
<li>And JSPUI is only used internally (so it doesn&rsquo;t really count), leaving us with OAI, REST, and XMLUI</li>
<li>Assuming each takes a theoretical maximum of 35 connections during a heavy load (35 * 3 = 105), that would put the connections well above PostgreSQL&rsquo;s default max of 100 connections (remember a handful of connections are reserved for the PostgreSQL super user, see <code>superuser_reserved_connections</code>)</li>
<li>So we should adjust PostgreSQL&rsquo;s max connections to be DSpace&rsquo;s <code>db.maxconnections</code> * 3 + 3</li>
<li>This would allow each application to use up to <code>db.maxconnections</code> and not to go over the system&rsquo;s PostgreSQL limit</li>
<li>Perhaps since CGSpace is a busy site with lots of resources we could actually use something like 40 for <code>db.maxconnections</code></li>
<li>Also worth looking into is to set up a database pool using JNDI, as apparently DSpace&rsquo;s <code>db.poolname</code> hasn&rsquo;t been used since around DSpace 1.7 (according to Chris Wilper&rsquo;s comments in the thread)</li>
<li>Need to go check the PostgreSQL connection stats in Munin on CGSpace from the past week to get an idea if 40 is appropriate</li>
<li>Looks like connections hover around 50:</li>
</ul>
<p><img src="/cgspace-notes/2017/08/postgresql-connections-cgspace.png" alt="PostgreSQL connections 2017-08"></p>
<ul>
<li>Unfortunately I don&rsquo;t have the breakdown of which DSpace apps are making those connections (I&rsquo;ll assume XMLUI)</li>
<li>So I guess a limit of 30 (DSpace default) is too low, but 70 causes problems when the load increases and the system&rsquo;s PostgreSQL <code>max_connections</code> is too low</li>
<li>For now I think maybe setting DSpace&rsquo;s <code>db.maxconnections</code> to 40 and adjusting the system&rsquo;s <code>max_connections</code> might be a good starting point: 40 * 3 + 3 = 123</li>
<li>Apply 223 more author corrections from Peter on CGIAR Library</li>
<li>Help Magdalena from CCAFS with some CUA statistics questions</li>
</ul>
<h2 id="2017-08-15">2017-08-15</h2>
<ul>
<li>Increase the nginx upload limit on CGSpace (linode18) so Sisay can upload 23 CIAT reports</li>
<li>Do some last minute cleanups and de-duplications of the CGIAR Library data, as I need to send it to Peter this week</li>
<li>Metadata fields like <code>dc.contributor.author</code>, <code>dc.publisher</code>, <code>dc.type</code>, and a few others had somehow been duplicated along the line</li>
<li>Also, a few dozen <code>dc.description.abstract</code> fields still had various HTML tags and entities in them</li>
<li>Also, a bunch of <code>dc.subject</code> fields that were not AGROVOC had not been moved properly to <code>cg.system.subject</code></li>
</ul>
<h2 id="2017-08-16">2017-08-16</h2>
<ul>
<li>I wanted to merge the various field variations like <code>cg.subject.system</code> and <code>cg.subject.system[en_US]</code> in OpenRefine but I realized it would be easier in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
</code></pre><ul>
<li>And actually, we can do it for other generic fields for items in those collections, for example <code>dc.description.abstract</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;description&#39; and qualifier = &#39;abstract&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;)))
</code></pre><ul>
<li>And on others like <code>dc.language.iso</code>, <code>dc.relation.ispartofseries</code>, <code>dc.type</code>, <code>dc.title</code>, etc&hellip;</li>
<li>Also, to move fields from <code>dc.identifier.url</code> to <code>cg.identifier.url[en_US]</code> (because we don&rsquo;t use the Dublin Core one for some reason):</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set metadata_field_id = 219, text_lang = &#39;en_US&#39; where resource_type_id = 2 AND metadata_field_id = 237;
UPDATE 15
</code></pre><ul>
<li>Set the text_lang of all <code>dc.identifier.uri</code> (Handle) fields to be NULL, just like default DSpace does:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like &#39;http://hdl.handle.net/10947/%&#39;;
UPDATE 4248
</code></pre><ul>
<li>Also update the text_lang of <code>dc.contributor.author</code> fields for metadata in these collections:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in (&#39;10568/93761&#39;, &#39;10947/1&#39;, &#39;10947/10&#39;, &#39;10947/11&#39;, &#39;10947/12&#39;, &#39;10947/13&#39;, &#39;10947/14&#39;, &#39;10947/15&#39;, &#39;10947/16&#39;, &#39;10947/17&#39;, &#39;10947/18&#39;, &#39;10947/19&#39;, &#39;10947/2&#39;, &#39;10947/20&#39;, &#39;10947/21&#39;, &#39;10947/22&#39;, &#39;10947/23&#39;, &#39;10947/24&#39;, &#39;10947/25&#39;, &#39;10947/2512&#39;, &#39;10947/2515&#39;, &#39;10947/2516&#39;, &#39;10947/2517&#39;, &#39;10947/2518&#39;, &#39;10947/2519&#39;, &#39;10947/2520&#39;, &#39;10947/2521&#39;, &#39;10947/2522&#39;, &#39;10947/2523&#39;, &#39;10947/2524&#39;, &#39;10947/2525&#39;, &#39;10947/2526&#39;, &#39;10947/2527&#39;, &#39;10947/2528&#39;, &#39;10947/2529&#39;, &#39;10947/2530&#39;, &#39;10947/2531&#39;, &#39;10947/2532&#39;, &#39;10947/2533&#39;, &#39;10947/2534&#39;, &#39;10947/2535&#39;, &#39;10947/2536&#39;, &#39;10947/2537&#39;, &#39;10947/2538&#39;, &#39;10947/2539&#39;, &#39;10947/2540&#39;, &#39;10947/2541&#39;, &#39;10947/2589&#39;, &#39;10947/26&#39;, &#39;10947/2631&#39;, &#39;10947/27&#39;, &#39;10947/2708&#39;, &#39;10947/2776&#39;, &#39;10947/2782&#39;, &#39;10947/2784&#39;, &#39;10947/2786&#39;, &#39;10947/2790&#39;, &#39;10947/28&#39;, &#39;10947/2805&#39;, &#39;10947/2836&#39;, &#39;10947/2871&#39;, &#39;10947/2878&#39;, &#39;10947/29&#39;, &#39;10947/2900&#39;, &#39;10947/2919&#39;, &#39;10947/3&#39;, &#39;10947/30&#39;, &#39;10947/31&#39;, &#39;10947/32&#39;, &#39;10947/33&#39;, &#39;10947/34&#39;, &#39;10947/3457&#39;, &#39;10947/35&#39;, &#39;10947/36&#39;, &#39;10947/37&#39;, &#39;10947/38&#39;, &#39;10947/39&#39;, &#39;10947/4&#39;, &#39;10947/40&#39;, &#39;10947/4052&#39;, &#39;10947/4054&#39;, &#39;10947/4056&#39;, &#39;10947/4068&#39;, &#39;10947/41&#39;, &#39;10947/42&#39;, &#39;10947/43&#39;, &#39;10947/4368&#39;, &#39;10947/44&#39;, &#39;10947/4467&#39;, &#39;10947/45&#39;, &#39;10947/4508&#39;, &#39;10947/4509&#39;, &#39;10947/4510&#39;, &#39;10947/4573&#39;, &#39;10947/46&#39;, &#39;10947/4635&#39;, &#39;10947/4636&#39;, &#39;10947/4637&#39;, &#39;10947/4638&#39;, &#39;10947/4639&#39;, &#39;10947/4651&#39;, &#39;10947/4657&#39;, &#39;10947/47&#39;, &#39;10947/48&#39;, &#39;10947/49&#39;, &#39;10947/5&#39;, &#39;10947/50&#39;, &#39;10947/51&#39;, &#39;10947/5308&#39;, &#39;10947/5322&#39;, &#39;10947/5324&#39;, &#39;10947/5326&#39;, &#39;10947/6&#39;, &#39;10947/7&#39;, &#39;10947/8&#39;, &#39;10947/9&#39;)));
UPDATE 4899
</code></pre><ul>
<li>Wow, I just wrote this baller regex facet to find duplicate authors:</li>
</ul>
<pre tabindex="0"><code>isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
</code></pre><ul>
<li>This would be true if the authors were like <code>CGIAR System Management Office||CGIAR System Management Office</code>, which some of the CGIAR Library&rsquo;s were</li>
<li>Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn&rsquo;t detect any changes, so you have to edit them all manually via DSpace&rsquo;s &ldquo;Edit Item&rdquo;</li>
<li>Ooh! And an even more interesting regex would match <em>any</em> duplicated author:</li>
</ul>
<pre tabindex="0"><code>isNotNull(value.match(/(.+?)\|\|\1/))
</code></pre><ul>
<li>Which means it can also be used to find items with duplicate <code>dc.subject</code> fields&hellip;</li>
<li>Finally sent Peter the final dump of the CGIAR System Organization community so he can have a last look at it</li>
<li>Post a message to the dspace-tech mailing list to ask about querying the AGROVOC API from the submission form</li>
<li>Abenet was asking if there was some way to hide certain internal items from the &ldquo;ILRI Research Outputs&rdquo; RSS feed (which is the top-level ILRI community feed), because Shirley was complaining</li>
<li>I think we could use <code>harvest.includerestricted.rss = false</code> but the items might need to be 100% restricted, not just the metadata</li>
<li>Adjust Ansible postgres role to use <code>max_connections</code> from a template variable and deploy a new limit of 123 on CGSpace</li>
</ul>
<h2 id="2017-08-17">2017-08-17</h2>
<ul>
<li>Run Peter&rsquo;s edits to the CGIAR System Organization community on DSpace Test</li>
<li>Uptime Robot said CGSpace went down for 1 minute, not sure why</li>
<li>Looking in <code>dspace.log.2017-08-17</code> I see some weird errors that might be related?</li>
</ul>
<pre tabindex="0"><code>2017-08-17 07:55:31,396 ERROR net.sf.ehcache.store.DiskStore @ cocoon-ehcacheCache: Could not read disk store element for key PK_G-aspect-cocoon://DRI/12/handle/10568/65885?pipelinehash=823411183535858997_T-Navigation-3368194896954203241. Error was invalid stream header: 00000000
java.io.StreamCorruptedException: invalid stream header: 00000000
</code></pre><ul>
<li>Weird that these errors seem to have started on August 11th, the same day we had capacity issues with PostgreSQL:</li>
</ul>
<pre tabindex="0"><code># grep -c &#34;ERROR net.sf.ehcache.store.DiskStore&#34; dspace.log.2017-08-*
dspace.log.2017-08-01:0
dspace.log.2017-08-02:0
dspace.log.2017-08-03:0
dspace.log.2017-08-04:0
dspace.log.2017-08-05:0
dspace.log.2017-08-06:0
dspace.log.2017-08-07:0
dspace.log.2017-08-08:0
dspace.log.2017-08-09:0
dspace.log.2017-08-10:0
dspace.log.2017-08-11:8806
dspace.log.2017-08-12:5496
dspace.log.2017-08-13:2925
dspace.log.2017-08-14:2135
dspace.log.2017-08-15:1506
dspace.log.2017-08-16:1935
dspace.log.2017-08-17:584
</code></pre><ul>
<li>There are none in 2017-07 either&hellip;</li>
<li>A few posts on the dspace-tech mailing list say this is related to the Cocoon cache somehow</li>
<li>I will clear the XMLUI cache for now and see if the errors continue (though perpaps shutting down Tomcat and removing the cache is more effective somehow?)</li>
<li>We tested the option for limiting restricted items from the RSS feeds on DSpace Test</li>
<li>I created four items, and only the two with public metadata showed up in the community&rsquo;s RSS feed:
<ul>
<li>Public metadata, public bitstream ✓</li>
<li>Public metadata, restricted bitstream ✓</li>
<li>Restricted metadata, restricted bitstream ✗</li>
<li>Private item ✗</li>
</ul>
</li>
<li>Peter responded and said that he doesn&rsquo;t want to limit items to be restricted just so we can change the RSS feeds</li>
</ul>
<h2 id="2017-08-18">2017-08-18</h2>
<ul>
<li>Someone on the dspace-tech mailing list responded with some tips about using the authority framework to do external queries from the submission form</li>
<li>He linked to some examples from DSpace-CRIS that use this functionality: <a href="https://github.com/4Science/DSpace/blob/dspace-5_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/VIAFAuthority.java">VIAFAuthority</a></li>
<li>I wired it up to the <code>dc.subject</code> field of the submission interface using the &ldquo;lookup&rdquo; type and it works!</li>
<li>I think we can use this example to get a working AGROVOC query</li>
<li>More information about authority framework: <a href="https://wiki.lyrasis.org/display/DSPACE/Authority+Control+of+Metadata+Values">https://wiki.lyrasis.org/display/DSPACE/Authority+Control+of+Metadata+Values</a></li>
<li>Wow, I&rsquo;m playing with the AGROVOC SPARQL endpoint using the <a href="https://github.com/tialaramex/sparql-query">sparql-query tool</a>:</li>
</ul>
<pre tabindex="0"><code>$ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
sparql$ PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt;
SELECT
?label
WHERE {
{ ?concept skos:altLabel ?label . } UNION { ?concept skos:prefLabel ?label . }
FILTER regex(str(?label), &#34;^fish&#34;, &#34;i&#34;) .
} LIMIT 10;
┌───────────────────────┐
│ ?label │
├───────────────────────┤
│ fisheries legislation │
│ fishery legislation │
│ fishery law │
│ fish production │
│ fish farming │
│ fishing industry │
│ fisheries data │
│ fishing power │
│ fishing times │
│ fish passes │
└───────────────────────┘
</code></pre><ul>
<li>More examples about SPARQL syntax: <a href="https://github.com/rsinger/openlcsh/wiki/Sparql-Examples">https://github.com/rsinger/openlcsh/wiki/Sparql-Examples</a></li>
<li>I found this blog post about speeding up the Tomcat startup time: <a href="http://skybert.net/java/improve-tomcat-startup-time/">http://skybert.net/java/improve-tomcat-startup-time/</a></li>
<li>The startup time went from ~80s to 40s!</li>
</ul>
<h2 id="2017-08-19">2017-08-19</h2>
<ul>
<li>More examples of SPARQL queries: <a href="https://github.com/rsinger/openlcsh/wiki/Sparql-Examples">https://github.com/rsinger/openlcsh/wiki/Sparql-Examples</a></li>
<li>Specifically the explanation of the <code>FILTER</code> regex</li>
<li>Might want to <code>SELECT DISTINCT</code> or increase the <code>LIMIT</code> to get terms like &ldquo;wheat&rdquo; and &ldquo;fish&rdquo; to be visible</li>
<li>Test queries online on the AGROVOC SPARQL portal: http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc</li>
</ul>
<h2 id="2017-08-20">2017-08-20</h2>
<ul>
<li>Since I cleared the XMLUI cache on 2017-08-17 there haven&rsquo;t been any more <code>ERROR net.sf.ehcache.store.DiskStore</code> errors</li>
<li>Look at the CGIAR Library to see if I can find the items that have been submitted since May:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) &gt; &#39;2017-05-01T00:00:00Z&#39;;
metadata_value_id | item_id | metadata_field_id | text_value | text_lang | place | authority | confidence
-------------------+---------+-------------------+----------------------+-----------+-------+-----------+------------
123117 | 5872 | 11 | 2017-06-28T13:05:18Z | | 1 | | -1
123042 | 5869 | 11 | 2017-05-15T03:29:23Z | | 1 | | -1
123056 | 5870 | 11 | 2017-05-22T11:27:15Z | | 1 | | -1
123072 | 5871 | 11 | 2017-06-06T07:46:01Z | | 1 | | -1
123171 | 5874 | 11 | 2017-08-04T07:51:20Z | | 1 | | -1
(5 rows)
</code></pre><ul>
<li>According to <code>dc.date.accessioned</code> (metadata field id 11) there have only been five items submitted since May</li>
<li>These are their handles:</li>
</ul>
<pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) &gt; &#39;2017-05-01T00:00:00Z&#39;);
handle
------------
10947/4658
10947/4659
10947/4660
10947/4661
10947/4664
(5 rows)
</code></pre><h2 id="2017-08-23">2017-08-23</h2>
<ul>
<li>Start testing the nginx configs for the CGIAR Library migration as well as start making a checklist</li>
</ul>
<h2 id="2017-08-28">2017-08-28</h2>
<ul>
<li>Bram had written to me two weeks ago to set up a chat about ORCID stuff but the email apparently bounced and I only found out when he emaiiled me on another account</li>
<li>I told him I can chat in a few weeks when I&rsquo;m back</li>
</ul>
<h2 id="2017-08-31">2017-08-31</h2>
<ul>
<li>I notice that in many WLE collections Marianne Gadeberg is in the edit or approval steps, but she is also in the groups for those steps.</li>
<li>I think we need to have a process to go back and check / fix some of these scenarios—to remove her user from the step and instead add her to the group—because we have way too many authorizations and in late 2016 we had <a href="https://github.com/ilri/rmg-ansible-public/commit/358b5ea43f9e5820986f897c9d560937c702ac6e">performance issues with Solr</a> because of this</li>
<li>I asked Sisay about this and hinted that he should go back and fix these things, but let&rsquo;s see what he says</li>
<li>Saw CGSpace go down briefly today and noticed SQL connection pool errors in the dspace log file:</li>
</ul>
<pre tabindex="0"><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<li>Looking at the logs I see we have been having hundreds or thousands of these errors a few times per week in 2017-07 and almost every day in 2017-08</li>
<li>It seems that I changed the <code>db.maxconnections</code> setting from 70 to 40 around 2017-08-14, but Macaroni Bros also reduced their hourly hammering of the REST API then</li>
<li>Nevertheless, it seems like a connection limit is not enough and that I should increase it (as well as the system&rsquo;s PostgreSQL <code>max_connections</code>)</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

713
docs/2017-09/index.html Normal file
View File

@ -0,0 +1,713 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="September, 2017" />
<meta property="og:description" content="2017-09-06
Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
2017-09-07
Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-09/" />
<meta property="article:published_time" content="2017-09-07T16:54:52+07:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2017"/>
<meta name="twitter:description" content="2017-09-06
Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
2017-09-07
Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "September, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-09/",
"wordCount": "4199",
"datePublished": "2017-09-07T16:54:52+07:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-09/">
<title>September, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-09/">September, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-09-07T16:54:52+07:00">Thu Sep 07, 2017</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2017-09-06">2017-09-06</h2>
<ul>
<li>Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours</li>
</ul>
<h2 id="2017-09-07">2017-09-07</h2>
<ul>
<li>Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group</li>
</ul>
<h2 id="2017-09-10">2017-09-10</h2>
<ul>
<li>Delete 58 blank metadata values from the CGSpace database:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and text_value=&#39;&#39;;
DELETE 58
</code></pre><ul>
<li>I also ran it on DSpace Test because we&rsquo;ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate</li>
<li>Run system updates and restart DSpace Test</li>
<li>We only have 7.7GB of free space on DSpace Test so I need to copy some data off of it before doing the CGIAR Library migration (requires lots of exporting and creating temp files)</li>
<li>I still have the original data from the CGIAR Library so I&rsquo;ve zipped it up and sent it off to linode18 for now</li>
<li>sha256sum of <code>original-cgiar-library-6.6GB.tar.gz</code> is: bcfabb52f51cbdf164b61b7e9b3a0e498479e4c1ed1d547d32d11f44c0d5eb8a</li>
<li>Start doing a test run of the CGIAR Library migration locally</li>
<li>Notes and todo checklist here for now: <a href="https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c">https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c</a></li>
<li>Create pull request for Phase I and II changes to CCAFS Project Tags: <a href="https://github.com/ilri/DSpace/pull/336">#336</a></li>
<li>We&rsquo;ve been discussing with Macaroni Bros and CCAFS for the past month or so and the list of tags was recently finalized</li>
<li>There will need to be some metadata updatesthough if I recall correctly it is only about seven recordsfor that as well, I had made some notes about it in <a href="/cgspace-notes/2017-07">2017-07</a>, but I&rsquo;ve asked for more clarification from Lili just in case</li>
<li>Looking at the DSpace logs to see if we&rsquo;ve had a change in the &ldquo;Cannot get a connection&rdquo; errors since last month when we adjusted the <code>db.maxconnections</code> parameter on CGSpace:</li>
</ul>
<pre tabindex="0"><code># grep -c &#34;Cannot get a connection, pool error Timeout waiting for idle object&#34; dspace.log.2017-09-*
dspace.log.2017-09-01:0
dspace.log.2017-09-02:0
dspace.log.2017-09-03:9
dspace.log.2017-09-04:17
dspace.log.2017-09-05:752
dspace.log.2017-09-06:0
dspace.log.2017-09-07:0
dspace.log.2017-09-08:10
dspace.log.2017-09-09:0
dspace.log.2017-09-10:0
</code></pre><ul>
<li>Also, since last month (2017-08) Macaroni Bros no longer runs their REST API scraper every hour, so I&rsquo;m sure that helped</li>
<li>There are still some errors, though, so maybe I should bump the connection limit up a bit</li>
<li>I remember seeing that Munin shows that the average number of connections is 50 (which is probably mostly from the XMLUI) and we&rsquo;re currently allowing 40 connections per app, so maybe it would be good to bump that value up to 50 or 60 along with the system&rsquo;s PostgreSQL <code>max_connections</code> (formula should be: webapps * 60 + 3, or 3 * 60 + 3 = 183 in our case)</li>
<li>I updated both CGSpace and DSpace Test to use these new settings (60 connections per web app and 183 for system PostgreSQL limit)</li>
<li>I&rsquo;m expecting to see 0 connection errors for the next few months</li>
</ul>
<h2 id="2017-09-11">2017-09-11</h2>
<ul>
<li>Lots of work testing the CGIAR Library migration</li>
<li>Many technical notes and TODOs here: <a href="https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c">https://gist.github.com/alanorth/3579b74e116ab13418d187ed379abd9c</a></li>
</ul>
<h2 id="2017-09-12">2017-09-12</h2>
<ul>
<li>I was testing the <a href="https://wiki.lyrasis.org/display/DSDOC5x/AIP+Backup+and+Restore#AIPBackupandRestore-AIPConfigurationsToImproveIngestionSpeedwhileValidating">METS XSD caching during AIP ingest</a> but it doesn&rsquo;t seem to help actually</li>
<li>The import process takes the same amount of time with and without the caching</li>
<li>Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):</li>
</ul>
<pre tabindex="0"><code>$ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and &#39;tcp[32:4] = 0x47455420&#39;
</code></pre><ul>
<li>Great TCP dump guide here: <a href="https://danielmiessler.com/study/tcpdump">https://danielmiessler.com/study/tcpdump</a></li>
<li>The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation</li>
<li>I sent a message to the mailing list to see if anyone knows more about this</li>
<li>In looking at the tcpdump results I notice that there is an update check to the ehcache server on <em>every</em> iteration of the ingest loop, for example:</li>
</ul>
<pre tabindex="0"><code>09:39:36.008956 IP 192.168.8.124.50515 &gt; 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&amp;pageID=update.properties&amp;id=2130706433&amp;os-name=Mac+OS+X&amp;jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&amp;jvm-version=1.8.0_144&amp;platform=x86_64&amp;tc-version=UNKNOWN&amp;tc-product=Ehcache+Core+1.7.2&amp;source=Ehcache+Core&amp;uptime-secs=0&amp;patch=UNKNOWN HTTP/1.1
</code></pre><ul>
<li>Turns out this is a known issue and Ehcache has refused to make it opt-in: <a href="https://jira.terracotta.org/jira/browse/EHC-461">https://jira.terracotta.org/jira/browse/EHC-461</a></li>
<li>But we can disable it by adding an <code>updateCheck=&quot;false&quot;</code> attribute to the main <code>&lt;ehcache &gt;</code> tag in <code>dspace-services/src/main/resources/caching/ehcache-config.xml</code></li>
<li>After re-compiling and re-deploying DSpace I no longer see those update checks during item submission</li>
<li>I had a Skype call with Bram Luyten from Atmire to discuss various issues related to ORCID in DSpace
<ul>
<li>First, ORCID is deprecating their version 1 API (which DSpace uses) and in version 2 API they have removed the ability to search for users by name</li>
<li>The logic is that searching by name actually isn&rsquo;t very useful because ORCID is essentially a global phonebook and there are tons of legitimately duplicate and ambiguous names</li>
<li>Atmire&rsquo;s proposed integration would work by having users lookup and add authors to the authority core directly using their ORCID ID itself (this would happen during the item submission process or perhaps as a standalone / batch process, for example to populate the authority core with a list of known ORCIDs)</li>
<li>Once the association between name and ORCID is made in the authority then it can be autocompleted in the lookup field</li>
<li>Ideally there could also be a user interface for cleanup and merging of authorities</li>
<li>He will prepare a quote for us with keeping in mind that this could be useful to contribute back to the community for a 5.x release</li>
<li>As far as exposing ORCIDs as flat metadata along side all other metadata, he says this should be possible and will work on a quote for us</li>
</ul>
</li>
</ul>
<h2 id="2017-09-13">2017-09-13</h2>
<ul>
<li>Last night Linode sent an alert about CGSpace (linode18) that it has exceeded the outbound traffic rate threshold of 10Mb/s for the last two hours</li>
<li>I wonder what was going on, and looking into the nginx logs I think maybe it&rsquo;s OAI&hellip;</li>
<li>Here is yesterday&rsquo;s top ten IP addresses making requests to <code>/oai</code>:</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
1 213.136.89.78
1 66.249.66.90
1 66.249.66.92
3 68.180.229.31
4 35.187.22.255
13745 54.70.175.86
15814 34.211.17.113
15825 35.161.215.53
16704 54.70.51.7
</code></pre><ul>
<li>Compared to the previous day&rsquo;s logs it looks VERY high:</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
1 207.46.13.39
1 66.249.66.93
2 66.249.66.91
4 216.244.66.194
14 66.249.66.90
</code></pre><ul>
<li>The user agents for those top IPs are:
<ul>
<li>54.70.175.86: API scraper</li>
<li>34.211.17.113: API scraper</li>
<li>35.161.215.53: API scraper</li>
<li>54.70.51.7: API scraper</li>
</ul>
</li>
<li>And this user agent has never been seen before today (or at least recently!):</li>
</ul>
<pre tabindex="0"><code># grep -c &#34;API scraper&#34; /var/log/nginx/oai.log
62088
# zgrep -c &#34;API scraper&#34; /var/log/nginx/oai.log.*.gz
/var/log/nginx/oai.log.10.gz:0
/var/log/nginx/oai.log.11.gz:0
/var/log/nginx/oai.log.12.gz:0
/var/log/nginx/oai.log.13.gz:0
/var/log/nginx/oai.log.14.gz:0
/var/log/nginx/oai.log.15.gz:0
/var/log/nginx/oai.log.16.gz:0
/var/log/nginx/oai.log.17.gz:0
/var/log/nginx/oai.log.18.gz:0
/var/log/nginx/oai.log.19.gz:0
/var/log/nginx/oai.log.20.gz:0
/var/log/nginx/oai.log.21.gz:0
/var/log/nginx/oai.log.22.gz:0
/var/log/nginx/oai.log.23.gz:0
/var/log/nginx/oai.log.24.gz:0
/var/log/nginx/oai.log.25.gz:0
/var/log/nginx/oai.log.26.gz:0
/var/log/nginx/oai.log.27.gz:0
/var/log/nginx/oai.log.28.gz:0
/var/log/nginx/oai.log.29.gz:0
/var/log/nginx/oai.log.2.gz:0
/var/log/nginx/oai.log.30.gz:0
/var/log/nginx/oai.log.3.gz:0
/var/log/nginx/oai.log.4.gz:0
/var/log/nginx/oai.log.5.gz:0
/var/log/nginx/oai.log.6.gz:0
/var/log/nginx/oai.log.7.gz:0
/var/log/nginx/oai.log.8.gz:0
/var/log/nginx/oai.log.9.gz:0
</code></pre><ul>
<li>Some of these heavy users are also using XMLUI, and their user agent isn&rsquo;t matched by the <a href="https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L158">Tomcat Session Crawler valve</a>, so each request uses a different session</li>
<li>Yesterday alone the IP addresses using the <code>API scraper</code> user agent were responsible for 16,000 sessions in XMLUI:</li>
</ul>
<pre tabindex="0"><code># grep -a -E &#34;(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)&#34; /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
15924
</code></pre><ul>
<li>If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex</li>
<li>A search for &ldquo;API scraper&rdquo; user agent on Google returns a <code>robots.txt</code> with a comment that this is the Yewno bot: <a href="http://www.escholarship.org/robots.txt">http://www.escholarship.org/robots.txt</a></li>
<li>Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:</li>
</ul>
<pre tabindex="0"><code>WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
</code></pre><ul>
<li>Looking at the spreadsheet with deletions and corrections that CCAFS sent last week</li>
<li>It appears they want to delete a lot of metadata, which I&rsquo;m not sure they realize the implications of:</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;) group by text_value;
text_value | count
--------------------------+-------
FP4_ClimateModels | 6
FP1_CSAEvidence | 7
SEA_UpscalingInnovation | 7
FP4_Baseline | 69
WA_Partnership | 1
WA_SciencePolicyExchange | 6
SA_GHGMeasurement | 2
SA_CSV | 7
EA_PAR | 18
FP4_Livestock | 7
FP4_GenderPolicy | 4
FP2_CRMWestAfrica | 12
FP4_ClimateData | 24
FP4_CCPAG | 2
SEA_mitigationSAMPLES | 2
SA_Biodiversity | 1
FP4_PolicyEngagement | 20
FP3_Gender | 9
FP4_GenderToolbox | 3
(19 rows)
</code></pre><ul>
<li>I sent CCAFS people an email to ask if they really want to remove these 200+ tags</li>
<li>She responded yes, so I&rsquo;ll at least need to do these deletes in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;,&#39;FP_GII&#39;);
DELETE 207
</code></pre><ul>
<li>When we discussed this in late July there were some other renames they had requested, but I don&rsquo;t see them in the current spreadsheet so I will have to follow that up</li>
<li>I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!</li>
<li>The final list of corrections and deletes should therefore be:</li>
</ul>
<pre tabindex="0"><code>delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-FP4_CRMWestAfrica&#39;;
update metadatavalue set text_value=&#39;FP3_VietnamLED&#39; where resource_type_id=2 and metadata_field_id=134 and text_value=&#39;FP3_VeitnamLED&#39;;
update metadatavalue set text_value=&#39;PII-FP1_PIRCCA&#39; where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-SEA_PIRCCA&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value=&#39;PII-WA_IntegratedInterventions&#39;;
delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in (&#39;EA_PAR&#39;,&#39;FP1_CSAEvidence&#39;,&#39;FP2_CRMWestAfrica&#39;,&#39;FP3_Gender&#39;,&#39;FP4_Baseline&#39;,&#39;FP4_CCPAG&#39;,&#39;FP4_CCPG&#39;,&#39;FP4_CIATLAM IMPACT&#39;,&#39;FP4_ClimateData&#39;,&#39;FP4_ClimateModels&#39;,&#39;FP4_GenderPolicy&#39;,&#39;FP4_GenderToolbox&#39;,&#39;FP4_Livestock&#39;,&#39;FP4_PolicyEngagement&#39;,&#39;FP_GII&#39;,&#39;SA_Biodiversity&#39;,&#39;SA_CSV&#39;,&#39;SA_GHGMeasurement&#39;,&#39;SEA_mitigationSAMPLES&#39;,&#39;SEA_UpscalingInnovation&#39;,&#39;WA_Partnership&#39;,&#39;WA_SciencePolicyExchange&#39;,&#39;FP_GII&#39;);
</code></pre><ul>
<li>Create and merge pull request to shut up the Ehcache update check (<a href="https://github.com/ilri/DSpace/pull/337">#337</a>)</li>
<li>Although it looks like there was a previous attempt to disable these update checks that was merged in DSpace 4.0 (although it only affects XMLUI): <a href="https://jira.duraspace.org/browse/DS-1492">https://jira.duraspace.org/browse/DS-1492</a></li>
<li>I commented there suggesting that we disable it globally</li>
<li>I merged the changes to the CCAFS project tags (<a href="https://github.com/ilri/DSpace/pull/336">#336</a>) but still need to finalize the metadata deletions/renames</li>
<li>I merged the CGIAR Library theme changes (<a href="https://github.com/ilri/DSpace/pull/338">#338</a>) to the <code>5_x-prod</code> branch in preparation for next week&rsquo;s migration</li>
<li>I emailed the Handle administrators (<a href="mailto:hdladmin@cnri.reston.va.us">hdladmin@cnri.reston.va.us</a>) to ask them what the process for changing their prefix to be resolved by our resolver</li>
<li>They responded and said that they need email confirmation from the contact of record of the other prefix, so I should have the CGIAR System Organization people email them before I send the new <code>sitebndl.zip</code></li>
<li>Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database</li>
<li>Here are all my distinct authority combinations in the database before:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | -1
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 0
Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 | -1
Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 | 600
(8 rows)
</code></pre><ul>
<li>And then after adding a new item and selecting an existing &ldquo;Orth, Alan&rdquo; with an ORCID in the author lookup:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | -1
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 0
Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde | 600
Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 | -1
Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 | 600
(9 rows)
</code></pre><ul>
<li>It created a new authority&hellip; let&rsquo;s try to add another item and select the same existing author and see what happens in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | -1
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 0
Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde | 600
Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 | -1
Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 | 600
(9 rows)
</code></pre><ul>
<li>No new one&hellip; so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | d85a8a5b-9b82-4aaf-8033-d7e0c7d9cb8f | 600
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | -1
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 0
Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde | 600
Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 | -1
Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 | 600
(10 rows)
</code></pre><ul>
<li>Shit, it created another authority! Let&rsquo;s try it again!</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;%Orth, %&#39;;
text_value | authority | confidence
------------+--------------------------------------+------------
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | -1
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | d85a8a5b-9b82-4aaf-8033-d7e0c7d9cb8f | 600
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 600
Orth, Alan | 9aed566a-a248-4878-9577-0caedada43db | 600
Orth, A. | 1a1943a0-3f87-402f-9afe-e52fb46a513e | 600
Orth, Alan | 1a1943a0-3f87-402f-9afe-e52fb46a513e | -1
Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad | 0
Orth, Alan | cb3aa5ae-906f-4902-97b1-2667cf148dde | 600
Orth, Alan | 0d575fa3-8ac4-4763-a90a-1248d4791793 | -1
Orth, Alan | 67a9588f-d86a-4155-81a2-af457e9d13f9 | 600
(11 rows)
</code></pre><ul>
<li>It added <em>another</em> authority&hellip; surely this is not the desired behavior, or maybe we are not using this as intented?</li>
</ul>
<h2 id="2017-09-14">2017-09-14</h2>
<ul>
<li>Communicate with Handle.net admins to try to get some guidance about the 10947 prefix</li>
<li>Michael Marus is the contact for their prefix but he has left CGIAR, but as I actually have access to the CGIAR Library server I think I can just generate a new <code>sitebndl.zip</code> file from their server and send it to Handle.net</li>
<li>Also, Handle.net says their prefix is up for annual renewal next month so we might want to just pay for it and take it over</li>
<li>CGSpace was very slow and Uptime Robot even said it was down at one time</li>
<li>I didn&rsquo;t see any abnormally high usage in the REST or OAI logs, but looking at Munin I see the average JVM usage was at 4.9GB and the heap is only 5GB (5120M), so I think it&rsquo;s just normal growing pains</li>
<li>Every few months I generally try to increase the JVM heap to be 512M higher than the average usage reported by Munin, so now I adjusted it to 5632M</li>
</ul>
<h2 id="2017-09-15">2017-09-15</h2>
<ul>
<li>Apply CCAFS project tag corrections on CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# \i /tmp/ccafs-projects.sql
DELETE 5
UPDATE 4
UPDATE 1
DELETE 1
DELETE 207
</code></pre><h2 id="2017-09-17">2017-09-17</h2>
<ul>
<li>Create pull request for CGSpace to be able to resolve multiple handles (<a href="https://github.com/ilri/DSpace/pull/339">#339</a>)</li>
<li>We still need to do the changes to <code>config.dct</code> and regenerate the <code>sitebndl.zip</code> to send to the Handle.net admins</li>
<li>According to this <a href="http://dspace.2283337.n4.nabble.com/Multiple-handle-prefixes-merged-DSpace-instances-td3427192.html">dspace-tech mailing list entry from 2011</a>, we need to add the extra handle prefixes to <code>config.dct</code> like this:</li>
</ul>
<pre tabindex="0"><code>&#34;server_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&#34;replication_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
&#34;backup_admins&#34; = (
&#34;300:0.NA/10568&#34;
&#34;300:0.NA/10947&#34;
)
</code></pre><ul>
<li>More work on the CGIAR Library migration test run locally, as I was having problem with importing the last fourteen items from the CGIAR System Management Office community</li>
<li>The problem was that we remapped the items to new collections after the initial import, so the items were using the 10947 prefix but the community and collection was using 10568</li>
<li>I ended up having to read the <a href="https://wiki.lyrasis.org/display/DSDOC5x/AIP+Backup+and+Restore#AIPBackupandRestore-ForceReplaceMode">AIP Backup and Restore</a> closely a few times and then explicitly preserve handles and ignore parents:</li>
</ul>
<pre tabindex="0"><code>$ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
</code></pre><ul>
<li>Also, this was in replace mode (-r) rather than submit mode (-s), because submit mode always generated a new handle even if I told it not to!</li>
<li>I decided to start the import process in the evening rather than waiting for the morning, and right as the first community was finished importing I started seeing <code>Timeout waiting for idle object</code> errors</li>
<li>I had to cancel the import, clean up a bunch of database entries, increase the PostgreSQL <code>max_connections</code> as a precaution, restart PostgreSQL and Tomcat, and then finally completed the import</li>
</ul>
<h2 id="2017-09-18">2017-09-18</h2>
<ul>
<li>I think we should force regeneration of all thumbnails in the CGIAR Library community, as their DSpace is version 1.7 and CGSpace is running DSpace 5.5 so they should look much better</li>
<li>One item for comparison:</li>
</ul>
<p><img src="/cgspace-notes/2017/09/10947-2919-before.jpg" alt="With original DSpace 1.7 thumbnail"></p>
<p><img src="/cgspace-notes/2017/09/10947-2919-after.jpg" alt="After DSpace 5.5"></p>
<ul>
<li>Moved the CGIAR Library Migration notes to a page<a href="/cgspace-notes/cgiar-library-migration/">cgiar-library-migration</a>as there seems to be a bug with post slugs defined in frontmatter when you have a permalink scheme defined in <code>config.toml</code> (happens currently in Hugo 0.27.1 at least)</li>
</ul>
<h2 id="2017-09-19">2017-09-19</h2>
<ul>
<li>Nightly Solr indexing is working again, and it appears to be pretty quick actually:</li>
</ul>
<pre tabindex="0"><code>2017-09-19 00:00:14,953 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (0 of 65808): 17607
...
2017-09-19 00:04:18,017 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (65807 of 65808): 83753
</code></pre><ul>
<li>Sisay asked if he could import 50 items for IITA that have already been checked by Bosede and Bizuwork</li>
<li>I had a look at the collection and noticed a bunch of issues with item types and donors, so I asked him to fix those and import it to DSpace Test again first</li>
<li>Abenet wants to be able to filter by ISI Journal in advanced search on queries like this: <a href="https://cgspace.cgiar.org/discover?filtertype_0=dateIssued&amp;filtertype_1=dateIssued&amp;filter_relational_operator_1=equals&amp;filter_relational_operator_0=equals&amp;filter_1=%5B2010+TO+2017%5D&amp;filter_0=2017&amp;filtertype=type&amp;filter_relational_operator=equals&amp;filter=Journal+Article">https://cgspace.cgiar.org/discover?filtertype_0=dateIssued&amp;filtertype_1=dateIssued&amp;filter_relational_operator_1=equals&amp;filter_relational_operator_0=equals&amp;filter_1=%5B2010+TO+2017%5D&amp;filter_0=2017&amp;filtertype=type&amp;filter_relational_operator=equals&amp;filter=Journal+Article</a></li>
<li>I opened an issue to track this (<a href="https://github.com/ilri/DSpace/issues/340">#340</a>) and will test it on DSpace Test soon</li>
<li>Marianne Gadeberg from WLE asked if I would add an account for Adam Hunt on CGSpace and give him permissions to approve all WLE publications</li>
<li>I told him to register first, as he&rsquo;s a CGIAR user and needs an account to be created before I can add him to the groups</li>
</ul>
<h2 id="2017-09-20">2017-09-20</h2>
<ul>
<li>Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite</li>
<li>Force thumbnail regeneration for the CGIAR System Organization&rsquo;s Historic Archive community (2000 items):</li>
</ul>
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p &#34;ImageMagick PDF Thumbnail&#34;
</code></pre><ul>
<li>I&rsquo;m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org</li>
</ul>
<h2 id="2017-09-21">2017-09-21</h2>
<ul>
<li>Switch to OpenJDK 8 from Oracle JDK on DSpace Test</li>
<li>I want to test this for awhile to see if we can start using it instead</li>
<li>I need to look at the JVM graphs in Munin, test the Atmire modules, build the source, etc to get some impressions</li>
</ul>
<h2 id="2017-09-22">2017-09-22</h2>
<ul>
<li>Experimenting with setting up a global JNDI database resource that can be pooled among all the DSpace webapps (reference the <a href="https://wiki.lyrasis.org/display/cmtygp/DCAT+Meeting+April+2017">April, 2017 DCAT meeting</a> comments)</li>
<li>See: <a href="https://www.journaldev.com/2513/tomcat-datasource-jndi-example-java">https://www.journaldev.com/2513/tomcat-datasource-jndi-example-java</a></li>
<li>See: <a href="http://memorynotfound.com/configure-jndi-datasource-tomcat/">http://memorynotfound.com/configure-jndi-datasource-tomcat/</a></li>
</ul>
<h2 id="2017-09-24">2017-09-24</h2>
<ul>
<li>Start investigating other platforms for CGSpace due to linear instance pricing on Linode</li>
<li>We need to figure out how much memory is used by applications, caches, etc, and how much disk space the asset store needs</li>
<li>First, here&rsquo;s the last week of memory usage on CGSpace and DSpace Test:</li>
</ul>
<p><img src="/cgspace-notes/2017/09/cgspace-memory-week.png" alt="CGSpace memory week">
<img src="/cgspace-notes/2017/09/dspace-test-memory-week.png" alt="DSpace Test memory week"></p>
<ul>
<li>8GB of RAM seems to be good for DSpace Test for now, with Tomcat&rsquo;s JVM heap taking 3GB, caches and buffers taking 34GB, and then ~1GB unused</li>
<li>24GB of RAM is <em>way</em> too much for CGSpace, with Tomcat&rsquo;s JVM heap taking 5.5GB and caches and buffers happily using 14GB or so</li>
<li>As far as disk space, the CGSpace assetstore currently uses 51GB and Solr cores use 86GB (mostly in the statistics core)</li>
<li>DSpace Test currently doesn&rsquo;t even have enough space to store a full copy of CGSpace, as its Linode instance only has 96GB of disk space</li>
<li>I&rsquo;ve heard Google Cloud is nice (cheap and performant) but it&rsquo;s definitely more complicated than Linode and instances aren&rsquo;t <em>that</em> much cheaper to make it worth it</li>
<li>Here are some theoretical instances on Google Cloud:
<ul>
<li>DSpace Test, <code>n1-standard-2 </code> with 2 vCPUs, 7.5GB RAM, 300GB persistent SSD: $99/month</li>
<li>CGSpace, <code>n1-standard-4 </code> with 4 vCPUs, 15GB RAM, 300GB persistent SSD: $148/month</li>
</ul>
</li>
<li>Looking at <a href="https://www.linode.com/pricing#all">Linode&rsquo;s instance pricing</a>, for DSpace Test it seems we could use the same 8GB instance for $40/month, and then add <a href="https://www.linode.com/docs/platform/how-to-use-block-storage-with-your-linode">block storage</a> of ~300GB for $30 (block storage is currently in beta and priced at $0.10/GiB)</li>
<li>For CGSpace we could use the cheaper 12GB instance for $80 and then add block storage of 500GB for $50</li>
<li>I&rsquo;ve sent Peter a message about moving DSpace Test to the New Jersey data center so we can test the block storage beta</li>
<li>Create pull request for adding ISI Journal to search filters (<a href="https://github.com/ilri/DSpace/pull/341">#341</a>)</li>
<li>Peter asked if we could map all the items of type <code>Journal Article</code> in <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI Archive</a> to <a href="https://cgspace.cgiar.org/handle/10568/3">ILRI articles in journals and newsletters</a></li>
<li>It is easy to do via CSV using OpenRefine but I noticed that on CGSpace ~1,000 of the expected 2,500 are already mapped, while on DSpace Test they were not</li>
<li>I&rsquo;ve asked Peter if he knows what&rsquo;s going on (or who mapped them)</li>
<li>Turns out he had already mapped some, but requested that I finish the rest</li>
<li>With this GREL in OpenRefine I can find items that are mapped, ie they have <code>10568/3||</code> or <code>10568/3$</code> in their <code>collection</code> field:</li>
</ul>
<pre tabindex="0"><code>isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
</code></pre><ul>
<li>Peter also made a lot of changes to the data in the Archives collections while I was attempting to import the changes, so we were essentially competing for PostgreSQL and Solr connections</li>
<li>I ended up having to kill the import and wait until he was done</li>
<li>I exported a clean CSV and applied the changes from that one, which was a hundred or two less than I thought there should be (at least compared to the current state of DSpace Test, which is a few months old)</li>
</ul>
<h2 id="2017-09-25">2017-09-25</h2>
<ul>
<li>Email Rosemary Kande from ICT to ask about the administrative / finance procedure for moving DSpace Test from EU to US region on Linode</li>
<li>Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org</li>
<li>Peter wants me to clean up the text values for Delia Grace&rsquo;s metadata, as the authorities are all messed up again since we cleaned them up in <a href="/cgspace-notes/2016-12">2016-12</a>:</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
text_value | authority | confidence
--------------+--------------------------------------+------------
Grace, Delia | | 600
Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c | 600
Grace, Delia | bfa61d7c-7583-4175-991c-2e7315000f0c | -1
Grace, D. | 6a8ddca3-33c1-45f9-aa00-6fa9fc91e3fc | -1
</code></pre><ul>
<li>Strangely, none of her authority entries have ORCIDs anymore&hellip;</li>
<li>I&rsquo;ll just fix the text values and forget about it for now:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_value=&#39;Grace, Delia&#39;, authority=&#39;bfa61d7c-7583-4175-991c-2e7315000f0c&#39;, confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;Grace, D%&#39;;
UPDATE 610
</code></pre><ul>
<li>After this we have to reindex the Discovery and Authority cores (as <code>tomcat7</code> user):</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 83m56.895s
user 13m16.320s
sys 2m17.917s
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-authority -b
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Exception: null
java.lang.NullPointerException
at org.dspace.authority.AuthorityValueGenerator.generateRaw(AuthorityValueGenerator.java:82)
at org.dspace.authority.AuthorityValueGenerator.generate(AuthorityValueGenerator.java:39)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.prepareNextValue(DSpaceAuthorityIndexer.java:201)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:132)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:159)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
at org.dspace.authority.indexer.DSpaceAuthorityIndexer.hasMore(DSpaceAuthorityIndexer.java:144)
at org.dspace.authority.indexer.AuthorityIndexClient.main(AuthorityIndexClient.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
real 6m6.447s
user 1m34.010s
sys 0m12.113s
</code></pre><ul>
<li>The <code>index-authority</code> script always seems to fail, I think it&rsquo;s the same old bug</li>
<li>Something interesting for my notes about JNDI database pool—since I couldn&rsquo;t determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:</li>
</ul>
<pre tabindex="0"><code>ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
...
INFO org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspaceLocal
INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
</code></pre><ul>
<li>So it&rsquo;s good to know that <em>something</em> gets printed when it fails because I didn&rsquo;t see <em>any</em> mention of JNDI before when I was testing!</li>
</ul>
<h2 id="2017-09-26">2017-09-26</h2>
<ul>
<li>Adam Hunt from WLE finally registered so I added him to the editor and approver groups</li>
<li>Then I noticed that Sisay never removed Marianne&rsquo;s user accounts from the approver steps in the workflow because she is already in the WLE groups, which are in those steps</li>
<li>For what it&rsquo;s worth, I had asked him to remove them on 2017-09-14</li>
<li>I also went and added the WLE approvers and editors groups to the appropriate steps of all the Phase I and Phase II research theme collections</li>
<li>A lot of CIAT&rsquo;s items have manually generated thumbnails which have an incorrect aspect ratio and an ugly black border</li>
<li>I communicated with Elizabeth from CIAT to tell her she should use DSpace&rsquo;s automatically generated thumbnails</li>
<li>Start discussiong with ICT about Linode server update for DSpace Test</li>
<li>Rosemary said I need to work with Robert Okal to destroy/create the server, and then let her and Lilian Masigah from finance know the updated Linode asset names for their records</li>
</ul>
<h2 id="2017-09-28">2017-09-28</h2>
<ul>
<li>Tunji from the System Organization finally sent the DNS request for library.cgiar.org to CGNET</li>
<li>Now the redirects work</li>
<li>I quickly registered a Let&rsquo;s Encrypt certificate for the domain:</li>
</ul>
<pre tabindex="0"><code># systemctl stop nginx
# /opt/certbot-auto certonly --standalone --email aorth@mjanja.ch -d library.cgiar.org
# systemctl start nginx
</code></pre><ul>
<li>I modified the nginx configuration of the ansible playbooks to use this new certificate and now the certificate is enabled and OCSP stapling is working:</li>
</ul>
<pre tabindex="0"><code>$ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org -tls1_2 -tlsextdebug -status
...
OCSP Response Data:
...
Cert Status: good
</code></pre>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

497
docs/2017-10/index.html Normal file
View File

@ -0,0 +1,497 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="October, 2017" />
<meta property="og:description" content="2017-10-01
Peter emailed to point out that many items in the ILRI archive collection have multiple handles:
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-10/" />
<meta property="article:published_time" content="2017-10-01T08:07:54+03:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2017"/>
<meta name="twitter:description" content="2017-10-01
Peter emailed to point out that many items in the ILRI archive collection have multiple handles:
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "October, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-10/",
"wordCount": "2613",
"datePublished": "2017-10-01T08:07:54+03:00",
"dateModified": "2019-10-28T13:39:25+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-10/">
<title>October, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-10/">October, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-10-01T08:07:54+03:00">Sun Oct 01, 2017</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2017-10-01">2017-10-01</h2>
<ul>
<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
</ul>
<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
</code></pre><ul>
<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
</ul>
<h2 id="2017-10-02">2017-10-02</h2>
<ul>
<li>Peter Ballantyne said he was having problems logging into CGSpace with &ldquo;both&rdquo; of his accounts (CGIAR LDAP and personal, apparently)</li>
<li>I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a &ldquo;no DN found&rdquo; error:</li>
</ul>
<pre tabindex="0"><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
2017-10-01 20:22:37,982 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
</code></pre><ul>
<li>I thought maybe his account had expired (seeing as it&rsquo;s was the first of the month) but he says he was finally able to log in today</li>
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &#34;ldap_authentication:type=failed_auth&#34; dspace.log.2017-10-01
14
</code></pre><ul>
<li>For what it&rsquo;s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET&rsquo;s LDAP server</li>
<li>Linode emailed to say that linode578611 (DSpace Test) needs to migrate to a new host for a security update so I initiated the migration immediately rather than waiting for the scheduled time in two weeks</li>
</ul>
<h2 id="2017-10-04">2017-10-04</h2>
<ul>
<li>Twice in the last twenty-four hours Linode has alerted about high CPU usage on CGSpace (linode2533629)</li>
<li>Communicate with Sam from the CGIAR System Organization about some broken links coming from their CGIAR Library domain to CGSpace</li>
<li>The first is a link to a browse page that should be handled better in nginx:</li>
</ul>
<pre tabindex="0"><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject
</code></pre><ul>
<li>We&rsquo;ll need to check for browse links and handle them properly, including swapping the <code>subject</code> parameter for <code>systemsubject</code> (which doesn&rsquo;t exist in Discovery yet, but we&rsquo;ll need to add it) as we have moved their poorly curated subjects from <code>dc.subject</code> to <code>cg.subject.system</code></li>
<li>The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead</li>
<li>Help Sisay proof sixty-two IITA records on DSpace Test</li>
<li>Lots of inconsistencies and errors in subjects, dc.format.extent, regions, countries</li>
<li>Merge the Discovery search changes for ISI Journal (<a href="https://github.com/ilri/DSpace/pull/341">#341</a>)</li>
</ul>
<h2 id="2017-10-05">2017-10-05</h2>
<ul>
<li>Twice in the past twenty-four hours Linode has warned that CGSpace&rsquo;s outbound traffic rate was exceeding the notification threshold</li>
<li>I had a look at yesterday&rsquo;s OAI and REST logs in <code>/var/log/nginx</code> but didn&rsquo;t see anything unusual:</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
141 157.55.39.240
145 40.77.167.85
162 66.249.66.92
181 66.249.66.95
211 66.249.66.91
312 66.249.66.94
384 66.249.66.90
1495 50.116.102.77
3904 70.32.83.92
9904 45.5.184.196
# awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
5 66.249.66.71
6 66.249.66.67
6 68.180.229.31
8 41.84.227.85
8 66.249.66.92
17 66.249.66.65
24 66.249.66.91
38 66.249.66.95
69 66.249.66.90
148 66.249.66.94
</code></pre><ul>
<li>Working on the nginx redirects for CGIAR Library</li>
<li>We should start using 301 redirects and also allow for <code>/sitemap</code> to work on the library.cgiar.org domain so the CGIAR System Organization people can update their Google Search Console and allow Google to find their content in a structured way</li>
<li>Remove eleven occurrences of <code>ACP</code> in IITA&rsquo;s <code>cg.coverage.region</code> using the Atmire batch edit module from Discovery</li>
<li>Need to investigate how we can verify the library.cgiar.org using the HTML or DNS methods</li>
<li>Run corrections on 143 ILRI Archive items that had two <code>dc.identifier.uri</code> values (Handle) that Peter had pointed out earlier this week</li>
<li>I used OpenRefine to isolate them and then fixed and re-imported them into CGSpace</li>
<li>I manually checked a dozen of them and it appeared that the correct handle was always the second one, so I just deleted the first one</li>
</ul>
<h2 id="2017-10-06">2017-10-06</h2>
<ul>
<li>I saw a nice tweak to thumbnail presentation on the Cardiff Metropolitan University DSpace: <a href="https://repository.cardiffmet.ac.uk/handle/10369/8780">https://repository.cardiffmet.ac.uk/handle/10369/8780</a></li>
<li>It adds a subtle border and box shadow, before and after:</li>
</ul>
<p><img src="/cgspace-notes/2017/10/dspace-thumbnail-original.png" alt="Original flat thumbnails">
<img src="/cgspace-notes/2017/10/dspace-thumbnail-box-shadow.png" alt="Tweaked with border and box shadow"></p>
<ul>
<li>I&rsquo;ll post it to the Yammer group to see what people think</li>
<li>I figured out at way to do the HTML verification for Google Search console for library.cgiar.org</li>
<li>We can drop the HTML file in their XMLUI theme folder and it will get copied to the webapps directory during build/install</li>
<li>Then we add an nginx alias for that URL in the library.cgiar.org vhost</li>
<li>This method is kinda a hack but at least we can put all the pieces into git to be reproducible</li>
<li>I will tell Tunji to send me the verification file</li>
</ul>
<h2 id="2017-10-10">2017-10-10</h2>
<ul>
<li>Deploy logic to allow verification of the library.cgiar.org domain in the Google Search Console (<a href="https://github.com/ilri/DSpace/pull/343">#343</a>)</li>
<li>After verifying both the HTTP and HTTPS domains and submitting a sitemap it will be interesting to see how the stats in the console as well as the search results change (currently 28,500 results):</li>
</ul>
<p><img src="/cgspace-notes/2017/10/google-search-console.png" alt="Google Search Console">
<img src="/cgspace-notes/2017/10/google-search-console-2.png" alt="Google Search Console 2">
<img src="/cgspace-notes/2017/10/google-search-results.png" alt="Google Search results"></p>
<ul>
<li>I tried to submit a &ldquo;Change of Address&rdquo; request in the Google Search Console but I need to be an owner on CGSpace&rsquo;s console (currently I&rsquo;m just a user) in order to do that</li>
<li>Manually clean up some communities and collections that Peter had requested a few weeks ago</li>
<li>Delete Community 10568/102 (ILRI Research and Development Issues)</li>
<li>Move five collections to 10568/27629 (ILRI Projects) using <code>move-collections.sh</code> with the following configuration:</li>
</ul>
<pre tabindex="0"><code>10568/1637 10568/174 10568/27629
10568/1642 10568/174 10568/27629
10568/1614 10568/174 10568/27629
10568/75561 10568/150 10568/27629
10568/183 10568/230 10568/27629
</code></pre><ul>
<li>Delete community 10568/174 (Sustainable livestock futures)</li>
<li>Delete collections in 10568/27629 that have zero items (33 of them!)</li>
</ul>
<h2 id="2017-10-11">2017-10-11</h2>
<ul>
<li>Peter added me as an owner on the CGSpace property on Google Search Console and I tried to submit a &ldquo;Change of Address&rdquo; request for the CGIAR Library but got an error:</li>
</ul>
<p><img src="/cgspace-notes/2017/10/search-console-change-address-error.png" alt="Change of Address error"></p>
<ul>
<li>We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won&rsquo;t work—we&rsquo;ll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects</li>
<li>Also the Google Search Console doesn&rsquo;t work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the &ldquo;Change of Address&rdquo; tool to work!</li>
</ul>
<h2 id="2017-10-12">2017-10-12</h2>
<ul>
<li>Finally finish (I think) working on the myriad nginx redirects for all the CGIAR Library browse stuff—it ended up getting pretty complicated!</li>
<li>I still need to commit the DSpace changes (add browse index, XMLUI strings, Discovery index, etc), but I should be able to deploy that on CGSpace soon</li>
</ul>
<h2 id="2017-10-14">2017-10-14</h2>
<ul>
<li>Run system updates on DSpace Test and reboot server</li>
<li>Merge changes adding a search/browse index for CGIAR System subject to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/344">#344</a>)</li>
<li>I checked the top browse links in Google&rsquo;s search results for <code>site:library.cgiar.org inurl:browse</code> and they are all redirected appropriately by the nginx rewrites I worked on last week</li>
</ul>
<h2 id="2017-10-22">2017-10-22</h2>
<ul>
<li>Run system updates on DSpace Test and reboot server</li>
<li>Re-deploy CGSpace from latest <code>5_x-prod</code> (adds ISI Journal to search filters and adds Discovery index for CGIAR Library <code>systemsubject</code>)</li>
<li>Deploy nginx redirect fixes to catch CGIAR Library browse links (redirect to their community and translate subject→systemsubject)</li>
<li>Run migration of CGSpace server (linode18) for Linode security alert, which took 42 minutes of downtime</li>
</ul>
<h2 id="2017-10-26">2017-10-26</h2>
<ul>
<li>In the last 24 hours we&rsquo;ve gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace</li>
<li>Uptime Robot even noticed CGSpace go &ldquo;down&rdquo; for a few minutes</li>
<li>In other news, I was trying to look at a question about stats raised by Magdalena and then CGSpace went down due to SQL connection pool</li>
<li>Looking at the PostgreSQL activity I see there are 93 connections, but after a minute or two they went down and CGSpace came back up</li>
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
<li>Still not sure where the load is coming from right now, but it&rsquo;s clear why there were so many alerts yesterday on the 25th!</li>
</ul>
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-25 | sort -n | uniq | wc -l
18022
</code></pre><ul>
<li>Compared to other days there were two or three times the number of requests yesterday!</li>
</ul>
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-23 | sort -n | uniq | wc -l
3141
# grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-26 | sort -n | uniq | wc -l
7851
</code></pre><ul>
<li>I still have no idea what was causing the load to go up today</li>
<li>I finally investigated Magdalena&rsquo;s issue with the item download stats and now I can&rsquo;t reproduce it: I get the same number of downloads reported in the stats widget on the item page, the &ldquo;Most Popular Items&rdquo; page, and in Usage Stats</li>
<li>I think it might have been an issue with the statistics not being fresh</li>
<li>I added the admin group for the systems organization to the admin role of the top-level community of CGSpace because I guess Sisay had forgotten</li>
<li>Magdalena asked if there was a way to reuse data in item submissions where items have a lot of similar data</li>
<li>I told her about the possibility to use per-collection item templates, and asked if her items in question were all from a single collection</li>
<li>We&rsquo;ve never used it but it could be worth looking at</li>
</ul>
<h2 id="2017-10-27">2017-10-27</h2>
<ul>
<li>Linode alerted about high CPU usage again (twice) on CGSpace in the last 24 hours, around 2AM and 2PM</li>
</ul>
<h2 id="2017-10-28">2017-10-28</h2>
<ul>
<li>Linode alerted about high CPU usage again on CGSpace around 2AM this morning</li>
</ul>
<h2 id="2017-10-29">2017-10-29</h2>
<ul>
<li>Linode alerted about high CPU usage again on CGSpace around 2AM and 4AM</li>
<li>I&rsquo;m still not sure why this started causing alerts so repeatadely the past week</li>
<li>I don&rsquo;t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
</ul>
<pre tabindex="0"><code># grep &#39;2017-10-29 02:&#39; dspace.log.2017-10-29 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2049
</code></pre><ul>
<li>So there were 2049 unique sessions during the hour of 2AM</li>
<li>Looking at my notes, the number of unique sessions was about the same during the same hour on other days when there were no alerts</li>
<li>I think I&rsquo;ll need to enable access logging in nginx to figure out what&rsquo;s going on</li>
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I&rsquo;ve never seen before:</li>
</ul>
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &#34;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&#34; 200 7776 &#34;-&#34; &#34;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&#34;
</code></pre><ul>
<li>CORE seems to be some bot that is &ldquo;Aggregating the worlds open access research papers&rdquo;</li>
<li>The contact address listed in their bot&rsquo;s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
<li>I will check the logs in a few days to see if they are harvesting us regularly, then add their bot&rsquo;s user agent to the Tomcat Crawler Session Valve</li>
<li>After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now</li>
<li>For now I will just contact them to have them update their contact info in the bot&rsquo;s user agent, but eventually I think I&rsquo;ll tell them to swap out the CGIAR Library entry for CGSpace</li>
</ul>
<h2 id="2017-10-30">2017-10-30</h2>
<ul>
<li>Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)</li>
<li>Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT * FROM pg_stat_activity;
...
(93 rows)
</code></pre><ul>
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
</ul>
<pre tabindex="0"><code># grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log
26475
# grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log.1
135083
</code></pre><ul>
<li>IP addresses for this bot currently seem to be:</li>
</ul>
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq
137.108.70.6
137.108.70.7
</code></pre><ul>
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won&rsquo;t help much because they are only using two sessions:</li>
</ul>
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq
session_id=5771742CABA3D0780860B8DA81E0551B
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
</code></pre><ul>
<li>&hellip; and most of their requests are for dynamic discover pages:</li>
</ul>
<pre tabindex="0"><code># grep -c 137.108.70 /var/log/nginx/access.log
26622
# grep 137.108.70 /var/log/nginx/access.log | grep -c &#34;GET /discover&#34;
24055
</code></pre><ul>
<li>Just because I&rsquo;m curious who the top IPs are:</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
496 62.210.247.93
571 46.4.94.226
651 40.77.167.39
763 157.55.39.231
782 207.46.13.90
998 66.249.66.90
1948 104.196.152.243
4247 190.19.92.5
31602 137.108.70.6
31636 137.108.70.7
</code></pre><ul>
<li>At least we know the top two are CORE, but who are the others?</li>
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don&rsquo;t reuse their session variable, creating thousands of new sessions!</li>
</ul>
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1419
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2811
</code></pre><ul>
<li>From looking at the requests, it appears these are from CIAT and CCAFS</li>
<li>I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them</li>
<li>Actually, according to the Tomcat docs, we could use an IP with <code>crawlerIps</code>: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve</a></li>
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn&rsquo;t in Ubuntu 16.04&rsquo;s 7.0.68 build!</li>
<li>That would explain the errors I was getting when trying to set it:</li>
</ul>
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property &#39;crawlerIps&#39; to &#39;190\.19\.92\.5|104\.196\.152\.243&#39; did not find a matching property.
</code></pre><ul>
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
</ul>
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)&#39; dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
</code></pre><ul>
<li>I will check again tomorrow</li>
</ul>
<h2 id="2017-10-31">2017-10-31</h2>
<ul>
<li>Very nice, Linode alerted that CGSpace had high CPU usage at 2AM again</li>
<li>Ask on the dspace-tech mailing list if it&rsquo;s possible to use an existing item as a template for a new item</li>
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
</ul>
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log.1 | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h
139109 137.108.70.6
139253 137.108.70.7
</code></pre><ul>
<li>I&rsquo;ve emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace</li>
<li>Also, I asked if they could perhaps use the <code>sitemap.xml</code>, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets</li>
<li>I added <a href="https://goaccess.io/">GoAccess</a> to the list of package to install in the DSpace role of the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
<li>It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:</li>
</ul>
<pre tabindex="0"><code># goaccess /var/log/nginx/access.log --log-format=COMBINED
</code></pre><ul>
<li>According to Uptime Robot CGSpace went down and up a few times</li>
<li>I had a look at goaccess and I saw that CORE was actively indexing</li>
<li>Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)</li>
<li>I&rsquo;m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
<li>Actually, come to think of it, they aren&rsquo;t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
</ul>
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | grep -o -E &#34;GET /(discover|search-filter)&#34; | sort -n | uniq -c | sort -rn
158058 GET /discover
14260 GET /search-filter
</code></pre><ul>
<li>I tested a URL of pattern <code>/discover</code> in Google&rsquo;s webmaster tools and it was indeed identified as blocked</li>
<li>I will send feedback to the CORE bot team</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

998
docs/2017-11/index.html Normal file
View File

@ -0,0 +1,998 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="November, 2017" />
<meta property="og:description" content="2017-11-01
The CORE developers responded to say they are looking into their bot not respecting our robots.txt
2017-11-02
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
# grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-11/" />
<meta property="article:published_time" content="2017-11-02T09:37:54+02:00" />
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2017"/>
<meta name="twitter:description" content="2017-11-01
The CORE developers responded to say they are looking into their bot not respecting our robots.txt
2017-11-02
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
# grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-11/",
"wordCount": "5428",
"datePublished": "2017-11-02T09:37:54+02:00",
"dateModified": "2019-10-28T13:39:25+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-11/">
<title>November, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-11/">November, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-11-02T09:37:54+02:00">Thu Nov 02, 2017</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2017-11-01">2017-11-01</h2>
<ul>
<li>The CORE developers responded to say they are looking into their bot not respecting our robots.txt</li>
</ul>
<h2 id="2017-11-02">2017-11-02</h2>
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre><ul>
<li>Abenet asked if it would be possible to generate a report of items in Listing and Reports that had &ldquo;International Fund for Agricultural Development&rdquo; as the <em>only</em> investor</li>
<li>I opened a ticket with Atmire to ask if this was possible: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=540">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=540</a></li>
<li>Work on making the thumbnails in the item view clickable</li>
<li>Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link</li>
</ul>
<pre tabindex="0"><code>//mets:fileSec/mets:fileGrp[@USE=&#39;CONTENT&#39;]/mets:file/mets:FLocat[@LOCTYPE=&#39;URL&#39;]/@xlink:href
</code></pre><ul>
<li>METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml</li>
<li>I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!</li>
<li>Help proof fifty-three CIAT records for Sisay: <a href="https://dspacetest.cgiar.org/handle/10568/95895">https://dspacetest.cgiar.org/handle/10568/95895</a></li>
<li>A handful of issues with <code>cg.place</code> using format like &ldquo;Lima, PE&rdquo; instead of &ldquo;Lima, Peru&rdquo;</li>
<li>Also, some dates like with completely invalid format like &ldquo;2010- 06&rdquo; and &ldquo;2011-3-28&rdquo;</li>
<li>I also collapsed some consecutive whitespace on a handful of fields</li>
</ul>
<h2 id="2017-11-03">2017-11-03</h2>
<ul>
<li>Atmire got back to us to say that they estimate it will take two days of labor to implement the change to Listings and Reports</li>
<li>I said I&rsquo;d ask Abenet if she wants that feature</li>
</ul>
<h2 id="2017-11-04">2017-11-04</h2>
<ul>
<li>I finished looking through Sisay&rsquo;s CIAT records for the &ldquo;Alianzas de Aprendizaje&rdquo; data</li>
<li>I corrected about half of the authors to standardize them</li>
<li>Linode emailed this morning to say that the CPU usage was high again, this time at 6:14AM</li>
<li>It&rsquo;s the first time in a few days that this has happened</li>
<li>I had a look to see what was going on, but it isn&rsquo;t the CORE bot:</li>
</ul>
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
306 68.180.229.31
323 61.148.244.116
414 66.249.66.91
507 40.77.167.16
618 157.55.39.161
652 207.46.13.103
666 157.55.39.254
1173 104.196.152.243
1737 66.249.66.90
23101 138.201.52.218
</code></pre><ul>
<li>138.201.52.218 is from some Hetzner server, and I see it making 40,000 requests yesterday too, but none before that:</li>
</ul>
<pre tabindex="0"><code># zgrep -c 138.201.52.218 /var/log/nginx/access.log*
/var/log/nginx/access.log:24403
/var/log/nginx/access.log.1:45958
/var/log/nginx/access.log.2.gz:0
/var/log/nginx/access.log.3.gz:0
/var/log/nginx/access.log.4.gz:0
/var/log/nginx/access.log.5.gz:0
/var/log/nginx/access.log.6.gz:0
</code></pre><ul>
<li>It&rsquo;s clearly a bot as it&rsquo;s making tens of thousands of requests, but it&rsquo;s using a &ldquo;normal&rdquo; user agent:</li>
</ul>
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
</code></pre><ul>
<li>For now I don&rsquo;t know what this user is!</li>
</ul>
<h2 id="2017-11-05">2017-11-05</h2>
<ul>
<li>Peter asked if I could fix the appearance of &ldquo;International Livestock Research Institute&rdquo; in the author lookup during item submission</li>
<li>It looks to be just an issue with the user interface expecting authors to have both a first and last name:</li>
</ul>
<p><img src="/cgspace-notes/2017/11/author-lookup.png" alt="Author lookup">
<img src="/cgspace-notes/2017/11/add-author.png" alt="Add author"></p>
<ul>
<li>But in the database the authors are correct (none with weird <code>, /</code> characters):</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like &#39;International Livestock Research Institute%&#39;;
text_value | authority | confidence
--------------------------------------------+--------------------------------------+------------
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 0
International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc | 0
International Livestock Research Institute | | -1
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 600
International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc | -1
International Livestock Research Institute | | 600
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | -1
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 500
(8 rows)
</code></pre><ul>
<li>So I&rsquo;m not sure if this is just a graphical glitch or if editors have to edit this metadata field prior to approval</li>
<li>Looking at monitoring Tomcat&rsquo;s JVM heap with Prometheus, it looks like we need to use JMX + <a href="https://github.com/prometheus/jmx_exporter">jmx_exporter</a></li>
<li>This guide shows how to <a href="https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/">enable JMX in Tomcat</a> by modifying <code>CATALINA_OPTS</code></li>
<li>I was able to successfully connect to my local Tomcat with jconsole!</li>
</ul>
<h2 id="2017-11-07">2017-11-07</h2>
<ul>
<li>CGSpace when down and up a few times this morning, first around 3AM, then around 7</li>
<li>Tsega had to restart Tomcat 7 to fix it temporarily</li>
<li>I will start by looking at bot usage (access.log.1 includes usage until 6AM today):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log.1 | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
619 65.49.68.184
840 65.49.68.199
924 66.249.66.91
1131 68.180.229.254
1583 66.249.66.90
1953 207.46.13.103
1999 207.46.13.80
2021 157.55.39.161
2034 207.46.13.36
4681 104.196.152.243
</code></pre><ul>
<li>104.196.152.243 seems to be a top scraper for a few weeks now:</li>
</ul>
<pre tabindex="0"><code># zgrep -c 104.196.152.243 /var/log/nginx/access.log*
/var/log/nginx/access.log:336
/var/log/nginx/access.log.1:4681
/var/log/nginx/access.log.2.gz:3531
/var/log/nginx/access.log.3.gz:3532
/var/log/nginx/access.log.4.gz:5786
/var/log/nginx/access.log.5.gz:8542
/var/log/nginx/access.log.6.gz:6988
/var/log/nginx/access.log.7.gz:7517
/var/log/nginx/access.log.8.gz:7211
/var/log/nginx/access.log.9.gz:2763
</code></pre><ul>
<li>This user is responsible for hundreds and sometimes thousands of Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
954
$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
6199
$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
7051
</code></pre><ul>
<li>The worst thing is that this user never specifies a user agent string so we can&rsquo;t lump it in with the other bots using the Tomcat Session Crawler Manager Valve</li>
<li>They don&rsquo;t request dynamic URLs like &ldquo;/discover&rdquo; but they seem to be fetching handles from XMLUI instead of REST (and some with <code>//handle</code>, note the regex below):</li>
</ul>
<pre tabindex="0"><code># grep -c 104.196.152.243 /var/log/nginx/access.log.1
4681
# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P &#39;GET //?handle&#39;
4618
</code></pre><ul>
<li>I just realized that <code>ciat.cgiar.org</code> points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior</li>
<li>The next IP (207.46.13.36) seem to be Microsoft&rsquo;s bingbot, but all its requests specify the &ldquo;bingbot&rdquo; user agent and there are no requests for dynamic URLs that are forbidden, like &ldquo;/discover&rdquo;:</li>
</ul>
<pre tabindex="0"><code>$ grep -c 207.46.13.36 /var/log/nginx/access.log.1
2034
# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:</li>
</ul>
<pre tabindex="0"><code># grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre tabindex="0"><code># grep -c -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1
5997
# grep -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;bingbot&#34;
5988
# grep -E &#39;207.46.13.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre tabindex="0"><code># grep -c -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1
3048
# grep -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c Google
3048
# grep -E &#39;66.249.66.[0-9]{2,3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like &ldquo;/discover&rdquo;:</li>
</ul>
<pre tabindex="0"><code># grep -c 68.180.229.254 /var/log/nginx/access.log.1
1131
# grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
0
</code></pre><ul>
<li>The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:</li>
</ul>
<pre tabindex="0"><code># grep -c -E &#39;65.49.68.[0-9]{3}&#39; /var/log/nginx/access.log.1
2950
# grep -E &#39;65.49.68.[0-9]{3}&#39; /var/log/nginx/access.log.1 | grep -c &#34;GET /discover&#34;
330
</code></pre><ul>
<li>Their user agents vary, ie:
<ul>
<li><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36</code></li>
<li><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11</code></li>
<li><code>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)</code></li>
</ul>
</li>
<li>I&rsquo;ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs</li>
<li>While it&rsquo;s not in the top ten, Baidu is one bot that seems to not give a fuck:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;7/Nov/2017&#34; | grep -c Baiduspider
8912
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;7/Nov/2017&#34; | grep Baiduspider | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
2521
</code></pre><ul>
<li>According to their documentation their bot <a href="http://www.baidu.com/search/robots_english.html">respects <code>robots.txt</code></a>, but I don&rsquo;t see this being the case</li>
<li>I think I will end up blocking Baidu as well&hellip;</li>
<li>Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed</li>
<li>I should look in nginx access.log, rest.log, oai.log, and DSpace&rsquo;s dspace.log.2017-11-07</li>
<li>Here are the top IPs making requests to XMLUI from 2 to 8 AM:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
279 66.249.66.91
373 65.49.68.199
446 68.180.229.254
470 104.196.152.243
470 197.210.168.174
598 207.46.13.103
603 157.55.39.161
637 207.46.13.80
703 207.46.13.36
724 66.249.66.90
</code></pre><ul>
<li>Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot</li>
<li>Here are the top IPs making requests to REST from 2 to 8 AM:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
8 207.241.229.237
10 66.249.66.90
16 104.196.152.243
25 41.60.238.61
26 157.55.39.161
27 207.46.13.103
27 207.46.13.80
31 207.46.13.36
1498 50.116.102.77
</code></pre><ul>
<li>The OAI requests during that same time period are nothing to worry about:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#39;07/Nov/2017:0[2-8]&#39; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
1 66.249.66.92
4 66.249.66.90
6 68.180.229.254
</code></pre><ul>
<li>The top IPs from dspace.log during the 28 AM period:</li>
</ul>
<pre tabindex="0"><code>$ grep -E &#39;2017-11-07 0[2-8]&#39; dspace.log.2017-11-07 | grep -o -E &#39;ip_addr=[0-9.]+&#39; | sort -n | uniq -c | sort -h | tail
143 ip_addr=213.55.99.121
181 ip_addr=66.249.66.91
223 ip_addr=157.55.39.161
248 ip_addr=207.46.13.80
251 ip_addr=207.46.13.103
291 ip_addr=207.46.13.36
297 ip_addr=197.210.168.174
312 ip_addr=65.49.68.199
462 ip_addr=104.196.152.243
488 ip_addr=66.249.66.90
</code></pre><ul>
<li>These aren&rsquo;t actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers</li>
<li>The number of requests isn&rsquo;t even that high to be honest</li>
<li>As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:</li>
</ul>
<pre tabindex="0"><code># zgrep -c 124.17.34.59 /var/log/nginx/access.log*
/var/log/nginx/access.log:22581
/var/log/nginx/access.log.1:0
/var/log/nginx/access.log.2.gz:14
/var/log/nginx/access.log.3.gz:0
/var/log/nginx/access.log.4.gz:0
/var/log/nginx/access.log.5.gz:3
/var/log/nginx/access.log.6.gz:0
/var/log/nginx/access.log.7.gz:0
/var/log/nginx/access.log.8.gz:0
/var/log/nginx/access.log.9.gz:1
</code></pre><ul>
<li>The whois data shows the IP is from China, but the user agent doesn&rsquo;t really give any clues:</li>
</ul>
<pre tabindex="0"><code># grep 124.17.34.59 /var/log/nginx/access.log | awk -F&#39;&#34; &#39; &#39;{print $3}&#39; | sort | uniq -c | sort -h
210 &#34;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36&#34;
22610 &#34;Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)&#34;
</code></pre><ul>
<li>A Google search for &ldquo;LCTE bot&rdquo; doesn&rsquo;t return anything interesting, but this <a href="https://stackoverflow.com/questions/42500881/what-is-lcte-in-user-agent">Stack Overflow discussion</a> references the lack of information</li>
<li>So basically after a few hours of looking at the log files I am not closer to understanding what is going on!</li>
<li>I do know that we want to block Baidu, though, as it does not respect <code>robots.txt</code></li>
<li>And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 1214 hours)</li>
<li>At least for now it seems to be that new Chinese IP (124.17.34.59):</li>
</ul>
<pre tabindex="0"><code># grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
198 207.46.13.103
203 207.46.13.80
205 207.46.13.36
218 157.55.39.161
249 45.5.184.221
258 45.5.187.130
386 66.249.66.90
410 197.210.168.174
1896 104.196.152.243
11005 124.17.34.59
</code></pre><ul>
<li>Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!</li>
</ul>
<pre tabindex="0"><code># grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
5948
# grep -E &#34;07/Nov/2017:1[234]:&#34; /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
0
</code></pre><ul>
<li>About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day</li>
<li>All CIAT requests vs unique ones:</li>
</ul>
<pre tabindex="0"><code>$ grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243&#39; dspace.log.2017-11-07 | wc -l
3506
$ grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243&#39; dspace.log.2017-11-07 | sort | uniq | wc -l
3506
</code></pre><ul>
<li>I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API</li>
<li>About Baidu, I found a link to their <a href="http://ziyuan.baidu.com/robots/">robots.txt tester tool</a></li>
<li>It seems like our robots.txt file is valid, and they claim to recognize that URLs like <code>/discover</code> should be forbidden (不允许, aka &ldquo;not allowed&rdquo;):</li>
</ul>
<p><img src="/cgspace-notes/2017/11/baidu-robotstxt.png" alt="Baidu robots.txt tester"></p>
<ul>
<li>But they literally just made this request today:</li>
</ul>
<pre tabindex="0"><code>180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] &#34;GET /discover?filtertype_0=crpsubject&amp;filter_relational_operator_0=equals&amp;filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&amp;filtertype=subject&amp;filter_relational_operator=equals&amp;filter=WATER+RESOURCES HTTP/1.1&#34; 200 82265 &#34;-&#34; &#34;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#34;
</code></pre><ul>
<li>Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:</li>
</ul>
<pre tabindex="0"><code># grep -c Baiduspider /var/log/nginx/access.log
3806
# grep Baiduspider /var/log/nginx/access.log | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
1085
</code></pre><ul>
<li>I will think about blocking their IPs but they have 164 of them!</li>
</ul>
<pre tabindex="0"><code># grep &#34;Baiduspider/2.0&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq | wc -l
164
</code></pre><h2 id="2017-11-08">2017-11-08</h2>
<ul>
<li>Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM</li>
<li>Linode sent another alert about CPU usage in the morning at 6:12AM</li>
<li>Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;0[78]/Nov/2017:&#34; | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
24981
</code></pre><ul>
<li>This is about 20,000 Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E &#39;session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59&#39; | sort | uniq | wc -l
20733
</code></pre><ul>
<li>I&rsquo;m getting really sick of this</li>
<li>Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections</li>
<li>I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test</li>
<li>Run system updates on DSpace Test and reboot the server</li>
<li>Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (<a href="https://github.com/ilri/DSpace/pull/346">#346</a>)</li>
<li>I figured out a way to use nginx&rsquo;s map function to assign a &ldquo;bot&rdquo; user agent to misbehaving clients who don&rsquo;t define a user agent</li>
<li>Most bots are automatically lumped into one generic session by <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Tomcat&rsquo;s Crawler Session Manager Valve</a> but this only works if their user agent matches a pre-defined regular expression like <code>.*[bB]ot.*</code></li>
<li>Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process</li>
<li>Basically, we modify the nginx config to add a mapping with a modified user agent <code>$ua</code>:</li>
</ul>
<pre tabindex="0"><code>map $remote_addr $ua {
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
124.17.34.59 &#39;ChineseBot&#39;;
default $http_user_agent;
}
</code></pre><ul>
<li>If the client&rsquo;s address matches then the user agent is set, otherwise the default <code>$http_user_agent</code> variable is used</li>
<li>Then, in the server&rsquo;s <code>/</code> block we pass this header to Tomcat:</li>
</ul>
<pre tabindex="0"><code>proxy_pass http://tomcat_http;
proxy_set_header User-Agent $ua;
</code></pre><ul>
<li>Note to self: the <code>$ua</code> variable won&rsquo;t show up in nginx access logs because the default <code>combined</code> log format doesn&rsquo;t show it, so don&rsquo;t run around pulling your hair out wondering with the modified user agents aren&rsquo;t showing in the logs!</li>
<li>If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve</li>
<li>You can verify by cross referencing nginx&rsquo;s <code>access.log</code> and DSpace&rsquo;s <code>dspace.log.2017-11-08</code>, for example</li>
<li>I will deploy this on CGSpace later this week</li>
<li>I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on <a href="#2017-11-07">2017-11-07</a> for example)</li>
<li>I merged the clickable thumbnails code to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/347">#347</a>) and will deploy it later along with the new bot mapping stuff (and re-run the Asible <code>nginx</code> and <code>tomcat</code> tags)</li>
<li>I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in <code>robots.txt</code>:</li>
</ul>
<pre tabindex="0"><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
22229
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E &#34;GET /(browse|discover|search-filter)&#34;
0
</code></pre><ul>
<li>It seems that they rarely even bother checking <code>robots.txt</code>, but Google does multiple times per day!</li>
</ul>
<pre tabindex="0"><code># zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
14
# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
1134
</code></pre><ul>
<li>I have been looking for a reason to ban Baidu and this is definitely a good one</li>
<li>Disallowing <code>Baiduspider</code> in <code>robots.txt</code> probably won&rsquo;t work because this bot doesn&rsquo;t seem to respect the robot exclusion standard anyways!</li>
<li>I will whip up something in nginx later</li>
<li>Run system updates on CGSpace and reboot the server</li>
<li>Re-deploy latest <code>5_x-prod</code> branch on CGSpace and DSpace Test (includes the clickable thumbnails, CCAFS phase II project tags, and updated news text)</li>
</ul>
<h2 id="2017-11-09">2017-11-09</h2>
<ul>
<li>Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:</li>
</ul>
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#39;09/Nov/2017&#39; | grep -c 104.196.152.243
8956
$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
223
</code></pre><ul>
<li>Versus the same stats for yesterday and the day before:</li>
</ul>
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#39;08/Nov/2017&#39; | grep -c 104.196.152.243
10216
$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2592
# zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep &#39;07/Nov/2017&#39; | grep -c 104.196.152.243
8120
$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
3506
</code></pre><ul>
<li>The number of sessions is over <em>ten times less</em>!</li>
<li>This gets me thinking, I wonder if I can use something like nginx&rsquo;s rate limiter to automatically change the user agent of clients who make too many requests</li>
<li>Perhaps using a combination of geo and map, like illustrated here: <a href="https://www.nginx.com/blog/rate-limiting-nginx/">https://www.nginx.com/blog/rate-limiting-nginx/</a></li>
</ul>
<h2 id="2017-11-11">2017-11-11</h2>
<ul>
<li>I was looking at the Google index and noticed there are 4,090 search results for dspace.ilri.org but only seven for mahider.ilri.org</li>
<li>Search with something like: inurl:dspace.ilri.org inurl:https</li>
<li>I want to get rid of those legacy domains eventually!</li>
</ul>
<h2 id="2017-11-12">2017-11-12</h2>
<ul>
<li>Update the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure templates</a> to be a little more modular and flexible</li>
<li>Looking at the top client IPs on CGSpace so far this morning, even though it&rsquo;s only been eight hours:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;12/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
243 5.83.120.111
335 40.77.167.103
424 66.249.66.91
529 207.46.13.36
554 40.77.167.129
604 207.46.13.53
754 104.196.152.243
883 66.249.66.90
1150 95.108.181.88
1381 5.9.6.51
</code></pre><ul>
<li>5.9.6.51 seems to be a Russian bot:</li>
</ul>
<pre tabindex="0"><code># grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] &#34;GET /handle/10568/16515/recent-submissions HTTP/1.1&#34; 200 5097 &#34;-&#34; &#34;Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&#34;
</code></pre><ul>
<li>What&rsquo;s amazing is that it seems to reuse its Java session across all requests:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2017-11-12
1558
$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>Bravo to MegaIndex.ru!</li>
<li>The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat&rsquo;s Crawler Session Manager valve regex should match &lsquo;YandexBot&rsquo;:</li>
</ul>
<pre tabindex="0"><code># grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] &#34;GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1&#34; 200 972019 &#34;-&#34; &#34;Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#34;
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88&#39; dspace.log.2017-11-12
991
</code></pre><ul>
<li>Move some items and collections on CGSpace for Peter Ballantyne, running <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move_collections.sh</code></a> with the following configuration:</li>
</ul>
<pre tabindex="0"><code>10947/6 10947/1 10568/83389
10947/34 10947/1 10568/83389
10947/2512 10947/1 10568/83389
</code></pre><ul>
<li>I explored nginx rate limits as a way to aggressively throttle Baidu bot which doesn&rsquo;t seem to respect disallowed URLs in robots.txt</li>
<li>There&rsquo;s an interesting <a href="https://www.nginx.com/blog/rate-limiting-nginx/">blog post from Nginx&rsquo;s team about rate limiting</a> as well as a <a href="https://gist.github.com/arosenhagen/8aaf5d7f94171778c0e9">clever use of mapping with rate limits</a></li>
<li>The solution <a href="https://github.com/ilri/rmg-ansible-public/commit/f0646991772660c505bea9c5ac586490e7c86156">I came up with</a> uses tricks from both of those</li>
<li>I deployed the limit on CGSpace and DSpace Test and it seems to work well:</li>
</ul>
<pre tabindex="0"><code>$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:&#39;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Sun, 12 Nov 2017 16:30:19 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:&#39;Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 503 Service Temporarily Unavailable
Connection: keep-alive
Content-Length: 206
Content-Type: text/html
Date: Sun, 12 Nov 2017 16:30:21 GMT
Server: nginx
</code></pre><ul>
<li>The first request works, second is denied with an HTTP 503!</li>
<li>I need to remember to check the Munin graphs for PostgreSQL and JVM next week to see how this affects them</li>
</ul>
<h2 id="2017-11-13">2017-11-13</h2>
<ul>
<li>At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:</li>
</ul>
<pre tabindex="0"><code># zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;13/Nov/2017&#34; | grep &#34;Baiduspider&#34; | grep -c &#34; 200 &#34;
1132
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep &#34;13/Nov/2017&#34; | grep &#34;Baiduspider&#34; | grep -c &#34; 503 &#34;
10105
</code></pre><ul>
<li>Helping Sisay proof 47 records for IITA: <a href="https://dspacetest.cgiar.org/handle/10568/97029">https://dspacetest.cgiar.org/handle/10568/97029</a></li>
<li>From looking at the data in OpenRefine I found:
<ul>
<li>Errors in <code>cg.authorship.types</code></li>
<li>Errors in <code>cg.coverage.country</code> (smart quote in &ldquo;COTE DIVOIRE&rdquo;, &ldquo;HAWAII&rdquo; is not a country)</li>
<li>Whitespace issues in some <code>cg.contributor.affiliation</code></li>
<li>Whitespace issues in some <code>cg.identifier.doi</code> fields and most values are using HTTP instead of HTTPS</li>
<li>Whitespace issues in some <code>dc.contributor.author</code> fields</li>
<li>Issue with invalid <code>dc.date.issued</code> value &ldquo;2011-3&rdquo;</li>
<li>Description fields are poorly copypasted</li>
<li>Whitespace issues in <code>dc.description.sponsorship</code></li>
<li>Lots of inconsistency in <code>dc.format.extent</code> (mixed dash style, periods at the end of values)</li>
<li>Whitespace errors in <code>dc.identifier.citation</code></li>
<li>Whitespace errors in <code>dc.subject</code></li>
<li>Whitespace errors in <code>dc.title</code></li>
</ul>
</li>
<li>After uploading and looking at the data in DSpace Test I saw more errors with CRPs, subjects (one item had four copies of all of its subjects, another had a &ldquo;.&rdquo; in it), affiliations, sponsors, etc.</li>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510">ticket about ORCID stuff</a> a few days ago, today I told them that I need to talk to Peter and the partners to see what we would like to do</li>
</ul>
<h2 id="2017-11-14">2017-11-14</h2>
<ul>
<li>Deploy some nginx configuration updates to CGSpace</li>
<li>They had been waiting on a branch for a few months and I think I just forgot about them</li>
<li>I have been running them on DSpace Test for a few days and haven&rsquo;t seen any issues there</li>
<li>Started testing DSpace 6.2 and a few things have changed</li>
<li>Now PostgreSQL needs <code>pgcrypto</code>:</li>
</ul>
<pre tabindex="0"><code>$ psql dspace6
dspace6=# CREATE EXTENSION pgcrypto;
</code></pre><ul>
<li>Also, local settings are no longer in <code>build.properties</code>, they are now in <code>local.cfg</code></li>
<li>I&rsquo;m not sure if we can use separate profiles like we did before with <code>mvn -Denv=blah</code> to use blah.properties</li>
<li>It seems we need to use &ldquo;system properties&rdquo; to override settings, ie: <code>-Ddspace.dir=/Users/aorth/dspace6</code></li>
</ul>
<h2 id="2017-11-15">2017-11-15</h2>
<ul>
<li>Send Adam Hunt an invite to the DSpace Developers network on Yammer</li>
<li>He is the new head of communications at WLE, since Michael left</li>
<li>Merge changes to item view&rsquo;s wording of link metadata (<a href="https://github.com/ilri/DSpace/pull/348">#348</a>)</li>
</ul>
<h2 id="2017-11-17">2017-11-17</h2>
<ul>
<li>Uptime Robot said that CGSpace went down today and I see lots of <code>Timeout waiting for idle object</code> errors in the DSpace logs</li>
<li>I looked in PostgreSQL using <code>SELECT * FROM pg_stat_activity;</code> and saw that there were 73 active connections</li>
<li>After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat</li>
<li>Looking at the REST and XMLUI log files, I don&rsquo;t see anything too crazy:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep &#34;17/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
13 66.249.66.223
14 207.46.13.36
17 207.46.13.137
22 207.46.13.23
23 66.249.66.221
92 66.249.66.219
187 104.196.152.243
1400 70.32.83.92
1503 50.116.102.77
6037 45.5.184.196
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#34;17/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
325 139.162.247.24
354 66.249.66.223
422 207.46.13.36
434 207.46.13.23
501 207.46.13.137
647 66.249.66.221
662 34.192.116.178
762 213.55.99.121
1867 104.196.152.243
2020 66.249.66.219
</code></pre><ul>
<li>I need to look into using JMX to analyze active sessions I think, rather than looking at log files</li>
<li>After adding appropriate <a href="https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/">JMX listener options to Tomcat&rsquo;s JAVA_OPTS</a> and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:</li>
</ul>
<pre tabindex="0"><code>$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
</code></pre><ul>
<li>Looking at the MBeans you can drill down in Catalina→Manager→webapp→localhost→Attributes and see active sessions, etc</li>
<li>I want to enable JMX listener on CGSpace but I need to do some more testing on DSpace Test and see if it causes any performance impact, for example</li>
<li>If I hit the server with some requests as a normal user I see the session counter increase, but if I specify a bot user agent then the sessions seem to be reused (meaning the Crawler Session Manager is working)</li>
<li>Here is the Jconsole screen after looping <code>http --print Hh https://dspacetest.cgiar.org/handle/10568/1</code> for a few minutes:</li>
</ul>
<p><img src="/cgspace-notes/2017/11/jconsole-sessions.png" alt="Jconsole sessions for XMLUI"></p>
<ul>
<li>Switch DSpace Test to using the G1GC for JVM so I can see what the JVM graph looks like eventually, and start evaluating it for production</li>
</ul>
<h2 id="2017-11-19">2017-11-19</h2>
<ul>
<li>Linode sent an alert that CGSpace was using a lot of CPU around 46 AM</li>
<li>Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;19/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
111 66.249.66.155
171 5.9.6.51
188 54.162.241.40
229 207.46.13.23
233 207.46.13.137
247 40.77.167.6
251 207.46.13.36
275 68.180.229.254
325 104.196.152.243
1610 66.249.66.153
</code></pre><ul>
<li>66.249.66.153 appears to be Googlebot:</li>
</ul>
<pre tabindex="0"><code>66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] &#34;GET /handle/10568/2203 HTTP/1.1&#34; 200 6309 &#34;-&#34; &#34;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#34;
</code></pre><ul>
<li>We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity</li>
<li>In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)</li>
</ul>
<pre tabindex="0"><code>$ wc -l dspace.log.2017-11-19
388472 dspace.log.2017-11-19
$ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
267494
</code></pre><ul>
<li>WTF is this process doing every day, and for so many hours?</li>
<li>In unrelated news, when I was looking at the DSpace logs I saw a bunch of errors like this:</li>
</ul>
<pre tabindex="0"><code>2017-11-19 03:00:32,806 INFO org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
</code></pre><ul>
<li>It&rsquo;s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:</li>
</ul>
<p><img src="/cgspace-notes/2017/11/tomcat-jvm-g1gc.png" alt="Tomcat G1GC"></p>
<h2 id="2017-11-20">2017-11-20</h2>
<ul>
<li>I found <a href="https://www.cakesolutions.net/teamblogs/low-pause-gc-on-the-jvm">an article about JVM tuning</a> that gives some pointers how to enable logging and tools to analyze logs for you</li>
<li>Also notes on <a href="https://blog.gceasy.io/2016/11/15/rotating-gc-log-files/">rotating GC logs</a></li>
<li>I decided to switch DSpace Test back to the CMS garbage collector because it is designed for low pauses and high throughput (like G1GC!) and because we haven&rsquo;t even tried to monitor or tune it</li>
</ul>
<h2 id="2017-11-21">2017-11-21</h2>
<ul>
<li>Magdalena was having problems logging in via LDAP and it seems to be a problem with the CGIAR LDAP server:</li>
</ul>
<pre tabindex="0"><code>2017-11-21 11:11:09,621 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
</code></pre><h2 id="2017-11-22">2017-11-22</h2>
<ul>
<li>Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM</li>
<li>The logs don&rsquo;t show anything particularly abnormal between those hours:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;22/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
136 31.6.77.23
174 68.180.229.254
217 66.249.66.91
256 157.55.39.79
268 54.144.57.183
281 207.46.13.137
282 207.46.13.36
290 207.46.13.23
696 66.249.66.90
707 104.196.152.243
</code></pre><ul>
<li>I haven&rsquo;t seen 54.144.57.183 before, it is apparently the CCBot from commoncrawl.org</li>
<li>In other news, it looks like the JVM garbage collection pattern is back to its standard jigsaw pattern after switching back to CMS a few days ago:</li>
</ul>
<p><img src="/cgspace-notes/2017/11/tomcat-jvm-cms.png" alt="Tomcat JVM with CMS GC"></p>
<h2 id="2017-11-23">2017-11-23</h2>
<ul>
<li>Linode alerted again that CPU usage was high on CGSpace from 4:13 to 6:13 AM</li>
<li>I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;23/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
88 66.249.66.91
140 68.180.229.254
155 54.196.2.131
182 54.224.164.166
301 157.55.39.79
315 207.46.13.36
331 207.46.13.23
358 207.46.13.137
565 104.196.152.243
1570 66.249.66.90
</code></pre><ul>
<li>&hellip; and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;23/Nov/2017:0[456]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
5 190.120.6.219
6 104.198.9.108
14 104.196.152.243
21 112.134.150.6
22 157.55.39.79
22 207.46.13.137
23 207.46.13.36
26 207.46.13.23
942 45.5.184.196
3995 70.32.83.92
</code></pre><ul>
<li>These IPs crawling the REST API don&rsquo;t specify user agents and I&rsquo;d assume they are creating many Tomcat sessions</li>
<li>I would catch them in nginx to assign a &ldquo;bot&rdquo; user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any reallyat least not in the dspace.log:</li>
</ul>
<pre tabindex="0"><code>$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2
</code></pre><ul>
<li>I&rsquo;m wondering if REST works differently, or just doesn&rsquo;t log these sessions?</li>
<li>I wonder if they are measurable via JMX MBeans?</li>
<li>I did some tests locally and I don&rsquo;t see the sessionCounter incrementing after making requests to REST, but it does with XMLUI and OAI</li>
<li>I came across some interesting PostgreSQL tuning advice for SSDs: <a href="https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0">https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0</a></li>
<li>Apparently setting <code>random_page_cost</code> to 1 is &ldquo;common&rdquo; advice for systems running PostgreSQL on SSD (the default is 4)</li>
<li>So I deployed this on DSpace Test and will check the Munin PostgreSQL graphs in a few days to see if anything changes</li>
</ul>
<h2 id="2017-11-24">2017-11-24</h2>
<ul>
<li>It&rsquo;s too early to tell for sure, but after I made the <code>random_page_cost</code> change on DSpace Test&rsquo;s PostgreSQL yesterday the number of connections dropped drastically:</li>
</ul>
<p><img src="/cgspace-notes/2017/11/postgres-connections-week.png" alt="PostgreSQL connections after tweak (week)"></p>
<ul>
<li>There have been other temporary drops before, but if I look at the past month and actually the whole year, the trend is that connections are four or five times higher on average:</li>
</ul>
<p><img src="/cgspace-notes/2017/11/postgres-connections-month.png" alt="PostgreSQL connections after tweak (month)"></p>
<ul>
<li>I just realized that we&rsquo;re not logging access requests to other vhosts on CGSpace, so it&rsquo;s possible I have no idea that we&rsquo;re getting slammed at 4AM on another domain that we&rsquo;re just silently redirecting to cgspace.cgiar.org</li>
<li>I&rsquo;ve enabled logging on the CGIAR Library on CGSpace so I can check to see if there are many requests there</li>
<li>In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)</li>
<li>I also noticed that CGNET appears to be monitoring the old domain every few minutes:</li>
</ul>
<pre tabindex="0"><code>192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] &#34;HEAD / HTTP/1.1&#34; 301 0 &#34;-&#34; &#34;curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2&#34;
</code></pre><ul>
<li>I should probably tell CGIAR people to have CGNET stop that</li>
</ul>
<h2 id="2017-11-26">2017-11-26</h2>
<ul>
<li>Linode alerted that CGSpace server was using too much CPU from 5:18 to 7:18 AM</li>
<li>Yet another mystery because the load for all domains looks fine at that time:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;26/Nov/2017:0[567]&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
190 66.249.66.83
195 104.196.152.243
220 40.77.167.82
246 207.46.13.137
247 68.180.229.254
257 157.55.39.214
289 66.249.66.91
298 157.55.39.206
379 66.249.66.70
1855 66.249.66.90
</code></pre><h2 id="2017-11-29">2017-11-29</h2>
<ul>
<li>Linode alerted that CGSpace was using 279% CPU from 6 to 8 AM this morning</li>
<li>About an hour later Uptime Robot said that the server was down</li>
<li>Here are all the top XMLUI and REST users from today:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;29/Nov/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
540 66.249.66.83
659 40.77.167.36
663 157.55.39.214
681 157.55.39.206
733 157.55.39.158
850 66.249.66.70
1311 66.249.66.90
1340 104.196.152.243
4008 70.32.83.92
6053 45.5.184.196
</code></pre><ul>
<li>PostgreSQL activity shows 69 connections</li>
<li>I don&rsquo;t have time to troubleshoot more as I&rsquo;m in Nairobi working on the HPC so I just restarted Tomcat for now</li>
<li>A few hours later Uptime Robot says the server is down again</li>
<li>I don&rsquo;t see much activity in the logs but there are 87 PostgreSQL connections</li>
<li>But shit, there were 10,000 unique Tomcat sessions today:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2017-11-29 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
10037
</code></pre><ul>
<li>Although maybe that&rsquo;s not much, as the previous two days had more:</li>
</ul>
<pre tabindex="0"><code>$ cat dspace.log.2017-11-27 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
12377
$ cat dspace.log.2017-11-28 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
16984
</code></pre><ul>
<li>I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it&rsquo;s the most common source of crashes we have</li>
<li>I will bump DSpace&rsquo;s <code>db.maxconnections</code> from 60 to 90, and PostgreSQL&rsquo;s <code>max_connections</code> from 183 to 273 (which is using my loose formula of 90 * webapps + 3)</li>
<li>I really need to figure out how to get DSpace to use a PostgreSQL connection pool</li>
</ul>
<h2 id="2017-11-30">2017-11-30</h2>
<ul>
<li>Linode alerted about high CPU usage on CGSpace again around 6 to 8 AM</li>
<li>Then Uptime Robot said CGSpace was down a few minutes later, but it resolved itself I think (or Tsega restarted Tomcat, I don&rsquo;t know)</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

837
docs/2017-12/index.html Normal file
View File

@ -0,0 +1,837 @@
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="December, 2017" />
<meta property="og:description" content="2017-12-01
Uptime Robot noticed that CGSpace went down
The logs say &ldquo;Timeout waiting for idle object&rdquo;
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-12/" />
<meta property="article:published_time" content="2017-12-01T13:53:54+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="December, 2017"/>
<meta name="twitter:description" content="2017-12-01
Uptime Robot noticed that CGSpace went down
The logs say &ldquo;Timeout waiting for idle object&rdquo;
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
"/>
<meta name="generator" content="Hugo 0.112.3">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "December, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-12/",
"wordCount": "4088",
"datePublished": "2017-12-01T13:53:54+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-12/">
<title>December, 2017 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-12/">December, 2017</a></h2>
<p class="blog-post-meta">
<time datetime="2017-12-01T13:53:54+03:00">Fri Dec 01, 2017</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2017-12-01">2017-12-01</h2>
<ul>
<li>Uptime Robot noticed that CGSpace went down</li>
<li>The logs say &ldquo;Timeout waiting for idle object&rdquo;</li>
<li>PostgreSQL activity says there are 115 connections currently</li>
<li>The list of connections to XMLUI and REST API for today:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
763 2.86.122.76
907 207.46.13.94
1018 157.55.39.206
1021 157.55.39.235
1407 66.249.66.70
1411 104.196.152.243
1503 50.116.102.77
1805 66.249.66.90
4007 70.32.83.92
6061 45.5.184.196
</code></pre><ul>
<li>The number of DSpace sessions isn&rsquo;t even that high:</li>
</ul>
<pre tabindex="0"><code>$ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
5815
</code></pre><ul>
<li>Connections in the last two hours:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017:(09|10)&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
78 93.160.60.22
101 40.77.167.122
113 66.249.66.70
129 157.55.39.206
130 157.55.39.235
135 40.77.167.58
164 68.180.229.254
177 87.100.118.220
188 66.249.66.90
314 2.86.122.76
</code></pre><ul>
<li>What the fuck is going on?</li>
<li>I&rsquo;ve never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:</li>
</ul>
<pre tabindex="0"><code>$ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
822
</code></pre><ul>
<li>Appears to be some new bot:</li>
</ul>
<pre tabindex="0"><code>2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] &#34;GET /handle/10568/78444?show=full HTTP/1.1&#34; 200 29307 &#34;-&#34; &#34;Mozilla/3.0 (compatible; Indy Library)&#34;
</code></pre><ul>
<li>I restarted Tomcat and everything came back up</li>
<li>I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx</li>
<li>I will also add &lsquo;Drupal&rsquo; to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;1/Dec/2017&#34; | grep Drupal | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
3 54.75.205.145
6 70.32.83.92
14 2a01:7e00::f03c:91ff:fe18:7396
46 2001:4b99:1:1:216:3eff:fe2c:dc6c
319 2001:4b99:1:1:216:3eff:fe76:205b
</code></pre><h2 id="2017-12-03">2017-12-03</h2>
<ul>
<li>Linode alerted that CGSpace&rsquo;s load was 327.5% from 6 to 8 AM again</li>
</ul>
<h2 id="2017-12-04">2017-12-04</h2>
<ul>
<li>Linode alerted that CGSpace&rsquo;s load was 255.5% from 8 to 10 AM again</li>
<li>I looked at the Munin stats on DSpace Test (linode02) again to see how the PostgreSQL tweaks from a few weeks ago were holding up:</li>
</ul>
<p><img src="/cgspace-notes/2017/12/postgres-connections-month.png" alt="DSpace Test PostgreSQL connections month"></p>
<ul>
<li>The results look fantastic! So the <code>random_page_cost</code> tweak is massively important for informing the PostgreSQL scheduler that there is no &ldquo;cost&rdquo; to accessing random pages, as we&rsquo;re on an SSD!</li>
<li>I guess we could probably even reduce the PostgreSQL connections in DSpace / PostgreSQL after using this</li>
<li>Run system updates on DSpace Test (linode02) and reboot it</li>
<li>I&rsquo;m going to enable the PostgreSQL <code>random_page_cost</code> tweak on CGSpace</li>
<li>For reference, here is the past month&rsquo;s connections:</li>
</ul>
<p><img src="/cgspace-notes/2017/12/postgres-connections-month-cgspace.png" alt="CGSpace PostgreSQL connections month"></p>
<h2 id="2017-12-05">2017-12-05</h2>
<ul>
<li>Linode alerted again that the CPU usage on CGSpace was high this morning from 8 to 10 AM</li>
<li>CORE updated the entry for CGSpace on their index: <a href="https://core.ac.uk/search?q=repositories.id:(1016)&amp;fullTextOnly=false">https://core.ac.uk/search?q=repositories.id:(1016)&amp;fullTextOnly=false</a></li>
<li>Linode alerted again that the CPU usage on CGSpace was high this evening from 8 to 10 PM</li>
</ul>
<h2 id="2017-12-06">2017-12-06</h2>
<ul>
<li>Linode alerted again that the CPU usage on CGSpace was high this morning from 6 to 8 AM</li>
<li>Uptime Robot alerted that the server went down and up around 8:53 this morning</li>
<li>Uptime Robot alerted that CGSpace was down and up again a few minutes later</li>
<li>I don&rsquo;t see any errors in the DSpace logs but I see in nginx&rsquo;s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)</li>
<li>Looking at the REST API logs I see some new client IP I haven&rsquo;t noticed before:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;6/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
18 95.108.181.88
19 68.180.229.254
30 207.46.13.151
33 207.46.13.110
38 40.77.167.20
41 157.55.39.223
82 104.196.152.243
1529 50.116.102.77
4005 70.32.83.92
6045 45.5.184.196
</code></pre><ul>
<li>50.116.102.77 is apparently in the US on websitewelcome.com</li>
</ul>
<h2 id="2017-12-07">2017-12-07</h2>
<ul>
<li>Uptime Robot reported a few times today that CGSpace was down and then up</li>
<li>At one point Tsega restarted Tomcat</li>
<li>I never got any alerts about high load from Linode though&hellip;</li>
<li>I looked just now and see that there are 121 PostgreSQL connections!</li>
<li>The top users right now are:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;7/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
838 40.77.167.11
939 66.249.66.223
1149 66.249.66.206
1316 207.46.13.110
1322 207.46.13.151
1323 2001:da8:203:2224:c912:1106:d94f:9189
1414 157.55.39.223
2378 104.196.152.243
2662 66.249.66.219
5110 124.17.34.60
</code></pre><ul>
<li>We&rsquo;ve never seen 124.17.34.60 yet, but it&rsquo;s really hammering us!</li>
<li>Apparently it is from China, and here is one of its user agents:</li>
</ul>
<pre tabindex="0"><code>Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
</code></pre><ul>
<li>It is responsible for 4,500 Tomcat sessions today alone:</li>
</ul>
<pre tabindex="0"><code>$ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
4574
</code></pre><ul>
<li>I&rsquo;ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it&rsquo;s the same bot on the same subnet</li>
<li>I was running the DSpace cleanup task manually and it hit an error:</li>
</ul>
<pre tabindex="0"><code>$ /home/cgspace.cgiar.org/bin/dspace cleanup -v
...
Error: ERROR: update or delete on table &#34;bitstream&#34; violates foreign key constraint &#34;bundle_primary_bitstream_id_fkey&#34; on table &#34;bundle&#34;
Detail: Key (bitstream_id)=(144666) is still referenced from table &#34;bundle&#34;.
</code></pre><ul>
<li>The solution is like I discovered in <a href="/cgspace-notes/2017-04">2017-04</a>, to set the <code>primary_bitstream_id</code> to null:</li>
</ul>
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
UPDATE 1
</code></pre><h2 id="2017-12-13">2017-12-13</h2>
<ul>
<li>Linode alerted that CGSpace was using high CPU from 10:13 to 12:13 this morning</li>
</ul>
<h2 id="2017-12-16">2017-12-16</h2>
<ul>
<li>Re-work the XMLUI base theme to allow child themes to override the header logo&rsquo;s image and link destination: <a href="https://github.com/ilri/DSpace/pull/349">#349</a></li>
<li>This required a little bit of work to restructure the XSL templates</li>
<li>Optimize PNG and SVG image assets in the CGIAR base theme using pngquant and svgo: <a href="https://github.com/ilri/DSpace/pull/350">#350</a></li>
</ul>
<h2 id="2017-12-17">2017-12-17</h2>
<ul>
<li>Reboot DSpace Test to get new Linode Linux kernel</li>
<li>Looking at CCAFS bulk import for Magdalena Haman (she originally sent them in November but some of the thumbnails were missing and dates were messed up so she resent them now)</li>
<li>A few issues with the data and thumbnails:
<ul>
<li>Her thumbnail files all use capital JPG so I had to rename them to lowercase: <code>rename -fc *.JPG</code></li>
<li>thumbnail20.jpg is 1.7MB so I have to resize it</li>
<li>I also had to add the .jpg to the thumbnail string in the CSV</li>
<li>The thumbnail11.jpg is missing</li>
<li>The dates are in super long ISO8601 format (from Excel?) like <code>2016-02-07T00:00:00Z</code> so I converted them to simpler forms in GREL: <code>value.toString(&quot;yyyy-MM-dd&quot;)</code></li>
<li>I trimmed the whitespaces in a few fields but it wasn&rsquo;t many</li>
<li>Rename her thumbnail column to filename, and format it so SAFBuilder adds the files to the thumbnail bundle with this GREL in OpenRefine: <code>value + &quot;__bundle:THUMBNAIL&quot;</code></li>
<li>Rename dc.identifier.status and dc.identifier.url columns to cg.identifier.status and cg.identifier.url</li>
<li>Item 4 has weird characters in citation, ie: Nagoya et de Trait</li>
<li>Some author names need normalization, ie: <code>Aggarwal, Pramod</code> and <code>Aggarwal, Pramod K.</code></li>
<li>Something weird going on with duplicate authors that have the same text value, like <code>Berto, Jayson C.</code> and <code>Balmeo, Katherine P.</code></li>
<li>I will send her feedback on some author names like UNEP and ICRISAT and ask her for the missing thumbnail11.jpg</li>
</ul>
</li>
<li>I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the <code>collection</code> field)</li>
</ul>
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &amp;&gt; /tmp/ccafs.log
</code></pre><ul>
<li>It&rsquo;s the same on DSpace Test, I can&rsquo;t import the SAF bundle without specifying the collection:</li>
</ul>
<pre tabindex="0"><code>$ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
No collections given. Assuming &#39;collections&#39; file inside item directory
Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat
Generating mapfile: /tmp/ccafs.map
Processing collections file: collections
Adding item from directory item_1
java.lang.NullPointerException
at org.dspace.app.itemimport.ItemImport.addItem(ItemImport.java:865)
at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java:736)
at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:498)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
java.lang.NullPointerException
Started: 1513521856014
Ended: 1513521858573
Elapsed time: 2 secs (2559 msecs)
</code></pre><ul>
<li>I even tried to debug it by adding verbose logging to the <code>JAVA_OPTS</code>:</li>
</ul>
<pre tabindex="0"><code>-Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
</code></pre><ul>
<li>&hellip; but the error message was the same, just with more INFO noise around it</li>
<li>For now I&rsquo;ll import into a collection in DSpace Test but I&rsquo;m really not sure what&rsquo;s up with this!</li>
<li>Linode alerted that CGSpace was using high CPU from 4 to 6 PM</li>
<li>The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;17/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
671 66.249.66.70
885 95.108.181.88
904 157.55.39.96
923 157.55.39.179
1159 207.46.13.107
1184 104.196.152.243
1230 66.249.66.91
1414 68.180.229.254
4137 66.249.66.90
46401 137.108.70.7
</code></pre><ul>
<li>And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;17/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
33 68.180.229.254
48 157.55.39.96
51 157.55.39.179
56 207.46.13.107
102 104.196.152.243
102 66.249.66.90
691 137.108.70.7
1531 50.116.102.77
4014 70.32.83.92
11030 45.5.184.196
</code></pre><ul>
<li>That&rsquo;s probably ok, as I don&rsquo;t think the REST API connections use up a Tomcat session&hellip;</li>
<li>CIP emailed a few days ago to ask about unique IDs for authors and organizations, and if we can provide them via an API</li>
<li>Regarding the import issue above it seems to be a known issue that has a patch in DSpace 5.7:
<ul>
<li><a href="https://jira.duraspace.org/browse/DS-2633">https://jira.duraspace.org/browse/DS-2633</a></li>
<li><a href="https://jira.duraspace.org/browse/DS-3583">https://jira.duraspace.org/browse/DS-3583</a></li>
</ul>
</li>
<li>We&rsquo;re on DSpace 5.5 but there is a one-word fix to the addItem() function here: <a href="https://github.com/DSpace/DSpace/pull/1731">https://github.com/DSpace/DSpace/pull/1731</a></li>
<li>I will apply it on our branch but I need to make a note to NOT cherry-pick it when I rebase on to the latest 5.x upstream later</li>
<li>Pull request: <a href="https://github.com/ilri/DSpace/pull/351">#351</a></li>
</ul>
<h2 id="2017-12-18">2017-12-18</h2>
<ul>
<li>Linode alerted this morning that there was high outbound traffic from 6 to 8 AM</li>
<li>The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
190 207.46.13.146
191 197.210.168.174
202 86.101.203.216
268 157.55.39.134
297 66.249.66.91
314 213.55.99.121
402 66.249.66.90
532 68.180.229.254
644 104.196.152.243
32220 137.108.70.7
</code></pre><ul>
<li>On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
7 104.198.9.108
8 185.29.8.111
8 40.77.167.176
9 66.249.66.91
9 68.180.229.254
10 157.55.39.134
15 66.249.66.90
59 104.196.152.243
4014 70.32.83.92
8619 45.5.184.196
</code></pre><ul>
<li>I need to keep an eye on this issue because it has nice fixes for reducing the number of database connections in DSpace 5.7: <a href="https://jira.duraspace.org/browse/DS-3551">https://jira.duraspace.org/browse/DS-3551</a></li>
<li>Update text on CGSpace about page to give some tips to developers about using the resources more wisely (<a href="https://github.com/ilri/DSpace/pull/352">#352</a>)</li>
<li>Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM</li>
<li>The REST and OAI API logs look pretty much the same as earlier this morning, but there&rsquo;s a new IP harvesting XMLUI:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;18/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
360 95.108.181.88
477 66.249.66.90
526 86.101.203.216
691 207.46.13.13
698 197.210.168.174
819 207.46.13.146
878 68.180.229.254
1965 104.196.152.243
17701 2.86.72.181
52532 137.108.70.7
</code></pre><ul>
<li>2.86.72.181 appears to be from Greece, and has the following user agent:</li>
</ul>
<pre tabindex="0"><code>Mozilla/3.0 (compatible; Indy Library)
</code></pre><ul>
<li>Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:</li>
</ul>
<pre tabindex="0"><code>$ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>I guess there&rsquo;s nothing I can do to them for now</li>
<li>In other news, I am curious how many PostgreSQL connection pool errors we&rsquo;ve had in the last month:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &#34;Cannot get a connection, pool error Timeout waiting for idle object&#34; dspace.log.2017-1* | grep -v :0
dspace.log.2017-11-07:15695
dspace.log.2017-11-08:135
dspace.log.2017-11-17:1298
dspace.log.2017-11-26:4160
dspace.log.2017-11-28:107
dspace.log.2017-11-29:3972
dspace.log.2017-12-01:1601
dspace.log.2017-12-02:1274
dspace.log.2017-12-07:2769
</code></pre><ul>
<li>I made a small fix to my <code>move-collections.sh</code> script so that it handles the case when a &ldquo;to&rdquo; or &ldquo;from&rdquo; community doesn&rsquo;t exist</li>
<li>The script lives here: <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515">https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515</a></li>
<li>Major reorganization of four of CTA&rsquo;s French collections</li>
<li>Basically moving their items into the English ones, then moving the English ones to the top-level of the CTA community, and deleting the old sub-communities</li>
<li>Move collection 10568/51821 from 10568/42212 to 10568/42211</li>
<li>Move collection 10568/51400 from 10568/42214 to 10568/42211</li>
<li>Move collection 10568/56992 from 10568/42216 to 10568/42211</li>
<li>Move collection 10568/42218 from 10568/42217 to 10568/42211</li>
<li>Export CSV of collection 10568/63484 and move items to collection 10568/51400</li>
<li>Export CSV of collection 10568/64403 and move items to collection 10568/56992</li>
<li>Export CSV of collection 10568/56994 and move items to collection 10568/42218</li>
<li>There are blank lines in this metadata, which causes DSpace to not detect changes in the CSVs</li>
<li>I had to use OpenRefine to remove all columns from the CSV except <code>id</code> and <code>collection</code>, and then update the <code>collection</code> field for the new mappings</li>
<li>Remove empty sub-communities: 10568/42212, 10568/42214, 10568/42216, 10568/42217</li>
<li>I was in the middle of applying the metadata imports on CGSpace and the system ran out of PostgreSQL connections&hellip;</li>
<li>There were 128 PostgreSQL connections at the time&hellip; grrrr.</li>
<li>So I restarted Tomcat 7 and restarted the imports</li>
<li>I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything:</li>
</ul>
<pre tabindex="0"><code>$ dspace index-discovery -r 10568/42211
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
</code></pre><ul>
<li>The PostgreSQL issues are getting out of control, I need to figure out how to enable connection pools in Tomcat!</li>
</ul>
<h2 id="2017-12-19">2017-12-19</h2>
<ul>
<li>Briefly had PostgreSQL connection issues on CGSpace for the millionth time</li>
<li>I&rsquo;m fucking sick of this!</li>
<li>The connection graph on CGSpace shows shit tons of connections idle</li>
</ul>
<p><img src="/cgspace-notes/2017/12/postgres-connections-month-cgspace-2.png" alt="Idle PostgreSQL connections on CGSpace"></p>
<ul>
<li>And I only now just realized that DSpace&rsquo;s <code>db.maxidle</code> parameter is not seconds, but number of idle connections to allow.</li>
<li>So theoretically, because each webapp has its own pool, this could be 20 per app—so no wonder we have 50 idle connections!</li>
<li>I notice that this number will be set to 10 by default in DSpace 6.1 and 7.0: <a href="https://jira.duraspace.org/browse/DS-3564">https://jira.duraspace.org/browse/DS-3564</a></li>
<li>So I&rsquo;m going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI</li>
<li>I re-deployed the <code>5_x-prod</code> branch on CGSpace, applied all system updates, and restarted the server</li>
<li>Looking through the dspace.log I see this error:</li>
</ul>
<pre tabindex="0"><code>2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore &#39;statistics-2010&#39;: Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
</code></pre><ul>
<li>I don&rsquo;t have time now to look into this but the Solr sharding has long been an issue!</li>
<li>Looking into using JDBC / JNDI to provide a database pool to DSpace</li>
<li>The <a href="https://wiki.lyrasis.org/display/DSDOC6x/Configuration+Reference">DSpace 6.x configuration docs</a> have more notes about setting up the database pool than the 5.x ones (which actually have none!)</li>
<li>First, I uncomment <code>db.jndi</code> in <em>dspace/config/dspace.cfg</em></li>
<li>Then I create a global <code>Resource</code> in the main Tomcat <em>server.xml</em> (inside <code>GlobalNamingResources</code>):</li>
</ul>
<pre tabindex="0"><code>&lt;Resource name=&#34;jdbc/dspace&#34; auth=&#34;Container&#34; type=&#34;javax.sql.DataSource&#34;
driverClassName=&#34;org.postgresql.Driver&#34;
url=&#34;jdbc:postgresql://localhost:5432/dspace&#34;
username=&#34;dspace&#34;
password=&#34;dspace&#34;
initialSize=&#39;5&#39;
maxActive=&#39;50&#39;
maxIdle=&#39;15&#39;
minIdle=&#39;5&#39;
maxWait=&#39;5000&#39;
validationQuery=&#39;SELECT 1&#39;
testOnBorrow=&#39;true&#39; /&gt;
</code></pre><ul>
<li>Most of the parameters are from comments by Mark Wood about his JNDI setup: <a href="https://jira.duraspace.org/browse/DS-3564">https://jira.duraspace.org/browse/DS-3564</a></li>
<li>Then I add a <code>ResourceLink</code> to each web application context:</li>
</ul>
<pre tabindex="0"><code>&lt;ResourceLink global=&#34;jdbc/dspace&#34; name=&#34;jdbc/dspace&#34; type=&#34;javax.sql.DataSource&#34;/&gt;
</code></pre><ul>
<li>I am not sure why several guides show configuration snippets for <em>server.xml</em> and web application contexts that use a Local and Global jdbc&hellip;</li>
<li>When DSpace can&rsquo;t find the JNDI context (for whatever reason) you will see this in the dspace logs:</li>
</ul>
<pre tabindex="0"><code>2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1414)
at org.dspace.storage.rdbms.DatabaseManager.initialize(DatabaseManager.java:1331)
at org.dspace.storage.rdbms.DatabaseManager.getDataSource(DatabaseManager.java:648)
at org.dspace.storage.rdbms.DatabaseManager.getConnection(DatabaseManager.java:627)
at org.dspace.core.Context.init(Context.java:121)
at org.dspace.core.Context.&lt;init&gt;(Context.java:95)
at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:79)
at org.dspace.app.util.DSpaceContextListener.contextInitialized(DSpaceContextListener.java:128)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:5110)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5633)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:145)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:1015)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:991)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:652)
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:712)
at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:2002)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2017-12-19 13:12:08,798 INFO org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
2017-12-19 13:12:08,798 INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
</code></pre><ul>
<li>And indeed the Catalina logs show that it failed to set up the JDBC driver:</li>
</ul>
<pre tabindex="0"><code>org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class &#39;org.postgresql.Driver&#39;
</code></pre><ul>
<li>There are several copies of the PostgreSQL driver installed by DSpace:</li>
</ul>
<pre tabindex="0"><code>$ find ~/dspace/ -iname &#34;postgresql*jdbc*.jar&#34;
/Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/oai/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/webapps/rest/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
/Users/aorth/dspace/lib/postgresql-9.1-901-1.jdbc4.jar
</code></pre><ul>
<li>These apparently come from the main DSpace <code>pom.xml</code>:</li>
</ul>
<pre tabindex="0"><code>&lt;dependency&gt;
&lt;groupId&gt;postgresql&lt;/groupId&gt;
&lt;artifactId&gt;postgresql&lt;/artifactId&gt;
&lt;version&gt;9.1-901-1.jdbc4&lt;/version&gt;
&lt;/dependency&gt;
</code></pre><ul>
<li>So WTF? Let&rsquo;s try copying one to Tomcat&rsquo;s lib folder and restarting Tomcat:</li>
</ul>
<pre tabindex="0"><code>$ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
</code></pre><ul>
<li>Oh that&rsquo;s fantastic, now at least Tomcat doesn&rsquo;t print an error during startup so I guess it succeeds to create the JNDI pool</li>
<li>DSpace starts up but I have no idea if it&rsquo;s using the JNDI configuration because I see this in the logs:</li>
</ul>
<pre tabindex="0"><code>2017-12-19 13:26:54,271 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS is &#39;{}&#39;PostgreSQL
2017-12-19 13:26:54,277 INFO org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is &#39;{}&#39;9.5.10
2017-12-19 13:26:54,293 INFO org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
2017-12-19 13:26:54,306 INFO org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
</code></pre><ul>
<li>Let&rsquo;s try again, but this time explicitly blank the PostgreSQL connection parameters in dspace.cfg and see if DSpace starts&hellip;</li>
<li>Wow, ok, that works, but having to copy the PostgreSQL JDBC JAR to Tomcat&rsquo;s lib folder totally blows</li>
<li>Also, it&rsquo;s likely this is only a problem on my local macOS + Tomcat test environment</li>
<li>Ubuntu&rsquo;s Tomcat distribution will probably handle this differently</li>
<li>So for reference I have:
<ul>
<li>a <code>&lt;Resource&gt;</code> defined globally in server.xml</li>
<li>a <code>&lt;ResourceLink&gt;</code> defined in each web application&rsquo;s context XML</li>
<li>unset the <code>db.url</code>, <code>db.username</code>, and <code>db.password</code> parameters in dspace.cfg</li>
<li>set the <code>db.jndi</code> in dspace.cfg to the name specified in the web application context</li>
</ul>
</li>
<li>After adding the <code>Resource</code> to <em>server.xml</em> on Ubuntu I get this in Catalina&rsquo;s logs:</li>
</ul>
<pre tabindex="0"><code>SEVERE: Unable to create initial connections of pool.
java.sql.SQLException: org.postgresql.Driver
...
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
</code></pre><ul>
<li>The username and password are correct, but maybe I need to copy the fucking lib there too?</li>
<li>I tried installing Ubuntu&rsquo;s <code>libpostgresql-jdbc-java</code> package but Tomcat still can&rsquo;t find the class</li>
<li>Let me try to symlink the lib into Tomcat&rsquo;s libs:</li>
</ul>
<pre tabindex="0"><code># ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
</code></pre><ul>
<li>Now Tomcat starts but the localhost container has errors:</li>
</ul>
<pre tabindex="0"><code>SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract
</code></pre><ul>
<li>Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace&rsquo;s are 9.1&hellip;</li>
<li>Let me try to remove it and copy in DSpace&rsquo;s:</li>
</ul>
<pre tabindex="0"><code># rm /usr/share/tomcat7/lib/postgresql.jar
# cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/
</code></pre><ul>
<li>Wow, I think that actually works&hellip;</li>
<li>I wonder if I could get the JDBC driver from postgresql.org instead of relying on the one from the DSpace build: <a href="https://jdbc.postgresql.org/">https://jdbc.postgresql.org/</a></li>
<li>I notice our version is 9.1-901, which isn&rsquo;t even available anymore! The latest in the archived versions is 9.1-903</li>
<li>Also, since I commented out all the db parameters in DSpace.cfg, how does the command line <code>dspace</code> tool work?</li>
<li>Let&rsquo;s try the upstream JDBC driver first:</li>
</ul>
<pre tabindex="0"><code># rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
# wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar
</code></pre><ul>
<li>DSpace command line fails unless db settings are present in dspace.cfg:</li>
</ul>
<pre tabindex="0"><code>$ dspace database info
Caught exception:
java.sql.SQLException: java.lang.ClassNotFoundException:
at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1438)
at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: java.lang.ClassNotFoundException:
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:41)
... 8 more
</code></pre><ul>
<li>And in the logs:</li>
</ul>
<pre tabindex="0"><code>2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file: java.naming.factory.initial
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
at javax.naming.InitialContext.getURLOrDefaultInitCtx(InitialContext.java:350)
at javax.naming.InitialContext.lookup(InitialContext.java:417)
at org.dspace.storage.rdbms.DatabaseManager.initDataSource(DatabaseManager.java:1413)
at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2017-12-19 18:26:56,983 INFO org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspace
2017-12-19 18:26:56,983 INFO org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
2017-12-19 18:26:56,992 WARN org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxconnections
2017-12-19 18:26:56,992 WARN org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxwait
2017-12-19 18:26:56,993 WARN org.dspace.core.ConfigurationManager @ Warning: Number format error in property: db.maxidle
</code></pre><ul>
<li>If I add the db values back to dspace.cfg the <code>dspace database info</code> command succeeds but the log still shows errors retrieving the JNDI connection</li>
<li>Perhaps something to report to the dspace-tech mailing list when I finally send my comments</li>
<li>Oh cool! <code>select * from pg_stat_activity</code> shows &ldquo;PostgreSQL JDBC Driver&rdquo; for the application name! That&rsquo;s how you know it&rsquo;s working!</li>
<li>If you monitor the <code>pg_stat_activity</code> while you run <code>dspace database info</code> you can see that it doesn&rsquo;t use the JNDI and creates ~9 extra PostgreSQL connections!</li>
<li>And in the middle of all of this Linode sends an alert that CGSpace has high CPU usage from 2 to 4 PM</li>
</ul>
<h2 id="2017-12-20">2017-12-20</h2>
<ul>
<li>The database connection pooling is definitely better!</li>
</ul>
<p><img src="/cgspace-notes/2017/12/postgres-connections-week-dspacetest.png" alt="PostgreSQL connection pooling on DSpace Test"></p>
<ul>
<li>Now there is only one set of idle connections shared among all the web applications, instead of 10+ per application</li>
<li>There are short bursts of connections up to 10, but it generally stays around 5</li>
<li>Test and import 13 records to CGSpace for Abenet:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS=&#34;-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1&#34;
$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
</code></pre><ul>
<li>The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.</li>
<li>Since I had to restart Tomcat anyways, I decided to just deploy the new JNDI connection pooling stuff on CGSpace</li>
<li>There was an initial connection storm of 50 PostgreSQL connections, but then it settled down to 7</li>
<li>After that CGSpace came up fine and I was able to import the 13 items just fine:</li>
</ul>
<pre tabindex="0"><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &amp;&gt; systemoffice.log
$ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
</code></pre><ul>
<li>The final code for the JNDI work in the Ansible infrastructure scripts is here: <a href="https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b">https://github.com/ilri/rmg-ansible-public/commit/1959d9cb7a0e7a7318c77f769253e5e029bdfa3b</a></li>
</ul>
<h2 id="2017-12-24">2017-12-24</h2>
<ul>
<li>Linode alerted that CGSpace was using high CPU this morning around 6 AM</li>
<li>I&rsquo;m playing with reading all of a month&rsquo;s nginx logs into goaccess:</li>
</ul>
<pre tabindex="0"><code># find /var/log/nginx -type f -newermt &#34;2017-12-01&#34; | xargs zcat --force | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>I can see interesting things using this approach, for example:
<ul>
<li>50.116.102.77 checked our status almost 40,000 times so far this month—I think it&rsquo;s the CGNet uptime tool</li>
<li>Also, we&rsquo;ve handled 2.9 million requests this month from 172,000 unique IP addresses!</li>
<li>Total bandwidth so far this month is 640GiB</li>
<li>The user that made the most requests so far this month is 45.5.184.196 (267,000 requests)</li>
</ul>
</li>
</ul>
<h2 id="2017-12-25">2017-12-25</h2>
<ul>
<li>The PostgreSQL connection pooling is much better when using the Tomcat JNDI pool</li>
<li>Here are the Munin stats for the past week on CGSpace:</li>
</ul>
<p><img src="/cgspace-notes/2017/12/postgres-connections-cgspace.png" alt="CGSpace PostgreSQL connections week"></p>
<h2 id="2017-12-29">2017-12-29</h2>
<ul>
<li>Looking at some old notes for metadata to clean up, I found a few hundred corrections in <code>cg.fulltextstatus</code> and <code>dc.language.iso</code>:</li>
</ul>
<pre tabindex="0"><code># update metadatavalue set text_value=&#39;Formally Published&#39; where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;Formally published&#39;;
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like &#39;NO&#39;;
DELETE 17
# update metadatavalue set text_value=&#39;en&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(En|English)&#39;;
UPDATE 49
# update metadatavalue set text_value=&#39;fr&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(fre|frn|French)&#39;;
UPDATE 4
# update metadatavalue set text_value=&#39;es&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(Spanish|spa)&#39;;
UPDATE 16
# update metadatavalue set text_value=&#39;vi&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Vietnamese&#39;;
UPDATE 9
# update metadatavalue set text_value=&#39;ru&#39; where resource_type_id=2 and metadata_field_id=38 and text_value=&#39;Ru&#39;;
UPDATE 1
# update metadatavalue set text_value=&#39;in&#39; where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(IN|In)&#39;;
UPDATE 5
# delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ &#39;(dc.language.iso|CGIAR Challenge Program on Water and Food)&#39;;
DELETE 20
</code></pre><ul>
<li>I need to figure out why we have records with language <code>in</code> because that&rsquo;s not a language!</li>
</ul>
<h2 id="2017-12-30">2017-12-30</h2>
<ul>
<li>Linode alerted that CGSpace was using 259% CPU from 4 to 6 AM</li>
<li>Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM</li>
<li>Here&rsquo;s the XMLUI logs:</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E &#34;30/Dec/2017&#34; | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h | tail
637 207.46.13.106
641 157.55.39.186
715 68.180.229.254
924 104.196.152.243
1012 66.249.64.95
1060 216.244.66.245
1120 54.175.208.220
1287 66.249.64.93
1586 66.249.64.78
3653 66.249.64.91
</code></pre><ul>
<li>Looks pretty normal actually, but I don&rsquo;t know who 54.175.208.220 is</li>
<li>They identify as &ldquo;com.plumanalytics&rdquo;, which Google says is associated with Elsevier</li>
<li>They only seem to have used one Tomcat session so that&rsquo;s good, I guess I don&rsquo;t need to add them to the Tomcat Crawler Session Manager valve:</li>
</ul>
<pre tabindex="0"><code>$ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1
</code></pre><ul>
<li>216.244.66.245 seems to be moz.com&rsquo;s DotBot</li>
</ul>
<h2 id="2017-12-31">2017-12-31</h2>
<ul>
<li>I finished working on the 42 records for CCAFS after Magdalena sent the remaining corrections</li>
<li>After that I uploaded them to CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &amp;&gt; ccafs.log
</code></pre>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
<li><a href="/cgspace-notes/2023-05/">May, 2023</a></li>
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.3 KiB

BIN
docs/2017/02/cpu-week.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 175 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

BIN
docs/2017/04/cplace.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

BIN
docs/2017/04/dc-rights.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 182 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 469 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 263 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 260 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 133 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 102 KiB

BIN
docs/2017/11/add-author.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 156 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 188 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

Some files were not shown because too many files have changed in this diff Show More