cgspace-notes/docs/2018-02/index.html

828 lines
36 KiB
HTML
Raw Normal View History

2018-02-11 17:28:23 +01:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="February, 2018" />
<meta property="og:description" content="2018-02-01
Peter gave feedback on the dc.rights proof of concept that I had sent him last week
We don&rsquo;t need to distinguish between internal and external works, so that makes it just a simple list
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-02/" />
<meta property="article:published_time" content="2018-02-01T16:28:54&#43;02:00"/>
2018-02-15 21:31:11 +01:00
<meta property="article:modified_time" content="2018-02-15T14:00:34&#43;02:00"/>
2018-02-11 17:28:23 +01:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="February, 2018"/>
<meta name="twitter:description" content="2018-02-01
Peter gave feedback on the dc.rights proof of concept that I had sent him last week
We don&rsquo;t need to distinguish between internal and external works, so that makes it just a simple list
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
"/>
<meta name="generator" content="Hugo 0.36" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "February, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-02/",
2018-02-15 21:31:11 +01:00
"wordCount": "3674",
2018-02-11 17:28:23 +01:00
"datePublished": "2018-02-01T16:28:54&#43;02:00",
2018-02-15 21:31:11 +01:00
"dateModified": "2018-02-15T14:00:34&#43;02:00",
2018-02-11 17:28:23 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-02/">
<title>February, 2018 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-HjEPigLMLBzVQsUi6JWp9tmxJtBimdClDBxwZrwZR&#43;VE3s11/PtFYOrLClxIv2SG" crossorigin="anonymous">
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-02/">February, 2018</a></h2>
<p class="blog-post-meta"><time datetime="2018-02-01T16:28:54&#43;02:00">Thu Feb 01, 2018</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2018-02-01">2018-02-01</h2>
<ul>
<li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li>
<li>We don&rsquo;t need to distinguish between internal and external works, so that makes it just a simple list</li>
<li>Yesterday I figured out how to monitor DSpace sessions using JMX</li>
<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu&rsquo;s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="/cgspace-notes/2018-01/">in 2018-01</a></li>
</ul>
<p></p>
<p><img src="/cgspace-notes/2018/02/jmx_dspace_sessions-day.png" alt="DSpace Sessions" /></p>
<ul>
<li>Run all system updates and reboot DSpace Test</li>
<li>Wow, I packaged up the <code>jmx_dspace_sessions</code> stuff in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> and deployed it on CGSpace and it totally works:</li>
</ul>
<pre><code># munin-run jmx_dspace_sessions
v_.value 223
v_jspui.value 1
v_oai.value 0
</code></pre>
<h2 id="2018-02-03">2018-02-03</h2>
<ul>
<li>Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566</a></li>
<li>We will close that ticket for now and wait for the 5.8 stuff: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560</a></li>
<li>I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January</li>
<li>After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:</li>
</ul>
<pre><code>$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
</code></pre>
<ul>
<li>Then I started a full Discovery reindex:</li>
</ul>
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 96m39.823s
user 14m10.975s
sys 2m29.088s
</code></pre>
<ul>
<li>Generate a new list of affiliations for Peter to sort through:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723
</code></pre>
<ul>
<li>Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in <a href="/cgspace-notes/2017-12/">December</a>:</li>
</ul>
<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2018&quot;
3126109
real 0m23.839s
user 0m27.225s
sys 0m1.905s
</code></pre>
<h2 id="2018-02-05">2018-02-05</h2>
<ul>
<li>Toying with correcting authors with trailing spaces via PostgreSQL:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
UPDATE 20
</code></pre>
<ul>
<li>I tried the <code>TRIM(TRAILING from text_value)</code> function and it said it changed 20 items but the spaces didn&rsquo;t go away</li>
<li>This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.</li>
<li>Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630
</code></pre>
<h2 id="2018-02-06">2018-02-06</h2>
<ul>
<li>UptimeRobot says CGSpace is down this morning around 9:15</li>
<li>I see 308 PostgreSQL connections in <code>pg_stat_activity</code></li>
<li>The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:</li>
</ul>
<pre><code># date
Tue Feb 6 09:30:32 UTC 2018
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;6/Feb/2018:(08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2 223.185.41.40
2 66.249.64.14
2 77.246.52.40
4 157.55.39.82
4 193.205.105.8
5 207.46.13.63
5 207.46.13.64
6 154.68.16.34
7 207.46.13.66
1548 50.116.102.77
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E &quot;6/Feb/2018:(08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
77 213.55.99.121
86 66.249.64.14
101 104.196.152.243
103 207.46.13.64
118 157.55.39.82
133 207.46.13.66
136 207.46.13.63
156 68.180.228.157
295 197.210.168.174
752 144.76.64.79
</code></pre>
<ul>
<li>I did notice in <code>/var/log/tomcat7/catalina.out</code> that Atmire&rsquo;s update thing was running though</li>
<li>So I restarted Tomcat and now everything is fine</li>
<li>Next time I see that many database connections I need to save the output so I can analyze it later</li>
<li>I&rsquo;m going to re-schedule the taskUpdateSolrStatsMetadata task as <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566">Bram detailed in ticket 566</a> to see if it makes CGSpace stop crashing every morning</li>
<li>If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later</li>
<li>Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on</li>
<li>I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server</li>
<li>Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called &lsquo;dspaceCli&rsquo; so that I can distinguish it in <code>pg_stat_activity</code></li>
<li>I implemented some changes to the pooling in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> so that each DSpace web application can use its own pool (web, api, and solr)</li>
<li>Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down</li>
<li>Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine</li>
<li>I&rsquo;m not actually sure if the Solr web application uses the database though, so I&rsquo;ll have to check later and remove it if necessary</li>
<li>I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week</li>
</ul>
<h2 id="2018-02-07">2018-02-07</h2>
<ul>
<li>Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace</li>
<li>I tried on DSpace Test and indeed the lookup just doesn&rsquo;t work!</li>
<li>The ORCiD code in DSpace appears to be using <code>http://pub.orcid.org/</code>, but when I go there in the browser it redirects me to <code>https://pub.orcid.org/v2.0/</code></li>
<li>According to <a href="https://groups.google.com/forum/#!topic/orcid-api-users/qfg-HwAB1bk">the announcement</a> the v1 API was moved from <code>http://pub.orcid.org/</code> to <code>https://pub.orcid.org/v1.2</code> until March 1st when it will be discontinued for good</li>
<li>But the old URL is hard coded in DSpace and it doesn&rsquo;t work anyways, because it currently redirects you to <code>https://pub.orcid.org/v2.0/v1.2</code></li>
<li>So I guess we have to disable that shit once and for all and switch to a controlled vocabulary</li>
<li>CGSpace crashed again, this time around <code>Wed Feb 7 11:20:28 UTC 2018</code></li>
<li>I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' &gt; /tmp/pg_stat_activity.txt
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
/tmp/pg_stat_activity1.txt:300
/tmp/pg_stat_activity2.txt:272
/tmp/pg_stat_activity3.txt:168
/tmp/pg_stat_activity4.txt:5
/tmp/pg_stat_activity5.txt:6
</code></pre>
<ul>
<li>Interestingly, all of those 751 connections were idle!</li>
</ul>
<pre><code>$ grep &quot;PostgreSQL JDBC&quot; /tmp/pg_stat_activity* | grep -c idle
751
</code></pre>
<ul>
<li>Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps</li>
<li>Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!):</li>
</ul>
<p><img src="/cgspace-notes/2018/02/jmx_dspace-sessions-day.png" alt="DSpace Sessions" /></p>
<ul>
<li>Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:</li>
</ul>
<pre><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1828
</code></pre>
<ul>
<li>CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)</li>
<li>What&rsquo;s interesting is that the DSpace log says the connections are all busy:</li>
</ul>
<pre><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre>
<ul>
<li>&hellip; but in PostgreSQL I see them <code>idle</code> or <code>idle in transaction</code>:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c &quot;idle in transaction&quot;
187
</code></pre>
<ul>
<li>What the fuck, does DSpace think all connections are busy?</li>
<li>I suspect these are issues with abandoned connections or maybe a leak, so I&rsquo;m going to try adding the <code>removeAbandoned='true'</code> parameter which is apparently off by default</li>
<li>I will try <code>testOnReturn='true'</code> too, just to add more validation, because I&rsquo;m fucking grasping at straws</li>
<li>Also, WTF, there was a heap space error randomly in catalina.out:</li>
</ul>
<pre><code>Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread &quot;http-bio-127.0.0.1-8081-exec-58&quot; java.lang.OutOfMemoryError: Java heap space
</code></pre>
<ul>
<li>I&rsquo;m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!</li>
<li>Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:</li>
</ul>
<pre><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
34 ip_addr=46.229.168.67
34 ip_addr=46.229.168.73
37 ip_addr=46.229.168.76
40 ip_addr=34.232.65.41
41 ip_addr=46.229.168.71
44 ip_addr=197.210.168.174
55 ip_addr=181.137.2.214
55 ip_addr=213.55.99.121
58 ip_addr=46.229.168.65
64 ip_addr=66.249.66.91
67 ip_addr=66.249.66.90
71 ip_addr=207.46.13.54
78 ip_addr=130.82.1.40
104 ip_addr=40.77.167.36
151 ip_addr=68.180.228.157
174 ip_addr=207.46.13.135
194 ip_addr=54.83.138.123
198 ip_addr=40.77.167.62
210 ip_addr=207.46.13.71
214 ip_addr=104.196.152.243
</code></pre>
<ul>
<li>These IPs made thousands of sessions today:</li>
</ul>
<pre><code>$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
530
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
859
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
610
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
8
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
826
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
727
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
181
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
24
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
166
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
992
</code></pre>
<ul>
<li>Let&rsquo;s investigate who these IPs belong to:
<ul>
<li>104.196.152.243 is CIAT, which is already marked as a bot via nginx!</li>
<li>207.46.13.71 is Bing, which is already marked as a bot in Tomcat&rsquo;s Crawler Session Manager Valve!</li>
<li>40.77.167.62 is Bing, which is already marked as a bot in Tomcat&rsquo;s Crawler Session Manager Valve!</li>
<li>207.46.13.135 is Bing, which is already marked as a bot in Tomcat&rsquo;s Crawler Session Manager Valve!</li>
<li>68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat&rsquo;s Crawler Session Manager Valve!</li>
<li>40.77.167.36 is Bing, which is already marked as a bot in Tomcat&rsquo;s Crawler Session Manager Valve!</li>
<li>207.46.13.54 is Bing, which is already marked as a bot in Tomcat&rsquo;s Crawler Session Manager Valve!</li>
<li>46.229.168.x is Semrush, which is already marked as a bot in Tomcat&rsquo;s Crawler Session Manager Valve!</li>
</ul></li>
<li>Nice, so these are all known bots that are already crammed into one session by Tomcat&rsquo;s Crawler Session Manager Valve.</li>
<li>What in the actual fuck, why is our load doing this? It&rsquo;s gotta be something fucked up with the database pool being &ldquo;busy&rdquo; but everything is fucking idle</li>
<li>One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:</li>
</ul>
<pre><code>BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
</code></pre>
<ul>
<li>This one makes two thousand requests per day or so recently:</li>
</ul>
<pre><code># grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
/var/log/nginx/access.log:1925
/var/log/nginx/access.log.1:2029
</code></pre>
<ul>
<li>And they have 30 IPs, so fuck that shit I&rsquo;m going to add them to the Tomcat Crawler Session Manager Valve nowwww</li>
<li>Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace</li>
<li>Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker</li>
<li>This is how the connections looked when it crashed this afternoon:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
290 dspaceWeb
</code></pre>
<ul>
<li>This is how it is right now:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
5 dspaceWeb
</code></pre>
<ul>
<li>So is this just some fucked up XMLUI database leaking?</li>
<li>I notice there is an issue (that I&rsquo;ve probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: <a href="https://jira.duraspace.org/browse/DS-3551">https://jira.duraspace.org/browse/DS-3551</a></li>
<li>I seriously doubt this leaking shit is fixed for sure, but I&rsquo;m gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I&rsquo;m fed up with this shit</li>
<li>I cherry-picked all the commits for DS-3551 but it won&rsquo;t build on our current DSpace 5.5!</li>
<li>I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle</li>
</ul>
<h2 id="2018-02-10">2018-02-10</h2>
<ul>
<li>I tried to disable ORCID lookups but keep the existing authorities</li>
<li>This item has an ORCID for Ralf Kiese: <a href="http://localhost:8080/handle/10568/89897">http://localhost:8080/handle/10568/89897</a></li>
<li>Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn&rsquo;t show up on the item</li>
<li>Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
<br /></li>
</ul>
<pre><code>Field dc_contributor_author has choice presentation of type &quot;select&quot;, it may NOT be authority-controlled.
</code></pre>
<ul>
<li>If I change choices.presentation to suggest it give this error:</li>
</ul>
<pre><code>xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
</code></pre>
<ul>
<li>So I don&rsquo;t think we can disable the ORCID lookup function and keep the ORCID badges</li>
</ul>
<h2 id="2018-02-11">2018-02-11</h2>
<ul>
<li>Magdalena from CCAFS emailed to ask why one of their items has such a weird thumbnail: <a href="https://cgspace.cgiar.org/handle/10568/90735"><sup>10568</sup>&frasl;<sub>90735</sub></a></li>
</ul>
<p><img src="/cgspace-notes/2018/02/CCAFS_WP_223.pdf.jpg" alt="Weird thumbnail" /></p>
<ul>
<li>I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:</li>
</ul>
<pre><code>$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
</code></pre>
<p><img src="/cgspace-notes/2018/02/CCAFS_WP_223.jpg" alt="Manual thumbnail" /></p>
<ul>
<li>Peter sent me corrected author names last week but the file encoding is messed up:</li>
</ul>
<pre><code>$ isutf8 authors-2018-02-05.csv
authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
</code></pre>
<ul>
<li>The <code>isutf8</code> program comes from <code>moreutils</code></li>
<li>Line 100 contains: Galiè, Alessandra</li>
<li>In other news, psycopg2 is splitting their package in pip, so to install the binary wheel distribution you need to use <code>pip install psycopg2-binary</code></li>
<li>See: <a href="http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/">http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/</a></li>
<li>I updated my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts on the scripts page: <a href="https://github.com/ilri/DSpace/wiki/Scripts">https://github.com/ilri/DSpace/wiki/Scripts</a></li>
<li>I ran the 342 author corrections (after trimming whitespace and excluding those with <code>||</code> and other syntax errors) on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
</code></pre>
<ul>
<li>Then I ran a full Discovery re-indexing:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre>
<ul>
<li>That reminds me that Bizu had asked me to fix some of Alan Duncan&rsquo;s names in December</li>
<li>I see he actually has some variations with &ldquo;Duncan, Alan J.&rdquo;: <a href="https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=">https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=</a></li>
<li>I will just update those for her too and then restart the indexing:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value | authority | confidence
-----------------+--------------------------------------+------------
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | 600
Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 | -1
Duncan, Alan J. | | -1
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 | -1
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d | -1
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | -1
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
(8 rows)
dspace=# begin;
dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
UPDATE 216
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value | authority | confidence
--------------+--------------------------------------+------------
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
(1 row)
dspace=# commit;
</code></pre>
<ul>
<li>Run all system updates on DSpace Test (linode02) and reboot it</li>
<li>I wrote a Python script (<a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b"><code>resolve-orcids-from-solr.py</code></a>) using SolrClient to parse the Solr authority cache for ORCID IDs</li>
<li>We currently have 1562 authority records with ORCID IDs, and 624 unique IDs</li>
<li>We can use this to build a controlled vocabulary of ORCID IDs for new item submissions</li>
<li>I don&rsquo;t know how to add ORCID IDs to existing items yet&hellip; some more querying of PostgreSQL for authority values perhaps?</li>
<li>I added the script to the <a href="https://github.com/ilri/DSpace/wiki/Scripts">ILRI DSpace wiki on GitHub</a></li>
</ul>
2018-02-12 10:17:26 +01:00
<h2 id="2018-02-12">2018-02-12</h2>
<ul>
<li>Follow up with Atmire on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 Compatibility ticket</a> to ask again if they want me to send them a DSpace 5.8 branch to work on</li>
<li>Abenet asked if there was a way to get the number of submissions she and Bizuwork did</li>
<li>I said that the Atmire Workflow Statistics module was supposed to be able to do that</li>
<li>We had tried it in <a href="/cgspace-notes/2017-06/">June, 2017</a> and found that it didn&rsquo;t work</li>
<li>Atmire sent us some fixes but they didn&rsquo;t work either</li>
<li>I just tried the branch with the fixes again and it indeed does not work:</li>
</ul>
<p><img src="/cgspace-notes/2018/02/atmire-workflow-statistics.png" alt="Atmire Workflow Statistics No Data Available" /></p>
<ul>
<li>I see that in <a href="/cgspace-notes/2017-04/">April, 2017</a> I just used a SQL query to get a user&rsquo;s submissions by checking the <code>dc.description.provenance</code> field</li>
<li>So for Abenet, I can check her submissions in December, 2017 with:</li>
</ul>
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
</code></pre>
2018-02-12 10:33:00 +01:00
<ul>
<li>I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it</li>
<li>This would be using <a href="https://www.linode.com/blockstorage">Linode&rsquo;s new block storage volumes</a></li>
<li>I think our current $40/month Linode has enough CPU and memory capacity, but we need more disk space</li>
<li>I think I&rsquo;d probably just attach the block storage volume and mount it on /home/dspace</li>
2018-02-12 10:38:08 +01:00
<li>Ask Peter about <code>dc.rights</code> on DSpace Test again, if he likes it then we should move it to CGSpace soon</li>
2018-02-12 10:33:00 +01:00
</ul>
2018-02-13 14:16:18 +01:00
<h2 id="2018-02-13">2018-02-13</h2>
<ul>
<li>Peter said he was getting a &ldquo;socket closed&rdquo; error on CGSpace</li>
<li>I looked in the dspace.log.2018-02-13 and saw one recent one:</li>
</ul>
<pre><code>2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
...
Caused by: java.net.SocketException: Socket closed
</code></pre>
<ul>
<li>Could be because of the <code>removeAbandoned=&quot;true&quot;</code> that I enabled in the JDBC connection pool last week?</li>
</ul>
<pre><code>$ grep -c &quot;java.net.SocketException: Socket closed&quot; dspace.log.2018-02-*
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
dspace.log.2018-02-04:0
dspace.log.2018-02-05:0
dspace.log.2018-02-06:0
dspace.log.2018-02-07:0
dspace.log.2018-02-08:1
dspace.log.2018-02-09:6
dspace.log.2018-02-10:0
dspace.log.2018-02-11:3
dspace.log.2018-02-12:0
dspace.log.2018-02-13:4
</code></pre>
<ul>
<li>I apparently added that on 2018-02-07 so it could be, as I don&rsquo;t see any of those socket closed errors in 2018-01&rsquo;s logs!</li>
<li>I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned</li>
2018-02-13 16:50:12 +01:00
<li>Peter hit this issue one more time, and this is apparently what Tomcat&rsquo;s catalina.out log says when an abandoned connection is removed:</li>
2018-02-13 14:16:18 +01:00
</ul>
2018-02-13 16:50:12 +01:00
<pre><code>Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception
</code></pre>
2018-02-14 12:56:18 +01:00
<h2 id="2018-02-14">2018-02-14</h2>
<ul>
<li>Skype with Peter and the Addis team to discuss what we need to do for the ORCIDs in the immediate future</li>
<li>We said we&rsquo;d start with a controlled vocabulary for <code>cg.creator.id</code> on the DSpace Test submission form, where we store the author name and the ORCID in some format like: Alan S. Orth (0000-0002-1735-7458)</li>
<li>Eventually we need to find a way to print the author names with links to their ORCID profiles</li>
<li>Abenet will send an email to the partners to give us ORCID IDs for their authors and to stress that they update their name format on ORCID.org if they want it in a special way</li>
<li>I sent the Codeobia guys a question to ask how they prefer that we store the IDs, ie one of:
<ul>
<li>Alan Orth - 0000-0002-1735-7458</li>
<li>Alan Orth: 0000-0002-1735-7458</li>
<li>Alan S. Orth (0000-0002-1735-7458)</li>
</ul></li>
<li>Atmire responded on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">DSpace 5.8 compatability ticket</a> and said they will let me know if they they want me to give them a clean 5.8 branch</li>
<li>I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:</li>
</ul>
<pre><code>$ sort cgspace-orcids.txt &gt; dspace/config/controlled-vocabularies/cg-creator-id.xml
$ add XML formatting...
$ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre>
<ul>
<li>It seems the tidy fucks up accents, for example it turns <code>Adriana Tofiño (0000-0001-7115-7169)</code> into <code>Adriana Tofiño (0000-0001-7115-7169)</code></li>
<li>We need to force UTF-8:</li>
</ul>
<pre><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre>
<ul>
2018-02-14 15:45:03 +01:00
<li>This preserves special accent characters</li>
<li>I tested the display and store of these in the XMLUI and PostgreSQL and it looks good</li>
<li>Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+</li>
<li>Peter combined it with mine and we have 1204 unique ORCIDs!</li>
</ul>
<pre><code>$ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
1204
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
1204
</code></pre>
<ul>
<li>Also, save that regex for the future because it will be very useful!</li>
<li>CIAT sent a list of their authors&rsquo; ORCIDs and combined with ours there are now 1227:</li>
</ul>
<pre><code>$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1227
</code></pre>
<ul>
<li>There are some formatting issues with names in Peter&rsquo;s list, so I should remember to re-generate the list of names from ORCID&rsquo;s API once we&rsquo;re done</li>
<li>The <code>dspace cleanup -v</code> currently fails on CGSpace with the following:</li>
</ul>
<pre><code> - Deleting bitstream record from database (ID: 149473)
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
Detail: Key (bitstream_id)=(149473) is still referenced from table &quot;bundle&quot;.
</code></pre>
<ul>
<li>The solution is to update the bitstream table, as I&rsquo;ve discovered several other times in 2016 and 2017:</li>
</ul>
<pre><code>$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
UPDATE 1
</code></pre>
<ul>
<li>Then the cleanup process will continue for awhile and hit another foreign key conflict, and eventually it will complete after you manually resolve them all</li>
2018-02-14 12:56:18 +01:00
</ul>
2018-02-15 13:00:34 +01:00
<h2 id="2018-02-15">2018-02-15</h2>
<ul>
<li>Altmetric seems to be indexing DSpace Test for some reason:
<ul>
<li>See this item on DSpace Test: <a href="https://dspacetest.cgiar.org/handle/10568/78450">https://dspacetest.cgiar.org/handle/10568/78450</a></li>
<li>See the corresponding page on Altmetric: <a href="https://www.altmetric.com/details/handle/10568/78450">https://www.altmetric.com/details/handle/10568/78450</a></li>
</ul></li>
<li>And this item doesn&rsquo;t even exist on CGSpace!</li>
<li>Start working on XMLUI item display code for ORCIDs</li>
<li>Send emails to Macaroni Bros and Usman at CIFOR about ORCID metadata</li>
2018-02-15 21:31:11 +01:00
<li>CGSpace crashed while I was driving to Tel Aviv, and was down for four hours!</li>
<li>I only looked quickly in the logs but saw a bunch of database errors</li>
<li>PostgreSQL connections are currently:</li>
</ul>
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
2 dspaceApi
1 dspaceWeb
3 dspaceApi
</code></pre>
<ul>
<li>I see shitloads of memory errors in Tomcat&rsquo;s logs:
<br /></li>
</ul>
<pre><code># grep -c &quot;Java heap space&quot; /var/log/tomcat7/catalina.out
56
</code></pre>
<ul>
<li>And shit tons of database connections abandoned:</li>
</ul>
<pre><code># grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
612
</code></pre>
<ul>
<li>I have no fucking idea why it crashed</li>
2018-02-15 13:00:34 +01:00
</ul>
2018-02-11 17:28:23 +01:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2018-02/">February, 2018</a></li>
<li><a href="/cgspace-notes/2018-01/">January, 2018</a></li>
<li><a href="/cgspace-notes/2017-12/">December, 2017</a></li>
<li><a href="/cgspace-notes/2017-11/">November, 2017</a></li>
<li><a href="/cgspace-notes/2017-10/">October, 2017</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>