mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 16:38:19 +01:00
637 lines
27 KiB
HTML
637 lines
27 KiB
HTML
|
<!DOCTYPE html>
|
||
|
<html lang="en">
|
||
|
|
||
|
<head>
|
||
|
<meta charset="utf-8">
|
||
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
|
|
||
|
<meta property="og:title" content="February, 2018" />
|
||
|
<meta property="og:description" content="2018-02-01
|
||
|
|
||
|
|
||
|
Peter gave feedback on the dc.rights proof of concept that I had sent him last week
|
||
|
We don’t need to distinguish between internal and external works, so that makes it just a simple list
|
||
|
Yesterday I figured out how to monitor DSpace sessions using JMX
|
||
|
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
|
||
|
|
||
|
|
||
|
" />
|
||
|
<meta property="og:type" content="article" />
|
||
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-02/" />
|
||
|
|
||
|
|
||
|
|
||
|
<meta property="article:published_time" content="2018-02-01T16:28:54+02:00"/>
|
||
|
|
||
|
<meta property="article:modified_time" content="2018-02-11T18:21:39+02:00"/>
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
<meta name="twitter:card" content="summary"/>
|
||
|
<meta name="twitter:title" content="February, 2018"/>
|
||
|
<meta name="twitter:description" content="2018-02-01
|
||
|
|
||
|
|
||
|
Peter gave feedback on the dc.rights proof of concept that I had sent him last week
|
||
|
We don’t need to distinguish between internal and external works, so that makes it just a simple list
|
||
|
Yesterday I figured out how to monitor DSpace sessions using JMX
|
||
|
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
|
||
|
|
||
|
|
||
|
"/>
|
||
|
<meta name="generator" content="Hugo 0.36" />
|
||
|
|
||
|
|
||
|
|
||
|
<script type="application/ld+json">
|
||
|
{
|
||
|
"@context": "http://schema.org",
|
||
|
"@type": "BlogPosting",
|
||
|
"headline": "February, 2018",
|
||
|
"url": "https://alanorth.github.io/cgspace-notes/2018-02/",
|
||
|
"wordCount": "2666",
|
||
|
"datePublished": "2018-02-01T16:28:54+02:00",
|
||
|
"dateModified": "2018-02-11T18:21:39+02:00",
|
||
|
"author": {
|
||
|
"@type": "Person",
|
||
|
"name": "Alan Orth"
|
||
|
},
|
||
|
"keywords": "Notes"
|
||
|
}
|
||
|
</script>
|
||
|
|
||
|
|
||
|
|
||
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-02/">
|
||
|
|
||
|
<title>February, 2018 | CGSpace Notes</title>
|
||
|
|
||
|
<!-- combined, minified CSS -->
|
||
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-HjEPigLMLBzVQsUi6JWp9tmxJtBimdClDBxwZrwZR+VE3s11/PtFYOrLClxIv2SG" crossorigin="anonymous">
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
</head>
|
||
|
|
||
|
<body>
|
||
|
|
||
|
|
||
|
<div class="blog-masthead">
|
||
|
<div class="container">
|
||
|
<nav class="nav blog-nav">
|
||
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
|
|
||
|
|
||
|
</nav>
|
||
|
</div>
|
||
|
</div>
|
||
|
|
||
|
|
||
|
|
||
|
<header class="blog-header">
|
||
|
<div class="container">
|
||
|
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
|
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
||
|
</div>
|
||
|
</header>
|
||
|
|
||
|
|
||
|
|
||
|
<div class="container">
|
||
|
<div class="row">
|
||
|
<div class="col-sm-8 blog-main">
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
<article class="blog-post">
|
||
|
<header>
|
||
|
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-02/">February, 2018</a></h2>
|
||
|
<p class="blog-post-meta"><time datetime="2018-02-01T16:28:54+02:00">Thu Feb 01, 2018</time> by Alan Orth in
|
||
|
|
||
|
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
||
|
|
||
|
</p>
|
||
|
</header>
|
||
|
<h2 id="2018-02-01">2018-02-01</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li>Peter gave feedback on the <code>dc.rights</code> proof of concept that I had sent him last week</li>
|
||
|
<li>We don’t need to distinguish between internal and external works, so that makes it just a simple list</li>
|
||
|
<li>Yesterday I figured out how to monitor DSpace sessions using JMX</li>
|
||
|
<li>I copied the logic in the <code>jmx_tomcat_dbpools</code> provided by Ubuntu’s <code>munin-plugins-java</code> package and used the stuff I discovered about JMX <a href="/cgspace-notes/2018-01/">in 2018-01</a></li>
|
||
|
</ul>
|
||
|
|
||
|
<p></p>
|
||
|
|
||
|
<p><img src="/cgspace-notes/2018/02/jmx_dspace_sessions-day.png" alt="DSpace Sessions" /></p>
|
||
|
|
||
|
<ul>
|
||
|
<li>Run all system updates and reboot DSpace Test</li>
|
||
|
<li>Wow, I packaged up the <code>jmx_dspace_sessions</code> stuff in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> and deployed it on CGSpace and it totally works:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code># munin-run jmx_dspace_sessions
|
||
|
v_.value 223
|
||
|
v_jspui.value 1
|
||
|
v_oai.value 0
|
||
|
</code></pre>
|
||
|
|
||
|
<h2 id="2018-02-03">2018-02-03</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li>Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566</a></li>
|
||
|
<li>We will close that ticket for now and wait for the 5.8 stuff: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560</a></li>
|
||
|
<li>I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January</li>
|
||
|
<li>After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
|
||
|
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>Then I started a full Discovery reindex:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
|
||
|
|
||
|
real 96m39.823s
|
||
|
user 14m10.975s
|
||
|
sys 2m29.088s
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>Generate a new list of affiliations for Peter to sort through:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
|
||
|
COPY 3723
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in <a href="/cgspace-notes/2017-12/">December</a>:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code># time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
|
||
|
3126109
|
||
|
|
||
|
real 0m23.839s
|
||
|
user 0m27.225s
|
||
|
sys 0m1.905s
|
||
|
</code></pre>
|
||
|
|
||
|
<h2 id="2018-02-05">2018-02-05</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li>Toying with correcting authors with trailing spaces via PostgreSQL:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
|
||
|
UPDATE 20
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>I tried the <code>TRIM(TRAILING from text_value)</code> function and it said it changed 20 items but the spaces didn’t go away</li>
|
||
|
<li>This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.</li>
|
||
|
<li>Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
|
||
|
COPY 55630
|
||
|
</code></pre>
|
||
|
|
||
|
<h2 id="2018-02-06">2018-02-06</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li>UptimeRobot says CGSpace is down this morning around 9:15</li>
|
||
|
<li>I see 308 PostgreSQL connections in <code>pg_stat_activity</code></li>
|
||
|
<li>The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code># date
|
||
|
Tue Feb 6 09:30:32 UTC 2018
|
||
|
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
|
2 223.185.41.40
|
||
|
2 66.249.64.14
|
||
|
2 77.246.52.40
|
||
|
4 157.55.39.82
|
||
|
4 193.205.105.8
|
||
|
5 207.46.13.63
|
||
|
5 207.46.13.64
|
||
|
6 154.68.16.34
|
||
|
7 207.46.13.66
|
||
|
1548 50.116.102.77
|
||
|
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
|
77 213.55.99.121
|
||
|
86 66.249.64.14
|
||
|
101 104.196.152.243
|
||
|
103 207.46.13.64
|
||
|
118 157.55.39.82
|
||
|
133 207.46.13.66
|
||
|
136 207.46.13.63
|
||
|
156 68.180.228.157
|
||
|
295 197.210.168.174
|
||
|
752 144.76.64.79
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>I did notice in <code>/var/log/tomcat7/catalina.out</code> that Atmire’s update thing was running though</li>
|
||
|
<li>So I restarted Tomcat and now everything is fine</li>
|
||
|
<li>Next time I see that many database connections I need to save the output so I can analyze it later</li>
|
||
|
<li>I’m going to re-schedule the taskUpdateSolrStatsMetadata task as <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566">Bram detailed in ticket 566</a> to see if it makes CGSpace stop crashing every morning</li>
|
||
|
<li>If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later</li>
|
||
|
<li>Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on</li>
|
||
|
<li>I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server</li>
|
||
|
<li>Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called ‘dspaceCli’ so that I can distinguish it in <code>pg_stat_activity</code></li>
|
||
|
<li>I implemented some changes to the pooling in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> so that each DSpace web application can use its own pool (web, api, and solr)</li>
|
||
|
<li>Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down</li>
|
||
|
<li>Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine</li>
|
||
|
<li>I’m not actually sure if the Solr web application uses the database though, so I’ll have to check later and remove it if necessary</li>
|
||
|
<li>I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week</li>
|
||
|
</ul>
|
||
|
|
||
|
<h2 id="2018-02-07">2018-02-07</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li>Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace</li>
|
||
|
<li>I tried on DSpace Test and indeed the lookup just doesn’t work!</li>
|
||
|
<li>The ORCiD code in DSpace appears to be using <code>http://pub.orcid.org/</code>, but when I go there in the browser it redirects me to <code>https://pub.orcid.org/v2.0/</code></li>
|
||
|
<li>According to <a href="https://groups.google.com/forum/#!topic/orcid-api-users/qfg-HwAB1bk">the announcement</a> the v1 API was moved from <code>http://pub.orcid.org/</code> to <code>https://pub.orcid.org/v1.2</code> until March 1st when it will be discontinued for good</li>
|
||
|
<li>But the old URL is hard coded in DSpace and it doesn’t work anyways, because it currently redirects you to <code>https://pub.orcid.org/v2.0/v1.2</code></li>
|
||
|
<li>So I guess we have to disable that shit once and for all and switch to a controlled vocabulary</li>
|
||
|
<li>CGSpace crashed again, this time around <code>Wed Feb 7 11:20:28 UTC 2018</code></li>
|
||
|
<li>I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
|
||
|
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
|
||
|
/tmp/pg_stat_activity1.txt:300
|
||
|
/tmp/pg_stat_activity2.txt:272
|
||
|
/tmp/pg_stat_activity3.txt:168
|
||
|
/tmp/pg_stat_activity4.txt:5
|
||
|
/tmp/pg_stat_activity5.txt:6
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>Interestingly, all of those 751 connections were idle!</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
|
||
|
751
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps</li>
|
||
|
<li>Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!):</li>
|
||
|
</ul>
|
||
|
|
||
|
<p><img src="/cgspace-notes/2018/02/jmx_dspace-sessions-day.png" alt="DSpace Sessions" /></p>
|
||
|
|
||
|
<ul>
|
||
|
<li>Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
1828
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)</li>
|
||
|
<li>What’s interesting is that the DSpace log says the connections are all busy:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>… but in PostgreSQL I see them <code>idle</code> or <code>idle in transaction</code>:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
|
||
|
250
|
||
|
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
|
||
|
250
|
||
|
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
|
||
|
187
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>What the fuck, does DSpace think all connections are busy?</li>
|
||
|
<li>I suspect these are issues with abandoned connections or maybe a leak, so I’m going to try adding the <code>removeAbandoned='true'</code> parameter which is apparently off by default</li>
|
||
|
<li>I will try <code>testOnReturn='true'</code> too, just to add more validation, because I’m fucking grasping at straws</li>
|
||
|
<li>Also, WTF, there was a heap space error randomly in catalina.out:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
|
||
|
Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!</li>
|
||
|
<li>Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
|
||
|
34 ip_addr=46.229.168.67
|
||
|
34 ip_addr=46.229.168.73
|
||
|
37 ip_addr=46.229.168.76
|
||
|
40 ip_addr=34.232.65.41
|
||
|
41 ip_addr=46.229.168.71
|
||
|
44 ip_addr=197.210.168.174
|
||
|
55 ip_addr=181.137.2.214
|
||
|
55 ip_addr=213.55.99.121
|
||
|
58 ip_addr=46.229.168.65
|
||
|
64 ip_addr=66.249.66.91
|
||
|
67 ip_addr=66.249.66.90
|
||
|
71 ip_addr=207.46.13.54
|
||
|
78 ip_addr=130.82.1.40
|
||
|
104 ip_addr=40.77.167.36
|
||
|
151 ip_addr=68.180.228.157
|
||
|
174 ip_addr=207.46.13.135
|
||
|
194 ip_addr=54.83.138.123
|
||
|
198 ip_addr=40.77.167.62
|
||
|
210 ip_addr=207.46.13.71
|
||
|
214 ip_addr=104.196.152.243
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>These IPs made thousands of sessions today:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
530
|
||
|
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
859
|
||
|
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
610
|
||
|
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
8
|
||
|
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
826
|
||
|
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
727
|
||
|
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
181
|
||
|
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
24
|
||
|
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
166
|
||
|
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||
|
992
|
||
|
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>Let’s investigate who these IPs belong to:
|
||
|
|
||
|
<ul>
|
||
|
<li>104.196.152.243 is CIAT, which is already marked as a bot via nginx!</li>
|
||
|
<li>207.46.13.71 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!</li>
|
||
|
<li>40.77.167.62 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!</li>
|
||
|
<li>207.46.13.135 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!</li>
|
||
|
<li>68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!</li>
|
||
|
<li>40.77.167.36 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!</li>
|
||
|
<li>207.46.13.54 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!</li>
|
||
|
<li>46.229.168.x is Semrush, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!</li>
|
||
|
</ul></li>
|
||
|
<li>Nice, so these are all known bots that are already crammed into one session by Tomcat’s Crawler Session Manager Valve.</li>
|
||
|
<li>What in the actual fuck, why is our load doing this? It’s gotta be something fucked up with the database pool being “busy” but everything is fucking idle</li>
|
||
|
<li>One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>This one makes two thousand requests per day or so recently:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code># grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
|
||
|
/var/log/nginx/access.log:1925
|
||
|
/var/log/nginx/access.log.1:2029
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>And they have 30 IPs, so fuck that shit I’m going to add them to the Tomcat Crawler Session Manager Valve nowwww</li>
|
||
|
<li>Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace</li>
|
||
|
<li>Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker</li>
|
||
|
<li>This is how the connections looked when it crashed this afternoon:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
|
5 dspaceApi
|
||
|
290 dspaceWeb
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>This is how it is right now:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
|
5 dspaceApi
|
||
|
5 dspaceWeb
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>So is this just some fucked up XMLUI database leaking?</li>
|
||
|
<li>I notice there is an issue (that I’ve probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: <a href="https://jira.duraspace.org/browse/DS-3551">https://jira.duraspace.org/browse/DS-3551</a></li>
|
||
|
<li>I seriously doubt this leaking shit is fixed for sure, but I’m gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I’m fed up with this shit</li>
|
||
|
<li>I cherry-picked all the commits for DS-3551 but it won’t build on our current DSpace 5.5!</li>
|
||
|
<li>I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle</li>
|
||
|
</ul>
|
||
|
|
||
|
<h2 id="2018-02-10">2018-02-10</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li>I tried to disable ORCID lookups but keep the existing authorities</li>
|
||
|
<li>This item has an ORCID for Ralf Kiese: <a href="http://localhost:8080/handle/10568/89897">http://localhost:8080/handle/10568/89897</a></li>
|
||
|
<li>Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn’t show up on the item</li>
|
||
|
<li>Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
|
||
|
<br /></li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>If I change choices.presentation to suggest it give this error:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>So I don’t think we can disable the ORCID lookup function and keep the ORCID badges</li>
|
||
|
</ul>
|
||
|
|
||
|
<h2 id="2018-02-11">2018-02-11</h2>
|
||
|
|
||
|
<ul>
|
||
|
<li>Magdalena from CCAFS emailed to ask why one of their items has such a weird thumbnail: <a href="https://cgspace.cgiar.org/handle/10568/90735"><sup>10568</sup>⁄<sub>90735</sub></a></li>
|
||
|
</ul>
|
||
|
|
||
|
<p><img src="/cgspace-notes/2018/02/CCAFS_WP_223.pdf.jpg" alt="Weird thumbnail" /></p>
|
||
|
|
||
|
<ul>
|
||
|
<li>I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
|
||
|
</code></pre>
|
||
|
|
||
|
<p><img src="/cgspace-notes/2018/02/CCAFS_WP_223.jpg" alt="Manual thumbnail" /></p>
|
||
|
|
||
|
<ul>
|
||
|
<li>Peter sent me corrected author names last week but the file encoding is messed up:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ isutf8 authors-2018-02-05.csv
|
||
|
authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>The <code>isutf8</code> program comes from <code>moreutils</code></li>
|
||
|
<li>Line 100 contains: Galiè, Alessandra</li>
|
||
|
<li>In other news, psycopg2 is splitting their package in pip, so to install the binary wheel distribution you need to use <code>pip install psycopg2-binary</code></li>
|
||
|
<li>See: <a href="http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/">http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/</a></li>
|
||
|
<li>I updated my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts on the scripts page: <a href="https://github.com/ilri/DSpace/wiki/Scripts">https://github.com/ilri/DSpace/wiki/Scripts</a></li>
|
||
|
<li>I ran the 342 author corrections (after trimming whitespace and excluding those with <code>||</code> and other syntax errors) on CGSpace:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>Then I ran a full Discovery re-indexing:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||
|
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>That reminds me that Bizu had asked me to fix some of Alan Duncan’s names in December</li>
|
||
|
<li>I see he actually has some variations with “Duncan, Alan J.”: <a href="https://cgspace.cgiar.org/discover?filtertype_1=author&filter_relational_operator_1=contains&filter_1=Duncan%2C+Alan&submit_apply_filter=&query=">https://cgspace.cgiar.org/discover?filtertype_1=author&filter_relational_operator_1=contains&filter_1=Duncan%2C+Alan&submit_apply_filter=&query=</a></li>
|
||
|
<li>I will just update those for her too and then restart the indexing:</li>
|
||
|
</ul>
|
||
|
|
||
|
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
|
||
|
text_value | authority | confidence
|
||
|
-----------------+--------------------------------------+------------
|
||
|
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | 600
|
||
|
Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 | -1
|
||
|
Duncan, Alan J. | | -1
|
||
|
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
|
||
|
Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 | -1
|
||
|
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d | -1
|
||
|
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | -1
|
||
|
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
|
||
|
(8 rows)
|
||
|
|
||
|
dspace=# begin;
|
||
|
dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
|
||
|
UPDATE 216
|
||
|
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
|
||
|
text_value | authority | confidence
|
||
|
--------------+--------------------------------------+------------
|
||
|
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
|
||
|
(1 row)
|
||
|
dspace=# commit;
|
||
|
</code></pre>
|
||
|
|
||
|
<ul>
|
||
|
<li>Run all system updates on DSpace Test (linode02) and reboot it</li>
|
||
|
<li>I wrote a Python script (<a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b"><code>resolve-orcids-from-solr.py</code></a>) using SolrClient to parse the Solr authority cache for ORCID IDs</li>
|
||
|
<li>We currently have 1562 authority records with ORCID IDs, and 624 unique IDs</li>
|
||
|
<li>We can use this to build a controlled vocabulary of ORCID IDs for new item submissions</li>
|
||
|
<li>I don’t know how to add ORCID IDs to existing items yet… some more querying of PostgreSQL for authority values perhaps?</li>
|
||
|
<li>I added the script to the <a href="https://github.com/ilri/DSpace/wiki/Scripts">ILRI DSpace wiki on GitHub</a></li>
|
||
|
</ul>
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
</article>
|
||
|
|
||
|
|
||
|
|
||
|
</div> <!-- /.blog-main -->
|
||
|
|
||
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
||
|
|
||
|
|
||
|
|
||
|
<section class="sidebar-module">
|
||
|
<h4>Recent Posts</h4>
|
||
|
<ol class="list-unstyled">
|
||
|
|
||
|
|
||
|
<li><a href="/cgspace-notes/2018-02/">February, 2018</a></li>
|
||
|
|
||
|
<li><a href="/cgspace-notes/2018-01/">January, 2018</a></li>
|
||
|
|
||
|
<li><a href="/cgspace-notes/2017-12/">December, 2017</a></li>
|
||
|
|
||
|
<li><a href="/cgspace-notes/2017-11/">November, 2017</a></li>
|
||
|
|
||
|
<li><a href="/cgspace-notes/2017-10/">October, 2017</a></li>
|
||
|
|
||
|
</ol>
|
||
|
</section>
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
<section class="sidebar-module">
|
||
|
<h4>Links</h4>
|
||
|
<ol class="list-unstyled">
|
||
|
|
||
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
|
||
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
|
||
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
|
||
|
</ol>
|
||
|
</section>
|
||
|
|
||
|
</aside>
|
||
|
|
||
|
|
||
|
</div> <!-- /.row -->
|
||
|
</div> <!-- /.container -->
|
||
|
|
||
|
|
||
|
|
||
|
<footer class="blog-footer">
|
||
|
<p>
|
||
|
|
||
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
|
||
|
</p>
|
||
|
<p>
|
||
|
<a href="#">Back to top</a>
|
||
|
</p>
|
||
|
</footer>
|
||
|
|
||
|
|
||
|
</body>
|
||
|
|
||
|
</html>
|