cgspace-notes/docs/2019-05/index.html

629 lines
26 KiB
HTML
Raw Normal View History

2019-05-01 10:53:26 +02:00
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="May, 2019" />
<meta property="og:description" content="2019-05-01
Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace
A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items
Apparently if the item is in the workflowitem table it is submitted to a workflow
And if it is in the workspaceitem table it is in the pre-submitted state
2019-05-05 15:45:12 +02:00
The item seems to be in a pre-submitted state, so I tried to delete it from there:
2019-05-01 10:53:26 +02:00
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present&hellip;
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-05/" />
<meta property="article:published_time" content="2019-05-01T07:37:43&#43;03:00"/>
2019-05-14 20:20:49 +02:00
<meta property="article:modified_time" content="2019-05-12T13:59:44&#43;03:00"/>
2019-05-01 10:53:26 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2019"/>
<meta name="twitter:description" content="2019-05-01
Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace
A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items
Apparently if the item is in the workflowitem table it is submitted to a workflow
And if it is in the workspaceitem table it is in the pre-submitted state
2019-05-05 15:45:12 +02:00
The item seems to be in a pre-submitted state, so I tried to delete it from there:
2019-05-01 10:53:26 +02:00
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present&hellip;
"/>
2019-05-05 15:45:12 +02:00
<meta name="generator" content="Hugo 0.55.5" />
2019-05-01 10:53:26 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2019",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-05\/",
2019-05-14 20:20:49 +02:00
"wordCount": "2387",
2019-05-01 10:53:26 +02:00
"datePublished": "2019-05-01T07:37:43\x2b03:00",
2019-05-14 20:20:49 +02:00
"dateModified": "2019-05-12T13:59:44\x2b03:00",
2019-05-01 10:53:26 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-05/">
<title>May, 2019 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2019-05/">May, 2019</a></h2>
<p class="blog-post-meta"><time datetime="2019-05-01T07:37:43&#43;03:00">Wed May 01, 2019</time> by Alan Orth in
<i class="fa fa-tag" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
</p>
</header>
<h2 id="2019-05-01">2019-05-01</h2>
<ul>
<li>Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace</li>
<li>A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items
<ul>
<li>Apparently if the item is in the <code>workflowitem</code> table it is submitted to a workflow</li>
<li>And if it is in the <code>workspaceitem</code> table it is in the pre-submitted state</li>
</ul></li>
2019-05-05 15:45:12 +02:00
<li><p>The item seems to be in a pre-submitted state, so I tried to delete it from there:</p>
2019-05-01 10:53:26 +02:00
<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
2019-05-05 15:45:12 +02:00
</code></pre></li>
2019-05-01 10:53:26 +02:00
2019-05-05 15:45:12 +02:00
<li><p>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</p></li>
2019-05-01 10:53:26 +02:00
</ul>
<ul>
2019-05-05 15:45:12 +02:00
<li><p>I managed to delete the problematic item from the database</p>
2019-05-01 10:53:26 +02:00
<ul>
<li>First I deleted the item&rsquo;s bitstream in XMLUI and then ran <code>dspace cleanup -v</code> to remove it from the assetstore</li>
2019-05-05 15:45:12 +02:00
<li><p>Then I ran the following SQL:</p>
2019-05-01 10:53:26 +02:00
<pre><code>dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
dspace=# DELETE FROM item WHERE item_id=74648;
2019-05-05 15:45:12 +02:00
</code></pre></li>
</ul></li>
2019-05-01 10:53:26 +02:00
2019-05-05 15:45:12 +02:00
<li><p>Now the item is (hopefully) really gone and I can continue to troubleshoot the issue with REST API&rsquo;s <code>/items/find-by-metadata-value</code> endpoint</p>
2019-05-01 10:53:26 +02:00
<ul>
2019-05-05 15:45:12 +02:00
<li><p>Of course I run into another HTTP 401 error when I continue trying the LandPortal search from last month:</p>
2019-05-01 10:53:26 +02:00
<pre><code>$ curl -f -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;:&quot;cg.subject.cpwf&quot;, &quot;value&quot;:&quot;WATER MANAGEMENT&quot;,&quot;language&quot;: &quot;en_US&quot;}'
curl: (22) The requested URL returned error: 401 Unauthorized
2019-05-05 15:45:12 +02:00
</code></pre></li>
</ul></li>
2019-05-01 10:53:26 +02:00
2019-05-05 15:45:12 +02:00
<li><p>The DSpace log shows the item ID (because I modified the error text):</p>
2019-05-01 10:53:26 +02:00
<pre><code>2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
2019-05-05 15:45:12 +02:00
</code></pre></li>
2019-05-01 10:53:26 +02:00
2019-05-05 15:45:12 +02:00
<li><p>If I delete that one I get another, making the list of item IDs so far:</p>
2019-05-01 10:53:26 +02:00
<ul>
<li>74648</li>
<li>77708</li>
<li>85079</li>
</ul></li>
2019-05-05 15:45:12 +02:00
<li><p>Some are in the <code>workspaceitem</code> table (pre-submission), others are in the <code>workflowitem</code> table (submitted), and others are actually approved, but withdrawn&hellip;</p>
2019-05-01 10:53:26 +02:00
<ul>
<li>This is actually a worthless exercise because the real issue is that the <code>/items/find-by-metadata-value</code> endpoint is simply designed flawed and shouldn&rsquo;t be fatally erroring when the search returns items the user doesn&rsquo;t have permission to access</li>
<li>It would take way too much time to try to fix the fucked up items that are in limbo by deleting them in SQL, but also, it doesn&rsquo;t actually fix the problem because some items are <em>submitted</em> but <em>withdrawn</em>, so they actually have handles and everything</li>
<li>I think the solution is to recommend people don&rsquo;t use the <code>/items/find-by-metadata-value</code> endpoint</li>
</ul></li>
2019-05-05 15:45:12 +02:00
<li><p>CIP is asking about embedding PDF thumbnail images in their RSS feeds again</p>
2019-05-01 10:53:26 +02:00
<ul>
<li>They asked in 2018-09 as well and I told them it wasn&rsquo;t possible</li>
<li>To make sure, I looked at <a href="https://wiki.duraspace.org/display/DSPACE/Enable+Media+RSS+Feeds">the documentation for RSS media feeds</a> and tried it, but couldn&rsquo;t get it to work</li>
<li>It seems to be geared towards iTunes and Podcasts&hellip; I dunno</li>
</ul></li>
2019-05-05 15:45:12 +02:00
<li><p>CIP also asked for a way to get an XML file of all their RTB journal articles on CGSpace</p>
2019-05-01 11:24:01 +02:00
<ul>
2019-05-05 15:45:12 +02:00
<li><p>I told them to use the REST API like (where <code>1179</code> is the id of the RTB journal articles collection):</p>
2019-05-01 10:53:26 +02:00
2019-05-01 11:24:01 +02:00
<pre><code>https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&amp;expand=metadata
2019-05-05 15:45:12 +02:00
</code></pre></li>
</ul></li>
</ul>
2019-05-01 11:24:01 +02:00
2019-05-03 09:29:01 +02:00
<h2 id="2019-05-03">2019-05-03</h2>
<ul>
2019-05-05 15:45:12 +02:00
<li><p>A user from CIAT emailed to say that CGSpace submission emails have not been working the last few weeks</p>
2019-05-03 09:29:01 +02:00
<ul>
2019-05-05 15:45:12 +02:00
<li><p>I checked the <code>dspace test-email</code> script on CGSpace and they are indeed failing:</p>
2019-05-03 09:29:01 +02:00
<pre><code>$ dspace test-email
About to send test email:
2019-05-05 15:45:12 +02:00
- To: woohoo@cgiar.org
- Subject: DSpace test email
- Server: smtp.office365.com
2019-05-03 09:29:01 +02:00
Error sending email:
2019-05-05 15:45:12 +02:00
- Error: javax.mail.AuthenticationFailedException
2019-05-03 09:29:01 +02:00
Please see the DSpace documentation for assistance.
2019-05-05 15:45:12 +02:00
</code></pre></li>
</ul></li>
2019-05-03 09:29:01 +02:00
2019-05-05 15:45:12 +02:00
<li><p>I will ask ILRI ICT to reset the password</p>
2019-05-03 15:33:34 +02:00
<ul>
<li>They reset the password and I tested it on CGSpace</li>
</ul></li>
2019-05-03 09:29:01 +02:00
</ul>
2019-05-05 15:45:12 +02:00
<h2 id="2019-05-05">2019-05-05</h2>
<ul>
<li>Run all system updates on DSpace Test (linode19) and reboot it</li>
<li>Merge changes into the <code>5_x-prod</code> branch of CGSpace:
<ul>
<li>Updates to remove deprecated social media websites (Google+ and Delicious), update Twitter share intent, and add item title to Twitter and email links (<a href="https://github.com/ilri/DSpace/pull/421">#421</a>)</li>
<li>Add new CCAFS Phase II project tags (<a href="https://github.com/ilri/DSpace/pull/420">#420</a>)</li>
<li>Add item ID to REST API error logging (<a href="https://github.com/ilri/DSpace/pull/422">#422</a>)</li>
</ul></li>
<li>Re-deploy CGSpace from <code>5_x-prod</code> branch</li>
<li>Run all system updates on CGSpace (linode18) and reboot it</li>
2019-05-05 22:53:42 +02:00
<li>Tag version 1.1.0 of the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> (with Falcon 2.0.0)
<ul>
<li>Deploy on DSpace Test</li>
</ul></li>
2019-05-05 15:45:12 +02:00
</ul>
2019-05-06 10:50:57 +02:00
<h2 id="2019-05-06">2019-05-06</h2>
<ul>
<li><p>Peter pointed out that Solr stats are only showing 2019 stats</p>
<ul>
<li><p>I looked at the Solr Admin UI and I see:</p>
<pre><code>statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
</code></pre></li>
</ul></li>
<li><p>As well as this error in the logs:</p>
<pre><code>Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
</code></pre></li>
<li><p>Strangely enough, I <em>do</em> see the statistics-2018, statistics-2017, etc cores in the Admin UI&hellip;</p></li>
<li><p>I restarted Tomcat a few times (and even deleted all the Solr write locks) and at least five times there were issues loading one statistics core, causing the Atmire stats to be incomplete</p>
<ul>
<li>Also, I tried to increase the <code>writeLockTimeout</code> in <code>solrconfig.xml</code> from the default of 1000ms to 10000ms</li>
<li>Eventually the Atmire stats started working, despite errors about &ldquo;Error opening new searcher&rdquo; in the Solr Admin UI</li>
<li>I wrote to the dspace-tech mailing list again on the thread from March, 2019</li>
</ul></li>
2019-05-06 14:41:40 +02:00
<li><p>There were a few alerts from UptimeRobot about CGSpace going up and down this morning, along with an alert from Linode about 596% load</p>
<ul>
<li>Looking at the Munin stats I see an exponential rise in DSpace XMLUI sessions, firewall activity, and PostgreSQL connections this morning:</li>
</ul></li>
</ul>
<p><img src="/cgspace-notes/2019/05/2019-05-06-jmx_dspace_sessions-day.png" alt="CGSpace XMLUI sessions day" /></p>
<p><img src="/cgspace-notes/2019/05/2019-05-06-fw_conntrack-day.png" alt="linode18 firewall connections day" /></p>
<p><img src="/cgspace-notes/2019/05/2019-05-06-postgres_connections_db-day.png" alt="linode18 postgres connections day" /></p>
<p><img src="/cgspace-notes/2019/05/2019-05-06-cpu-day.png" alt="linode18 CPU day" /></p>
<ul>
<li><p>The number of unique sessions today is <em>ridiculously</em> high compared to the last few days considering it&rsquo;s only 12:30PM right now:</p>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
101108
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
14618
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-04 | sort | uniq | wc -l
14946
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-03 | sort | uniq | wc -l
6410
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-02 | sort | uniq | wc -l
7758
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc -l
20528
</code></pre></li>
<li><p>The number of unique IP addresses from 2 to 6 AM this morning is already several times higher than the average for that time of the morning this past week:</p>
<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
7127
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1231
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '04/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1255
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '03/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1736
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '02/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1573
# zcat --force /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.6.gz | grep -E '01/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
1410
</code></pre></li>
<li><p>Just this morning between the hours of 2 and 6 the number of unique sessions was <em>very</em> high compared to previous mornings:</p>
<pre><code>$ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
83650
$ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2547
$ cat dspace.log.2019-05-04 | grep -E '2019-05-04 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2574
$ cat dspace.log.2019-05-03 | grep -E '2019-05-03 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2911
$ cat dspace.log.2019-05-02 | grep -E '2019-05-02 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2704
$ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
3699
</code></pre></li>
<li><p>Most of the requests were GETs:</p>
<pre><code># cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot;(GET|HEAD|POST|PUT)&quot; | sort | uniq -c | sort -n
1 PUT
98 POST
2845 HEAD
98121 GET
</code></pre></li>
<li><p>I&rsquo;m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?</p></li>
<li><p>Looking again, I see 84,000 requests to <code>/handle</code> this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in <code>access.log</code>):</p>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E &quot; /handle/[0-9]+/[0-9]+&quot;
84350
</code></pre></li>
<li><p>But it would be difficult to find a pattern for those requests because they cover 78,000 <em>unique</em> Handles (ie direct browsing of items, collections, or communities) and only 2,492 discover/browse (total, not unique):</p>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+ HTTP&quot; | sort | uniq | wc -l
78104
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E &quot; /handle/[0-9]+/[0-9]+/(discover|browse)&quot; | wc -l
2492
</code></pre></li>
<li><p>In other news, I see some IP is making several requests per second to the exact same REST API endpoints, for example:</p>
<pre><code># grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
3 2a01:7e00::f03c:91ff:fe0a:d645
113 63.32.242.35
</code></pre></li>
<li><p>According to <a href="https://viewdns.info/reverseip/?host=63.32.242.35&amp;t=1">viewdns.info</a> that server belongs to Macaroni Brothers&rsquo;</p>
<ul>
<li>The user agent of their non-REST API requests from the same IP is Drupal</li>
<li>This is one very good reason to limit REST API requests, and perhaps to enable caching via nginx</li>
</ul></li>
2019-05-06 10:50:57 +02:00
</ul>
2019-05-07 15:51:55 +02:00
<h2 id="2019-05-07">2019-05-07</h2>
<ul>
<li><p>The total number of unique IPs on CGSpace yesterday was almost 14,000, which is several thousand higher than previous day totals:</p>
<pre><code># zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
13969
# zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '05/May/2019' | awk '{print $1}' | sort | uniq | wc -l
5936
# zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz | grep -E '04/May/2019' | awk '{print $1}' | sort | uniq | wc -l
6229
# zcat --force /var/log/nginx/access.log.4.gz /var/log/nginx/access.log.5.gz | grep -E '03/May/2019' | awk '{print $1}' | sort | uniq | wc -l
8051
</code></pre></li>
<li><p>Total number of sessions yesterday was <em>much</em> higher compared to days last week:</p>
<pre><code>$ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
144160
$ cat dspace.log.2019-05-05 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
57269
$ cat dspace.log.2019-05-04 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
58648
$ cat dspace.log.2019-05-03 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
27883
$ cat dspace.log.2019-05-02 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
26996
$ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
61866
</code></pre></li>
<li><p>The usage statistics seem to agree that yesterday was crazy:</p></li>
</ul>
<p><img src="/cgspace-notes/2019/05/2019-05-07-atmire-usage-week.png" alt="Atmire Usage statistics spike 2019-05-06" /></p>
<ul>
<li>Sarah from RTB asked me about the RSS / XML link for the the CGIAR.org website again
<ul>
<li>Apparently Sam Stacey is trying to add an RSS feed so the items get automatically syndicated to the CGIAR website</li>
<li>I send her the link to the collection RSS feed</li>
</ul></li>
<li>Add requests cache to <code>resolve-addresses.py</code> script</li>
</ul>
2019-05-08 14:33:15 +02:00
<h2 id="2019-05-08">2019-05-08</h2>
<ul>
<li><p>A user said that CGSpace emails have stopped sending again</p>
<ul>
<li><p>Indeed, the <code>dspace test-email</code> script is showing an authentication failure:</p>
<pre><code>$ dspace test-email
About to send test email:
- To: wooooo@cgiar.org
- Subject: DSpace test email
- Server: smtp.office365.com
Error sending email:
- Error: javax.mail.AuthenticationFailedException
Please see the DSpace documentation for assistance.
</code></pre></li>
</ul></li>
<li><p>I checked the settings and apparently I had updated it incorrectly last week after ICT reset the password</p></li>
<li><p>Help Moayad with certbot-auto for Let&rsquo;s Encrypt scripts on the new AReS server (linode20)</p></li>
<li><p>Normalize all <code>text_lang</code> values for metadata on CGSpace and DSpace Test (as I had tested last month):</p>
<pre><code>UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
</code></pre></li>
<li><p>Send Francesca Giampieri from Bioversity a CSV export of all their items issued in 2018</p>
<ul>
<li>They will be doing a migration of 1500 items from their TYPO3 database into CGSpace soon and want an example CSV with all required metadata columns</li>
</ul></li>
</ul>
2019-05-10 16:27:11 +02:00
<h2 id="2019-05-10">2019-05-10</h2>
<ul>
<li>I finally had time to analyze the 7,000 IPs from the major traffic spike on 2019-05-06 after several runs of my <code>resolve-addresses.py</code> script (ipapi.co has a limit of 1,000 requests per day)</li>
<li>Resolving the unique IP addresses to organization and AS names reveals some pretty big abusers:
<ul>
<li>1213 from Region40 LLC (AS200557)</li>
<li>697 from Trusov Ilya Igorevych (AS50896)</li>
<li>687 from UGB Hosting OU (AS206485)</li>
<li>620 from UAB Rakrejus (AS62282)</li>
<li>491 from Dedipath (AS35913)</li>
<li>476 from Global Layer B.V. (AS49453)</li>
<li>333 from QuadraNet Enterprises LLC (AS8100)</li>
<li>278 from GigeNET (AS32181)</li>
<li>261 from Psychz Networks (AS40676)</li>
<li>196 from Cogent Communications (AS174)</li>
<li>125 from Blockchain Network Solutions Ltd (AS43444)</li>
<li>118 from Silverstar Invest Limited (AS35624)</li>
</ul></li>
<li><p>All of the IPs from these networks are using generic user agents like this, but MANY more, and they change many times:</p>
<pre><code>&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36&quot;
</code></pre></li>
<li><p>I found a <a href="https://www.qurium.org/alerts/azerbaijan/azerbaijan-and-the-region40-ddos-service/">blog post from 2018 detailing an attack from a DDoS service</a> that matches our pattern exactly</p></li>
<li><p>They specifically mention:</p></li>
</ul>
<pre>The attack that targeted the “Search” functionality of the website, aimed to bypass our mitigation by performing slow but simultaneous searches from 5500 IP addresses.</pre>
<ul>
2019-05-12 09:39:10 +02:00
<li>So this was definitely an attack of some sort&hellip; only God knows why</li>
<li>I noticed a few new bots that don&rsquo;t use the word &ldquo;bot&rdquo; in their user agent and therefore don&rsquo;t match Tomcat&rsquo;s Crawler Session Manager Valve:
2019-05-10 16:27:11 +02:00
<ul>
<li><code>Blackboard Safeassign</code></li>
2019-05-12 12:59:44 +02:00
<li><code>Unpaywall</code></li>
2019-05-10 16:27:11 +02:00
</ul></li>
</ul>
2019-05-12 09:39:10 +02:00
<h2 id="2019-05-12">2019-05-12</h2>
<ul>
2019-05-12 12:59:44 +02:00
<li><p>I see that the Unpaywall bot is resonsible for a few thousand XMLUI sessions every day (IP addresses come from nginx access.log):</p>
<pre><code>$ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2206
</code></pre></li>
<li><p>I added &ldquo;Unpaywall&rdquo; to the list of bots in the Tomcat Crawler Session Manager Valve</p></li>
<li><p>Set up nginx to use TLS and proxy pass to NodeJS on the AReS development server (linode20)</p></li>
<li><p>Run all system updates on linode20 and reboot it</p></li>
<li><p>Also, there is 10 to 20% CPU steal on that VM, so I will ask Linode to move it to another host</p></li>
<li><p>Commit changes to the <code>resolve-addresses.py</code> script to add proper CSV output support</p></li>
2019-05-12 09:39:10 +02:00
</ul>
2019-05-14 20:20:49 +02:00
<h2 id="2019-05-14">2019-05-14</h2>
<ul>
<li>Skype with Peter and AgroKnow about CTA story telling modification they want to do on the CTA ICT Update collection on CGSpace
<ul>
<li>I told them they should aim for modifying the collection theme and insert some custom HTML / JS</li>
<li>I need to send Panagis some documentation about Mirage 2 and the DSpace build process, as well as the Maven settings for build</li>
</ul></li>
</ul>
2019-05-01 10:53:26 +02:00
<!-- vim: set sw=2 ts=2: -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2019-05/">May, 2019</a></li>
<li><a href="/cgspace-notes/posts/">Posts</a></li>
<li><a href="/cgspace-notes/2019-04/">April, 2019</a></li>
<li><a href="/cgspace-notes/2019-03/">March, 2019</a></li>
<li><a href="/cgspace-notes/2019-02/">February, 2019</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p>
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>