cgspace-notes/docs/2017-10/index.html

498 lines
26 KiB
HTML
Raw Normal View History

2018-02-11 17:28:23 +01:00
<!DOCTYPE html>
<html lang="en" >
2018-02-11 17:28:23 +01:00
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
2020-12-06 15:53:29 +01:00
2018-02-11 17:28:23 +01:00
<meta property="og:title" content="October, 2017" />
<meta property="og:description" content="2017-10-01
Peter emailed to point out that many items in the ILRI archive collection have multiple handles:
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
2020-01-27 15:20:44 +01:00
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
2018-02-11 17:28:23 +01:00
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
" />
<meta property="og:type" content="article" />
2019-02-02 13:12:57 +01:00
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-10/" />
2019-08-08 17:10:44 +02:00
<meta property="article:published_time" content="2017-10-01T08:07:54+03:00" />
2019-10-28 12:43:25 +01:00
<meta property="article:modified_time" content="2019-10-28T13:39:25+02:00" />
2018-09-30 07:23:48 +02:00
2020-12-06 15:53:29 +01:00
2018-02-11 17:28:23 +01:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2017"/>
<meta name="twitter:description" content="2017-10-01
Peter emailed to point out that many items in the ILRI archive collection have multiple handles:
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
2020-01-27 15:20:44 +01:00
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
2018-02-11 17:28:23 +01:00
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
2022-09-05 15:59:11 +02:00
<meta name="generator" content="Hugo 0.102.3" />
2018-02-11 17:28:23 +01:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "October, 2017",
2020-04-02 09:55:42 +02:00
"url": "https://alanorth.github.io/cgspace-notes/2017-10/",
2018-04-30 18:05:39 +02:00
"wordCount": "2613",
"datePublished": "2017-10-01T08:07:54+03:00",
2019-10-28 12:43:25 +01:00
"dateModified": "2019-10-28T13:39:25+02:00",
2018-02-11 17:28:23 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-10/">
<title>October, 2017 | CGSpace Notes</title>
2018-02-11 17:28:23 +01:00
<!-- combined, minified CSS -->
2020-01-23 19:19:38 +01:00
2022-09-09 16:29:51 +02:00
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
2018-02-11 17:28:23 +01:00
2020-01-28 11:01:42 +01:00
<!-- minified Font Awesome for SVG icons -->
2021-09-28 09:32:32 +02:00
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
2020-01-28 11:01:42 +01:00
2019-04-14 15:59:47 +02:00
<!-- RSS 2.0 feed -->
2018-02-11 17:28:23 +01:00
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
2018-12-19 12:20:39 +01:00
2018-02-11 17:28:23 +01:00
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
2018-02-11 17:28:23 +01:00
</div>
</header>
2018-12-19 12:20:39 +01:00
2018-02-11 17:28:23 +01:00
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-10/">October, 2017</a></h2>
2020-11-16 09:54:00 +01:00
<p class="blog-post-meta">
<time datetime="2017-10-01T08:07:54+03:00">Sun Oct 01, 2017</time>
in
2022-06-23 07:40:53 +02:00
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
2018-02-11 17:28:23 +01:00
</p>
</header>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-01">2017-10-01</h2>
2018-02-11 17:28:23 +01:00
<ul>
2019-11-28 16:30:45 +01:00
<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
2019-11-28 16:30:45 +01:00
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-02">2017-10-02</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Peter Ballantyne said he was having problems logging into CGSpace with &ldquo;both&rdquo; of his accounts (CGIAR LDAP and personal, apparently)</li>
2019-11-28 16:30:45 +01:00
<li>I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a &ldquo;no DN found&rdquo; error:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
2018-02-11 17:28:23 +01:00
2017-10-01 20:22:37,982 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>I thought maybe his account had expired (seeing as it&rsquo;s was the first of the month) but he says he was finally able to log in today</li>
2019-11-28 16:30:45 +01:00
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ grep -c &#34;ldap_authentication:type=failed_auth&#34; dspace.log.2017-10-01
2018-02-11 17:28:23 +01:00
14
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>For what it&rsquo;s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET&rsquo;s LDAP server</li>
2019-11-28 16:30:45 +01:00
<li>Linode emailed to say that linode578611 (DSpace Test) needs to migrate to a new host for a security update so I initiated the migration immediately rather than waiting for the scheduled time in two weeks</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-04">2017-10-04</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Twice in the last twenty-four hours Linode has alerted about high CPU usage on CGSpace (linode2533629)</li>
<li>Communicate with Sam from the CGIAR System Organization about some broken links coming from their CGIAR Library domain to CGSpace</li>
2019-11-28 16:30:45 +01:00
<li>The first is a link to a browse page that should be handled better in nginx:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&amp;type=subject
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>We&rsquo;ll need to check for browse links and handle them properly, including swapping the <code>subject</code> parameter for <code>systemsubject</code> (which doesn&rsquo;t exist in Discovery yet, but we&rsquo;ll need to add it) as we have moved their poorly curated subjects from <code>dc.subject</code> to <code>cg.subject.system</code></li>
2019-11-28 16:30:45 +01:00
<li>The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead</li>
<li>Help Sisay proof sixty-two IITA records on DSpace Test</li>
<li>Lots of inconsistencies and errors in subjects, dc.format.extent, regions, countries</li>
<li>Merge the Discovery search changes for ISI Journal (<a href="https://github.com/ilri/DSpace/pull/341">#341</a>)</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-05">2017-10-05</h2>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>Twice in the past twenty-four hours Linode has warned that CGSpace&rsquo;s outbound traffic rate was exceeding the notification threshold</li>
<li>I had a look at yesterday&rsquo;s OAI and REST logs in <code>/var/log/nginx</code> but didn&rsquo;t see anything unusual:</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
2019-11-28 16:30:45 +01:00
141 157.55.39.240
145 40.77.167.85
162 66.249.66.92
181 66.249.66.95
211 66.249.66.91
312 66.249.66.94
384 66.249.66.90
1495 50.116.102.77
3904 70.32.83.92
9904 45.5.184.196
2022-03-04 13:30:06 +01:00
# awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
2019-11-28 16:30:45 +01:00
5 66.249.66.71
6 66.249.66.67
6 68.180.229.31
8 41.84.227.85
8 66.249.66.92
17 66.249.66.65
24 66.249.66.91
38 66.249.66.95
69 66.249.66.90
148 66.249.66.94
</code></pre><ul>
<li>Working on the nginx redirects for CGIAR Library</li>
<li>We should start using 301 redirects and also allow for <code>/sitemap</code> to work on the library.cgiar.org domain so the CGIAR System Organization people can update their Google Search Console and allow Google to find their content in a structured way</li>
2020-01-27 15:20:44 +01:00
<li>Remove eleven occurrences of <code>ACP</code> in IITA&rsquo;s <code>cg.coverage.region</code> using the Atmire batch edit module from Discovery</li>
2019-11-28 16:30:45 +01:00
<li>Need to investigate how we can verify the library.cgiar.org using the HTML or DNS methods</li>
<li>Run corrections on 143 ILRI Archive items that had two <code>dc.identifier.uri</code> values (Handle) that Peter had pointed out earlier this week</li>
<li>I used OpenRefine to isolate them and then fixed and re-imported them into CGSpace</li>
<li>I manually checked a dozen of them and it appeared that the correct handle was always the second one, so I just deleted the first one</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-06">2017-10-06</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>I saw a nice tweak to thumbnail presentation on the Cardiff Metropolitan University DSpace: <a href="https://repository.cardiffmet.ac.uk/handle/10369/8780">https://repository.cardiffmet.ac.uk/handle/10369/8780</a></li>
<li>It adds a subtle border and box shadow, before and after:</li>
</ul>
2019-11-28 16:30:45 +01:00
<p><img src="/cgspace-notes/2017/10/dspace-thumbnail-original.png" alt="Original flat thumbnails">
<img src="/cgspace-notes/2017/10/dspace-thumbnail-box-shadow.png" alt="Tweaked with border and box shadow"></p>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>I&rsquo;ll post it to the Yammer group to see what people think</li>
2018-02-11 17:28:23 +01:00
<li>I figured out at way to do the HTML verification for Google Search console for library.cgiar.org</li>
<li>We can drop the HTML file in their XMLUI theme folder and it will get copied to the webapps directory during build/install</li>
<li>Then we add an nginx alias for that URL in the library.cgiar.org vhost</li>
<li>This method is kinda a hack but at least we can put all the pieces into git to be reproducible</li>
<li>I will tell Tunji to send me the verification file</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-10">2017-10-10</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Deploy logic to allow verification of the library.cgiar.org domain in the Google Search Console (<a href="https://github.com/ilri/DSpace/pull/343">#343</a>)</li>
<li>After verifying both the HTTP and HTTPS domains and submitting a sitemap it will be interesting to see how the stats in the console as well as the search results change (currently 28,500 results):</li>
</ul>
2019-11-28 16:30:45 +01:00
<p><img src="/cgspace-notes/2017/10/google-search-console.png" alt="Google Search Console">
<img src="/cgspace-notes/2017/10/google-search-console-2.png" alt="Google Search Console 2">
<img src="/cgspace-notes/2017/10/google-search-results.png" alt="Google Search results"></p>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>I tried to submit a &ldquo;Change of Address&rdquo; request in the Google Search Console but I need to be an owner on CGSpace&rsquo;s console (currently I&rsquo;m just a user) in order to do that</li>
2018-02-11 17:28:23 +01:00
<li>Manually clean up some communities and collections that Peter had requested a few weeks ago</li>
2019-11-28 16:30:45 +01:00
<li>Delete Community 10568/102 (ILRI Research and Development Issues)</li>
<li>Move five collections to 10568/27629 (ILRI Projects) using <code>move-collections.sh</code> with the following configuration:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>10568/1637 10568/174 10568/27629
2018-02-11 17:28:23 +01:00
10568/1642 10568/174 10568/27629
10568/1614 10568/174 10568/27629
10568/75561 10568/150 10568/27629
10568/183 10568/230 10568/27629
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Delete community 10568/174 (Sustainable livestock futures)</li>
<li>Delete collections in 10568/27629 that have zero items (33 of them!)</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-11">2017-10-11</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Peter added me as an owner on the CGSpace property on Google Search Console and I tried to submit a &ldquo;Change of Address&rdquo; request for the CGIAR Library but got an error:</li>
</ul>
2019-11-28 16:30:45 +01:00
<p><img src="/cgspace-notes/2017/10/search-console-change-address-error.png" alt="Change of Address error"></p>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won&rsquo;t work—we&rsquo;ll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects</li>
<li>Also the Google Search Console doesn&rsquo;t work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the &ldquo;Change of Address&rdquo; tool to work!</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-12">2017-10-12</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Finally finish (I think) working on the myriad nginx redirects for all the CGIAR Library browse stuff—it ended up getting pretty complicated!</li>
<li>I still need to commit the DSpace changes (add browse index, XMLUI strings, Discovery index, etc), but I should be able to deploy that on CGSpace soon</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-14">2017-10-14</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Run system updates on DSpace Test and reboot server</li>
<li>Merge changes adding a search/browse index for CGIAR System subject to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/344">#344</a>)</li>
2020-01-27 15:20:44 +01:00
<li>I checked the top browse links in Google&rsquo;s search results for <code>site:library.cgiar.org inurl:browse</code> and they are all redirected appropriately by the nginx rewrites I worked on last week</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-22">2017-10-22</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Run system updates on DSpace Test and reboot server</li>
<li>Re-deploy CGSpace from latest <code>5_x-prod</code> (adds ISI Journal to search filters and adds Discovery index for CGIAR Library <code>systemsubject</code>)</li>
<li>Deploy nginx redirect fixes to catch CGIAR Library browse links (redirect to their community and translate subject→systemsubject)</li>
<li>Run migration of CGSpace server (linode18) for Linode security alert, which took 42 minutes of downtime</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-26">2017-10-26</h2>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>In the last 24 hours we&rsquo;ve gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace</li>
2018-02-11 17:28:23 +01:00
<li>Uptime Robot even noticed CGSpace go &ldquo;down&rdquo; for a few minutes</li>
<li>In other news, I was trying to look at a question about stats raised by Magdalena and then CGSpace went down due to SQL connection pool</li>
<li>Looking at the PostgreSQL activity I see there are 93 connections, but after a minute or two they went down and CGSpace came back up</li>
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
2020-01-27 15:20:44 +01:00
<li>Still not sure where the load is coming from right now, but it&rsquo;s clear why there were so many alerts yesterday on the 25th!</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-25 | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
18022
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Compared to other days there were two or three times the number of requests yesterday!</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-23 | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
3141
2022-03-04 13:30:06 +01:00
# grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-26 | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
7851
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>I still have no idea what was causing the load to go up today</li>
2020-01-27 15:20:44 +01:00
<li>I finally investigated Magdalena&rsquo;s issue with the item download stats and now I can&rsquo;t reproduce it: I get the same number of downloads reported in the stats widget on the item page, the &ldquo;Most Popular Items&rdquo; page, and in Usage Stats</li>
2019-11-28 16:30:45 +01:00
<li>I think it might have been an issue with the statistics not being fresh</li>
<li>I added the admin group for the systems organization to the admin role of the top-level community of CGSpace because I guess Sisay had forgotten</li>
<li>Magdalena asked if there was a way to reuse data in item submissions where items have a lot of similar data</li>
<li>I told her about the possibility to use per-collection item templates, and asked if her items in question were all from a single collection</li>
2020-01-27 15:20:44 +01:00
<li>We&rsquo;ve never used it but it could be worth looking at</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-27">2017-10-27</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Linode alerted about high CPU usage again (twice) on CGSpace in the last 24 hours, around 2AM and 2PM</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-28">2017-10-28</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Linode alerted about high CPU usage again on CGSpace around 2AM this morning</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-29">2017-10-29</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Linode alerted about high CPU usage again on CGSpace around 2AM and 4AM</li>
2020-01-27 15:20:44 +01:00
<li>I&rsquo;m still not sure why this started causing alerts so repeatadely the past week</li>
<li>I don&rsquo;t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep &#39;2017-10-29 02:&#39; dspace.log.2017-10-29 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
2049
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>So there were 2049 unique sessions during the hour of 2AM</li>
<li>Looking at my notes, the number of unique sessions was about the same during the same hour on other days when there were no alerts</li>
2020-01-27 15:20:44 +01:00
<li>I think I&rsquo;ll need to enable access logging in nginx to figure out what&rsquo;s going on</li>
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I&rsquo;ve never seen before:</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &#34;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&#34; 200 7776 &#34;-&#34; &#34;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&#34;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>CORE seems to be some bot that is &ldquo;Aggregating the worlds open access research papers&rdquo;</li>
2020-01-27 15:20:44 +01:00
<li>The contact address listed in their bot&rsquo;s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
<li>I will check the logs in a few days to see if they are harvesting us regularly, then add their bot&rsquo;s user agent to the Tomcat Crawler Session Valve</li>
2019-11-28 16:30:45 +01:00
<li>After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now</li>
2020-01-27 15:20:44 +01:00
<li>For now I will just contact them to have them update their contact info in the bot&rsquo;s user agent, but eventually I think I&rsquo;ll tell them to swap out the CGIAR Library entry for CGSpace</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-30">2017-10-30</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)</li>
2019-11-28 16:30:45 +01:00
<li>Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>dspace=# SELECT * FROM pg_stat_activity;
2018-02-11 17:28:23 +01:00
...
(93 rows)
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log
2018-02-11 17:28:23 +01:00
26475
2022-03-04 13:30:06 +01:00
# grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log.1
2018-02-11 17:28:23 +01:00
135083
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>IP addresses for this bot currently seem to be:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq
2018-02-11 17:28:23 +01:00
137.108.70.6
137.108.70.7
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won&rsquo;t help much because they are only using two sessions:</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq
2018-02-11 17:28:23 +01:00
session_id=5771742CABA3D0780860B8DA81E0551B
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>&hellip; and most of their requests are for dynamic discover pages:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code># grep -c 137.108.70 /var/log/nginx/access.log
2018-02-11 17:28:23 +01:00
26622
2022-03-04 13:30:06 +01:00
# grep 137.108.70 /var/log/nginx/access.log | grep -c &#34;GET /discover&#34;
2018-02-11 17:28:23 +01:00
24055
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>Just because I&rsquo;m curious who the top IPs are:</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
496 62.210.247.93
571 46.4.94.226
651 40.77.167.39
763 157.55.39.231
782 207.46.13.90
998 66.249.66.90
1948 104.196.152.243
4247 190.19.92.5
31602 137.108.70.6
31636 137.108.70.7
</code></pre><ul>
<li>At least we know the top two are CORE, but who are the others?</li>
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
2020-01-27 15:20:44 +01:00
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don&rsquo;t reuse their session variable, creating thousands of new sessions!</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
1419
2022-03-04 13:30:06 +01:00
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
2811
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>From looking at the requests, it appears these are from CIAT and CCAFS</li>
<li>I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them</li>
<li>Actually, according to the Tomcat docs, we could use an IP with <code>crawlerIps</code>: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve</a></li>
2020-01-27 15:20:44 +01:00
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn&rsquo;t in Ubuntu 16.04&rsquo;s 7.0.68 build!</li>
2019-11-28 16:30:45 +01:00
<li>That would explain the errors I was getting when trying to set it:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property &#39;crawlerIps&#39; to &#39;190\.19\.92\.5|104\.196\.152\.243&#39; did not find a matching property.
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)&#39; dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
2019-11-28 16:30:45 +01:00
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
</code></pre><ul>
<li>I will check again tomorrow</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-10-31">2017-10-31</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Very nice, Linode alerted that CGSpace had high CPU usage at 2AM again</li>
2020-01-27 15:20:44 +01:00
<li>Ask on the dspace-tech mailing list if it&rsquo;s possible to use an existing item as a template for a new item</li>
2019-11-28 16:30:45 +01:00
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log.1 | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h
2019-11-28 16:30:45 +01:00
139109 137.108.70.6
139253 137.108.70.7
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>I&rsquo;ve emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace</li>
2019-11-28 16:30:45 +01:00
<li>Also, I asked if they could perhaps use the <code>sitemap.xml</code>, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets</li>
<li>I added <a href="https://goaccess.io/">GoAccess</a> to the list of package to install in the DSpace role of the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
<li>It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code># goaccess /var/log/nginx/access.log --log-format=COMBINED
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>According to Uptime Robot CGSpace went down and up a few times</li>
<li>I had a look at goaccess and I saw that CORE was actively indexing</li>
<li>Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)</li>
2020-01-27 15:20:44 +01:00
<li>I&rsquo;m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
<li>Actually, come to think of it, they aren&rsquo;t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | grep -o -E &#34;GET /(discover|search-filter)&#34; | sort -n | uniq -c | sort -rn
2019-11-28 16:30:45 +01:00
158058 GET /discover
14260 GET /search-filter
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>I tested a URL of pattern <code>/discover</code> in Google&rsquo;s webmaster tools and it was indeed identified as blocked</li>
2019-11-28 16:30:45 +01:00
<li>I will send feedback to the CORE bot team</li>
2018-02-11 17:28:23 +01:00
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2022-08-01 15:36:13 +02:00
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
2022-07-04 08:25:14 +02:00
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
2022-06-06 08:45:43 +02:00
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
2022-05-04 10:09:45 +02:00
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
2022-04-27 08:58:45 +02:00
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
2022-03-01 15:48:40 +01:00
2018-02-11 17:28:23 +01:00
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
2018-02-11 17:28:23 +01:00
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>