Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<li>Abenet asked if it would be possible to generate a report of items in Listing and Reports that had “International Fund for Agricultural Development” as the <em>only</em> investor</li>
<li>I opened a ticket with Atmire to ask if this was possible: <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=540">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=540</a></li>
<li>Work on making the thumbnails in the item view clickable</li>
<li>Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link</li>
<li>METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml</li>
<li>I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!</li>
<li>Help proof fifty-three CIAT records for Sisay: <ahref="https://dspacetest.cgiar.org/handle/10568/95895">https://dspacetest.cgiar.org/handle/10568/95895</a></li>
<li>A handful of issues with <code>cg.place</code> using format like “Lima, PE” instead of “Lima, Peru”</li>
<li>Also, some dates like with completely invalid format like “2010- 06” and “2011-3-28”</li>
<li>I also collapsed some consecutive whitespace on a handful of fields</li>
<li>Peter asked if I could fix the appearance of “International Livestock Research Institute” in the author lookup during item submission</li>
<li>It looks to be just an issue with the user interface expecting authors to have both a first and last name:</li>
<li>But in the database the authors are correct (none with weird <code>, /</code> characters):</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
<li>Looking at monitoring Tomcat’s JVM heap with Prometheus, it looks like we need to use JMX + <ahref="https://github.com/prometheus/jmx_exporter">jmx_exporter</a></li>
<li>This guide shows how to <ahref="https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/">enable JMX in Tomcat</a> by modifying <code>CATALINA_OPTS</code></li>
<li>I was able to successfully connect to my local Tomcat with jconsole!</li>
<li>The worst thing is that this user never specifies a user agent string so we can’t lump it in with the other bots using the Tomcat Session Crawler Manager Valve</li>
<li>They don’t request dynamic URLs like “/discover” but they seem to be fetching handles from XMLUI instead of REST (and some with <code>//handle</code>, note the regex below):</li>
<li>I just realized that <code>ciat.cgiar.org</code> points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior</li>
<li>The next IP (207.46.13.36) seem to be Microsoft’s bingbot, but all its requests specify the “bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “/discover”:</li>
<li><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36</code></li>
<li><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11</code></li>
<li><code>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)</code></li>
</ul></li>
<li>I’ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs</li>
<li>While it’s not in the top ten, Baidu is one bot that seems to not give a fuck:</li>
<li>According to their documentation their bot <ahref="http://www.baidu.com/search/robots_english.html">respects <code>robots.txt</code></a>, but I don’t see this being the case</li>
<li>I think I will end up blocking Baidu as well…</li>
<li>Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed</li>
<li>I should look in nginx access.log, rest.log, oai.log, and DSpace’s dspace.log.2017-11-07</li>
<li>These aren’t actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers</li>
<li>The number of requests isn’t even that high to be honest</li>
<li>As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:</li>
210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
</code></pre>
<ul>
<li>A Google search for “LCTE bot” doesn’t return anything interesting, but this <ahref="https://stackoverflow.com/questions/42500881/what-is-lcte-in-user-agent">Stack Overflow discussion</a> references the lack of information</li>
<li>So basically after a few hours of looking at the log files I am not closer to understanding what is going on!</li>
<li>I do know that we want to block Baidu, though, as it does not respect <code>robots.txt</code></li>
<li>And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12–14 hours)</li>
<li>At least for now it seems to be that new Chinese IP (124.17.34.59):</li>
<li>About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day</li>
<li>I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API</li>
<li>It seems like our robots.txt file is valid, and they claim to recognize that URLs like <code>/discover</code> should be forbidden (不允许, aka “not allowed”):</li>
<li>Run system updates on DSpace Test and reboot the server</li>
<li>Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (<ahref="https://github.com/ilri/DSpace/pull/346">#346</a>)</li>
<li>I figured out a way to use nginx’s map function to assign a “bot” user agent to misbehaving clients who don’t define a user agent</li>
<li>Most bots are automatically lumped into one generic session by <ahref="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Tomcat’s Crawler Session Manager Valve</a> but this only works if their user agent matches a pre-defined regular expression like <code>.*[bB]ot.*</code></li>
<li>Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process</li>
<li>Basically, we modify the nginx config to add a mapping with a modified user agent <code>$ua</code>:</li>
</ul>
<pre><code>map $remote_addr $ua {
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
124.17.34.59 'ChineseBot';
default $http_user_agent;
}
</code></pre>
<ul>
<li>If the client’s address matches then the user agent is set, otherwise the default <code>$http_user_agent</code> variable is used</li>
<li>Then, in the server’s <code>/</code> block we pass this header to Tomcat:</li>
</ul>
<pre><code>proxy_pass http://tomcat_http;
proxy_set_header User-Agent $ua;
</code></pre>
<ul>
<li>Note to self: the <code>$ua</code> variable won’t show up in nginx access logs because the default <code>combined</code> log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!</li>
<li>If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve</li>
<li>You can verify by cross referencing nginx’s <code>access.log</code> and DSpace’s <code>dspace.log.2017-11-08</code>, for example</li>
<li>I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on <ahref="#2017-11-07">2017-11-07</a> for example)</li>
<li>I merged the clickable thumbnails code to <code>5_x-prod</code> (<ahref="https://github.com/ilri/DSpace/pull/347">#347</a>) and will deploy it later along with the new bot mapping stuff (and re-run the Asible <code>nginx</code> and <code>tomcat</code> tags)</li>
<li>I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in <code>robots.txt</code>:</li>
<li>I have been looking for a reason to ban Baidu and this is definitely a good one</li>
<li>Disallowing <code>Baiduspider</code> in <code>robots.txt</code> probably won’t work because this bot doesn’t seem to respect the robot exclusion standard anyways!</li>
<li>Run system updates on CGSpace and reboot the server</li>
<li>Re-deploy latest <code>5_x-prod</code> branch on CGSpace and DSpace Test (includes the clickable thumbnails, CCAFS phase II project tags, and updated news text)</li>
<li>Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:</li>
<li>This gets me thinking, I wonder if I can use something like nginx’s rate limiter to automatically change the user agent of clients who make too many requests</li>
<li>Perhaps using a combination of geo and map, like illustrated here: <ahref="https://www.nginx.com/blog/rate-limiting-nginx/">https://www.nginx.com/blog/rate-limiting-nginx/</a></li>
<li>The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’s Crawler Session Manager valve regex should match ‘YandexBot’:</li>
<li>Move some items and collections on CGSpace for Peter Ballantyne, running <ahref="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move_collections.sh</code></a> with the following configuration:</li>
</ul>
<pre><code>10947/6 10947/1 10568/83389
10947/34 10947/1 10568/83389
10947/2512 10947/1 10568/83389
</code></pre>
<ul>
<li>I explored nginx rate limits as a way to aggressively throttle Baidu bot which doesn’t seem to respect disallowed URLs in robots.txt</li>
<li>There’s an interesting <ahref="https://www.nginx.com/blog/rate-limiting-nginx/">blog post from Nginx’s team about rate limiting</a> as well as a <ahref="https://gist.github.com/arosenhagen/8aaf5d7f94171778c0e9">clever use of mapping with rate limits</a></li>
<li>The solution <ahref="https://github.com/ilri/rmg-ansible-public/commit/f0646991772660c505bea9c5ac586490e7c86156">I came up with</a> uses tricks from both of those</li>
<li>I deployed the limit on CGSpace and DSpace Test and it seems to work well:</li>
</ul>
<pre><code>$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Sun, 12 Nov 2017 16:30:19 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
HTTP/1.1 503 Service Temporarily Unavailable
Connection: keep-alive
Content-Length: 206
Content-Type: text/html
Date: Sun, 12 Nov 2017 16:30:21 GMT
Server: nginx
</code></pre>
<ul>
<li>The first request works, second is denied with an HTTP 503!</li>
<li>I need to remember to check the Munin graphs for PostgreSQL and JVM next week to see how this affects them</li>
<li>Helping Sisay proof 47 records for IITA: <ahref="https://dspacetest.cgiar.org/handle/10568/97029">https://dspacetest.cgiar.org/handle/10568/97029</a></li>
<li>From looking at the data in OpenRefine I found:
<ul>
<li>Errors in <code>cg.authorship.types</code></li>
<li>Errors in <code>cg.coverage.country</code> (smart quote in “COTE D’IVOIRE”, “HAWAII” is not a country)</li>
<li>Whitespace issues in some <code>cg.identifier.doi</code> fields and most values are using HTTP instead of HTTPS</li>
<li>Whitespace issues in some <code>dc.contributor.author</code> fields</li>
<li>Issue with invalid <code>dc.date.issued</code> value “2011-3”</li>
<li>Description fields are poorly copy–pasted</li>
<li>Whitespace issues in <code>dc.description.sponsorship</code></li>
<li>Lots of inconsistency in <code>dc.format.extent</code> (mixed dash style, periods at the end of values)</li>
<li>Whitespace errors in <code>dc.identifier.citation</code></li>
<li>Whitespace errors in <code>dc.subject</code></li>
<li>Whitespace errors in <code>dc.title</code></li>
</ul></li>
<li>After uploading and looking at the data in DSpace Test I saw more errors with CRPs, subjects (one item had four copies of all of its subjects, another had a “.” in it), affiliations, sponsors, etc.</li>
<li>Atmire responded to the <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510">ticket about ORCID stuff</a> a few days ago, today I told them that I need to talk to Peter and the partners to see what we would like to do</li>
</ul>
<h2id="2017-11-14">2017-11-14</h2>
<ul>
<li>Deploy some nginx configuration updates to CGSpace</li>
<li>They had been waiting on a branch for a few months and I think I just forgot about them</li>
<li>I have been running them on DSpace Test for a few days and haven’t seen any issues there</li>
<li>I need to look into using JMX to analyze active sessions I think, rather than looking at log files</li>
<li>After adding appropriate <ahref="https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/">JMX listener options to Tomcat’s JAVA_OPTS</a> and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:</li>
<li>Looking at the MBeans you can drill down in Catalina→Manager→<webapp>→localhost→Attributes and see active sessions, etc</li>
<li>I want to enable JMX listener on CGSpace but I need to do some more testing on DSpace Test and see if it causes any performance impact, for example</li>
<li>If I hit the server with some requests as a normal user I see the session counter increase, but if I specify a bot user agent then the sessions seem to be reused (meaning the Crawler Session Manager is working)</li>
<li>Here is the Jconsole screen after looping <code>http --print Hh https://dspacetest.cgiar.org/handle/10568/1</code> for a few minutes:</li>
</ul>
<p><imgsrc="/cgspace-notes/2017/11/jconsole-sessions.png"alt="Jconsole sessions for XMLUI"/></p>
<li>We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity</li>
<li>In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)</li>
<li>I found <ahref="https://www.cakesolutions.net/teamblogs/low-pause-gc-on-the-jvm">an article about JVM tuning</a> that gives some pointers how to enable logging and tools to analyze logs for you</li>
<li>Also notes on <ahref="https://blog.gceasy.io/2016/11/15/rotating-gc-log-files/">rotating GC logs</a></li>
<li>I decided to switch DSpace Test back to the CMS garbage collector because it is designed for low pauses and high throughput (like G1GC!) and because we haven’t even tried to monitor or tune it</li>