November, 2017
2017-11-01
- The CORE developers responded to say they are looking into their bot not respecting our robots.txt
2017-11-02
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
# grep -c "CORE" /var/log/nginx/access.log 0
Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701
- Abenet asked if it would be possible to generate a report of items in Listing and Reports that had “International Fund for Agricultural Development” as the only investor
- I opened a ticket with Atmire to ask if this was possible: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=540
- Work on making the thumbnails in the item view clickable
Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link
//mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml
I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!
Help proof fifty-three CIAT records for Sisay: https://dspacetest.cgiar.org/handle/10568/95895
A handful of issues with
cg.place
using format like “Lima, PE” instead of “Lima, Peru”Also, some dates like with completely invalid format like “2010- 06” and “2011-3-28”
I also collapsed some consecutive whitespace on a handful of fields
2017-11-03
- Atmire got back to us to say that they estimate it will take two days of labor to implement the change to Listings and Reports
- I said I’d ask Abenet if she wants that feature
2017-11-04
- I finished looking through Sisay’s CIAT records for the “Alianzas de Aprendizaje” data
- I corrected about half of the authors to standardize them
- Linode emailed this morning to say that the CPU usage was high again, this time at 6:14AM
- It’s the first time in a few days that this has happened
I had a look to see what was going on, but it isn’t the CORE bot:
# awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail 306 68.180.229.31 323 61.148.244.116 414 66.249.66.91 507 40.77.167.16 618 157.55.39.161 652 207.46.13.103 666 157.55.39.254 1173 104.196.152.243 1737 66.249.66.90 23101 138.201.52.218
138.201.52.218 is from some Hetzner server, and I see it making 40,000 requests yesterday too, but none before that:
# zgrep -c 138.201.52.218 /var/log/nginx/access.log* /var/log/nginx/access.log:24403 /var/log/nginx/access.log.1:45958 /var/log/nginx/access.log.2.gz:0 /var/log/nginx/access.log.3.gz:0 /var/log/nginx/access.log.4.gz:0 /var/log/nginx/access.log.5.gz:0 /var/log/nginx/access.log.6.gz:0
It’s clearly a bot as it’s making tens of thousands of requests, but it’s using a “normal” user agent:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
For now I don’t know what this user is!
2017-11-05
- Peter asked if I could fix the appearance of “International Livestock Research Institute” in the author lookup during item submission
- It looks to be just an issue with the user interface expecting authors to have both a first and last name:
But in the database the authors are correct (none with weird
, /
characters):dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%'; text_value | authority | confidence --------------------------------------------+--------------------------------------+------------ International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 0 International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc | 0 International Livestock Research Institute | | -1 International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 600 International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc | -1 International Livestock Research Institute | | 600 International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | -1 International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 500 (8 rows)
So I’m not sure if this is just a graphical glitch or if editors have to edit this metadata field prior to approval
Looking at monitoring Tomcat’s JVM heap with Prometheus, it looks like we need to use JMX + jmx_exporter
This guide shows how to enable JMX in Tomcat by modifying
CATALINA_OPTS
I was able to successfully connect to my local Tomcat with jconsole!
2017-11-07
- CGSpace when down and up a few times this morning, first around 3AM, then around 7
- Tsega had to restart Tomcat 7 to fix it temporarily
I will start by looking at bot usage (access.log.1 includes usage until 6AM today):
# cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 619 65.49.68.184 840 65.49.68.199 924 66.249.66.91 1131 68.180.229.254 1583 66.249.66.90 1953 207.46.13.103 1999 207.46.13.80 2021 157.55.39.161 2034 207.46.13.36 4681 104.196.152.243
104.196.152.243 seems to be a top scraper for a few weeks now:
# zgrep -c 104.196.152.243 /var/log/nginx/access.log* /var/log/nginx/access.log:336 /var/log/nginx/access.log.1:4681 /var/log/nginx/access.log.2.gz:3531 /var/log/nginx/access.log.3.gz:3532 /var/log/nginx/access.log.4.gz:5786 /var/log/nginx/access.log.5.gz:8542 /var/log/nginx/access.log.6.gz:6988 /var/log/nginx/access.log.7.gz:7517 /var/log/nginx/access.log.8.gz:7211 /var/log/nginx/access.log.9.gz:2763
This user is responsible for hundreds and sometimes thousands of Tomcat sessions:
$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 954 $ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 6199 $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 7051
The worst thing is that this user never specifies a user agent string so we can’t lump it in with the other bots using the Tomcat Session Crawler Manager Valve
They don’t request dynamic URLs like “/discover” but they seem to be fetching handles from XMLUI instead of REST (and some with
//handle
, note the regex below):# grep -c 104.196.152.243 /var/log/nginx/access.log.1 4681 # grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle' 4618
I just realized that
ciat.cgiar.org
points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behaviorThe next IP (207.46.13.36) seem to be Microsoft’s bingbot, but all its requests specify the “bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “/discover”:
$ grep -c 207.46.13.36 /var/log/nginx/access.log.1 2034 # grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c "GET /discover" 0
The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:
# grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c "GET /discover" 0
The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like “/discover”:
# grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 5997 # grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "bingbot" 5988 # grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover" 0
The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like “/discover”:
# grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 3048 # grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google 3048 # grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover" 0
The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like “/discover”:
# grep -c 68.180.229.254 /var/log/nginx/access.log.1 1131 # grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c "GET /discover" 0
The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:
# grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 2950 # grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c "GET /discover" 330
Their user agents vary, ie:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)
I’ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs
While it’s not in the top ten, Baidu is one bot that seems to not give a fuck:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider 8912 # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)" 2521
According to their documentation their bot respects
robots.txt
, but I don’t see this being the caseI think I will end up blocking Baidu as well…
Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed
I should look in nginx access.log, rest.log, oai.log, and DSpace’s dspace.log.2017-11-07
Here are the top IPs making requests to XMLUI from 2 to 8 AM:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 279 66.249.66.91 373 65.49.68.199 446 68.180.229.254 470 104.196.152.243 470 197.210.168.174 598 207.46.13.103 603 157.55.39.161 637 207.46.13.80 703 207.46.13.36 724 66.249.66.90
Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot
Here are the top IPs making requests to REST from 2 to 8 AM:
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 8 207.241.229.237 10 66.249.66.90 16 104.196.152.243 25 41.60.238.61 26 157.55.39.161 27 207.46.13.103 27 207.46.13.80 31 207.46.13.36 1498 50.116.102.77
The OAI requests during that same time period are nothing to worry about:
# cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 1 66.249.66.92 4 66.249.66.90 6 68.180.229.254
The top IPs from dspace.log during the 2–8 AM period:
$ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail 143 ip_addr=213.55.99.121 181 ip_addr=66.249.66.91 223 ip_addr=157.55.39.161 248 ip_addr=207.46.13.80 251 ip_addr=207.46.13.103 291 ip_addr=207.46.13.36 297 ip_addr=197.210.168.174 312 ip_addr=65.49.68.199 462 ip_addr=104.196.152.243 488 ip_addr=66.249.66.90
These aren’t actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers
The number of requests isn’t even that high to be honest
As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:
# zgrep -c 124.17.34.59 /var/log/nginx/access.log* /var/log/nginx/access.log:22581 /var/log/nginx/access.log.1:0 /var/log/nginx/access.log.2.gz:14 /var/log/nginx/access.log.3.gz:0 /var/log/nginx/access.log.4.gz:0 /var/log/nginx/access.log.5.gz:3 /var/log/nginx/access.log.6.gz:0 /var/log/nginx/access.log.7.gz:0 /var/log/nginx/access.log.8.gz:0 /var/log/nginx/access.log.9.gz:1
The whois data shows the IP is from China, but the user agent doesn’t really give any clues:
# grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h 210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" 22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
A Google search for “LCTE bot” doesn’t return anything interesting, but this Stack Overflow discussion references the lack of information
So basically after a few hours of looking at the log files I am not closer to understanding what is going on!
I do know that we want to block Baidu, though, as it does not respect
robots.txt
And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12–14 hours)
At least for now it seems to be that new Chinese IP (124.17.34.59):
# grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 198 207.46.13.103 203 207.46.13.80 205 207.46.13.36 218 157.55.39.161 249 45.5.184.221 258 45.5.187.130 386 66.249.66.90 410 197.210.168.174 1896 104.196.152.243 11005 124.17.34.59
Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!
# grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf 5948 # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf 0
About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day
All CIAT requests vs unique ones:
$ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l 3506 $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l 3506
I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API
About Baidu, I found a link to their robots.txt tester tool
It seems like our robots.txt file is valid, and they claim to recognize that URLs like
/discover
should be forbidden (不允许, aka “not allowed”):
But they literally just made this request today:
180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:
# grep -c Baiduspider /var/log/nginx/access.log 3806 # grep Baiduspider /var/log/nginx/access.log | grep -c -E "GET /(browse|discover|search-filter)" 1085
I will think about blocking their IPs but they have 164 of them!
# grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l 164
2017-11-08
- Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM
- Linode sent another alert about CPU usage in the morning at 6:12AM
Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf 24981
This is about 20,000 Tomcat sessions:
$ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l 20733
I’m getting really sick of this
Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections
I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test
Run system updates on DSpace Test and reboot the server
Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (#346)
I figured out a way to use nginx’s map function to assign a “bot” user agent to misbehaving clients who don’t define a user agent
Most bots are automatically lumped into one generic session by Tomcat’s Crawler Session Manager Valve but this only works if their user agent matches a pre-defined regular expression like
.*[bB]ot.*
Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process
Basically, we modify the nginx config to add a mapping with a modified user agent
$ua
:map $remote_addr $ua { # 2017-11-08 Random Chinese host grabbing 20,000 PDFs 124.17.34.59 'ChineseBot'; default $http_user_agent; }
If the client’s address matches then the user agent is set, otherwise the default
$http_user_agent
variable is usedThen, in the server’s
/
block we pass this header to Tomcat:proxy_pass http://tomcat_http; proxy_set_header User-Agent $ua;
Note to self: the
$ua
variable won’t show up in nginx access logs because the defaultcombined
log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve
You can verify by cross referencing nginx’s
access.log
and DSpace’sdspace.log.2017-11-08
, for exampleI will deploy this on CGSpace later this week
I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on 2017-11-07 for example)
I merged the clickable thumbnails code to
5_x-prod
(#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asiblenginx
andtomcat
tags)I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in
robots.txt
:# zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)" 22229 # zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)" 0
It seems that they rarely even bother checking
robots.txt
, but Google does multiple times per day!# zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt 14 # zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt 1134
I have been looking for a reason to ban Baidu and this is definitely a good one
Disallowing
Baiduspider
inrobots.txt
probably won’t work because this bot doesn’t seem to respect the robot exclusion standard anyways!I will whip up something in nginx later
Run system updates on CGSpace and reboot the server
Re-deploy latest
5_x-prod
branch on CGSpace and DSpace Test (includes the clickable thumbnails, CCAFS phase II project tags, and updated news text)
2017-11-09
Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243 8956 $ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 223
Versus the same stats for yesterday and the day before:
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243 10216 $ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 2592 # zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep '07/Nov/2017' | grep -c 104.196.152.243 8120 $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 3506
The number of sessions is over ten times less!
This gets me thinking, I wonder if I can use something like nginx’s rate limiter to automatically change the user agent of clients who make too many requests
Perhaps using a combination of geo and map, like illustrated here: https://www.nginx.com/blog/rate-limiting-nginx/
2017-11-11
- I was looking at the Google index and noticed there are 4,090 search results for dspace.ilri.org but only seven for mahider.ilri.org
- Search with something like: inurl:dspace.ilri.org inurl:https
- I want to get rid of those legacy domains eventually!
2017-11-12
- Update the Ansible infrastructure templates to be a little more modular and flexible
Looking at the top client IPs on CGSpace so far this morning, even though it’s only been eight hours:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 243 5.83.120.111 335 40.77.167.103 424 66.249.66.91 529 207.46.13.36 554 40.77.167.129 604 207.46.13.53 754 104.196.152.243 883 66.249.66.90 1150 95.108.181.88 1381 5.9.6.51
5.9.6.51 seems to be a Russian bot:
# grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1 5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] "GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
What’s amazing is that it seems to reuse its Java session across all requests:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12 1558 $ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 1
Bravo to MegaIndex.ru!
The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’s Crawler Session Manager valve regex should match ‘YandexBot’:
# grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1 95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12 991
Move some items and collections on CGSpace for Peter Ballantyne, running
move_collections.sh
with the following configuration:10947/6 10947/1 10568/83389 10947/34 10947/1 10568/83389 10947/2512 10947/1 10568/83389
I explored nginx rate limits as a way to aggressively throttle Baidu bot which doesn’t seem to respect disallowed URLs in robots.txt
There’s an interesting blog post from Nginx’s team about rate limiting as well as a clever use of mapping with rate limits
The solution I came up with uses tricks from both of those
I deployed the limit on CGSpace and DSpace Test and it seems to work well:
$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' HTTP/1.1 200 OK Connection: keep-alive Content-Encoding: gzip Content-Language: en-US Content-Type: text/html;charset=utf-8 Date: Sun, 12 Nov 2017 16:30:19 GMT Server: nginx Strict-Transport-Security: max-age=15768000 Transfer-Encoding: chunked Vary: Accept-Encoding X-Cocoon-Version: 2.2.0 X-Content-Type-Options: nosniff X-Frame-Options: SAMEORIGIN X-XSS-Protection: 1; mode=block $ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' HTTP/1.1 503 Service Temporarily Unavailable Connection: keep-alive Content-Length: 206 Content-Type: text/html Date: Sun, 12 Nov 2017 16:30:21 GMT Server: nginx
The first request works, second is denied with an HTTP 503!
I need to remember to check the Munin graphs for PostgreSQL and JVM next week to see how this affects them
2017-11-13
At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 200 " 1132 # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 503 " 10105
Helping Sisay proof 47 records for IITA: https://dspacetest.cgiar.org/handle/10568/97029
From looking at the data in OpenRefine I found:
- Errors in
cg.authorship.types
- Errors in
cg.coverage.country
(smart quote in “COTE D’IVOIRE”, “HAWAII” is not a country) - Whitespace issues in some
cg.contributor.affiliation
- Whitespace issues in some
cg.identifier.doi
fields and most values are using HTTP instead of HTTPS - Whitespace issues in some
dc.contributor.author
fields - Issue with invalid
dc.date.issued
value “2011-3” - Description fields are poorly copy–pasted
- Whitespace issues in
dc.description.sponsorship
- Lots of inconsistency in
dc.format.extent
(mixed dash style, periods at the end of values) - Whitespace errors in
dc.identifier.citation
- Whitespace errors in
dc.subject
- Whitespace errors in
dc.title
- Errors in
After uploading and looking at the data in DSpace Test I saw more errors with CRPs, subjects (one item had four copies of all of its subjects, another had a “.” in it), affiliations, sponsors, etc.
Atmire responded to the ticket about ORCID stuff a few days ago, today I told them that I need to talk to Peter and the partners to see what we would like to do
2017-11-14
- Deploy some nginx configuration updates to CGSpace
- They had been waiting on a branch for a few months and I think I just forgot about them
- I have been running them on DSpace Test for a few days and haven’t seen any issues there
- Started testing DSpace 6.2 and a few things have changed
Now PostgreSQL needs
pgcrypto
:$ psql dspace6 dspace6=# CREATE EXTENSION pgcrypto;
Also, local settings are no longer in
build.properties
, they are now inlocal.cfg
I’m not sure if we can use separate profiles like we did before with
mvn -Denv=blah
to use blah.propertiesIt seems we need to use “system properties” to override settings, ie:
-Ddspace.dir=/Users/aorth/dspace6
2017-11-15
- Send Adam Hunt an invite to the DSpace Developers network on Yammer
- He is the new head of communications at WLE, since Michael left
- Merge changes to item view’s wording of link metadata (#348)
2017-11-17
- Uptime Robot said that CGSpace went down today and I see lots of
Timeout waiting for idle object
errors in the DSpace logs - I looked in PostgreSQL using
SELECT * FROM pg_stat_activity;
and saw that there were 73 active connections - After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat
Looking at the REST and XMLUI log files, I don’t see anything too crazy:
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 13 66.249.66.223 14 207.46.13.36 17 207.46.13.137 22 207.46.13.23 23 66.249.66.221 92 66.249.66.219 187 104.196.152.243 1400 70.32.83.92 1503 50.116.102.77 6037 45.5.184.196 # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 325 139.162.247.24 354 66.249.66.223 422 207.46.13.36 434 207.46.13.23 501 207.46.13.137 647 66.249.66.221 662 34.192.116.178 762 213.55.99.121 1867 104.196.152.243 2020 66.249.66.219
I need to look into using JMX to analyze active sessions I think, rather than looking at log files
After adding appropriate JMX listener options to Tomcat’s JAVA_OPTS and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:
$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
Looking at the MBeans you can drill down in Catalina→Manager→webapp→localhost→Attributes and see active sessions, etc
I want to enable JMX listener on CGSpace but I need to do some more testing on DSpace Test and see if it causes any performance impact, for example
If I hit the server with some requests as a normal user I see the session counter increase, but if I specify a bot user agent then the sessions seem to be reused (meaning the Crawler Session Manager is working)
Here is the Jconsole screen after looping
http --print Hh https://dspacetest.cgiar.org/handle/10568/1
for a few minutes:
- Switch DSpace Test to using the G1GC for JVM so I can see what the JVM graph looks like eventually, and start evaluating it for production
2017-11-19
- Linode sent an alert that CGSpace was using a lot of CPU around 4–6 AM
Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "19/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 111 66.249.66.155 171 5.9.6.51 188 54.162.241.40 229 207.46.13.23 233 207.46.13.137 247 40.77.167.6 251 207.46.13.36 275 68.180.229.254 325 104.196.152.243 1610 66.249.66.153
66.249.66.153 appears to be Googlebot:
66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] "GET /handle/10568/2203 HTTP/1.1" 200 6309 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity
In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)
$ wc -l dspace.log.2017-11-19 388472 dspace.log.2017-11-19 $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19 267494
WTF is this process doing every day, and for so many hours?
In unrelated news, when I was looking at the DSpace logs I saw a bunch of errors like this:
2017-11-19 03:00:32,806 INFO org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted 2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
It’s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:
2017-11-20
- I found an article about JVM tuning that gives some pointers how to enable logging and tools to analyze logs for you
- Also notes on rotating GC logs
- I decided to switch DSpace Test back to the CMS garbage collector because it is designed for low pauses and high throughput (like G1GC!) and because we haven’t even tried to monitor or tune it
2017-11-21
Magdalena was having problems logging in via LDAP and it seems to be a problem with the CGIAR LDAP server:
2017-11-21 11:11:09,621 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
2017-11-22
- Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM
The logs don’t show anything particularly abnormal between those hours:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 136 31.6.77.23 174 68.180.229.254 217 66.249.66.91 256 157.55.39.79 268 54.144.57.183 281 207.46.13.137 282 207.46.13.36 290 207.46.13.23 696 66.249.66.90 707 104.196.152.243
I haven’t seen 54.144.57.183 before, it is apparently the CCBot from commoncrawl.org
In other news, it looks like the JVM garbage collection pattern is back to its standard jigsaw pattern after switching back to CMS a few days ago:
2017-11-23
- Linode alerted again that CPU usage was high on CGSpace from 4:13 to 6:13 AM
I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 88 66.249.66.91 140 68.180.229.254 155 54.196.2.131 182 54.224.164.166 301 157.55.39.79 315 207.46.13.36 331 207.46.13.23 358 207.46.13.137 565 104.196.152.243 1570 66.249.66.90
… and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 5 190.120.6.219 6 104.198.9.108 14 104.196.152.243 21 112.134.150.6 22 157.55.39.79 22 207.46.13.137 23 207.46.13.36 26 207.46.13.23 942 45.5.184.196 3995 70.32.83.92
These IPs crawling the REST API don’t specify user agents and I’d assume they are creating many Tomcat sessions
I would catch them in nginx to assign a “bot” user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any really — at least not in the dspace.log:
$ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 2
I’m wondering if REST works differently, or just doesn’t log these sessions?
I wonder if they are measurable via JMX MBeans?
I did some tests locally and I don’t see the sessionCounter incrementing after making requests to REST, but it does with XMLUI and OAI
I came across some interesting PostgreSQL tuning advice for SSDs: https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0
Apparently setting
random_page_cost
to 1 is “common” advice for systems running PostgreSQL on SSD (the default is 4)So I deployed this on DSpace Test and will check the Munin PostgreSQL graphs in a few days to see if anything changes
2017-11-24
- It’s too early to tell for sure, but after I made the
random_page_cost
change on DSpace Test’s PostgreSQL yesterday the number of connections dropped drastically:
- There have been other temporary drops before, but if I look at the past month and actually the whole year, the trend is that connections are four or five times higher on average:
- I just realized that we’re not logging access requests to other vhosts on CGSpace, so it’s possible I have no idea that we’re getting slammed at 4AM on another domain that we’re just silently redirecting to cgspace.cgiar.org
- I’ve enabled logging on the CGIAR Library on CGSpace so I can check to see if there are many requests there
- In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)
I also noticed that CGNET appears to be monitoring the old domain every few minutes:
192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] "HEAD / HTTP/1.1" 301 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
I should probably tell CGIAR people to have CGNET stop that
2017-11-26
- Linode alerted that CGSpace server was using too much CPU from 5:18 to 7:18 AM
Yet another mystery because the load for all domains looks fine at that time:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "26/Nov/2017:0[567]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 190 66.249.66.83 195 104.196.152.243 220 40.77.167.82 246 207.46.13.137 247 68.180.229.254 257 157.55.39.214 289 66.249.66.91 298 157.55.39.206 379 66.249.66.70 1855 66.249.66.90
2017-11-29
- Linode alerted that CGSpace was using 279% CPU from 6 to 8 AM this morning
- About an hour later Uptime Robot said that the server was down
Here are all the top XMLUI and REST users from today:
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "29/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 540 66.249.66.83 659 40.77.167.36 663 157.55.39.214 681 157.55.39.206 733 157.55.39.158 850 66.249.66.70 1311 66.249.66.90 1340 104.196.152.243 4008 70.32.83.92 6053 45.5.184.196
PostgreSQL activity shows 69 connections
I don’t have time to troubleshoot more as I’m in Nairobi working on the HPC so I just restarted Tomcat for now
A few hours later Uptime Robot says the server is down again
I don’t see much activity in the logs but there are 87 PostgreSQL connections
But shit, there were 10,000 unique Tomcat sessions today:
$ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 10037
Although maybe that’s not much, as the previous two days had more:
$ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 12377 $ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 16984
I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it’s the most common source of crashes we have
I will bump DSpace’s
db.maxconnections
from 60 to 90, and PostgreSQL’smax_connections
from 183 to 273 (which is using my loose formula of 90 * webapps + 3)I really need to figure out how to get DSpace to use a PostgreSQL connection pool
2017-11-30
- Linode alerted about high CPU usage on CGSpace again around 6 to 8 AM
- Then Uptime Robot said CGSpace was down a few minutes later, but it resolved itself I think (or Tsega restarted Tomcat, I don’t know)