cgspace-notes/content/post/2017-11.md

273 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "November, 2017"
date: 2017-11-02T09:37:54+02:00
author: "Alan Orth"
tags: ["Notes"]
---
## 2017-11-01
- The CORE developers responded to say they are looking into their bot not respecting our robots.txt
## 2017-11-02
- Today there have been no hits by CORE and no alerts from Linode (coincidence?)
```
# grep -c "CORE" /var/log/nginx/access.log
0
```
- Generate list of authors on CGSpace for Peter to go through and correct:
```
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
```
<!--more-->
- Abenet asked if it would be possible to generate a report of items in Listing and Reports that had "International Fund for Agricultural Development" as the *only* investor
- I opened a ticket with Atmire to ask if this was possible: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=540
- Work on making the thumbnails in the item view clickable
- Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link
```
//mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
```
- METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml
- I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!
- Help proof fifty-three CIAT records for Sisay: https://dspacetest.cgiar.org/handle/10568/95895
- A handful of issues with `cg.place` using format like "Lima, PE" instead of "Lima, Peru"
- Also, some dates like with completely invalid format like "2010- 06" and "2011-3-28"
- I also collapsed some consecutive whitespace on a handful of fields
## 2017-11-03
- Atmire got back to us to say that they estimate it will take two days of labor to implement the change to Listings and Reports
- I said I'd ask Abenet if she wants that feature
## 2017-11-04
- I finished looking through Sisay's CIAT records for the "Alianzas de Aprendizaje" data
- I corrected about half of the authors to standardize them
- Linode emailed this morning to say that the CPU usage was high again, this time at 6:14AM
- It's the first time in a few days that this has happened
- I had a look to see what was going on, but it isn't the CORE bot:
```
# awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
306 68.180.229.31
323 61.148.244.116
414 66.249.66.91
507 40.77.167.16
618 157.55.39.161
652 207.46.13.103
666 157.55.39.254
1173 104.196.152.243
1737 66.249.66.90
23101 138.201.52.218
```
- 138.201.52.218 is from some Hetzner server, and I see it making 40,000 requests yesterday too, but none before that:
```
# zgrep -c 138.201.52.218 /var/log/nginx/access.log*
/var/log/nginx/access.log:24403
/var/log/nginx/access.log.1:45958
/var/log/nginx/access.log.2.gz:0
/var/log/nginx/access.log.3.gz:0
/var/log/nginx/access.log.4.gz:0
/var/log/nginx/access.log.5.gz:0
/var/log/nginx/access.log.6.gz:0
```
- It's clearly a bot as it's making tens of thousands of requests, but it's using a "normal" user agent:
```
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
```
- For now I don't know what this user is!
## 2017-11-05
- Peter asked if I could fix the appearance of "International Livestock Research Institute" in the author lookup during item submission
- It looks to be just an issue with the user interface expecting authors to have both a first and last name:
![Author lookup](/cgspace-notes/2017/11/author-lookup.png)
![Add author](/cgspace-notes/2017/11/add-author.png)
- But in the database the authors are correct (none with weird `, /` characters):
```
dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
text_value | authority | confidence
--------------------------------------------+--------------------------------------+------------
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 0
International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc | 0
International Livestock Research Institute | | -1
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 600
International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc | -1
International Livestock Research Institute | | 600
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | -1
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 500
(8 rows)
```
- So I'm not sure if this is just a graphical glitch or if editors have to edit this metadata field prior to approval
- Looking at monitoring Tomcat's JVM heap with Prometheus, it looks like we need to use JMX + [jmx_exporter](https://github.com/prometheus/jmx_exporter)
- This guide shows how to [enable JMX in Tomcat](https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/) by modifying `CATALINA_OPTS`
- I was able to successfully connect to my local Tomcat with jconsole!
## 2017-11-07
- CGSpace when down and up a few times this morning, first around 3AM, then around 7
- Tsega had to restart Tomcat 7 to fix it temporarily
- I will start by looking at bot usage (access.log.1 includes usage until 6AM today):
```
# cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
619 65.49.68.184
840 65.49.68.199
924 66.249.66.91
1131 68.180.229.254
1583 66.249.66.90
1953 207.46.13.103
1999 207.46.13.80
2021 157.55.39.161
2034 207.46.13.36
4681 104.196.152.243
```
- 104.196.152.243 seems to be a top scraper for a few weeks now:
```
# zgrep -c 104.196.152.243 /var/log/nginx/access.log*
/var/log/nginx/access.log:336
/var/log/nginx/access.log.1:4681
/var/log/nginx/access.log.2.gz:3531
/var/log/nginx/access.log.3.gz:3532
/var/log/nginx/access.log.4.gz:5786
/var/log/nginx/access.log.5.gz:8542
/var/log/nginx/access.log.6.gz:6988
/var/log/nginx/access.log.7.gz:7517
/var/log/nginx/access.log.8.gz:7211
/var/log/nginx/access.log.9.gz:2763
```
- This user is responsible for hundreds and sometimes thousands of Tomcat sessions:
```
$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
954
$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
6199
$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
7051
```
- The worst thing is that this user never specifies a user agent string so we can't lump it in with the other bots using the Tomcat Session Crawler Manager Valve
- They don't request dynamic URLs like "/discover" but they seem to be fetching handles from XMLUI instead of REST (and some with `//handle`, note the regex below):
```
# grep -c 104.196.152.243 /var/log/nginx/access.log.1
4681
# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle'
4618
```
- I just realized that `ciat.cgiar.org` points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior
- The next IP (207.46.13.36) seem to be Microsoft's bingbot, but all its requests specify the "bingbot" user agent and there are no requests for dynamic URLs that are forbidden, like "/discover":
```
$ grep -c 207.46.13.36 /var/log/nginx/access.log.1
2034
# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c "GET /discover"
0
```
- The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:
```
# grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c "GET /discover"
0
```
- The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like "/discover":
```
# grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1
5997
# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "bingbot"
5988
# grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
0
```
- The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like "/discover":
```
# grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1
3048
# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google
3048
# grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
0
```
- The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like "/discover":
```
# grep -c 68.180.229.254 /var/log/nginx/access.log.1
1131
# grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c "GET /discover"
0
```
- The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:
```
# grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1
2950
# grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
330
```
- Their user agents vary, ie:
- `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36`
- `Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11`
- `Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)`
- I'll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs
- While it's not in the top ten, Baidu is one bot that seems to not give a fuck:
```
# grep -c Baiduspider /var/log/nginx/access.log.1
8068
# grep Baiduspider /var/log/nginx/access.log.1 | grep -c -E "GET /(browse|discover)"
1431
```
- According to their documentation their bot [respects `robots.txt`](http://www.baidu.com/search/robots_english.html), but I don't see this being the case
- I think I will end up blocking Baidu as well...
- Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed
- I should look in nginx access.log, rest.log, oai.log, and DSpace's dspace.log.2017-11-07
- Here are the top IPs during 210 AM:
```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
279 66.249.66.91
373 65.49.68.199
446 68.180.229.254
470 104.196.152.243
470 197.210.168.174
598 207.46.13.103
603 157.55.39.161
637 207.46.13.80
703 207.46.13.36
724 66.249.66.90
```
- Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot