mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2017-09-13
This commit is contained in:
@ -9,12 +9,12 @@ tags = ["Notes"]
|
||||
|
||||
- Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2017-09-07
|
||||
|
||||
- Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2017-09-10
|
||||
|
||||
- Delete 58 blank metadata values from the CGSpace database:
|
||||
@ -91,3 +91,90 @@ $ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x
|
||||
- Ideally there could also be a user interface for cleanup and merging of authorities
|
||||
- He will prepare a quote for us with keeping in mind that this could be useful to contribute back to the community for a 5.x release
|
||||
- As far as exposing ORCIDs as flat metadata along side all other metadata, he says this should be possible and will work on a quote for us
|
||||
|
||||
## 2017-09-13
|
||||
|
||||
- Last night Linode sent an alert about CGSpace (linode18) that it has exceeded the outbound traffic rate threshold of 10Mb/s for the last two hours
|
||||
- I wonder what was going on, and looking into the nginx logs I think maybe it's OAI...
|
||||
- Here is yesterday's top ten IP addresses making requests to `/oai`:
|
||||
|
||||
```
|
||||
# awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
|
||||
1 213.136.89.78
|
||||
1 66.249.66.90
|
||||
1 66.249.66.92
|
||||
3 68.180.229.31
|
||||
4 35.187.22.255
|
||||
13745 54.70.175.86
|
||||
15814 34.211.17.113
|
||||
15825 35.161.215.53
|
||||
16704 54.70.51.7
|
||||
```
|
||||
|
||||
- Compared to the previous day's logs it looks VERY high:
|
||||
|
||||
```
|
||||
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
|
||||
1 207.46.13.39
|
||||
1 66.249.66.93
|
||||
2 66.249.66.91
|
||||
4 216.244.66.194
|
||||
14 66.249.66.90
|
||||
```
|
||||
|
||||
- The user agents for those top IPs are:
|
||||
- 54.70.175.86: API scraper
|
||||
- 34.211.17.113: API scraper
|
||||
- 35.161.215.53: API scraper
|
||||
- 54.70.51.7: API scraper
|
||||
- And this user agent has never been seen before today (or at least recently!):
|
||||
|
||||
```
|
||||
# grep -c "API scraper" /var/log/nginx/oai.log
|
||||
62088
|
||||
# zgrep -c "API scraper" /var/log/nginx/oai.log.*.gz
|
||||
/var/log/nginx/oai.log.10.gz:0
|
||||
/var/log/nginx/oai.log.11.gz:0
|
||||
/var/log/nginx/oai.log.12.gz:0
|
||||
/var/log/nginx/oai.log.13.gz:0
|
||||
/var/log/nginx/oai.log.14.gz:0
|
||||
/var/log/nginx/oai.log.15.gz:0
|
||||
/var/log/nginx/oai.log.16.gz:0
|
||||
/var/log/nginx/oai.log.17.gz:0
|
||||
/var/log/nginx/oai.log.18.gz:0
|
||||
/var/log/nginx/oai.log.19.gz:0
|
||||
/var/log/nginx/oai.log.20.gz:0
|
||||
/var/log/nginx/oai.log.21.gz:0
|
||||
/var/log/nginx/oai.log.22.gz:0
|
||||
/var/log/nginx/oai.log.23.gz:0
|
||||
/var/log/nginx/oai.log.24.gz:0
|
||||
/var/log/nginx/oai.log.25.gz:0
|
||||
/var/log/nginx/oai.log.26.gz:0
|
||||
/var/log/nginx/oai.log.27.gz:0
|
||||
/var/log/nginx/oai.log.28.gz:0
|
||||
/var/log/nginx/oai.log.29.gz:0
|
||||
/var/log/nginx/oai.log.2.gz:0
|
||||
/var/log/nginx/oai.log.30.gz:0
|
||||
/var/log/nginx/oai.log.3.gz:0
|
||||
/var/log/nginx/oai.log.4.gz:0
|
||||
/var/log/nginx/oai.log.5.gz:0
|
||||
/var/log/nginx/oai.log.6.gz:0
|
||||
/var/log/nginx/oai.log.7.gz:0
|
||||
/var/log/nginx/oai.log.8.gz:0
|
||||
/var/log/nginx/oai.log.9.gz:0
|
||||
```
|
||||
|
||||
- Some of these heavy users are also using XMLUI, and their user agent isn't matched by the [Tomcat Session Crawler valve](https://github.com/ilri/rmg-ansible-public/blob/master/roles/dspace/templates/tomcat/server-tomcat7.xml.j2#L158), so each request uses a different session
|
||||
- Yesterday alone the IP addresses using the `API scraper` user agent were responsible for 16,000 sessions in XMLUI:
|
||||
|
||||
```
|
||||
# grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
15924
|
||||
```
|
||||
|
||||
- If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex
|
||||
- Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:
|
||||
|
||||
```
|
||||
WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
|
||||
```
|
||||
|
Reference in New Issue
Block a user