Update notes for 2018-09-27

This commit is contained in:
2018-09-27 10:52:23 +03:00
parent c96de57967
commit bf2221181b
3 changed files with 76 additions and 8 deletions

View File

@ -557,4 +557,36 @@ sys 2m18.485s
- I updated the dspace-statistiscs-api to use psycopg2's `execute_values()` to insert batches of 100 values into PostgreSQL instead of doing every insert individually
- On CGSpace this reduces the total run time of `indexer.py` from 432 seconds to 400 seconds (most of the time is actually spent in getting the data from Solr though)
## 2018-09-27
- Linode emailed to say that CGSpace's (linode19) CPU load was high for a few hours last night
- Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
295 34.218.226.147
296 66.249.64.95
350 157.55.39.185
359 207.46.13.28
371 157.55.39.85
388 40.77.167.148
444 66.249.64.93
544 68.6.87.12
834 66.249.64.91
902 35.237.175.180
```
- `35.237.175.180` is on Google Cloud
- `68.6.87.12` is on Cox Communications in the US (?)
- These hosts are not using proper user agents and are not re-using their Tomcat sessions:
```
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
5423
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
758
```
- I will add their IPs to the list of bad bots in nginx so we can add a "bot" user agent to them and let Tomcat's Crawler Session Manager Valve handle them
<!-- vim: set sw=2 ts=2: -->