mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes for 2017-11-08
This commit is contained in:
@ -407,11 +407,11 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
|
||||
|
||||
- Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM
|
||||
- Linode sent another alert about CPU usage in the morning at 6:12AM
|
||||
- Jesus, the new Chinese IP (124.17.34.59) has downloaded 22,000 PDFs in the last 24 hours:
|
||||
- Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:
|
||||
|
||||
```
|
||||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -c pdf
|
||||
22510
|
||||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
|
||||
24981
|
||||
```
|
||||
|
||||
- This is about 20,000 Tomcat sessions:
|
||||
@ -424,3 +424,29 @@ $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z
|
||||
- I'm getting really sick of this
|
||||
- Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections
|
||||
- I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test
|
||||
- Run system updates on DSpace Test and reboot the server
|
||||
- Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them ([#346](https://github.com/ilri/DSpace/pull/346))
|
||||
- I figured out a way to use nginx's map function to assign a "bot" user agent to misbehaving clients who don't define a user agent
|
||||
- Most bots are automatically lumped into one generic session by [Tomcat's Crawler Session Manager Valve](https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve) but this only works if their user agent matches a pre-defined regular expression like `.*[bB]ot.*`
|
||||
- Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process
|
||||
- Basically, we modify the nginx config to add a mapping with a modified user agent `$ua`:
|
||||
|
||||
```
|
||||
map $remote_addr $ua {
|
||||
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
|
||||
124.17.34.59 'ChineseBot';
|
||||
default $http_user_agent;
|
||||
}
|
||||
```
|
||||
|
||||
- If the client's address matches then the user agent is set, otherwise the default `$http_user_agent` variable is used
|
||||
- Then, in the server's `/` block we pass this header to Tomcat:
|
||||
|
||||
```
|
||||
proxy_pass http://tomcat_http;
|
||||
proxy_set_header User-Agent $ua;
|
||||
```
|
||||
|
||||
- Note to self: the `$ua` variable won't show up in nginx access logs because the default `combined` log format doesn't show it, so don't run around pulling your hair out wondering with the modified user agents aren't showing in the logs!
|
||||
- If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve
|
||||
- You can verify by cross referencing nginx's `access.log` and DSpace's `dspace.log.2017-11-08`, for example
|
||||
|
Reference in New Issue
Block a user