Add notes for 2018-04-10

This commit is contained in:
2018-04-10 08:27:55 +03:00
parent f45ab64261
commit 6f3b199d9f
61 changed files with 366 additions and 70 deletions

View File

@ -540,9 +540,9 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
- What's amazing is that it seems to reuse its Java session across all requests:
```
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' /home/cgspace.cgiar.org/log/dspace.log.2017-11-12
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
1558
$ grep 5.9.6.51 /home/cgspace.cgiar.org/log/dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1
```
@ -552,7 +552,7 @@ $ grep 5.9.6.51 /home/cgspace.cgiar.org/log/dspace.log.2017-11-12 | grep -o -E '
```
# grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' /home/cgspace.cgiar.org/log/dspace.log.2017-11-12
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
991
```

View File

@ -78,3 +78,145 @@ $ git rebase -i dspace-5.8
- DS-3583 Usage of correct Collection Array (#1731) (upstream commit on dspace-5_x: c8f62e6f496fa86846bfa6bcf2d16811087d9761)
- ... but somehow git knew, and didn't include them in my interactive rebase!
- I need to send this branch to Atmire and also arrange payment (see [ticket #560](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560) in their tracker)
- Fix Sisay's SSH access to the new DSpace Test server (linode19)
## 2018-04-05
- Fix Sisay's sudo access on the new DSpace Test server (linode19)
- The reindexing process on DSpace Test took _forever_ yesterday:
```
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 599m32.961s
user 9m3.947s
sys 2m52.585s
```
- So we really should not use this Linode block storage for Solr
- Assetstore might be fine but would complicate things with configuration and deployment (ughhh)
- Better to use Linode block storage only for backup
- Help Peter with the GDPR compliance / reporting form for CGSpace
- DSpace Test crashed due to memory issues again:
```
# grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
16
```
- I ran all system updates on DSpace Test and rebooted it
- Proof some records on DSpace Test for Udana from IWMI
- He has done better with the small syntax and consistency issues but then there are larger concerns with not linking to DOIs, copying titles incorrectly, etc
## 2018-04-10
- I got a notice that CGSpace CPU usage was very high this morning
- Looking at the nginx logs, here are the top users today so far:
```
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
282 207.46.13.112
286 54.175.208.220
287 207.46.13.113
298 66.249.66.153
322 207.46.13.114
780 104.196.152.243
3994 178.154.200.38
4295 70.32.83.92
4388 95.108.181.88
7653 45.5.186.2
```
- 45.5.186.2 is of course CIAT
- 95.108.181.88 appears to be Yandex:
```
95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
```
- And for some reason Yandex created a lot of Tomcat sessions today:
```
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
4363
```
- 70.32.83.92 appears to be some harvester we've seen before, but on a new IP
- They are not creating new Tomcat sessions so there is no problem there
- 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:
```
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
3982
```
- I'm not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
- Let's try a manual request with and without their user agent:
```
$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2638
Content-Type: image/jpeg;charset=ISO-8859-1
Date: Tue, 10 Apr 2018 05:18:37 GMT
Expires: Tue, 10 Apr 2018 06:18:37 GMT
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: HTTPie/0.9.9
HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2638
Content-Type: image/jpeg;charset=ISO-8859-1
Date: Tue, 10 Apr 2018 05:20:08 GMT
Expires: Tue, 10 Apr 2018 06:20:08 GMT
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=31635DB42B66D6A4208CFCC96DD96875; Path=/; Secure; HttpOnly
Strict-Transport-Security: max-age=15768000
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
```
- So it definitely looks like Yandex requests are getting assigned a session from the Crawler Session Manager valve
- And if I look at the DSpace log I see its IP sharing a session with other crawlers like Google (66.249.66.153)
- Indeed the number of Tomcat sessions appears to be normal:
![Tomcat sessions week](/cgspace-notes/2018/04/jmx_dspace_sessions-week.png)
- Looks like the number of total requests processed by nginx in March went down from the previous months:
```
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
2266594
real 0m13.658s
user 0m16.533s
sys 0m1.087s
```