Add notes for 2022-02-10

This commit is contained in:
2022-02-10 20:35:40 +03:00
parent 9a1280a7ed
commit 564bb11984
118 changed files with 1590 additions and 849 deletions

View File

@ -219,4 +219,125 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=bngo
- Fix some occurrences of "Hammond, Jim" to be "Hammond, James" on CGSpace
- Start a full index on AReS
## 2022-02-09
- UptimeRobot said that CGSpace was down yesterday evening, but when I looked it was up and I didn't see a high database load or anything wrong
- Maria from Bioversity wrote to say that CGSpace was very slow also...
## 2022-02-10
- Looking at the Munin graphs on CGSpace I see several metrics showing that there was likely just increased load...
![Firewall packets day](/cgspace-notes/2022/02/fw_packets-day-fs8.png)
![DSpace sessions day](/cgspace-notes/2022/02/jmx_dspace_sessions-day-fs8.png)
![Tomcat pool day](/cgspace-notes/2022/02/jmx_tomcat_dbpools-day-fs8.png)
![PostgreSQL connections day](/cgspace-notes/2022/02/postgres_connections_db-day-fs8.png)
- I extract the logs from nginx for yesterday so I can analyze the traffic:
```console
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-access.log
# zcat --force /var/log/nginx/rest.log.1 /var/log/nginx/rest.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-rest.log
# awk '{print $1}' /tmp/feb9-* | less | sort -u > /tmp/feb9-ips.txt
# wc -l /tmp/feb9-ips.txt
11636 /tmp/feb9-ips.tx
```
- I started resolving them with my `resolve-addresses-geoip2.py` script
- In the mean time I am looking at the requests and I see a new user agent: `1science Resolver 1.0.0`
- Seems to be a defunct project from Elsevier (website down, Twitter account inactive since 2020)
- I also see 3,400 requests from `EyeMonIT_bot_version_0.1_(http://www.eyemon.it/)`, but because it has "bot" in the name it gets heavily throttled...
- I wonder who is monitoring CGSpace with that service...
- Looking at the top twenty or so ASNs for the resolved IPs I see lots of bot traffic, but nothing malicious:
```console
$ csvcut -c asn /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
79 24940
89 36908
100 9299
107 2635
110 44546
111 16509
118 7552
120 4837
123 50245
123 55836
147 45899
173 33771
192 39832
202 32934
235 29465
260 15169
466 14618
607 24757
768 714
1214 8075
```
- The same information, but by org name:
```console
$ csvcut -c org /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
92 Orange
100 Hetzner Online GmbH
100 Philippine Long Distance Telephone Company
107 AUTOMATTIC
110 ALFA TELECOM s.r.o.
111 AMAZON-02
118 Viettel Group
120 CHINA UNICOM China169 Backbone
123 Reliance Jio Infocomm Limited
123 Serverel Inc.
147 VNPT Corp
173 SAFARICOM-LIMITED
192 Opera Software AS
202 FACEBOOK
235 MTN NIGERIA Communication limited
260 GOOGLE
466 AMAZON-AES
607 Ethiopian Telecommunication Corporation
768 APPLE-ENGINEERING
1214 MICROSOFT-CORP-MSN-AS-BLOCK
```
- Most of these are pretty normal except "Serverel" and Hetzner perhaps, but their user agents are pretending to be normal users so who knows...
- I decided to look in the Solr stats with `facet.limit=1000&facet.mincount=1` and found a few more definitely non-human agents:
- scalaj-http/2.4.2
- scpitspi-rs
- lua-resty-http
- AHC/2.1
- acebookexternalhit <---- typo, but purge it!!!
- Iframely/1.3.1 (+https://iframely.com/docs/about) Atlassian
- qbhttp/1.0.0
- got (https://github.com/sindresorhus/got)
- colly - https://github.com/gocolly/colly/v2
- article-parser/4.2.10
- SomeRandomText
- adreview/1.0
- I added them to the ILRI override in the DSpace spider list and ran the `check-spider-hits.sh` script:
```console
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 234 hits from randint in statistics
Purging 337 hits from Koha in statistics
Purging 1164 hits from scalaj-http in statistics
Purging 1528 hits from scpitspi-rs in statistics
Purging 3050 hits from lua-resty-http in statistics
Purging 1683 hits from AHC in statistics
Purging 1129 hits from acebookexternalhit in statistics
Purging 534 hits from Iframely in statistics
Purging 1022 hits from qbhttp in statistics
Purging 330 hits from ^got in statistics
Purging 156 hits from ^colly in statistics
Purging 38 hits from article-parser in statistics
Purging 1148 hits from SomeRandomText in statistics
Purging 3126 hits from adreview in statistics
Total number of bot hits purged: 14479
```
- I don't have time right now to add any of these to the COUNTER-Robots list...
- Peter asked me to add a new item type on CGSpace: Opinion Piece
- Map an item on CGSpace for Maria since she couldn't find it in the item mapper
<!-- vim: set sw=2 ts=2: -->