mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-02-10
This commit is contained in:
@ -219,4 +219,125 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=bngo
|
||||
- Fix some occurrences of "Hammond, Jim" to be "Hammond, James" on CGSpace
|
||||
- Start a full index on AReS
|
||||
|
||||
## 2022-02-09
|
||||
|
||||
- UptimeRobot said that CGSpace was down yesterday evening, but when I looked it was up and I didn't see a high database load or anything wrong
|
||||
- Maria from Bioversity wrote to say that CGSpace was very slow also...
|
||||
|
||||
## 2022-02-10
|
||||
|
||||
- Looking at the Munin graphs on CGSpace I see several metrics showing that there was likely just increased load...
|
||||
|
||||

|
||||

|
||||

|
||||

|
||||
|
||||
- I extract the logs from nginx for yesterday so I can analyze the traffic:
|
||||
|
||||
```console
|
||||
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-access.log
|
||||
# zcat --force /var/log/nginx/rest.log.1 /var/log/nginx/rest.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-rest.log
|
||||
# awk '{print $1}' /tmp/feb9-* | less | sort -u > /tmp/feb9-ips.txt
|
||||
# wc -l /tmp/feb9-ips.txt
|
||||
11636 /tmp/feb9-ips.tx
|
||||
```
|
||||
|
||||
- I started resolving them with my `resolve-addresses-geoip2.py` script
|
||||
- In the mean time I am looking at the requests and I see a new user agent: `1science Resolver 1.0.0`
|
||||
- Seems to be a defunct project from Elsevier (website down, Twitter account inactive since 2020)
|
||||
- I also see 3,400 requests from `EyeMonIT_bot_version_0.1_(http://www.eyemon.it/)`, but because it has "bot" in the name it gets heavily throttled...
|
||||
- I wonder who is monitoring CGSpace with that service...
|
||||
- Looking at the top twenty or so ASNs for the resolved IPs I see lots of bot traffic, but nothing malicious:
|
||||
|
||||
```console
|
||||
$ csvcut -c asn /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
|
||||
79 24940
|
||||
89 36908
|
||||
100 9299
|
||||
107 2635
|
||||
110 44546
|
||||
111 16509
|
||||
118 7552
|
||||
120 4837
|
||||
123 50245
|
||||
123 55836
|
||||
147 45899
|
||||
173 33771
|
||||
192 39832
|
||||
202 32934
|
||||
235 29465
|
||||
260 15169
|
||||
466 14618
|
||||
607 24757
|
||||
768 714
|
||||
1214 8075
|
||||
```
|
||||
|
||||
- The same information, but by org name:
|
||||
|
||||
```console
|
||||
$ csvcut -c org /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
|
||||
92 Orange
|
||||
100 Hetzner Online GmbH
|
||||
100 Philippine Long Distance Telephone Company
|
||||
107 AUTOMATTIC
|
||||
110 ALFA TELECOM s.r.o.
|
||||
111 AMAZON-02
|
||||
118 Viettel Group
|
||||
120 CHINA UNICOM China169 Backbone
|
||||
123 Reliance Jio Infocomm Limited
|
||||
123 Serverel Inc.
|
||||
147 VNPT Corp
|
||||
173 SAFARICOM-LIMITED
|
||||
192 Opera Software AS
|
||||
202 FACEBOOK
|
||||
235 MTN NIGERIA Communication limited
|
||||
260 GOOGLE
|
||||
466 AMAZON-AES
|
||||
607 Ethiopian Telecommunication Corporation
|
||||
768 APPLE-ENGINEERING
|
||||
1214 MICROSOFT-CORP-MSN-AS-BLOCK
|
||||
```
|
||||
|
||||
- Most of these are pretty normal except "Serverel" and Hetzner perhaps, but their user agents are pretending to be normal users so who knows...
|
||||
- I decided to look in the Solr stats with `facet.limit=1000&facet.mincount=1` and found a few more definitely non-human agents:
|
||||
- scalaj-http/2.4.2
|
||||
- scpitspi-rs
|
||||
- lua-resty-http
|
||||
- AHC/2.1
|
||||
- acebookexternalhit <---- typo, but purge it!!!
|
||||
- Iframely/1.3.1 (+https://iframely.com/docs/about) Atlassian
|
||||
- qbhttp/1.0.0
|
||||
- got (https://github.com/sindresorhus/got)
|
||||
- colly - https://github.com/gocolly/colly/v2
|
||||
- article-parser/4.2.10
|
||||
- SomeRandomText
|
||||
- adreview/1.0
|
||||
- I added them to the ILRI override in the DSpace spider list and ran the `check-spider-hits.sh` script:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
|
||||
Purging 234 hits from randint in statistics
|
||||
Purging 337 hits from Koha in statistics
|
||||
Purging 1164 hits from scalaj-http in statistics
|
||||
Purging 1528 hits from scpitspi-rs in statistics
|
||||
Purging 3050 hits from lua-resty-http in statistics
|
||||
Purging 1683 hits from AHC in statistics
|
||||
Purging 1129 hits from acebookexternalhit in statistics
|
||||
Purging 534 hits from Iframely in statistics
|
||||
Purging 1022 hits from qbhttp in statistics
|
||||
Purging 330 hits from ^got in statistics
|
||||
Purging 156 hits from ^colly in statistics
|
||||
Purging 38 hits from article-parser in statistics
|
||||
Purging 1148 hits from SomeRandomText in statistics
|
||||
Purging 3126 hits from adreview in statistics
|
||||
|
||||
Total number of bot hits purged: 14479
|
||||
```
|
||||
|
||||
- I don't have time right now to add any of these to the COUNTER-Robots list...
|
||||
- Peter asked me to add a new item type on CGSpace: Opinion Piece
|
||||
- Map an item on CGSpace for Maria since she couldn't find it in the item mapper
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user