mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-09-05
This commit is contained in:
@ -26,4 +26,77 @@ categories: ["Notes"]
|
||||
- I also pruned and updated all the Python dependencies
|
||||
- Then I released [version 0.6.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0) now that the excludes and region matching support is working way better
|
||||
|
||||
## 2022-09-05
|
||||
|
||||
- Started a harvest on AReS last night
|
||||
- Looking over the Solr statistics from last month I see many user agents that look suspicious:
|
||||
- Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)
|
||||
- Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36
|
||||
- Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
- Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre
|
||||
- Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131
|
||||
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
|
||||
- Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)
|
||||
- curb
|
||||
- bitdiscovery
|
||||
- omgili/0.5 +http://omgili.com
|
||||
- Mozilla/5.0 (compatible)
|
||||
- Vizzit
|
||||
- Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
- Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0
|
||||
- Java/17-ea
|
||||
- AdobeUxTechC4-Async/3.0.12 (win32)
|
||||
- ZaloPC-win32-24v473
|
||||
- Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
|
||||
- Scoop.it
|
||||
- Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0
|
||||
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
|
||||
- ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0
|
||||
- WebAPIClient
|
||||
- Mozilla/5.0 Firefox/26.0
|
||||
- Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)
|
||||
- For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (`Mozilla / 5.0`)
|
||||
- Tons of hosts making requests likt this:
|
||||
|
||||
```console
|
||||
GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
```
|
||||
|
||||
- I got a list of hosts making requests like that so I can purge their hits:
|
||||
|
||||
```console
|
||||
# zcat /var/log/nginx/{access,library-access,oai,rest}.log.[123]*.gz | grep 'String.fromCharCode(' | awk '{print $1}' | sort -u > /tmp/ips.txt
|
||||
```
|
||||
|
||||
- I purged 4,718 hits from IPs
|
||||
- I see some new Hetzner ranges that I hadn't blocked yet apparently?
|
||||
- I got a [list of Hetzner's IPs from IP Quality Score](https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh) then added them to the existing ones in my Ansible playbooks:
|
||||
|
||||
```console
|
||||
$ awk '{print $1}' /tmp/hetzner.txt | wc -l
|
||||
36
|
||||
$ sort -u /tmp/hetzner-combined.txt | wc -l
|
||||
49
|
||||
```
|
||||
|
||||
- I will add this new list to nginx's `bot-networks.conf` so they get throttled on scraping XMLUI and get classified as bots in Solr statistics
|
||||
- Then I purged hits from the following user agents:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-hits.sh -f /tmp/agents
|
||||
Found 374 hits from curb in statistics
|
||||
Found 350 hits from bitdiscovery in statistics
|
||||
Found 564 hits from omgili in statistics
|
||||
Found 390 hits from Vizzit in statistics
|
||||
Found 9125 hits from AdobeUxTechC4-Async in statistics
|
||||
Found 97 hits from ZaloPC-win32-24v473 in statistics
|
||||
Found 518 hits from nbertaupete95 in statistics
|
||||
Found 218 hits from Scoop.it in statistics
|
||||
Found 584 hits from WebAPIClient in statistics
|
||||
|
||||
Total number of hits from bots: 12220
|
||||
```
|
||||
|
||||
- Then I will add these user agents to the ILRI spider override in DSpace
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user