Add notes

This commit is contained in:
2024-11-19 10:40:23 +03:00
parent 47b96e8370
commit bd2d9779bb
160 changed files with 1734 additions and 1288 deletions

View File

@ -48,4 +48,35 @@ $ csvcut -c 'id,dc.title[en_US],dcterms.abstract[en_US],cg.identifier.doi[en_US]
- The default scrape interval is 60 seconds, so if we scrape it more than that the metrics will be stale
- From what I've seen this returns in less than one second so it should be safe to reduce the scrape interval
## 2024-10-19
- Heavy load on CGSpace today
- There is a noted increase just before 4PM local time
- I extracted a list of IPs:
```console
# grep -E '19/Oct/2024:1[567]' /var/log/nginx/api-access.log | awk '{print $1}' | sort -u > /tmp/ips.txt
```
- I looked them up and found some data center IPs that were using normal user agents with hundreds of IPs, for example:
- 154.47.29.168 # 212238 (CDNEXT - Datacamp Limited, GB)
- 91.210.64.12 # 29802 (HVC-AS, US) - HIVELOCITY, Inc.
- 103.221.57.120 # 132817 (DZCRD-AS-AP DZCRD Networks Ltd, BD)
- 109.107.150.136 # 201341 (CENTURION-INTERNET-SERVICES - trafficforce, UAB, LT) - Code200
- 185.210.207.1 # 209709 (CODE200-ISP1 - UAB code200, LT)
- 185.162.119.101 # 207223 (GLOBALCON - Global Connections Network LLC, US)
- 173.244.35.101 # 64286 (LOGICWEB, US) - Tesonet
- 139.28.160.141 # 396319 (US-INTERNET-396319, US) - OxyLabs
- 104.143.89.112 # 62874 (WEB2OBJECTS, US) - Web2Objects LLC
- I added some network blocks to the nginx conf
- Interestingly, I see so many IPs using the same user agent today:
```console
# grep "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.3" /var/log/nginx/api-access.log | awk '{print $1}' | sort -u | wc -l
767
```
- For reference, the current Chrome version is 129 or so...
- This is definitely worth looking into because it seems like one massive botnet
<!-- vim: set sw=2 ts=2: -->

50
content/posts/2024-11.md Normal file
View File

@ -0,0 +1,50 @@
---
title: "November, 2024"
date: 2024-11-11T09:47:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-11-11
- Some IP in India is making tons of requests this morning with a normal user agent:
```console
# awk '{print $1}' /var/log/nginx/api-access.log | sort | uniq -c | sort -h | tail -n 40
...
513743 49.207.196.249
```
<!--more-->
- They are using this user agent:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3
```
## 2024-11-16
- I switched CGSpace to Node.js v20 since I've been using it in dev and test for months
## 2024-11-18
- I see a bot (188.34.177.10) on Hetzner has made 35,000 requests this morning and is pretending to be Googlebot, GoogleOther, etc
- Google publishes their range of IPs also: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
- Our nginx config doesn't rate limit the API but perhaps that needs to change...
- In DSpace 4/5/6 the API was separate from the user interface so we didn't need to enforce rate limits there because we encouraged using that over scraping the UI
- In DSpace 7 the API is used by the frontend and perhaps should have the same IP- and UA-based rate limiting
## 2024-11-19
- I notice 10,000 requests by a new bot yesterday:
```
20.38.174.208 - - [18/Nov/2024:07:02:50 +0100] "GET /server/oai/request?verb=ListRecords&resumptionToken=oai_dc%2F2024-10-18T13%3A00%3A49Z%2F%2F%2F400 HTTP/1.1" 503 190 "-" "Laminas_Http_Client"
```
- Seems to be some kind of PHP framework library
- Yesterday one IP in Argentina made nearly 1,000,000 requests using a normal user agent: 181.4.143.40
- 188.34.177.10 ended up making 700,000 requests using various Googlebot, GoogleOther, and even normal Chrome user agents
<!-- vim: set sw=2 ts=2: -->