mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2023-01-17
This commit is contained in:
@ -153,4 +153,136 @@ $ curl -v "https://dspace7test.ilri.org/api/core/items" -H "Authorization: Beare
|
||||
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-01-16
|
||||
|
||||
- Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems
|
||||
- Batch import another twenty-eight items for IFPRI across several Initiatives
|
||||
- On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc
|
||||
- I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts
|
||||
- Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values
|
||||
|
||||
## 2023-01-17
|
||||
|
||||
- Batch import another twenty-three items for IFPRI across several Initiatives
|
||||
- I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc
|
||||
- I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts
|
||||
- Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669
|
||||
- Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values
|
||||
- I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality
|
||||
- I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace
|
||||
- There is a high load on CGSpace pretty regularly
|
||||
- Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks:
|
||||
|
||||

|
||||
|
||||
- Is this attributable to all the PRMS harvesting?
|
||||
- I also see some PostgreSQL locks starting earlier today:
|
||||
|
||||

|
||||
|
||||
- I'm curious to see what kinds of IPs have been connecting, so I will look at the last few weeks:
|
||||
|
||||
```console
|
||||
# zcat --force /var/log/nginx/{rest,access,library-access,oai}.log /var/log/nginx/{rest,access,library-access,oai}.log.1 /var/log/nginx/{rest,access,library-access,oai}.log.{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}.gz | awk '{print $1}' | sort | uniq > /tmp/2023-01-17-cgspace-ips.txt
|
||||
# wc -l /tmp/2023-01-17-cgspace-ips.txt
|
||||
129446 /tmp/2023-01-17-cgspace-ips.txt
|
||||
```
|
||||
|
||||
- I ran the IPs through my `resolve-addresses-geoip2.py` script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others):
|
||||
|
||||
```console
|
||||
$ csvgrep -c asn -r '^(8075|714|16276|15169|23576|24940|13238|32934|14061|12876|55286|203020|204287|7922|50245|6939|16509|14618)$' \
|
||||
/tmp/2023-01-17-cgspace-ips.csv | csvcut -c network | \
|
||||
sed 1d | sort | uniq > /tmp/networks-to-block.txt
|
||||
$ wc -l /tmp/networks-to-block.txt
|
||||
776 /tmp/networks-to-block.txt
|
||||
```
|
||||
|
||||
- I added the list of networks to nginx's `bot-networks.conf` so they will all be heavily rate limited
|
||||
- Looking at the Munin stats again I see the load has been extra high since yesterday morning:
|
||||
|
||||

|
||||
|
||||
- But still, it's suspicious that there are so many PostgreSQL locks
|
||||
- Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy)
|
||||
- I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!)
|
||||
- I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it's a data center ISP so nope
|
||||
- I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request
|
||||
- I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked
|
||||
- I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request
|
||||
- There are too many to count... so I will purge these and then move on to user agents
|
||||
- I purged hits from those IPs:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
||||
Purging 439185 hits from 31.148.223.10 in statistics
|
||||
Purging 2151 hits from 18.203.245.60 in statistics
|
||||
Purging 1990 hits from 3.249.192.212 in statistics
|
||||
Purging 1975 hits from 34.244.160.145 in statistics
|
||||
Purging 1969 hits from 52.213.59.101 in statistics
|
||||
Purging 2540 hits from 91.209.8.29 in statistics
|
||||
Purging 1624 hits from 54.78.176.127 in statistics
|
||||
Purging 1236 hits from 54.74.197.53 in statistics
|
||||
Purging 1327 hits from 54.246.128.111 in statistics
|
||||
Purging 1108 hits from 52.16.103.133 in statistics
|
||||
Purging 1045 hits from 63.32.99.252 in statistics
|
||||
Purging 999 hits from 176.34.141.181 in statistics
|
||||
Purging 997 hits from 34.243.17.80 in statistics
|
||||
Purging 985 hits from 34.240.206.16 in statistics
|
||||
Purging 862 hits from 18.203.81.120 in statistics
|
||||
Purging 1654 hits from 176.97.210.106 in statistics
|
||||
Purging 1628 hits from 51.81.193.200 in statistics
|
||||
Purging 1020 hits from 79.110.73.54 in statistics
|
||||
Purging 842 hits from 35.153.105.213 in statistics
|
||||
Purging 1689 hits from 54.164.237.125 in statistics
|
||||
|
||||
Total number of bot hits purged: 466826
|
||||
```
|
||||
|
||||
- Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones:
|
||||
- `azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0`
|
||||
- `Gov employment data scraper ([[your email]])`
|
||||
- `Microsoft.Data.Mashup (https://go.microsoft.com/fwlink/?LinkID=304225)`
|
||||
- `crownpeak`
|
||||
- `Mozilla/5.0 (compatible)`
|
||||
- Also, a ton of them are lower case, which I've never seen before... it might be possible, but looks super fishy to me:
|
||||
- `mozilla/5.0 (x11; ubuntu; linux x86_64; rv:84.0) gecko/20100101 firefox/86.0`
|
||||
- `mozilla/5.0 (macintosh; intel mac os x 11_3) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36`
|
||||
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36`
|
||||
- `mozilla/5.0 (windows nt 10.0; win64; x64; rv:86.0) gecko/20100101 firefox/86.0`
|
||||
- `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36`
|
||||
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.159 safari/537.36`
|
||||
- `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/88.0.4324.104 safari/537.36`
|
||||
- `mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/86.0.4240.75 safari/537.36`
|
||||
- I purged some of those:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
|
||||
Purging 1658 hits from azure-logic-apps\/1.0 in statistics
|
||||
Purging 948 hits from Gov employment data scraper in statistics
|
||||
Purging 786 hits from Microsoft\.Data\.Mashup in statistics
|
||||
Purging 303 hits from crownpeak in statistics
|
||||
Purging 332 hits from Mozilla\/5.0 (compatible) in statistics
|
||||
|
||||
Total number of bot hits purged: 4027
|
||||
```
|
||||
|
||||
- Then I ran all system updates on the server and rebooted it
|
||||
- Hopefully this clears the locks and the nginx mitigation helps with the load from non-human hosts in large data centers
|
||||
- I need to re-work how I'm doing this whitelisting and blacklisting... it's way too complicated now
|
||||
- Export entire CGSpace to check Initiative mappings, and add nineteen...
|
||||
- Start a harvest on AReS
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user