mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-08-08
This commit is contained in:
@ -89,4 +89,94 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
0 /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
```
|
||||
|
||||
## 2021-08-08
|
||||
|
||||
- Advise IWMI colleagues on best practices for thumbnails
|
||||
- Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
|
||||
- I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:
|
||||
- WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific
|
||||
- WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea
|
||||
- Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
|
||||
- 35.174.144.154 made 11,000 requests last month with the following user agent:
|
||||
|
||||
```console
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
|
||||
```
|
||||
|
||||
- That IP is on Amazon, and from looking at the DSpace logs I don't see them logging in at all, only scraping... so I will purge hits from that IP
|
||||
- I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
|
||||
- Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so
|
||||
- Same deal with 93.158.90.91, also in Sweden
|
||||
- 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart
|
||||
- 61.143.40.50 is in China and uses this hilarious user agent:
|
||||
|
||||
```console
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
|
||||
```
|
||||
|
||||
- 47.252.80.214 is owned by Alibaba in the US and has the same user agent
|
||||
- 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours
|
||||
- 95.87.154.12 seems to be a new bot with the following user agent:
|
||||
|
||||
```console
|
||||
Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
|
||||
```
|
||||
|
||||
- They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
|
||||
- I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots
|
||||
- I see a new bot using this user agent:
|
||||
|
||||
```console
|
||||
nettle (+https://www.nettle.sk)
|
||||
```
|
||||
|
||||
- 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.
|
||||
- 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day
|
||||
- 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human
|
||||
- There are probably more but that's most of them over 1,000 hits last month, so I will purge them:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
||||
Purging 10796 hits from 35.174.144.154 in statistics
|
||||
Purging 9993 hits from 93.158.90.30 in statistics
|
||||
Purging 6092 hits from 130.255.162.173 in statistics
|
||||
Purging 24863 hits from 3.225.28.105 in statistics
|
||||
Purging 2988 hits from 93.158.90.91 in statistics
|
||||
Purging 2497 hits from 61.143.40.50 in statistics
|
||||
Purging 13866 hits from 159.138.131.15 in statistics
|
||||
Purging 2721 hits from 95.87.154.12 in statistics
|
||||
Purging 2786 hits from 47.252.80.214 in statistics
|
||||
Purging 1485 hits from 129.0.211.251 in statistics
|
||||
Purging 8952 hits from 217.182.21.193 in statistics
|
||||
Purging 3446 hits from 103.135.104.139 in statistics
|
||||
|
||||
Total number of bot hits purged: 90485
|
||||
```
|
||||
|
||||
- Then I purged a few thousand more by user agent:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
|
||||
Found 2707 hits from MaCoCu in statistics
|
||||
Found 1785 hits from nettle in statistics
|
||||
|
||||
Total number of hits from bots: 4492
|
||||
```
|
||||
|
||||
- I found some CGSpace metadata in the wrong fields
|
||||
- Seven metadata in dc.subject (57) should be in dcterms.subject (187)
|
||||
- Twelve metadata in cg.title.journal (202) should be in cg.journal (251)
|
||||
- Three dc.identifier.isbn (20) should be in cg.isbn (252)
|
||||
- Three dc.identifier.issn (21) should be in cg.issn (253)
|
||||
- I re-ran the `migrate-fields.sh` script on CGSpace
|
||||
- I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:
|
||||
|
||||
```console
|
||||
$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
|
||||
```
|
||||
|
||||
- Then in OpenRefine I merged all null, blank, and en fields into the `en_US` one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
|
||||
- In total it was a few thousand metadata entries or so so I had to split the CSV with `xsv split` in order to process it
|
||||
- I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes...
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user