209 lines
9.5 KiB
Markdown
209 lines
9.5 KiB
Markdown
---
|
||
title: "September, 2022"
|
||
date: 2022-01-01T09:41:36+03:00
|
||
author: "Alan Orth"
|
||
categories: ["Notes"]
|
||
---
|
||
|
||
## 2022-09-01
|
||
|
||
- A bit of work on the "Mapping CG Core–CGSpace–MEL–MARLO Types" spreadsheet
|
||
- I tested an item submission on DSpace Test with the Cocoon `org.apache.cocoon.uploads.autosave=false` change
|
||
- The submission works as expected
|
||
- Start debugging some region-related issues with csv-metadata-quality
|
||
- I created a new test file `test-geography.csv` with some different scenarios
|
||
- I also fixed a few bugs and improved the region-matching logic
|
||
|
||
<!--more-->
|
||
|
||
- I filed [an issue for the "South-eastern Asia" case mismatch in country_converter](https://github.com/konstantinstadler/country_converter/issues/115) on GitHub
|
||
- Meeting with Moayad to discuss OpenRXV developments
|
||
- He demoed his new multiple dashboards feature and I helped him rebase those changes to master so we can test them more
|
||
|
||
## 2022-09-02
|
||
|
||
- I worked a bit more on exclusion and skipping logic in csv-metadata-quality
|
||
- I also pruned and updated all the Python dependencies
|
||
- Then I released [version 0.6.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0) now that the excludes and region matching support is working way better
|
||
|
||
## 2022-09-05
|
||
|
||
- Started a harvest on AReS last night
|
||
- Looking over the Solr statistics from last month I see many user agents that look suspicious:
|
||
- Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)
|
||
- Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36
|
||
- Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
|
||
- Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre
|
||
- Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131
|
||
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
|
||
- Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)
|
||
- curb
|
||
- bitdiscovery
|
||
- omgili/0.5 +http://omgili.com
|
||
- Mozilla/5.0 (compatible)
|
||
- Vizzit
|
||
- Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0
|
||
- Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0
|
||
- Java/17-ea
|
||
- AdobeUxTechC4-Async/3.0.12 (win32)
|
||
- ZaloPC-win32-24v473
|
||
- Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
|
||
- Scoop.it
|
||
- Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0
|
||
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
|
||
- ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0
|
||
- WebAPIClient
|
||
- Mozilla/5.0 Firefox/26.0
|
||
- Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)
|
||
- For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (`Mozilla / 5.0`)
|
||
- Tons of hosts making requests likt this:
|
||
|
||
```console
|
||
GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
|
||
```
|
||
|
||
- I got a list of hosts making requests like that so I can purge their hits:
|
||
|
||
```console
|
||
# zcat /var/log/nginx/{access,library-access,oai,rest}.log.[123]*.gz | grep 'String.fromCharCode(' | awk '{print $1}' | sort -u > /tmp/ips.txt
|
||
```
|
||
|
||
- I purged 4,718 hits from IPs
|
||
- I see some new Hetzner ranges that I hadn't blocked yet apparently?
|
||
- I got a [list of Hetzner's IPs from IP Quality Score](https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh) then added them to the existing ones in my Ansible playbooks:
|
||
|
||
```console
|
||
$ awk '{print $1}' /tmp/hetzner.txt | wc -l
|
||
36
|
||
$ sort -u /tmp/hetzner-combined.txt | wc -l
|
||
49
|
||
```
|
||
|
||
- I will add this new list to nginx's `bot-networks.conf` so they get throttled on scraping XMLUI and get classified as bots in Solr statistics
|
||
- Then I purged hits from the following user agents:
|
||
|
||
```console
|
||
$ ./ilri/check-spider-hits.sh -f /tmp/agents
|
||
Found 374 hits from curb in statistics
|
||
Found 350 hits from bitdiscovery in statistics
|
||
Found 564 hits from omgili in statistics
|
||
Found 390 hits from Vizzit in statistics
|
||
Found 9125 hits from AdobeUxTechC4-Async in statistics
|
||
Found 97 hits from ZaloPC-win32-24v473 in statistics
|
||
Found 518 hits from nbertaupete95 in statistics
|
||
Found 218 hits from Scoop.it in statistics
|
||
Found 584 hits from WebAPIClient in statistics
|
||
|
||
Total number of hits from bots: 12220
|
||
```
|
||
|
||
- Then I will add these user agents to the ILRI spider override in DSpace
|
||
|
||
## 2022-09-06
|
||
|
||
- I'm testing dspace-statistics-api with our DSpace 7 test server
|
||
- After setting up the env and the database the `python -m dspace_statistics_api.indexer` runs without issues
|
||
- While playing with Solr I tried to search for statistics from this month using `time:2022-09*` but I get this error: "Can't run prefix queries on numeric fields"
|
||
- I guess that the syntax in Solr changed since 4.10...
|
||
- This works, but is super annoying: `time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]`
|
||
|
||
## 2022-09-07
|
||
|
||
- I tested the controlled-vocabulary changes on DSpace 6 and they work fine
|
||
- Last week I found that DSpace 7 is more strict with controlled vocabularies and requires IDs for all node values
|
||
- This is a pain because it means I have to re-do the IDs in each file every time I update them
|
||
- If I add `id="0000"` to each, then I can use [this vim expression](https://vim.fandom.com/wiki/Making_a_list_of_numbers#Substitute_with_ascending_numbers) `let i=0001 | g/0000/s//\=i/ | let i=i+1` to replace the numbers with increments starting from 1
|
||
- Meeting with Marie Angelique, Abenet, Sarа, аnd Margarita to continue the discussion about Types from last week
|
||
- We made progress with concrete actions and will continue next week
|
||
|
||
## 2022-09-08
|
||
|
||
- I had a meeting with Nicky from UNEP to discuss issues they are having with their DSpace
|
||
- I told her about the meeting of DSpace community people that we're planning at ILRI in the next few weeks
|
||
|
||
## 2022-09-09
|
||
|
||
- Add some value mappings to AReS because I see a lot of incorrect regions and countries
|
||
- I also found some values that were blank in CGSpace so I deleted them:
|
||
|
||
```console
|
||
dspace=# BEGIN;
|
||
BEGIN
|
||
dspace=# DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
|
||
DELETE 70
|
||
dspace=# COMMIT;
|
||
COMMIT
|
||
```
|
||
|
||
- Start a full Discovery index on CGSpace to catch these changes in the Discovery
|
||
|
||
## 2022-09-11
|
||
|
||
- Today is Sunday and I see the load on the server is high
|
||
- Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it's not from them!
|
||
- Looking at the top IPs this morning:
|
||
|
||
```console
|
||
# cat /var/log/nginx/{access,library-access,oai,rest}.log /var/log/nginx/{access,library-access,oai,rest}.log.1 | grep '11/Sep/2022' | awk '{print $1}' | sort | uniq -c | sort -h | tail -n 40
|
||
...
|
||
165 64.233.172.79
|
||
166 87.250.224.34
|
||
200 69.162.124.231
|
||
202 216.244.66.198
|
||
385 207.46.13.149
|
||
398 207.46.13.147
|
||
421 66.249.64.185
|
||
422 157.55.39.81
|
||
442 2a01:4f8:1c17:5550::1
|
||
451 64.124.8.36
|
||
578 137.184.159.211
|
||
597 136.243.228.195
|
||
1185 66.249.64.183
|
||
1201 157.55.39.80
|
||
3135 80.248.237.167
|
||
4794 54.195.118.125
|
||
5486 45.5.186.2
|
||
6322 2a01:7e00::f03c:91ff:fe9a:3a37
|
||
9556 66.249.64.181
|
||
```
|
||
|
||
- The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least
|
||
- Then there's 80.248.237.167, which is using a normal user agent and scraping Discovery
|
||
- That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as 'bot' for XMLUI so most of these requests are HTTP 503
|
||
- On another note, I'm curious to explore enabling caching of certain REST API responses
|
||
- For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:
|
||
|
||
```console
|
||
# awk '{print $7}' /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n 10
|
||
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams
|
||
4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata
|
||
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams
|
||
4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata
|
||
5 /rest/handle/10568/110310?expand=all
|
||
5 /rest/handle/10568/89980?expand=all
|
||
5 /rest/handle/10568/97614?expand=all
|
||
6 /rest/handle/10568/107086?expand=all
|
||
6 /rest/handle/10568/108503?expand=all
|
||
6 /rest/handle/10568/98424?expand=all
|
||
```
|
||
|
||
- I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
|
||
- Will be interesting to check the results above as the day goes on (now 10AM)
|
||
- To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday's log):
|
||
|
||
```console
|
||
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l
|
||
33733
|
||
# awk '{print $7}' /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk '$1 > 1' | wc -l
|
||
5637
|
||
```
|
||
|
||
- In the afternoon I started a harvest on AReS (which should affect the numbers above also)
|
||
- I enabled an nginx proxy cache on DSpace Test for this location regex: `location ~ /rest/(handle|items|collections|communities)/.+`
|
||
|
||
## 2022-09-12
|
||
|
||
- I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled
|
||
|
||
<!-- vim: set sw=2 ts=2: -->
|