2022-09-01
- A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet
- I tested an item submission on DSpace Test with the Cocoon
org.apache.cocoon.uploads.autosave=false
change
- The submission works as expected
- Start debugging some region-related issues with csv-metadata-quality
- I created a new test file
test-geography.csv
with some different scenarios
- I also fixed a few bugs and improved the region-matching logic
2022-09-02
- I worked a bit more on exclusion and skipping logic in csv-metadata-quality
- I also pruned and updated all the Python dependencies
- Then I released version 0.6.0 now that the excludes and region matching support is working way better
2022-09-05
- Started a harvest on AReS last night
- Looking over the Solr statistics from last month I see many user agents that look suspicious:
- Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)
- Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36
- Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
- Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre
- Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
- Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)
- curb
- bitdiscovery
- omgili/0.5 +http://omgili.com
- Mozilla/5.0 (compatible)
- Vizzit
- Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0
- Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0
- Java/17-ea
- AdobeUxTechC4-Async/3.0.12 (win32)
- ZaloPC-win32-24v473
- Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com
- Scoop.it
- Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
- ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0
- WebAPIClient
- Mozilla/5.0 Firefox/26.0
- Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)
- For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (
Mozilla / 5.0
)
- Tons of hosts making requests likt this:
GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
- I got a list of hosts making requests like that so I can purge their hits:
# zcat /var/log/nginx/{access,library-access,oai,rest}.log.[123]*.gz | grep 'String.fromCharCode(' | awk '{print $1}' | sort -u > /tmp/ips.txt
- I purged 4,718 hits from IPs
- I see some new Hetzner ranges that I hadn’t blocked yet apparently?
$ awk '{print $1}' /tmp/hetzner.txt | wc -l
36
$ sort -u /tmp/hetzner-combined.txt | wc -l
49
- I will add this new list to nginx’s
bot-networks.conf
so they get throttled on scraping XMLUI and get classified as bots in Solr statistics
- Then I purged hits from the following user agents:
$ ./ilri/check-spider-hits.sh -f /tmp/agents
Found 374 hits from curb in statistics
Found 350 hits from bitdiscovery in statistics
Found 564 hits from omgili in statistics
Found 390 hits from Vizzit in statistics
Found 9125 hits from AdobeUxTechC4-Async in statistics
Found 97 hits from ZaloPC-win32-24v473 in statistics
Found 518 hits from nbertaupete95 in statistics
Found 218 hits from Scoop.it in statistics
Found 584 hits from WebAPIClient in statistics
Total number of hits from bots: 12220
- Then I will add these user agents to the ILRI spider override in DSpace
2022-09-06
- I’m testing dspace-statistics-api with our DSpace 7 test server
- After setting up the env and the database the
python -m dspace_statistics_api.indexer
runs without issues
- While playing with Solr I tried to search for statistics from this month using
time:2022-09*
but I get this error: “Can’t run prefix queries on numeric fields”
- I guess that the syntax in Solr changed since 4.10…
- This works, but is super annoying:
time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]