- A bit of work on the "Mapping CG Core–CGSpace–MEL–MARLO Types" spreadsheet
- I tested an item submission on DSpace Test with the Cocoon `org.apache.cocoon.uploads.autosave=false` change
- The submission works as expected
- Start debugging some region-related issues with csv-metadata-quality
- I created a new test file `test-geography.csv` with some different scenarios
- I also fixed a few bugs and improved the region-matching logic
<!--more-->
- I filed [an issue for the "South-eastern Asia" case mismatch in country_converter](https://github.com/konstantinstadler/country_converter/issues/115) on GitHub
- Meeting with Moayad to discuss OpenRXV developments
- He demoed his new multiple dashboards feature and I helped him rebase those changes to master so we can test them more
## 2022-09-02
- I worked a bit more on exclusion and skipping logic in csv-metadata-quality
- I also pruned and updated all the Python dependencies
- Then I released [version 0.6.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0) now that the excludes and region matching support is working way better
- For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (`Mozilla / 5.0`)
- Tons of hosts making requests likt this:
```console
GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
```
- I got a list of hosts making requests like that so I can purge their hits:
- I see some new Hetzner ranges that I hadn't blocked yet apparently?
- I got a [list of Hetzner's IPs from IP Quality Score](https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh) then added them to the existing ones in my Ansible playbooks:
```console
$ awk '{print $1}' /tmp/hetzner.txt | wc -l
36
$ sort -u /tmp/hetzner-combined.txt | wc -l
49
```
- I will add this new list to nginx's `bot-networks.conf` so they get throttled on scraping XMLUI and get classified as bots in Solr statistics
- Then I purged hits from the following user agents:
```console
$ ./ilri/check-spider-hits.sh -f /tmp/agents
Found 374 hits from curb in statistics
Found 350 hits from bitdiscovery in statistics
Found 564 hits from omgili in statistics
Found 390 hits from Vizzit in statistics
Found 9125 hits from AdobeUxTechC4-Async in statistics
Found 97 hits from ZaloPC-win32-24v473 in statistics
Found 518 hits from nbertaupete95 in statistics
Found 218 hits from Scoop.it in statistics
Found 584 hits from WebAPIClient in statistics
Total number of hits from bots: 12220
```
- Then I will add these user agents to the ILRI spider override in DSpace
- I'm testing dspace-statistics-api with our DSpace 7 test server
- After setting up the env and the database the `python -m dspace_statistics_api.indexer` runs without issues
- While playing with Solr I tried to search for statistics from this month using `time:2022-09*` but I get this error: "Can't run prefix queries on numeric fields"
- I guess that the syntax in Solr changed since 4.10...
- This works, but is super annoying: `time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]`
- I tested the controlled-vocabulary changes on DSpace 6 and they work fine
- Last week I found that DSpace 7 is more strict with controlled vocabularies and requires IDs for all node values
- This is a pain because it means I have to re-do the IDs in each file every time I update them
- If I add `id="0000"` to each, then I can use [this vim expression](https://vim.fandom.com/wiki/Making_a_list_of_numbers#Substitute_with_ascending_numbers) `let i=0001 | g/0000/s//\=i/ | let i=i+1` to replace the numbers with increments starting from 1
- Meeting with Marie Angelique, Abenet, Sarа, аnd Margarita to continue the discussion about Types from last week
- We made progress with concrete actions and will continue next week
- The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least
- Then there's 80.248.237.167, which is using a normal user agent and scraping Discovery
- That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as 'bot' for XMLUI so most of these requests are HTTP 503
- On another note, I'm curious to explore enabling caching of certain REST API responses
- For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:
- I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit
- Will be interesting to check the results above as the day goes on (now 10AM)
- To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday's log):
- I estimate that it will take about 1GB of cache to harvest 100,000 items from CGSpace with OpenRXV (10,000 pages)
- Basically all but 4 and 5 (bitstreams) should match
- Upload 682 OICRs from MARLO to CGSpace
- We had tested these on DSpace Test last month along with the MELIAs, Policies, and Innovations, but we decided to upload the OICRs first so that other things can link against them as related items