Add notes for 2021-06-16

This commit is contained in:
2021-06-16 18:31:15 +03:00
parent a41fb6d970
commit fb2bd040a7
25 changed files with 68 additions and 31 deletions

View File

@ -68,4 +68,21 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
- Dump and re-create indexes on AReS (as above) so I can do a harvest
## 2021-06-16
- Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot's DNS
- For example, user agent `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)` with DNS `msnbot-131-253-25-91.search.msn.com.`
- I queried Solr for all hits using the MSN bot DNS (`dns:*msnbot* AND dns:*.msn.com.`) and found 457,706
- I extracted their IPs using Solr's CSV format and ran them through my `resolve-addresses.py` script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
- Note that [Microsoft's docs say that reverse lookups on Bingbot IPs will always have "search.msn.com"](https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26) so it is safe to purge these as non-human traffic
- I purged the hits with `ilri/check-spider-ip-hits.sh` (though I had to do it in 3 batches because I forgot to increase the `facet.limit` so I was only getting them 100 at a time)
- Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
- It will hopefully also fix the duplicate and missing items issues
- I had a Skype with him to discuss
- I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:
```console
$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
```
<!-- vim: set sw=2 ts=2: -->