mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-06-16
This commit is contained in:
@ -68,4 +68,21 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
|
||||
|
||||
- Dump and re-create indexes on AReS (as above) so I can do a harvest
|
||||
|
||||
## 2021-06-16
|
||||
|
||||
- Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot's DNS
|
||||
- For example, user agent `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)` with DNS `msnbot-131-253-25-91.search.msn.com.`
|
||||
- I queried Solr for all hits using the MSN bot DNS (`dns:*msnbot* AND dns:*.msn.com.`) and found 457,706
|
||||
- I extracted their IPs using Solr's CSV format and ran them through my `resolve-addresses.py` script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
|
||||
- Note that [Microsoft's docs say that reverse lookups on Bingbot IPs will always have "search.msn.com"](https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26) so it is safe to purge these as non-human traffic
|
||||
- I purged the hits with `ilri/check-spider-ip-hits.sh` (though I had to do it in 3 batches because I forgot to increase the `facet.limit` so I was only getting them 100 at a time)
|
||||
- Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
|
||||
- It will hopefully also fix the duplicate and missing items issues
|
||||
- I had a Skype with him to discuss
|
||||
- I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:
|
||||
|
||||
```console
|
||||
$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user