Add notes for 2020-08-11

This commit is contained in:
2020-08-11 11:35:05 +03:00
parent cb03863647
commit ccecd63eb0
20 changed files with 62 additions and 25 deletions

View File

@ -367,5 +367,22 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H
- In Twitter's case they were also getting lumped in with the bad bots too, but really they only make ~50 or so requests a day when someone posts a CGSpace link on Twitter
- I tagged the ISO 3166-1 Alpha2 country codes on all items on CGSpace using my [CountryCodeTagger](https://github.com/ilri/cgspace-java-helpers) curation task
- I still need to set up a cron job for it...
- This tagged 50,000 countries!
```
dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
count
-------
50812
(1 row)
```
## 2020-08-11
- I noticed some more hits from Macaroni's WordPress harvestor that I hadn't caught last week
- 104.198.13.34 made many requests without a user agent, with a "WordPress" user agent, and with their new "RTB website BOT" user agent, about 100,000 in total in 2020, and maybe another 70,000 in the other years
- I will purge them an add them to the Tomcat Crawler Session Manager and the DSpace bots list so they don't get logged in Solr
- I noticed a bunch of user agents with "Crawl" in the Solr stats, which is strange because the DSpace spider agents file has had "crawl" for a long time (and it is case insensitive)
- In any case I will purge them and add them to the Tomcat Crawler Session Manager Valve so that at least their sessions get re-used
<!-- vim: set sw=2 ts=2: -->