mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-08-11
This commit is contained in:
@ -367,5 +367,22 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H
|
||||
- In Twitter's case they were also getting lumped in with the bad bots too, but really they only make ~50 or so requests a day when someone posts a CGSpace link on Twitter
|
||||
- I tagged the ISO 3166-1 Alpha2 country codes on all items on CGSpace using my [CountryCodeTagger](https://github.com/ilri/cgspace-java-helpers) curation task
|
||||
- I still need to set up a cron job for it...
|
||||
- This tagged 50,000 countries!
|
||||
|
||||
```
|
||||
dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
|
||||
count
|
||||
-------
|
||||
50812
|
||||
(1 row)
|
||||
```
|
||||
|
||||
## 2020-08-11
|
||||
|
||||
- I noticed some more hits from Macaroni's WordPress harvestor that I hadn't caught last week
|
||||
- 104.198.13.34 made many requests without a user agent, with a "WordPress" user agent, and with their new "RTB website BOT" user agent, about 100,000 in total in 2020, and maybe another 70,000 in the other years
|
||||
- I will purge them an add them to the Tomcat Crawler Session Manager and the DSpace bots list so they don't get logged in Solr
|
||||
- I noticed a bunch of user agents with "Crawl" in the Solr stats, which is strange because the DSpace spider agents file has had "crawl" for a long time (and it is case insensitive)
|
||||
- In any case I will purge them and add them to the Tomcat Crawler Session Manager Valve so that at least their sessions get re-used
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user