Update notes for 2020-09-10

This commit is contained in:
2020-09-10 15:00:40 +03:00
parent 9d0f0cbfde
commit 7b3aa58055
22 changed files with 66 additions and 27 deletions

View File

@ -190,5 +190,20 @@ Would fix 3 occurences of: SOUTHWEST ASIA
- I think we need to wait for the web team, though, as they need to update their mappings
- Not to mention that we'll need to give WLE and CCAFS time to update their harvesters as well... hmmm
- Looking at the top user agents active on CGSpace in 2020-08 and I see:
- `Delphi 2009`: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)
- `Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)`: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA's content)
- `RTB website BOT`: 12282
- `ILRI Livestock Website Publications importer BOT`: 9393
- Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn't commit the change
- HTTrack is in the agents list so I'm not sure why DSpace registers a hit from that request
- Also, I am surprised to see the RTB and ILRI bots here because they have "BOT" in the name and that should also be dropped
- I also see hits from `curl` and `Java/1.8.0_66` and `Apache-HttpClient` so WTF... those are supposed to be dropped by the default agents list
- Some IP `2607:f298:5:101d:f816:3eff:fed9:a484` made 9,000 requests with the `RI/1.0` user agent this year...
- That's on DreamHost...?
- I purged 448658 hits from these agents and added `Delphi` to our local agents overload for Solr as well as Tomcat's Crawler Session Manager Valve so that it forces them to re-use a single session
- I made a pull request on the COUNTER-Robots project for the Daum robot: https://github.com/atmire/COUNTER-Robots/pull/38
- This bot made 8,000 requests to CGSpace this year
- I purged about 20,000 total requests from this bot from our Solr stats for the last few years
<!-- vim: set sw=2 ts=2: -->