mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2017-08-01
This commit is contained in:
20
content/post/2017-08.md
Normal file
20
content/post/2017-08.md
Normal file
@ -0,0 +1,20 @@
|
||||
+++
|
||||
date = "2017-08-01T11:51:52+03:00"
|
||||
author = "Alan Orth"
|
||||
title = "August, 2017"
|
||||
tags = ["Notes"]
|
||||
|
||||
+++
|
||||
## 2017-08-01
|
||||
|
||||
- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
|
||||
- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
|
||||
- The good thing is that, according to `dspace.log.2017-08-01`, they are all using the same Tomcat session
|
||||
- This means our Tomcat Crawler Session Valve is working
|
||||
- But many of the bots are browsing dynamic URLs like:
|
||||
- /handle/10568/3353/discover
|
||||
- /handle/10568/16510/browse
|
||||
- The `robots.txt` only blocks the top-level `/discover` and `/browse` URLs... we will need to find a way to forbid them from accessing these!
|
||||
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
||||
|
||||
<!--more-->
|
Reference in New Issue
Block a user