mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-15 03:17:05 +01:00
1.2 KiB
1.2 KiB
+++ date = "2017-08-01T11:51:52+03:00" author = "Alan Orth" title = "August, 2017" tags = ["Notes"]
+++
2017-08-01
- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
- The good thing is that, according to
dspace.log.2017-08-01
, they are all using the same Tomcat session - This means our Tomcat Crawler Session Valve is working
- But many of the bots are browsing dynamic URLs like:
- /handle/10568/3353/discover
- /handle/10568/16510/browse
- The
robots.txt
only blocks the top-level/discover
and/browse
URLs... we will need to find a way to forbid them from accessing these! - Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
- It turns out that we're already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it! - Also, the bot has to successfully browse the page first so it can receive the HTTP header...