-
- 2017-08-01
-
-
-- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
-- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
-- The good thing is that, according to
dspace.log.2017-08-01
, they are all using the same Tomcat session
-- This means our Tomcat Crawler Session Valve is working
-- But many of the bots are browsing dynamic URLs like:
-
-
-- /handle/10568/3353/discover
-- /handle/10568/16510/browse
-
-- The
robots.txt
only blocks the top-level /discover
and /browse
URLs… we will need to find a way to forbid them from accessing these!
-- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
-- It turns out that we’re already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
-- Also, the bot has to successfully browse the page first so it can receive the HTTP header…
-- We might actually have to block these requests with HTTP 403 depending on the user agent
-- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
-- This was due to newline characters in the
dc.description.abstract
column, which caused OpenRefine to choke when exporting the CSV
-- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using
g/^$/d
-- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
-
-
-
- Read more →
-
-
-
-
-
-