CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

August, 2017

2017-08-01

  • Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
  • I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
  • The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
  • This means our Tomcat Crawler Session Valve is working
  • But many of the bots are browsing dynamic URLs like:
    • /handle/10568/3353/discover
    • /handle/10568/16510/browse
  • The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
  • Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
  • It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
  • Also, the bot has to successfully browse the page first so it can receive the HTTP header…