cgspace-notes/2017-08.md at e10b1fecb4bed862b25c869c931c3ad695cc4f83

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-15 03:17:05 +01:00

Alan Orth e10b1fecb4

Update notes for 2017-08-01

2017-08-01 16:31:58 +03:00

1.8 KiB

Raw Blame History

+++ date = "2017-08-01T11:51:52+03:00" author = "Alan Orth" title = "August, 2017" tags = ["Notes"]

+++

2017-08-01

Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
- /handle/10568/3353/discover
- /handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs... we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header...
We might actually have to block these requests with HTTP 403 depending on the user agent
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet

1.8 KiB Raw Blame History

2017-08-01

1.8 KiB

Raw Blame History