cgspace-notes/content/post/2017-08.md

29 lines
1.8 KiB
Markdown
Raw Normal View History

2017-08-01 10:57:37 +02:00
+++
date = "2017-08-01T11:51:52+03:00"
author = "Alan Orth"
title = "August, 2017"
tags = ["Notes"]
+++
## 2017-08-01
- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
- The good thing is that, according to `dspace.log.2017-08-01`, they are all using the same Tomcat session
- This means our Tomcat Crawler Session Valve is working
- But many of the bots are browsing dynamic URLs like:
- /handle/10568/3353/discover
- /handle/10568/16510/browse
- The `robots.txt` only blocks the top-level `/discover` and `/browse` URLs... we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
2017-08-01 11:03:37 +02:00
- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header...
2017-08-01 15:31:58 +02:00
- We might actually have to _block_ these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
- This was due to newline characters in the `dc.description.abstract` column, which caused OpenRefine to choke when exporting the CSV
- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using `g/^$/d`
- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
2017-08-01 10:57:37 +02:00
<!--more-->
2017-08-01 15:31:58 +02:00