Update notes for 2017-08-01

This commit is contained in:
2017-08-01 16:31:58 +03:00
parent 5b11434f0f
commit e10b1fecb4
9 changed files with 59 additions and 8 deletions

View File

@ -18,5 +18,11 @@ tags = ["Notes"]
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header...
- We might actually have to _block_ these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
- This was due to newline characters in the `dc.description.abstract` column, which caused OpenRefine to choke when exporting the CSV
- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using `g/^$/d`
- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
<!--more-->