cgspace-notes/content/post/2017-08.md

+++
date = "2017-08-01T11:51:52+03:00"
author = "Alan Orth"
title = "August, 2017"
tags = ["Notes"]

+++
## 2017-08-01

- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
- The good thing is that, according to `dspace.log.2017-08-01`, they are all using the same Tomcat session
- This means our Tomcat Crawler Session Valve is working
- But many of the bots are browsing dynamic URLs like:
  - /handle/10568/3353/discover
  - /handle/10568/16510/browse
- The `robots.txt` only blocks the top-level `/discover` and `/browse` URLs... we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header...
- We might actually have to _block_ these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
- This was due to newline characters in the `dc.description.abstract` column, which caused OpenRefine to choke when exporting the CSV
- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using `g/^$/d`
- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet

<!--more-->

## 2017-08-02

- Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)
- I think Atmire's Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can't figure it out
- I had a look at the moduel configuration and couldn't figure out a way to do this, so I [opened a ticket on the Atmire tracker](https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets)
- Atmire responded about the [missing workflow statistics issue](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=500) a few weeks ago but I didn't see it for some reason
- They said they added a publication and saw the workflow stat for the user, so I should try again and let them know

## 2017-08-05

- Usman from CIFOR emailed to ask about the status of our OAI tests for harvesting their DSpace repository
- I told him that the OAI appears to not be harvesting properly after the first sync, and that the control panel shows an "Internal error" for that collection:

![CIFOR OAI harvesting](/cgspace-notes/2017/08/cifor-oai-harvesting.png)

- I don't see anything related in our logs, so I asked him to check for our server's IP in their logs
- Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn't reset the collection, just the harvester status!)

## 2017-08-07

- Apply Abenet's corrections for the CGIAR Library's Consortium subcommunity (697 records)
- I had to fix a few small things, like moving the `dc.title` column away from the beginning of the row, delete blank spaces in the abstract in vim using `:g/^$/d`, add the `dc.subject[en_US]` column back, as she had deleted it and DSpace didn't detect the changes made there (we needed to blank the values instead)
Add notes for 2017-08-01 2017-08-01 10:57:37 +02:00			`+++`
			`date = "2017-08-01T11:51:52+03:00"`
			`author = "Alan Orth"`
			`title = "August, 2017"`
			`tags = ["Notes"]`

			`+++`
			`## 2017-08-01`

			`- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours`
			`- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)`
			- The good thing is that, according to `dspace.log.2017-08-01`, they are all using the same Tomcat session
			`- This means our Tomcat Crawler Session Valve is working`
			`- But many of the bots are browsing dynamic URLs like:`
			`- /handle/10568/3353/discover`
			`- /handle/10568/16510/browse`
			- The `robots.txt` only blocks the top-level `/discover` and `/browse` URLs... we will need to find a way to forbid them from accessing these!
			`- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962`
Update notes for 2017-08-01 2017-08-01 11:03:37 +02:00			- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
			`- Also, the bot has to successfully browse the page first so it can receive the HTTP header...`
Update notes for 2017-08-01 2017-08-01 15:31:58 +02:00			`- We might actually have to _block_ these requests with HTTP 403 depending on the user agent`
			`- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415`
			- This was due to newline characters in the `dc.description.abstract` column, which caused OpenRefine to choke when exporting the CSV
			- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using `g/^$/d`
			`- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet`
Add notes for 2017-08-01 2017-08-01 10:57:37 +02:00
			`<!--more-->`
Update notes for 2017-08-01 2017-08-01 15:31:58 +02:00
Add notes for 2017-08-02 2017-08-02 15:52:58 +02:00			`## 2017-08-02`

			`- Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)`
			`- I think Atmire's Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can't figure it out`
			`- I had a look at the moduel configuration and couldn't figure out a way to do this, so I [opened a ticket on the Atmire tracker](https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets)`
			`- Atmire responded about the [missing workflow statistics issue](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=500) a few weeks ago but I didn't see it for some reason`
			`- They said they added a publication and saw the workflow stat for the user, so I should try again and let them know`
Add notes for 2017-08-05 2017-08-05 16:24:05 +02:00
			`## 2017-08-05`

			`- Usman from CIFOR emailed to ask about the status of our OAI tests for harvesting their DSpace repository`
			`- I told him that the OAI appears to not be harvesting properly after the first sync, and that the control panel shows an "Internal error" for that collection:`

			`![CIFOR OAI harvesting](/cgspace-notes/2017/08/cifor-oai-harvesting.png)`

			`- I don't see anything related in our logs, so I asked him to check for our server's IP in their logs`
			`- Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn't reset the collection, just the harvester status!)`
Add notes for 2017-08-07 2017-08-07 15:09:11 +02:00
			`## 2017-08-07`

			`- Apply Abenet's corrections for the CGIAR Library's Consortium subcommunity (697 records)`
			- I had to fix a few small things, like moving the `dc.title` column away from the beginning of the row, delete blank spaces in the abstract in vim using `:g/^$/d`, add the `dc.subject[en_US]` column back, as she had deleted it and DSpace didn't detect the changes made there (we needed to blank the values instead)