- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
- The good thing is that, according to `dspace.log.2017-08-01`, they are all using the same Tomcat session
- This means our Tomcat Crawler Session Valve is working
- But many of the bots are browsing dynamic URLs like:
- /handle/10568/3353/discover
- /handle/10568/16510/browse
- The `robots.txt` only blocks the top-level `/discover` and `/browse` URLs... we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header...
- Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)
- I think Atmire's Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can't figure it out
- I had a look at the moduel configuration and couldn't figure out a way to do this, so I [opened a ticket on the Atmire tracker](https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets)
- Atmire responded about the [missing workflow statistics issue](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=500) a few weeks ago but I didn't see it for some reason
- They said they added a publication and saw the workflow stat for the user, so I should try again and let them know
- Usman from CIFOR emailed to ask about the status of our OAI tests for harvesting their DSpace repository
- I told him that the OAI appears to not be harvesting properly after the first sync, and that the control panel shows an "Internal error" for that collection:
- I don't see anything related in our logs, so I asked him to check for our server's IP in their logs
- Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn't reset the collection, just the harvester status!)
- Apply Abenet's corrections for the CGIAR Library's Consortium subcommunity (697 records)
- I had to fix a few small things, like moving the `dc.title` column away from the beginning of the row, delete blank spaces in the abstract in vim using `:g/^$/d`, add the `dc.subject[en_US]` column back, as she had deleted it and DSpace didn't detect the changes made there (we needed to blank the values instead)
- Apply last updates to the CGIAR Library's Fund community (812 items)
- Had to do some quality checks and column renames before importing, as either Sisay or Abenet renamed a few columns and the metadata importer wanted to remove/add new metadata for title, abstract, etc.
- Also I applied the HTML entities unescape transform on the abstract column in Open Refine
- I need to get an author list from the database for only the CGIAR Library community to send to Peter
- Alan to follow up with ICARDA about depositing in CGSpace, we want ICARD and Drylands legacy content but not duplicates
- Alan to follow up on dc.rights, where are we?
- Alan to follow up with Atmire about a dedicated field for ORCIDs, based on the discussion in the [June, 2017 DCAT meeting](https://wiki.duraspace.org/display/cmtygp/DCAT+Meeting+June+2017)
- Alan to ask about how to query external services like AGROVOC in the DSpace submission form
- CGSpace had load issues and was throwing errors related to PostgreSQL
- I told Tsega to reduce the max connections from 70 to 40 because actually each web application gets that limit and so for xmlui, oai, jspui, rest, etc it could be 70 x 4 = 280 connections depending on the load, and the PostgreSQL config itself is only 100!
- I learned this on a recent discussion on the DSpace wiki
- I need to either look into setting up a database pool through JNDI or increase the PostgreSQL max connections