cgspace-notes/content/post/2017-08.md

84 lines
7.5 KiB
Markdown
Raw Normal View History

2017-08-01 10:57:37 +02:00
+++
date = "2017-08-01T11:51:52+03:00"
author = "Alan Orth"
title = "August, 2017"
tags = ["Notes"]
+++
## 2017-08-01
- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
- The good thing is that, according to `dspace.log.2017-08-01`, they are all using the same Tomcat session
- This means our Tomcat Crawler Session Valve is working
- But many of the bots are browsing dynamic URLs like:
- /handle/10568/3353/discover
- /handle/10568/16510/browse
- The `robots.txt` only blocks the top-level `/discover` and `/browse` URLs... we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
2017-08-01 11:03:37 +02:00
- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header...
2017-08-01 15:31:58 +02:00
- We might actually have to _block_ these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
- This was due to newline characters in the `dc.description.abstract` column, which caused OpenRefine to choke when exporting the CSV
- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using `g/^$/d`
- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
2017-08-01 10:57:37 +02:00
<!--more-->
2017-08-01 15:31:58 +02:00
2017-08-02 15:52:58 +02:00
## 2017-08-02
- Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)
- I think Atmire's Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can't figure it out
- I had a look at the moduel configuration and couldn't figure out a way to do this, so I [opened a ticket on the Atmire tracker](https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets)
- Atmire responded about the [missing workflow statistics issue](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=500) a few weeks ago but I didn't see it for some reason
- They said they added a publication and saw the workflow stat for the user, so I should try again and let them know
2017-08-05 16:24:05 +02:00
## 2017-08-05
- Usman from CIFOR emailed to ask about the status of our OAI tests for harvesting their DSpace repository
- I told him that the OAI appears to not be harvesting properly after the first sync, and that the control panel shows an "Internal error" for that collection:
![CIFOR OAI harvesting](/cgspace-notes/2017/08/cifor-oai-harvesting.png)
- I don't see anything related in our logs, so I asked him to check for our server's IP in their logs
- Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn't reset the collection, just the harvester status!)
2017-08-07 15:09:11 +02:00
## 2017-08-07
- Apply Abenet's corrections for the CGIAR Library's Consortium subcommunity (697 records)
- I had to fix a few small things, like moving the `dc.title` column away from the beginning of the row, delete blank spaces in the abstract in vim using `:g/^$/d`, add the `dc.subject[en_US]` column back, as she had deleted it and DSpace didn't detect the changes made there (we needed to blank the values instead)
2017-08-08 11:04:22 +02:00
## 2017-08-08
- Apply Abenet's corrections for the CGIAR Library's historic archive subcommunity (2415 records)
- I had to add the `dc.subject[en_US]` column back with blank values so that DSpace could detect the changes
- I applied the changes in 500 item batches
2017-08-09 16:27:46 +02:00
## 2017-08-09
- Run system updates on DSpace Test and reboot server
- Help ICARDA upgrade their MELSpace to DSpace 5.7 using the [docker-dspace](https://github.com/alanorth/docker-dspace) container
- We had to import the PostgreSQL dump to the PostgreSQL container using: `pg_restore -U postgres -d dspace blah.dump`
- Otherwise, when using `-O` it messes up the permissions on the schema and DSpace can't read it
2017-08-10 14:26:52 +02:00
## 2017-08-10
- Apply last updates to the CGIAR Library's Fund community (812 items)
- Had to do some quality checks and column renames before importing, as either Sisay or Abenet renamed a few columns and the metadata importer wanted to remove/add new metadata for title, abstract, etc.
- Also I applied the HTML entities unescape transform on the abstract column in Open Refine
- I need to get an author list from the database for only the CGIAR Library community to send to Peter
2017-08-10 15:59:25 +02:00
- It turns out that I had already used this SQL query in [May, 2017](/cgspace-notes/2017-05) to get the authors from CGIAR Library:
```
dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
```
2017-08-10 14:26:52 +02:00
- Meeting with Peter and CGSpace team
- Alan to follow up with ICARDA about depositing in CGSpace, we want ICARD and Drylands legacy content but not duplicates
- Alan to follow up on dc.rights, where are we?
- Alan to follow up with Atmire about a dedicated field for ORCIDs, based on the discussion in the [June, 2017 DCAT meeting](https://wiki.duraspace.org/display/cmtygp/DCAT+Meeting+June+2017)
- Alan to ask about how to query external services like AGROVOC in the DSpace submission form
2017-08-10 15:59:25 +02:00
- Follow up with Atmire on the [ticket about ORCID metadata in DSpace](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510)
- Follow up with Lili and Andrea about the pending CCAFS metadata and flagship updates