2017-11-01
- The CORE developers responded to say they are looking into their bot not respecting our robots.txt
2017-11-02
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
# grep -c "CORE" /var/log/nginx/access.log
0
Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
Read more →
2017-10-01
Peter emailed to point out that many items in the ILRI archive collection have multiple handles:
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
Read more →
Rough notes for importing the CGIAR Library content. It was decided that this content would go to a new top-level community called CGIAR System Organization.
Read more →
2017-09-06
- Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours
2017-09-07
- Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group
Read more →
2017-08-01
- Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
- I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
- The good thing is that, according to
dspace.log.2017-08-01
, they are all using the same Tomcat session
- This means our Tomcat Crawler Session Valve is working
- But many of the bots are browsing dynamic URLs like:
- /handle/10568/3353/discover
- /handle/10568/16510/browse
- The
robots.txt
only blocks the top-level /discover
and /browse
URLs… we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
- It turns out that we’re already adding the
X-Robots-Tag "none"
HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header…
- We might actually have to block these requests with HTTP 403 depending on the user agent
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
- This was due to newline characters in the
dc.description.abstract
column, which caused OpenRefine to choke when exporting the CSV
- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using
g/^$/d
- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
Read more →
2017-07-01
- Run system updates and reboot DSpace Test
2017-07-04
- Merge changes for WLE Phase II theme rename (#329)
- Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
- We can use PostgreSQL’s extended output format (
-x
) plus sed
to format the output into quasi XML:
Read more →
2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we’ll create a new sub-community for Phase II and create collections for the research themes there The current “Research Themes” community will be renamed to “WLE Phase I Research Themes” Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg.
Read more →
2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it’s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire’s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.
Read more →
2017-04-02
- Merge one change to CCAFS flagships that I had forgotten to remove last month (“MANAGING CLIMATE RISK”): https://github.com/ilri/DSpace/pull/317
- Quick proof-of-concept hack to add
dc.rights
to the input form, including some inline instructions/hints:
Read more →
2017-03-01
- Run the 279 CIAT author corrections on CGSpace
2017-03-02
- Skype with Michael and Peter, discussing moving the CGIAR Library to CGSpace
- CGIAR people possibly open to moving content, redirecting library.cgiar.org to CGSpace and letting CGSpace resolve their handles
- They might come in at the top level in one “CGIAR System” community, or with several communities
- I need to spend a bit of time looking at the multiple handle support in DSpace and see if new content can be minted in both handles, or just one?
- Need to send Peter and Michael some notes about this in a few days
- Also, need to consider talking to Atmire about hiring them to bring ORCiD metadata to REST / OAI
- Filed an issue on DSpace issue tracker for the
filter-media
bug that causes it to process JPGs even when limiting to the PDF thumbnail plugin: DS-3516
- Discovered that the ImageMagic
filter-media
plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568⁄51999):
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
Read more →