CGSpace Notes

April, 2018

Sun Apr 01, 2018 by Alan Orth in Notes

2018-04-01

I tried to test something on DSpace Test but noticed that it's down since god knows when
Catalina logs at least show some memory errors yesterday:

March, 2018

Fri Mar 02, 2018 by Alan Orth in Notes

2018-03-02

Export a CSV of the IITA community metadata for Martin Mueller

February, 2018

Thu Feb 01, 2018 by Alan Orth in Notes

2018-02-01

Peter gave feedback on the dc.rights proof of concept that I had sent him last week
We don't need to distinguish between internal and external works, so that makes it just a simple list
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu's munin-plugins-java package and used the stuff I discovered about JMX in 2018-01

January, 2018

Tue Jan 02, 2018 by Alan Orth in Notes

2018-01-02

Uptime Robot noticed that CGSpace went down and up a few times last night, for a few minutes each time
I didn't get any load alerts from Linode and the REST and XMLUI logs don't show anything out of the ordinary
The nginx logs show HTTP 200s until 02/Jan/2018:11:27:17 +0000 when Uptime Robot got an HTTP 500
In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
And just before that I see this:

Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].

Ah hah! So the pool was actually empty!
I need to increase that, let's try to bump it up from 50 to 75
After that one client got an HTTP 499 but then the rest were HTTP 200, so I don't know what the hell Uptime Robot saw
I notice this error quite a few times in dspace.log:

2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.

And there are many of these errors every day for the past month:

$ grep -c "Error while searching for sidebar facets" dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
dspace.log.2017-11-24:11
dspace.log.2017-11-25:0
dspace.log.2017-11-26:1
dspace.log.2017-11-27:7
dspace.log.2017-11-28:21
dspace.log.2017-11-29:31
dspace.log.2017-11-30:15
dspace.log.2017-12-01:15
dspace.log.2017-12-02:20
dspace.log.2017-12-03:38
dspace.log.2017-12-04:65
dspace.log.2017-12-05:43
dspace.log.2017-12-06:72
dspace.log.2017-12-07:27
dspace.log.2017-12-08:15
dspace.log.2017-12-09:29
dspace.log.2017-12-10:35
dspace.log.2017-12-11:20
dspace.log.2017-12-12:44
dspace.log.2017-12-13:36
dspace.log.2017-12-14:59
dspace.log.2017-12-15:104
dspace.log.2017-12-16:53
dspace.log.2017-12-17:66
dspace.log.2017-12-18:83
dspace.log.2017-12-19:101
dspace.log.2017-12-20:74
dspace.log.2017-12-21:55
dspace.log.2017-12-22:66
dspace.log.2017-12-23:50
dspace.log.2017-12-24:85
dspace.log.2017-12-25:62
dspace.log.2017-12-26:49
dspace.log.2017-12-27:30
dspace.log.2017-12-28:54
dspace.log.2017-12-29:68
dspace.log.2017-12-30:89
dspace.log.2017-12-31:53
dspace.log.2018-01-01:45
dspace.log.2018-01-02:34

Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let's Encrypt if it's just a handful of domains

December, 2017

Fri Dec 01, 2017 by Alan Orth in Notes

2017-12-01

Uptime Robot noticed that CGSpace went down
The logs say “Timeout waiting for idle object”
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:

November, 2017

Thu Nov 02, 2017 by Alan Orth in Notes

2017-11-01

The CORE developers responded to say they are looking into their bot not respecting our robots.txt

2017-11-02

Today there have been no hits by CORE and no alerts from Linode (coincidence?)

# grep -c "CORE" /var/log/nginx/access.log
0

Generate list of authors on CGSpace for Peter to go through and correct:

dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701

October, 2017

Sun Oct 01, 2017 by Alan Orth in Notes

2017-10-01

Peter emailed to point out that many items in the ILRI archive collection have multiple handles:

http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336

There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections

CGIAR Library Migration

Mon Sep 18, 2017 by Alan Orth in Notes Migration

Rough notes for importing the CGIAR Library content. It was decided that this content would go to a new top-level community called CGIAR System Organization.

September, 2017

Thu Sep 07, 2017 by Alan Orth in Notes

2017-09-06

Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two hours

2017-09-07

Ask Sisay to clean up the WLE approvers a bit, as Marianne's user account is both in the approvers step as well as the group

August, 2017

Tue Aug 01, 2017 by Alan Orth in Notes

2017-08-01

Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
- /handle/10568/3353/discover
- /handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we're already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
We might actually have to block these requests with HTTP 403 depending on the user agent
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet