CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2019

2019-03-01

  • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
  • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
  • Looking at the other half of Udana’s WLE records from 2018-11
    • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
    • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
    • Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
    • 68.15% � 9.45 instead of 68.15% ± 9.45
    • 2003�2013 instead of 2003–2013
  • I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs

2019-03-03

  • Trying to finally upload IITA’s 259 Feb 14 items to CGSpace so I exported them from DSpace Test:
$ mkdir 2019-03-03-IITA-Feb14
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
  • As I was inspecting the archive I noticed that there were some problems with the bitsreams:
    • First, Sisay didn’t include the bitstream descriptions
    • Second, only five items had bitstreams and I remember in the discussion with IITA that there should have been nine!
    • I had to refer to the original CSV from January to find the file names, then download and add them to the export contents manually!
  • After adding the missing bitstreams and descriptions manually I tested them again locally, then imported them to a temporary collection on CGSpace:
$ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
  • DSpace’s export function doesn’t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something
  • After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the dspace cleanup script
  • Merge the IITA research theme changes from last month to the 5_x-prod branch (#413)
    • I will deploy to CGSpace soon and then think about how to batch tag all IITA’s existing items with this metadata
  • Deploy Tomcat 7.0.93 on CGSpace (linode18) after having tested it on DSpace Test (linode19) for a week

2019-03-06

  • Abenet was having problems with a CIP user account, I think that the user could not register
  • I suspect it’s related to the email issue that ICT hasn’t responded about since last week
  • As I thought, I still cannot send emails from CGSpace:
$ dspace test-email

About to send test email:
 - To: blah@stfu.com
 - Subject: DSpace test email
 - Server: smtp.office365.com

Error sending email:
 - Error: javax.mail.AuthenticationFailedException
  • I will send a follow-up to ICT to ask them to reset the password

2019-03-07

  • ICT reset the email password and I confirmed that it is working now
  • Generate a controlled vocabulary of 1187 AGROVOC subjects from the top 1500 that I checked last month, dumping the terms themselves using csvcut and then applying XML controlled vocabulary format in vim and then checking with tidy for good measure:
$ csvcut -c name 2019-02-22-subjects.csv > dspace/config/controlled-vocabularies/dc-contributor-author.xml
$ # apply formatting in XML file
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
  • I tested the AGROVOC controlled vocabulary locally and will deploy it on DSpace Test soon so people can see it
  • Atmire noticed my message about the “solr_update_time_stamp” error on the dspace-tech mailing list and created an issue on their tracker to discuss it with me
    • They say the error is harmless, but has nevertheless been fixed in their newer module versions

2019-03-08

  • There’s an issue with CGSpace right now where all items are giving a blank page in the XMLUI
    • Interestingly, if I check an item in the REST API it is also mostly blank: only the title and the ID! On second thought I realize I probably was just seeing the default view without any “expands”
    • I don’t see anything unusual in the Tomcat logs, though there are thousands of those solr_update_time_stamp errors:
# journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
1076
  • I restarted Tomcat and it’s OK now…
  • Skype meeting with Peter and Abenet and Sisay
    • We want to try to crowd source the correction of invalid AGROVOC terms starting with the ~313 invalid ones from our top 1500
    • We will share a Google Docs spreadsheet with the partners and ask them to mark the deletions and corrections
    • Abenet and Alan to spend some time identifying correct DCTERMS fields to move to, with preference over CG Core 2.0 as we want to be globally compliant (use information from SEO crosswalks)
    • I need to follow up on the privacy page that Sisay worked on
    • We want to try to migrate the 600 Livestock CRP blog posts to CGSpace, Peter will try to export the XML from WordPress so I can try to parse it with a script

2019-03-09

  • I shared a post on Yammer informing our editors to try to AGROVOC controlled list
  • The SPDX legal committee had a meeting and discussed the addition of CC-BY-ND-3.0-IGO and other IGO licenses to their list, but it seems unlikely (spdx/license-list-XML/issues/767)
  • The FireOak report highlights the fact that several CGSpace collections have mixed-content errors due to the use of HTTP links in the Feedburner forms
  • I see 46 occurrences of these with this query:
dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
  • I can replace these globally using the following SQL:
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
UPDATE 43
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
UPDATE 44
  • I ran the corrections on CGSpace and DSpace Test

2019-03-10

  • Working on tagging IITA’s items with their new research theme (cg.identifier.iitatheme) based on their existing IITA subjects (see notes from 2019-02)
  • I exported the entire IITA community from CGSpace and then used csvcut to extract only the needed fields:
$ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv > /tmp/iita.csv
  • After importing to OpenRefine I realized that tagging items based on their subjects is tricky because of the row/record mode of OpenRefine when you split the multi-value cells as well as the fact that some items might need to be tagged twice (thus needing a ||)

  • I think it might actually be easier to filter by IITA subject, then by IITA theme (if needed), and then do transformations with some conditional values in GREL expressions like:

if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
  • Then it’s more annoying because there are four IITA subject columns…
  • In total this would add research themes to 1,755 items
  • I want to double check one last time with Bosede that they would like to do this, because I also see that this will tag a few hundred items from the 1970s and 1980s

2019-03-11

  • Bosede said that she would like the IITA research theme tagging only for items since 2015, which would be 256 items

2019-03-12

  • I imported the changes to 256 of IITA’s records on CGSpace

2019-03-14

  • CGSpace had the same issue with blank items like earlier this month and I restarted Tomcat to fix it
  • Create a pull request to change Swaziland to Eswatini and Macedonia to North Macedonia (#414)
    • I see thirty-six items using Swaziland country metadata, and Peter says we should change only those from 2018 and 2019
    • I think that I could get the resource IDs from SQL and then export them using dspace metadata-export
  • This is a bit ugly, but it works (using the DSpace 5 SQL helper function to resolve ID to handle):
for id in $(psql -U postgres -d dspacetest -h localhost -c "SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'" | grep -oE '[0-9]{3,}'); do

    echo "Getting handle for id: ${id}"

    handle=$(psql -U postgres -d dspacetest -h localhost -c "SELECT ds5_item2itemhandle($id)" | grep -oE '[0-9]{5}/[0-9]+')

    ~/dspace/bin/dspace metadata-export -f /tmp/${id}.csv -i $handle

done
  • Then I couldn’t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:
$ grep -oE '201[89]' /tmp/*.csv | sort -u
/tmp/94834.csv:2018
/tmp/95615.csv:2018
/tmp/96747.csv:2018
  • And looking at those items more closely, only one of them has an issue date of after 2018-04, so I will only update that one (as the countrie’s name only changed in 2018-04)
  • Run all system updates and reboot linode20
  • Follow up with Felix from Earlham to see if he’s done testing DSpace Test with COPO so I can re-sync the server from CGSpace

2019-03-15

  • CGSpace (linode18) has the blank page error again
  • I’m not sure if it’s related, but I see the following error in DSpace’s log:
2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is closed.
        at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
        at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
        at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
        at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
        at org.dspace.authorize.AuthorizeManager.getPolicies(AuthorizeManager.java:612)
        at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:154)
        at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:300)
  • Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently
$ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
      5 dspace.log.2019-02-27
     11 dspace.log.2019-02-28
     29 dspace.log.2019-03-01
     24 dspace.log.2019-03-02
     41 dspace.log.2019-03-03
     11 dspace.log.2019-03-04
      9 dspace.log.2019-03-05
     15 dspace.log.2019-03-06
      7 dspace.log.2019-03-07
      9 dspace.log.2019-03-08
     22 dspace.log.2019-03-09
     23 dspace.log.2019-03-10
     18 dspace.log.2019-03-11
     13 dspace.log.2019-03-12
     10 dspace.log.2019-03-13
     25 dspace.log.2019-03-14
     12 dspace.log.2019-03-15
     67 dspace.log.2019-03-16
     72 dspace.log.2019-03-17
      8 dspace.log.2019-03-18
     15 dspace.log.2019-03-19
     21 dspace.log.2019-03-20
     29 dspace.log.2019-03-21
     41 dspace.log.2019-03-22
   4807 dspace.log.2019-03-23
  • (Update on 2019-03-23 to use correct grep query)
  • There are not too many connections currently in PostgreSQL:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      6 dspaceApi
     10 dspaceCli
     15 dspaceWeb
  • I didn’t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today might be related?
SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
java.util.EmptyStackException
        at java.util.Stack.peek(Stack.java:102)
        at java.util.Stack.pop(Stack.java:84)
        at org.apache.cocoon.callstack.CallStack.leave(CallStack.java:54)
        at org.apache.cocoon.servletservice.CallStackHelper.leaveServlet(CallStackHelper.java:85)
        at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:484)
        at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443)
        at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
        at com.sun.proxy.$Proxy90.service(Unknown Source)
        at org.dspace.springmvc.CocoonView.render(CocoonView.java:113)
        at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1180)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:950)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
        at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:778)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:624)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.dspace.rdf.negotiation.NegotiationFilter.doFilter(NegotiationFilter.java:59)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:494)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
        at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1137)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)
  • For now I will just restart Tomcat…

2019-03-17

  • Last week Felix from Earlham said that they finished testing on DSpace Test (linode19) so I made backups of some things there and re-deployed the system on Ubuntu 18.04
    • During re-deployment I hit a few issues with the Ansible playbooks and made some minor improvements
    • There seems to be an issue with nodejs’s dependencies now, which causes npm to get uninstalled when installing the certbot dependencies (due to a conflict in libssl dependencies)
    • I re-worked the playbooks to use Node.js from the upstream official repository for now
  • Create and merge pull request for the AGROVOC controlled list (#415)
    • Run all system updates on CGSpace (linode18) and re-deploy the 5_x-prod branch and reboot the server
  • Re-sync DSpace Test with a fresh database snapshot and assetstore from CGSpace
    • After restarting Tomcat, Solr was giving the “Error opening new searcher” error for all cores
    • I stopped Tomcat, added ulimit -v unlimited to the catalina.sh script and deleted all old locks in the DSpace solr directory and then DSpace started up normally
    • I’m still not exactly sure why I see this error and if the ulimit trick actually helps, as the tomcat7.service has LimitAS=infinity anyways (and from checking the PID’s limits file in /proc it seems to be applied)
    • Then I noticed that the item displays were blank… so I checked the database info and saw there were some unfinished migrations
    • I’m not entirely sure if it’s related, but I tried to delete the old migrations and then force running the ignored ones like when we upgraded to DSpace 5.8 in 2018-06 and then after restarting Tomcat I could see the item displays again
  • I copied the 2019 Solr statistics core from CGSpace to DSpace Test and it works (and is only 5.5GB currently), so now we have some useful stats on DSpace Test for the CUA module and the dspace-statistics-api
  • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
$ dspace cleanup -v
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(164496) is still referenced from table "bundle".
  • The solution is, as always:
# su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);'
UPDATE 1

2019-03-18

  • I noticed that the regular expression for validating lines from input files in my agrovoc-lookup.py script was skipping characters with accents, etc, so I changed it to use the \w character class for words instead of trying to match [A-Z] etc…
    • We have a Spanish and French subjects so this is very important
    • Also there were some subjects with apostrophes, dashes, and periods… these are probably invalid AGROVOC subject terms, but we should save them to the rejects file instead of skipping them nevertheless
  • Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my agrovoc-lookup.py script:
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
dspace=# \q
$ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv > 2019-03-18-top-1500-subject.csv
$ ./agrovoc-lookup.py -l en -i 2019-03-18-top-1500-subject.csv -om /tmp/en-subjects-matched.txt -or /tmp/en-subjects-unmatched.txt
$ ./agrovoc-lookup.py -l es -i 2019-03-18-top-1500-subject.csv -om /tmp/es-subjects-matched.txt -or /tmp/es-subjects-unmatched.txt
$ ./agrovoc-lookup.py -l fr -i 2019-03-18-top-1500-subject.csv -om /tmp/fr-subjects-matched.txt -or /tmp/fr-subjects-unmatched.txt
$ cat /tmp/*-subjects-matched.txt | sort -u > /tmp/subjects-matched-sorted.txt
$ wc -l /tmp/subjects-matched-sorted.txt                                                              
1318 /tmp/subjects-matched-sorted.txt
$ sort -u 2019-03-18-top-1500-subject.csv > /tmp/1500-subjects-sorted.txt
$ comm -13 /tmp/subjects-matched-sorted.txt /tmp/1500-subjects-sorted.txt > 2019-03-18-subjects-unmatched.txt
$ wc -l 2019-03-18-subjects-unmatched.txt
182 2019-03-18-subjects-unmatched.txt
  • So the new total of matched terms with the updated regex is 1317 and unmatched is 183 (previous number of matched terms was 1187)
  • Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (#416)
  • We are getting the blank page issue on CGSpace again today and I see a large number of the “SQL QueryTable Error” in the DSpace log again (last time was 2019-03-15):
$ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
dspace.log.2019-03-15:929
dspace.log.2019-03-16:67
dspace.log.2019-03-17:72
dspace.log.2019-03-18:1038
  • Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the “binary file matches” result with -I:
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
8
$ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
      9 dspace.log.2019-03-08
     25 dspace.log.2019-03-14
     12 dspace.log.2019-03-15
     67 dspace.log.2019-03-16
     72 dspace.log.2019-03-17
      8 dspace.log.2019-03-18
  • It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use -I to say binary files don’t match
  • Anyways, the full error in DSpace’s log is:
2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
        at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
        at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
        at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
        at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
  • There is a low number of connections to PostgreSQL currently:
$ psql -c 'select * from pg_stat_activity' | wc -l
33
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      6 dspaceApi
      7 dspaceCli
     15 dspaceWeb
  • I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:
2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
  • This is unrelated and apparently due to Munin checking a column that was changed in PostgreSQL 9.6
  • I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it’s a Cocoon thing?
  • Looking in the cocoon logs I see a large number of warnings about “Can not load requested doc” around 11AM and 12PM:
$ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
      2 2019-03-18 00:
      6 2019-03-18 02:
      3 2019-03-18 04:
      1 2019-03-18 05:
      1 2019-03-18 07:
      2 2019-03-18 08:
      4 2019-03-18 09:
      5 2019-03-18 10:
    863 2019-03-18 11:
    203 2019-03-18 12:
     14 2019-03-18 13:
      1 2019-03-18 14:
  • And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 1–2PM:
$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
      4 2019-03-15 01:
      3 2019-03-15 02:
      1 2019-03-15 03:
     13 2019-03-15 04:
      1 2019-03-15 05:
      2 2019-03-15 06:
      3 2019-03-15 07:
     27 2019-03-15 09:
      9 2019-03-15 10:
      3 2019-03-15 11:
      2 2019-03-15 12:
    531 2019-03-15 13:
    274 2019-03-15 14:
      4 2019-03-15 15:
     75 2019-03-15 16:
      5 2019-03-15 17:
      5 2019-03-15 18:
      6 2019-03-15 19:
      2 2019-03-15 20:
      4 2019-03-15 21:
      3 2019-03-15 22:
      1 2019-03-15 23:
  • And again on 2019-03-08, surprise surprise, it happened in the morning:
$ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
     11 2019-03-08 01:
      3 2019-03-08 02:
      1 2019-03-08 03:
      2 2019-03-08 04:
      1 2019-03-08 05:
      1 2019-03-08 06:
      1 2019-03-08 08:
    425 2019-03-08 09:
    432 2019-03-08 10:
    717 2019-03-08 11:
     59 2019-03-08 12:
  • I’m not sure if it’s cocoon or that’s just a symptom of something else

2019-03-19

  • I found a handful of AGROVOC subjects that use a non-breaking space (0x00a0) instead of a regular space, which makes for a pretty confusing debugging…
  • I will replace these in the database immediately to save myself the headache later:
dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
 count 
-------
    84
(1 row)
  • Perhaps my agrovoc-lookup.py script could notify if it finds these because they potentially give false negatives
  • CGSpace (linode18) is having problems with Solr again, I’m seeing “Error opening new searcher” in the Solr logs and there are no stats for previous years
  • Apparently the Solr statistics shards didn’t load properly when we restarted Tomcat yesterday:
2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
...
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:845)
        ... 31 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
  • For reference, I don’t see the ulimit -v unlimited in the catalina.sh script, though the tomcat7 systemd service has LimitAS=infinity
  • The limits of the current Tomcat java process are:
# cat /proc/27182/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             128589               128589               processes 
Max open files            16384                16384                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       128589               128589               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us
  • I will try to add ulimit -v unlimited to the Catalina startup script and check the output of the limits to see if it’s different in practice, as some wisdom on Stack Overflow says this solves the Solr core issues and I’ve superstitiously tried it various times in the past
    • The result is the same before and after, so adding the ulimit directly is unneccessary (whether or not unlimited address space is useful or not is another question)
  • For now I will just stop Tomcat, delete Solr locks, then start Tomcat again:
# systemctl stop tomcat7
# find /home/cgspace.cgiar.org/solr/ -iname "*.lock" -delete
# systemctl start tomcat7
  • After restarting I confirmed that all Solr statistics cores were loaded successfully…
  • Another avenue might be to look at point releases in Solr 4.10.x, as we’re running 4.10.2 and they released 4.10.3 and 4.10.4 back in 2014 or 2015
    • I see several issues regarding locks and IndexWriter that were fixed in Solr and Lucene 4.10.3 and 4.10.4…
  • I sent a mail to the dspace-tech mailing list to ask about Solr issues
  • Testing Solr 4.10.4 on DSpace 5.8:
    • Discovery indexing
    • dspace-statistics-api indexer
    • /solr admin UI

2019-03-20

  • Create a branch for Solr 4.10.4 changes so I can test on DSpace Test (linode19)
    • Deployed Solr 4.10.4 on DSpace Test and will leave it there for a few weeks, as well as on my local environment

2019-03-21

  • It’s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:
$ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
      3 2019-03-20 00:
     12 2019-03-20 02:
$ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21 [0-9]{2}:' | sort | uniq -c
      4 2019-03-21 00:
      1 2019-03-21 02:
      4 2019-03-21 03:
      1 2019-03-21 05:
      4 2019-03-21 06:
     11 2019-03-21 07:
     14 2019-03-21 08:
      3 2019-03-21 09:
      4 2019-03-21 10:
      5 2019-03-21 11:
      4 2019-03-21 12:
      3 2019-03-21 13:
      6 2019-03-21 14:
      2 2019-03-21 15:
      3 2019-03-21 16:
      3 2019-03-21 18:
      1 2019-03-21 19:
      6 2019-03-21 20:
  • To investigate the Solr lock issue I added a find command to the Tomcat 7 service with ExecStartPre and ExecStopPost and noticed that the lock files are always there…
    • Perhaps the lock files are less of an issue than I thought?
    • I will share my thoughts with the dspace-tech community
  • In other news, I notice that that systemd always thinks that Tomcat has failed when it stops because the JVM exits with code 143, which is apparently normal when processes gracefully receive a SIGTERM (128 + 15 == 143)
    • We can add SuccessExitStatus=143 to the systemd service so that it knows this is a successful exit

2019-03-22

  • Share the initial list of invalid AGROVOC terms on Yammer to ask the editors for help in correcting them
  • Advise Phanuel Ayuka from IITA about using controlled vocabularies in DSpace

2019-03-23

  • CGSpace (linode18) is having the blank page issue again and it seems to have started last night around 21:00:
$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
      2 2019-03-22 00:
     69 2019-03-22 01:
      1 2019-03-22 02:
     13 2019-03-22 03:
      2 2019-03-22 05:
      2 2019-03-22 06:
      8 2019-03-22 07:
      4 2019-03-22 08:
     12 2019-03-22 09:
      7 2019-03-22 10:
      1 2019-03-22 11:
      2 2019-03-22 12:
     14 2019-03-22 13:
      4 2019-03-22 15:
      7 2019-03-22 16:
      7 2019-03-22 17:
      3 2019-03-22 18:
      3 2019-03-22 19:
      7 2019-03-22 20:
    323 2019-03-22 21:
    685 2019-03-22 22:
    357 2019-03-22 23:
$ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23 [0-9]{2}:' | sort | uniq -c
    575 2019-03-23 00:
    445 2019-03-23 01:
    518 2019-03-23 02:
    436 2019-03-23 03:
    387 2019-03-23 04:
    593 2019-03-23 05:
    468 2019-03-23 06:
    541 2019-03-23 07:
    440 2019-03-23 08:
    260 2019-03-23 09:
  • I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn’t
  • Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:
$ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
      1 2019-03-22 21:0
      1 2019-03-22 21:1
     59 2019-03-22 21:2
     69 2019-03-22 21:3
     89 2019-03-22 21:4
    104 2019-03-22 21:5
  • Looking at the Cocoon log around that time I see the full error is:
2019-03-22 21:21:34,378 WARN  org.apache.cocoon.components.xslt.TraxErrorListener  - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
  • A few milliseconds before that time I see this in the DSpace log:
2019-03-22 21:21:34,356 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
org.postgresql.util.PSQLException: This statement has been closed.
        at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
        at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
        at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
        at org.postgresql.jdbc.PgStatement$StatementResultHandler.handleResultRows(PgStatement.java:204)
        at org.postgresql.core.ResultHandlerDelegate.handleResultRows(ResultHandlerDelegate.java:29)
        at org.postgresql.core.v3.QueryExecutorImpl$1.handleResultRows(QueryExecutorImpl.java:528)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2120)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:308)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:143)
        at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:106)
        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
        at org.apache.tomcat.dbcp.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
        at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:224)
        at org.dspace.storage.rdbms.DatabaseManager.querySingleTable(DatabaseManager.java:375)
        at org.dspace.storage.rdbms.DatabaseManager.findByUnique(DatabaseManager.java:544)
        at org.dspace.storage.rdbms.DatabaseManager.find(DatabaseManager.java:501)
        at org.dspace.eperson.Group.find(Group.java:706)
...
2019-03-22 21:21:34,381 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
org.postgresql.util.PSQLException: This statement has been closed.
        at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
        at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
        at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
...
2019-03-22 21:21:34,386 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL findByUnique Error -
org.postgresql.util.PSQLException: This statement has been closed.
        at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
        at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
        at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
...
2019-03-22 21:21:34,395 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL find Error -
org.postgresql.util.PSQLException: This statement has been closed.
        at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
        at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
        at org.postgresql.jdbc.PgStatement.createResultSet(PgStatement.java:153)
        at org.postgresql.jdbc.PgStatement$StatementResultHandler.handleResultRows(PgStatement.java:204)
  • I restarted Tomcat and now the item displays are working again for now

  • I am wondering if this is an issue with removing abandoned connections in Tomcat’s JDBC pooling?

    • It’s hard to tell because we have logAbanded enabled, but I don’t see anything in the tomcat7 service logs in the systemd journal
  • I sent another mail to the dspace-tech mailing list with my observations

  • I spent some time trying to test and debug the Tomcat connection pool’s settings, but for some reason our logs are either messed up or no connections are actually getting abandoned

  • I compiled this TomcatJdbcConnectionTest and created a bunch of database connections and waited a few minutes but they never got abandoned until I created over maxActive (75), after which almost all were purged at once

    • So perhaps our settings are not working right, but at least I know the logging works now…

2019-03-24

  • I did some more tests with the TomcatJdbcConnectionTest thing and while monitoring the number of active connections in jconsole and after adjusting the limits quite low I eventually saw some connections get abandoned
  • I forgot that to connect to a remote JMX session with jconsole you need to use a dynamic SSH SOCKS proxy (as I originally discovered in 2017-11:
$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
  • I need to remember to check the active connections next time we have issues with blank item pages on CGSpace
  • In other news, I’ve been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing
  • I deployed the latest 5_x-prod branch on CGSpace (linode18) and added more validation to the JDBC pool in our Tomcat config
    • This includes the new testWhileIdle and testOnConnect pool settings as well as the two new JDBC interceptors: StatementFinalizer and ConnectionState that should hopefully make sure our connections in the pool are valid
  • I spent one hour looking at the invalid AGROVOC terms from last week
    • It doesn’t seem like any of the editors did any work on this so I did most of them

2019-03-25

  • Finish looking over the 175 invalid AGROVOC terms
    • I need to apply the corrections and deletions this week
  • Looking at the DBCP status on CGSpace via jconsole and everything looks good, though I wonder why timeBetweenEvictionRunsMillis is -1, because the Tomcat 7.0 JDBC docs say the default is 5000…
    • Could be an error in the docs, as I see the Apache Commons DBCP has -1 as the default
    • Maybe I need to re-evaluate the “defauts” of Tomcat 7’s DBCP and set them explicitly in our config
    • From Tomcat 8 they seem to default to Apache Commons’ DBCP 2.x
  • Also, CGSpace doesn’t have many Cocoon errors yet this morning:
$ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
      4 2019-03-25 00:
      1 2019-03-25 01:
  • Holy shit I just realized we’ve been using the wrong DBCP pool in Tomcat
    • By default you get the Commons DBCP one unless you specify factory org.apache.tomcat.jdbc.pool.DataSourceFactory
    • Now I see all my interceptor settings etc in jconsole, where I didn’t see them before (also a new tomcat.jdbc mbean)!
    • No wonder our settings didn’t quite match the ones in the Tomcat DBCP Pool docs
  • Uptime Robot reported that CGSpace went down and I see the load is very high
  • The top IPs around the time in the nginx API and web logs were:
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      9 190.252.43.162
     12 157.55.39.140
     18 157.55.39.54
     21 66.249.66.211
     27 40.77.167.185
     29 138.220.87.165
     30 157.55.39.168
     36 157.55.39.9
     50 52.23.239.229
   2380 45.5.186.2
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    354 18.195.78.144
    363 190.216.179.100
    386 40.77.167.185
    484 157.55.39.168
    507 157.55.39.9
    536 2a01:4f8:140:3192::2
   1123 66.249.66.211
   1186 93.179.69.74
   1222 35.174.184.209
   1720 2a01:4f8:13b:1296::2
  • The IPs look pretty normal except we’ve never seen 93.179.69.74 before, and it uses the following user agent:
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
  • Surprisingly they are re-using their Tomcat session:
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
1
  • That’s weird because the total number of sessions today seems low compared to recent days:
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
5657
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-24 | sort -u | wc -l
17710
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-23 | sort -u | wc -l
17179
$ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
7904
  • PostgreSQL seems to be pretty busy:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
     11 dspaceApi
     10 dspaceCli
     67 dspaceWeb
  • I restarted Tomcat and deployed the new Tomcat JDBC settings on CGSpace since I had to restart the server anyways
    • I need to watch this carefully though because I’ve read some places that Tomcat’s DBCP doesn’t track statements and might create memory leaks if an application doesn’t close statements before a connection gets returned back to the pool
  • According the Uptime Robot the server was up and down a few more times over the next hour so I restarted Tomcat again

2019-03-26

  • UptimeRobot says CGSpace went down again and I see the load is again at 14.0!
  • Here are the top IPs in nginx logs in the last hour:
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
      3 35.174.184.209
      3 66.249.66.81
      4 104.198.9.108
      4 154.77.98.122
      4 2.50.152.13
     10 196.188.12.245
     14 66.249.66.80
    414 45.5.184.72
    535 45.5.186.2
   2014 205.186.128.185
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    157 41.204.190.40
    160 18.194.46.84
    160 54.70.40.11
    168 31.6.77.23
    188 66.249.66.81
    284 3.91.79.74
    405 2a01:4f8:140:3192::2
    471 66.249.66.80
    712 35.174.184.209
    784 2a01:4f8:13b:1296::2
  • The two IPV6 addresses are something called BLEXBot, which seems to check the robots.txt file and then completely ignore it by making thousands of requests to dynamic pages like Browse and Discovery
  • Then 35.174.184.209 is MauiBot, which does the same thing
  • Also 3.91.79.74 does, which appears to be CCBot
  • I will add these three to the “bad bot” rate limiting that I originally used for Baidu
  • Going further, these are the IPs making requests to Discovery and Browse pages so far today:
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "(discover|browse)" | grep -E "26/Mar/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    120 34.207.146.166
    128 3.91.79.74
    132 108.179.57.67
    143 34.228.42.25
    185 216.244.66.198
    430 54.70.40.11
   1033 93.179.69.74
   1206 2a01:4f8:140:3192::2
   2678 2a01:4f8:13b:1296::2
   3790 35.174.184.209
  • 54.70.40.11 is SemanticScholarBot
  • 216.244.66.198 is DotBot
  • 93.179.69.74 is some IP in Ukraine, which I will add to the list of bot IPs in nginx
  • I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)
  • Looking at the database usage I’m wondering why there are so many connections from the DSpace CLI:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
       5 dspaceApi
     10 dspaceCli
     13 dspaceWeb
  • Looking closer I see they are all idle… so at least I know the load isn’t coming from some background nightly task or something
  • Make a minor edit to my agrovoc-lookup.py script to match subject terms with parentheses like COCOA (PLANT)
  • Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week
$ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
  • UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0
  • Looking at the nginx logs I don’t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:
# grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
2931
  • So I’m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet
  • Otherwise, these are the top users in the web and API logs the last hour (18–19):
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
     54 41.216.228.158
     65 199.47.87.140
     75 157.55.39.238
     77 157.55.39.237
     89 157.55.39.236
    100 18.196.196.108
    128 18.195.78.144
    277 2a01:4f8:13b:1296::2
    291 66.249.66.80
    328 35.174.184.209
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      2 2409:4066:211:2caf:3c31:3fae:2212:19cc
      2 35.10.204.140
      2 45.251.231.45
      2 95.108.181.88
      2 95.137.190.2
      3 104.198.9.108
      3 107.167.109.88
      6 66.249.66.80
     13 41.89.230.156
   1860 45.5.184.2
  • For the XMLUI I see 18.195.78.144 and 18.196.196.108 requesting only CTA items and with no user agent
  • They are responsible for almost 1,000 XMLUI sessions today:
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
937
  • I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat’s Crawler Session Manager Valve to force them to re-use their session
  • Another user agent behaving badly in Colombia is “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
  • I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request
  • I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E "26/Mar/2019:" | grep -E '(discover|browse)' | wc -l                                        
119
  • What’s strange is that I can’t see any of their requests in the DSpace log…
$ grep -I -c 45.5.184.72 dspace.log.2019-03-26 
0

2019-03-28

  • Run the corrections and deletions to AGROVOC (dc.subject) on DSpace Test and CGSpace, and then start a full re-index of Discovery
  • What the hell is going on with this CTA publication?
# grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
      1 37.48.65.147
      1 80.113.172.162
      2 108.174.5.117
      2 83.110.14.208
      4 18.196.8.188
     84 18.195.78.144
    644 18.194.46.84
   1144 18.196.196.108
  • None of these 18.x.x.x IPs specify a user agent and they are all on Amazon!
  • Shortly after I started the re-indexing UptimeRobot began to complain that CGSpace was down, then up, then down, then up…
  • I see the load on the server is about 10.0 again for some reason though I don’t know WHAT is causing that load
    • It could be the CPU steal metric, as if Linode has oversold the CPU resources on this VM host…
  • Here are the Munin graphs of CPU usage for the last day, week, and year:

CPU day

CPU week

CPU year

  • What’s clear from this is that some other VM on our host has heavy usage for about four hours at 6AM and 6PM and that during that time the load on our server spikes
    • CPU steal has drastically increased since March 25th
    • It might be time to move to a dedicated CPU VM instances, or even real servers
    • For now I just sent a support ticket to bring this to Linode’s attention
  • In other news, I see that it’s not even the end of the month yet and we have 3.6 million hits already:
# zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
3654911
  • In other other news I see that DSpace has no statistics for years before 2019 currently, yet when I connect to Solr I see all the cores up

2019-03-29

  • Sent Linode more information from top and iostat about the resource usage on linode18
    • Linode agreed that the CPU steal percentage was high and migrated the VM to a new host
    • Now the resource contention is much lower according to iostat 1 10
  • I restarted Tomcat to see if I could fix the missing pre-2019 statistics (yes it fixed it)
    • Though I looked in the Solr Admin UI and noticed a logging dashboard that show warnings and errors, and the first concerning Solr cores was on 3/27/2019, 8:50:35 AM so I should check the logs around that time to see if something happened

2019-03-31

  • After a few days of the CGSpace VM (linode18) being migrated to a new host the CPU steal is gone and the site is much more responsive

linode18 CPU usage after migration

  • It is frustrating to see that the load spikes for own own legitimate load on the server were very aggravated and drawn out by the contention for CPU on this host
  • We had 4.2 million hits this month according to the web server logs:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
4218841

real    0m26.609s
user    0m31.657s
sys     0m2.551s
  • Interestingly, now that the CPU steal is not an issue the REST API is ten seconds faster than it was in 2018-10:
$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.33s user 0.07s system 2% cpu 17.167 total
0.27s user 0.04s system 1% cpu 16.643 total
0.24s user 0.09s system 1% cpu 17.764 total
0.25s user 0.06s system 1% cpu 15.947 total
  • I did some research on dedicated servers to potentially replace Linode for CGSpace stuff and it seems Hetzner is pretty good
    • This PX62-NVME system looks great an is half the price of our current Linode instance
    • It has 64GB of ECC RAM, six core Xeon processor from 2018, and 2x960GB NVMe storage
    • The alternative of staying with Linode and using dedicated CPU instances with added block storage gets expensive quickly if we want to keep more than 16GB of RAM (do we?)
      • Regarding RAM, our JVM heap is 8GB and we leave the rest of the system’s 32GB of RAM to PostgreSQL and Solr buffers
      • Seeing as we have 56GB of Solr data it might be better to have more RAM in order to keep more of it in memory
      • Also, I know that the Linode block storage is a major bottleneck for Solr indexing
  • Looking at the weird issue with shitloads of downloads on the CTA item again
  • The item was added on 2019-03-13 and these three IPs have attempted to download the item’s bitstream 43,000 times since it was added eighteen days ago:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
     42 196.43.180.134
    621 185.247.144.227
   8102 18.194.46.84
  14927 18.196.196.108
  20265 18.195.78.144
  • I will send a mail to CTA to ask if they know these IPs
  • I wonder if the Cocoon errors we had earlier this month were inadvertently related to the CPU steal issue… I see very low occurrences of the “Can not load requested doc” error in the Cocoon logs the past few days
  • Helping Perttu debug some issues with the REST API on DSpace Test
    • He was getting an HTTP 500 when working with a collection, and I see the following in the DSpace log:
2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
  • IWMI people emailed to ask why two items with the same DOI don’t have the same Altmetric score:
  • Only the second one has an Altmetric score (208)
  • I tweeted handles for both of them to see if Altmetric will pick it up
    • About twenty minutes later the Altmetric score for the second one had increased from 208 to 209, but the first still had a score of zero
    • Interestingly, if I look at the network requests during page load for the first item I see the following response payload for the Altmetric API request:
_altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
  • The response paylod for the second one is the same:
_altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
  • Very interesting to see this in the response:
"handles":["10568/89975","10568/89846"],
"handle":"10568/89975"