CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

July, 2018

2018-07-01

  • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
  • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
There is insufficient memory for the Java Runtime Environment to continue.

  • As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
  • Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:
$ sudo su - postgres
$ psql dspace
...
dspace=# begin;
BEGIN
dspace=# \i Atmire-DSpace-5.8-Schema-Migration.sql
DELETE 0
UPDATE 1
DELETE 1
dspace=# commit
dspace=# \q
$ exit
$ dspace database migrate ignored
  • After that I started Tomcat 7 and DSpace seems to be working, now I need to tell our colleagues to try stuff and report issues they have

2018-07-02

2018-07-03

  • Finally finish with the CIFOR Archive records (a total of 2448):
    • I mapped the 50 items that were duplicates from elsewhere in CGSpace into CIFOR Archive
    • I did one last check of the remaining 2398 items and found eight who have a cg.identifier.doi that links to some URL other than a DOI so I moved those to cg.identifier.url and cg.identifier.googleurl as appropriate
    • Also, thirteen items had a DOI in their citation, but did not have a cg.identifier.doi field, so I added those
    • Then I imported those 2398 items in two batches (to deal with memory issues):
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
  • I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
 count
-------
   785
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
 count
-------
     4
  • I think I should fix that as well as some other garbage values like “test” and “dspace.ilri.org” etc:
dspace=# begin;
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
UPDATE 785
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
UPDATE 4
dspace=# update metadatavalue set text_value='https://books.google.com/books?id=meF1CLdPSF4C' where resource_type_id=2 and metadata_field_id=222 and text_value='meF1CLdPSF4C';
UPDATE 1
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=222 and metadata_value_id in (2299312, 10684, 10700, 996403);
DELETE 4
dspace=# commit;
  • Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:
03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
 java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
	at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5256)
	at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
	at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:754)
	at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:730)
	at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
	at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
	at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1839)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
  • Gotta check that out later…

2018-07-04

  • I verified that the autowire error indeed only occurs on Tomcat 8.5, but the application works fine on Tomcat 7
  • I have raised this in the DSpace 5.8 compatibility ticket on Atmire’s tracker
  • Abenet wants me to add “United Kingdom government” to the sponsors on CGSpace so I created a ticket to track it (#381)
  • Also, Udana wants me to add “Enhancing Sustainability Across Agricultural Systems” to the WLE Phase II research themes so I created a ticket to track that (#382
  • I need to try to finish this DSpace 5.8 business first because I have too many branches with cherry-picks going on right now!

2018-07-06

  • CCAFS want me to add “PII-FP2_MSCCCAFS” to their Phase II project tags on CGSpace (#383
  • I’ll do it in a batch with all the other metadata updates next week

2018-07-08

  • I was tempted to do the Linode instance upgrade on CGSpace (linode18), but after looking closely at the system backups I noticed that Solr isn’t being backed up to S3
  • I apparently noticed this—and fixed it!—in 2016-07, but it doesn’t look like the backup has been updated since then!
  • It looks like I added Solr to the backup_to_s3.sh script, but that script is not even being used (s3cmd is run directly from root’s crontab)
  • For now I have just initiated a manual S3 backup of the Solr data:
# s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
  • But I need to add this to cron!
  • I wonder if I should convert some of the cron jobs to systemd services / timers…
  • I sent a note to all our users on Yammer to ask them about possible maintenance on Sunday, July 14th
  • Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (#384)
  • I regenerated the list of names for all our ORCID iDs using my resolve-orcids.py script:
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
$ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
  • But after comparing to the existing list of names I didn’t see much change, so I just ignored it

2018-07-09

  • Uptime Robot said that CGSpace was down for two minutes early this morning but I don’t see anything in Tomcat logs or dmesg
  • Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat’s catalina.out:
Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
  • I’m not sure if it’s the same error, but I see this in DSpace’s solr.log:
2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
  • I see a strange error around that time in dspace.log.2018-07-08:
2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr/statistics
  • But not sure what caused that…
  • I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT
  • Looking in the nginx logs I see the top ten IP addresses active today:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1691 40.77.167.84
   1701 40.77.167.69
   1718 50.116.102.77
   1872 137.108.70.6
   2172 157.55.39.234
   2190 207.46.13.47
   2848 178.154.200.38
   4367 35.227.26.162
   4387 70.32.83.92
   4738 95.108.181.88
  • Of those, all except 70.32.83.92 and 50.116.102.77 are NOT re-using their Tomcat sessions, for example from the XMLUI logs:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
4435
  • 95.108.181.88 appears to be Yandex, so I dunno why it’s creating so many sessions, as its user agent should match Tomcat’s Crawler Session Manager Valve
  • 70.32.83.92 is on MediaTemple but I’m not sure who it is. They are mostly hitting REST so I guess that’s fine
  • 35.227.26.162 doesn’t declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx
  • 178.154.200.38 is Yandex again
  • 207.46.13.47 is Bing
  • 157.55.39.234 is Bing
  • 137.108.70.6 is our old friend CORE bot
  • 50.116.102.77 doesn’t declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that’s fine
  • 40.77.167.84 is Bing again
  • Interestingly, the first time that I see 35.227.26.162 was on 2018-06-08
  • I’ve added 35.227.26.162 to the bot tagging logic in the nginx vhost