mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-18 12:47:04 +01:00
338 lines
19 KiB
Markdown
338 lines
19 KiB
Markdown
---
|
|
title: "July, 2018"
|
|
date: 2018-07-01T12:56:54+03:00
|
|
author: "Alan Orth"
|
|
tags: ["Notes"]
|
|
---
|
|
|
|
## 2018-07-01
|
|
|
|
- I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
|
|
|
|
```
|
|
$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
|
|
```
|
|
|
|
- During the `mvn package` stage on the 5.8 branch I kept getting issues with java running out of memory:
|
|
|
|
```
|
|
There is insufficient memory for the Java Runtime Environment to continue.
|
|
```
|
|
|
|
<!--more-->
|
|
|
|
- As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:
|
|
|
|
```
|
|
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
|
$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
|
|
```
|
|
|
|
- Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:
|
|
|
|
```
|
|
$ sudo su - postgres
|
|
$ psql dspace
|
|
...
|
|
dspace=# begin;
|
|
BEGIN
|
|
dspace=# \i Atmire-DSpace-5.8-Schema-Migration.sql
|
|
DELETE 0
|
|
UPDATE 1
|
|
DELETE 1
|
|
dspace=# commit
|
|
dspace=# \q
|
|
$ exit
|
|
$ dspace database migrate ignored
|
|
```
|
|
|
|
- After that I started Tomcat 7 and DSpace seems to be working, now I need to tell our colleagues to try stuff and report issues they have
|
|
|
|
## 2018-07-02
|
|
|
|
- Discuss AgriKnowledge including our Handle identifier on their harvested items from CGSpace
|
|
- They seem to be only interested in Gates-funded outputs, for example: https://www.agriknowledge.org/files/tm70mv21t
|
|
|
|
## 2018-07-03
|
|
|
|
- Finally finish with the CIFOR Archive records (a total of 2448):
|
|
- I mapped the 50 items that were duplicates from elsewhere in CGSpace into [CIFOR Archive](https://cgspace.cgiar.org/handle/10568/16702)
|
|
- I did one last check of the remaining 2398 items and found eight who have a `cg.identifier.doi` that links to some URL other than a DOI so I moved those to `cg.identifier.url` and `cg.identifier.googleurl` as appropriate
|
|
- Also, thirteen items had a DOI in their citation, but did not have a `cg.identifier.doi` field, so I added those
|
|
- Then I imported those 2398 items in two batches (to deal with memory issues):
|
|
|
|
```
|
|
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
|
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
|
|
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
|
|
```
|
|
|
|
- I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:
|
|
|
|
```
|
|
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
|
|
count
|
|
-------
|
|
785
|
|
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
|
|
count
|
|
-------
|
|
4
|
|
```
|
|
|
|
- I think I should fix that as well as some other garbage values like "test" and "dspace.ilri.org" etc:
|
|
|
|
```
|
|
dspace=# begin;
|
|
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
|
|
UPDATE 785
|
|
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
|
|
UPDATE 4
|
|
dspace=# update metadatavalue set text_value='https://books.google.com/books?id=meF1CLdPSF4C' where resource_type_id=2 and metadata_field_id=222 and text_value='meF1CLdPSF4C';
|
|
UPDATE 1
|
|
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=222 and metadata_value_id in (2299312, 10684, 10700, 996403);
|
|
DELETE 4
|
|
dspace=# commit;
|
|
```
|
|
|
|
- Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:
|
|
|
|
```
|
|
03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
|
|
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
|
|
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
|
|
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
|
|
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5256)
|
|
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
|
|
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:754)
|
|
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:730)
|
|
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
|
|
at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:629)
|
|
at org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1839)
|
|
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
|
|
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
|
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
|
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
|
|
at java.lang.Thread.run(Thread.java:748)
|
|
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
|
|
```
|
|
|
|
- Gotta check that out later...
|
|
|
|
## 2018-07-04
|
|
|
|
- I verified that the autowire error indeed only occurs on Tomcat 8.5, but the application works fine on Tomcat 7
|
|
- I have raised this in the [DSpace 5.8 compatibility ticket on Atmire's tracker](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560)
|
|
- Abenet wants me to add "United Kingdom government" to the sponsors on CGSpace so I created a ticket to track it ([#381](https://github.com/ilri/DSpace/issues/381))
|
|
- Also, Udana wants me to add "Enhancing Sustainability Across Agricultural Systems" to the WLE Phase II research themes so I created a ticket to track that ([#382](https://github.com/ilri/DSpace/issues/382)
|
|
- I need to try to finish this DSpace 5.8 business first because I have too many branches with cherry-picks going on right now!
|
|
|
|
## 2018-07-06
|
|
|
|
- CCAFS want me to add "PII-FP2_MSCCCAFS" to their Phase II project tags on CGSpace ([#383](https://github.com/ilri/DSpace/issues/383)
|
|
- I'll do it in a batch with all the other metadata updates next week
|
|
|
|
## 2018-07-08
|
|
|
|
- I was tempted to do the Linode instance upgrade on CGSpace (linode18), but after looking closely at the system backups I noticed that Solr isn't being backed up to S3
|
|
- I apparently noticed this—and fixed it!—in [2016-07]({{< relref "2016-07.md" >}}), but it doesn't look like the backup has been updated since then!
|
|
- It looks like I added Solr to the `backup_to_s3.sh` script, but that script is not even being used (`s3cmd` is run directly from root's crontab)
|
|
- For now I have just initiated a manual S3 backup of the Solr data:
|
|
|
|
```
|
|
# s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
|
|
```
|
|
|
|
- But I need to add this to cron!
|
|
- I wonder if I should convert some of the cron jobs to systemd services / timers...
|
|
- I sent a note to all our users on Yammer to ask them about possible maintenance on Sunday, July 14th
|
|
- Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it ([#384](https://github.com/ilri/DSpace/issues/384))
|
|
- I regenerated the list of names for all our ORCID iDs using my [resolve-orcids.py](https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b) script:
|
|
|
|
```
|
|
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
|
|
$ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
|
|
```
|
|
|
|
- But after comparing to the existing list of names I didn't see much change, so I just ignored it
|
|
|
|
## 2018-07-09
|
|
|
|
- Uptime Robot said that CGSpace was down for two minutes early this morning but I don't see anything in Tomcat logs or dmesg
|
|
- Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat's `catalina.out`:
|
|
|
|
```
|
|
Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
|
|
```
|
|
|
|
- I'm not sure if it's the same error, but I see this in DSpace's `solr.log`:
|
|
|
|
```
|
|
2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
|
|
```
|
|
|
|
- I see a strange error around that time in `dspace.log.2018-07-08`:
|
|
|
|
```
|
|
2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
|
|
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr/statistics
|
|
```
|
|
|
|
- But not sure what caused that...
|
|
- I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT
|
|
- Looking in the nginx logs I see the top ten IP addresses active today:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
1691 40.77.167.84
|
|
1701 40.77.167.69
|
|
1718 50.116.102.77
|
|
1872 137.108.70.6
|
|
2172 157.55.39.234
|
|
2190 207.46.13.47
|
|
2848 178.154.200.38
|
|
4367 35.227.26.162
|
|
4387 70.32.83.92
|
|
4738 95.108.181.88
|
|
```
|
|
|
|
- Of those, *all* except `70.32.83.92` and `50.116.102.77` are *NOT* re-using their Tomcat sessions, for example from the XMLUI logs:
|
|
|
|
```
|
|
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
|
|
4435
|
|
```
|
|
|
|
- `95.108.181.88` appears to be Yandex, so I dunno why it's creating so many sessions, as its user agent should match Tomcat's Crawler Session Manager Valve
|
|
- `70.32.83.92` is on MediaTemple but I'm not sure who it is. They are mostly hitting REST so I guess that's fine
|
|
- `35.227.26.162` doesn't declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx
|
|
- `178.154.200.38` is Yandex again
|
|
- `207.46.13.47` is Bing
|
|
- `157.55.39.234` is Bing
|
|
- `137.108.70.6` is our old friend CORE bot
|
|
- `50.116.102.77` doesn't declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that's fine
|
|
- `40.77.167.84` is Bing again
|
|
- Interestingly, the first time that I see `35.227.26.162` was on 2018-06-08
|
|
- I've added `35.227.26.162` to the bot tagging logic in the nginx vhost
|
|
|
|
## 2018-07-10
|
|
|
|
- Add "United Kingdom government" to sponsors ([#381](https://github.com/ilri/DSpace/issues/381))
|
|
- Add "Enhancing Sustainability Across Agricultural Systems" to WLE Phase II Research Themes ([#382](https://github.com/ilri/DSpace/issues/382))
|
|
- Add "PII-FP2_MSCCCAFS" to CCAFS Phase II Project Tags ([#383](https://github.com/ilri/DSpace/issues/383))
|
|
- Add journal title (dc.source) to Discovery search filters ([#384](https://github.com/ilri/DSpace/issues/384))
|
|
- All were tested and merged to the `5_x-prod` branch and will be deployed on CGSpace this coming weekend when I do the Linode server upgrade
|
|
- I need to get them onto the 5.8 testing branch too, either via cherry-picking or by rebasing after we finish testing Atmire's 5.8 pull request ([#378](https://github.com/ilri/DSpace/pull/378))
|
|
- Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC
|
|
- These are the top ten users in the last two hours:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Jul/2018:(11|12|13)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
81 193.95.22.113
|
|
82 50.116.102.77
|
|
112 40.77.167.90
|
|
117 196.190.95.98
|
|
120 178.154.200.38
|
|
215 40.77.167.96
|
|
243 41.204.190.40
|
|
415 95.108.181.88
|
|
695 35.227.26.162
|
|
697 213.139.52.250
|
|
```
|
|
|
|
- Looks like `213.139.52.250` is Moayad testing his new CGSpace vizualization thing:
|
|
|
|
```
|
|
213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
|
|
```
|
|
|
|
- He said there was a bug that caused his app to request a bunch of invalid URLs
|
|
- I'll have to keep and eye on this and see how their platform evolves
|
|
|
|
## 2018-07-11
|
|
|
|
- Skype meeting with Peter and Addis CGSpace team
|
|
- We need to look at doing the `dc.rights` stuff again, which we last worked on in 2018-01 and 2018-02
|
|
- Abenet suggested that we do a controlled vocabulary for the authors, perhaps with the top 1,500 or so on CGSpace?
|
|
- Peter told Sisay to test this controlled vocabulary
|
|
- Discuss meeting in Nairobi in October
|
|
|
|
## 2018-07-12
|
|
|
|
- Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM
|
|
- Here are the top ten IPs from last night and this morning:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
48 66.249.64.91
|
|
50 35.227.26.162
|
|
57 157.55.39.234
|
|
59 157.55.39.71
|
|
62 147.99.27.190
|
|
82 95.108.181.88
|
|
92 40.77.167.90
|
|
97 183.128.40.185
|
|
97 240e:f0:44:fa53:745a:8afe:d221:1232
|
|
3634 208.110.72.10
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
25 216.244.66.198
|
|
38 40.77.167.185
|
|
46 66.249.64.93
|
|
56 157.55.39.71
|
|
60 35.227.26.162
|
|
65 157.55.39.234
|
|
83 95.108.181.88
|
|
87 66.249.64.91
|
|
96 40.77.167.90
|
|
7075 208.110.72.10
|
|
```
|
|
|
|
- We have never seen `208.110.72.10` before... so that's interesting!
|
|
- The user agent for these requests is: Pcore-HTTP/v0.44.0
|
|
- A brief Google search doesn't turn up any information about what this bot is, but lots of users complaining about it
|
|
- This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
17098 208.110.72.10
|
|
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
|
|
1161
|
|
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
|
|
1885
|
|
```
|
|
|
|
- I think the problem is that, despite the bot requesting `robots.txt`, it almost exlusively requests dynamic pages from `/discover`:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
|
|
13364 GET /discover
|
|
993 GET /search-filter
|
|
804 GET /browse
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
|
|
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0"
|
|
```
|
|
|
|
- So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting
|
|
- I'll also add it to Tomcat's Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
|
|
- Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):
|
|
|
|
```
|
|
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
|
|
COPY 4518
|
|
dspace=# \q
|
|
$ csvcut -c 1 < /tmp/affiliations.csv > /tmp/affiliations-1.csv
|
|
```
|
|
|
|
- We also need to discuss standardizing our countries and comparing our ORCID iDs
|
|
|
|
## 2018-07-13
|
|
|
|
- Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:
|
|
|
|
```
|
|
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
|
|
COPY 4518
|
|
```
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|