mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-09-29 13:44:17 +02:00
302 lines
11 KiB
Markdown
302 lines
11 KiB
Markdown
---
|
|
title: "April, 2018"
|
|
date: 2018-04-01T16:13:54+02:00
|
|
author: "Alan Orth"
|
|
tags: ["Notes"]
|
|
---
|
|
|
|
## 2018-04-01
|
|
|
|
- I tried to test something on DSpace Test but noticed that it's down since god knows when
|
|
- Catalina logs at least show some memory errors yesterday:
|
|
|
|
<!--more-->
|
|
|
|
```
|
|
Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
|
|
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
|
|
java.lang.OutOfMemoryError: Java heap space
|
|
|
|
Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: Java heap space
|
|
```
|
|
|
|
- So this is getting super annoying
|
|
- I ran all system updates on DSpace Test and rebooted it
|
|
- For some reason Listings and Reports is not giving any results for any queries now...
|
|
- I posted a message on Yammer to ask if people are using the Duplicate Check step from the Metadata Quality Module
|
|
- Help Lili Szilagyi with a question about statistics on some CCAFS items
|
|
|
|
## 2018-04-04
|
|
|
|
- Peter noticed that there were still some old CRP names on CGSpace, because I hadn't forced the Discovery index to be updated after I fixed the others last week
|
|
- For completeness I re-ran the CRP corrections on CGSpace:
|
|
|
|
```
|
|
$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
|
|
Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
|
|
```
|
|
|
|
- Then started a full Discovery index:
|
|
|
|
```
|
|
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
|
|
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
|
|
|
real 76m13.841s
|
|
user 8m22.960s
|
|
sys 2m2.498s
|
|
```
|
|
|
|
- Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme's items
|
|
- I used my [add-orcid-identifiers-csv.py](https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py) script:
|
|
|
|
```
|
|
$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
|
|
```
|
|
|
|
- The CSV format of `jtohme-2018-04-04.csv` was:
|
|
|
|
```csv
|
|
dc.contributor.author,cg.creator.id
|
|
"Tohme, Joseph M.",Joe Tohme: 0000-0003-2765-7101
|
|
```
|
|
|
|
- There was a quoting error in my CRP CSV and the replacements for `Forests, Trees and Agroforestry` got messed up
|
|
- So I fixed them and had to re-index again!
|
|
- I started preparing the git branch for the the DSpace 5.5→5.8 upgrade:
|
|
|
|
```
|
|
$ git checkout -b 5_x-dspace-5.8 5_x-prod
|
|
$ git reset --hard ilri/5_x-prod
|
|
$ git rebase -i dspace-5.8
|
|
```
|
|
|
|
- I was prepared to skip some commits that I had cherry picked from the upstream `dspace-5_x` branch when we did the DSpace 5.5 upgrade (see notes on 2016-10-19 and 2017-12-17):
|
|
- [DS-3246] Improve cleanup in recyclable components (upstream commit on dspace-5_x: 9f0f5940e7921765c6a22e85337331656b18a403)
|
|
- [DS-3250] applying patch provided by Atmire (upstream commit on dspace-5_x: c6fda557f731dbc200d7d58b8b61563f86fe6d06)
|
|
- bump up to latest minor pdfbox version (upstream commit on dspace-5_x: b5330b78153b2052ed3dc2fd65917ccdbfcc0439)
|
|
- DS-3583 Usage of correct Collection Array (#1731) (upstream commit on dspace-5_x: c8f62e6f496fa86846bfa6bcf2d16811087d9761)
|
|
- ... but somehow git knew, and didn't include them in my interactive rebase!
|
|
- I need to send this branch to Atmire and also arrange payment (see [ticket #560](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560) in their tracker)
|
|
- Fix Sisay's SSH access to the new DSpace Test server (linode19)
|
|
|
|
## 2018-04-05
|
|
|
|
- Fix Sisay's sudo access on the new DSpace Test server (linode19)
|
|
- The reindexing process on DSpace Test took _forever_ yesterday:
|
|
|
|
```
|
|
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
|
|
|
real 599m32.961s
|
|
user 9m3.947s
|
|
sys 2m52.585s
|
|
```
|
|
|
|
- So we really should not use this Linode block storage for Solr
|
|
- Assetstore might be fine but would complicate things with configuration and deployment (ughhh)
|
|
- Better to use Linode block storage only for backup
|
|
- Help Peter with the GDPR compliance / reporting form for CGSpace
|
|
- DSpace Test crashed due to memory issues again:
|
|
|
|
```
|
|
# grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
|
|
16
|
|
```
|
|
|
|
- I ran all system updates on DSpace Test and rebooted it
|
|
- Proof some records on DSpace Test for Udana from IWMI
|
|
- He has done better with the small syntax and consistency issues but then there are larger concerns with not linking to DOIs, copying titles incorrectly, etc
|
|
|
|
## 2018-04-10
|
|
|
|
- I got a notice that CGSpace CPU usage was very high this morning
|
|
- Looking at the nginx logs, here are the top users today so far:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
282 207.46.13.112
|
|
286 54.175.208.220
|
|
287 207.46.13.113
|
|
298 66.249.66.153
|
|
322 207.46.13.114
|
|
780 104.196.152.243
|
|
3994 178.154.200.38
|
|
4295 70.32.83.92
|
|
4388 95.108.181.88
|
|
7653 45.5.186.2
|
|
```
|
|
|
|
- 45.5.186.2 is of course CIAT
|
|
- 95.108.181.88 appears to be Yandex:
|
|
|
|
```
|
|
95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
|
|
```
|
|
|
|
- And for some reason Yandex created a lot of Tomcat sessions today:
|
|
|
|
```
|
|
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
|
|
4363
|
|
```
|
|
|
|
- 70.32.83.92 appears to be some harvester we've seen before, but on a new IP
|
|
- They are not creating new Tomcat sessions so there is no problem there
|
|
- 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:
|
|
|
|
```
|
|
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
|
|
3982
|
|
```
|
|
|
|
- I'm not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
|
|
- Let's try a manual request with and without their user agent:
|
|
|
|
```
|
|
$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
|
|
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
|
|
Accept: */*
|
|
Accept-Encoding: gzip, deflate
|
|
Connection: keep-alive
|
|
Host: cgspace.cgiar.org
|
|
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
|
|
|
|
HTTP/1.1 200 OK
|
|
Connection: keep-alive
|
|
Content-Language: en-US
|
|
Content-Length: 2638
|
|
Content-Type: image/jpeg;charset=ISO-8859-1
|
|
Date: Tue, 10 Apr 2018 05:18:37 GMT
|
|
Expires: Tue, 10 Apr 2018 06:18:37 GMT
|
|
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
|
|
Server: nginx
|
|
Strict-Transport-Security: max-age=15768000
|
|
Vary: User-Agent
|
|
X-Cocoon-Version: 2.2.0
|
|
X-Content-Type-Options: nosniff
|
|
X-Frame-Options: SAMEORIGIN
|
|
X-XSS-Protection: 1; mode=block
|
|
|
|
$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg
|
|
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
|
|
Accept: */*
|
|
Accept-Encoding: gzip, deflate
|
|
Connection: keep-alive
|
|
Host: cgspace.cgiar.org
|
|
User-Agent: HTTPie/0.9.9
|
|
|
|
HTTP/1.1 200 OK
|
|
Connection: keep-alive
|
|
Content-Language: en-US
|
|
Content-Length: 2638
|
|
Content-Type: image/jpeg;charset=ISO-8859-1
|
|
Date: Tue, 10 Apr 2018 05:20:08 GMT
|
|
Expires: Tue, 10 Apr 2018 06:20:08 GMT
|
|
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
|
|
Server: nginx
|
|
Set-Cookie: JSESSIONID=31635DB42B66D6A4208CFCC96DD96875; Path=/; Secure; HttpOnly
|
|
Strict-Transport-Security: max-age=15768000
|
|
Vary: User-Agent
|
|
X-Cocoon-Version: 2.2.0
|
|
X-Content-Type-Options: nosniff
|
|
X-Frame-Options: SAMEORIGIN
|
|
X-XSS-Protection: 1; mode=block
|
|
```
|
|
|
|
- So it definitely looks like Yandex requests are getting assigned a session from the Crawler Session Manager valve
|
|
- And if I look at the DSpace log I see its IP sharing a session with other crawlers like Google (66.249.66.153)
|
|
- Indeed the number of Tomcat sessions appears to be normal:
|
|
|
|
![Tomcat sessions week](/cgspace-notes/2018/04/jmx_dspace_sessions-week.png)
|
|
|
|
- In other news, it looks like the number of total requests processed by nginx in March went down from the previous months:
|
|
|
|
```
|
|
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
|
|
2266594
|
|
|
|
real 0m13.658s
|
|
user 0m16.533s
|
|
sys 0m1.087s
|
|
```
|
|
|
|
- In other other news, the database cleanup script has an issue again:
|
|
|
|
```
|
|
$ dspace cleanup -v
|
|
...
|
|
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
|
Detail: Key (bitstream_id)=(151626) is still referenced from table "bundle".
|
|
```
|
|
|
|
- The solution is, as always:
|
|
|
|
```
|
|
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
|
|
UPDATE 1
|
|
```
|
|
|
|
- Looking at abandoned connections in Tomcat:
|
|
|
|
```
|
|
# zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
|
|
2115
|
|
```
|
|
|
|
- Apparently from these stacktraces we should be able to see which code is not closing connections properly
|
|
- Here's a pretty good overview of days where we had database issues recently:
|
|
|
|
```
|
|
# zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
|
|
1 Feb 18, 2018
|
|
1 Feb 19, 2018
|
|
1 Feb 20, 2018
|
|
1 Feb 24, 2018
|
|
2 Feb 13, 2018
|
|
3 Feb 17, 2018
|
|
5 Feb 16, 2018
|
|
5 Feb 23, 2018
|
|
5 Feb 27, 2018
|
|
6 Feb 25, 2018
|
|
40 Feb 14, 2018
|
|
63 Feb 28, 2018
|
|
154 Mar 19, 2018
|
|
202 Feb 21, 2018
|
|
264 Feb 26, 2018
|
|
268 Mar 21, 2018
|
|
524 Feb 22, 2018
|
|
570 Feb 15, 2018
|
|
```
|
|
|
|
- In Tomcat 8.5 the `removeAbandoned` property has been split into two: `removeAbandonedOnBorrow` and `removeAbandonedOnMaintenance`
|
|
- See: https://tomcat.apache.org/tomcat-8.5-doc/jndi-datasource-examples-howto.html#Database_Connection_Pool_(DBCP_2)_Configurations
|
|
- I assume we want `removeAbandonedOnBorrow` and make updates to the Tomcat 8 templates in Ansible
|
|
- After reading more documentation I see that Tomcat 8.5's default DBCP seems to now be Commons DBCP2 instead of Tomcat DBCP
|
|
- It can be overridden in Tomcat's _server.xml_ by setting `factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"` in the `<Resource>`
|
|
- I think we should use this default, so we'll need to remove some other settings that are specific to Tomcat's DBCP like `jdbcInterceptors` and `abandonWhenPercentageFull`
|
|
- Merge the changes adding ORCID identifier to advanced search and Atmire Listings and Reports ([#371](https://github.com/ilri/DSpace/pull/371))
|
|
- Fix one more issue of missing XMLUI strings (for CRP subject when clicking "view more" in the Discovery sidebar)
|
|
- I told Udana to fix the citation and abstract of the one item, and to correct the `dc.language.iso` for the five Spanish items in his Book Chapters collection
|
|
- Then we can import the records to CGSpace
|
|
|
|
## 2018-04-11
|
|
|
|
- DSpace Test (linode19) crashed again some time since yesterday:
|
|
|
|
```
|
|
# grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
|
|
168
|
|
```
|
|
|
|
- I ran all system updates and rebooted the server
|
|
|
|
## 2018-04-12
|
|
|
|
- I caught wind of an interesting XMLUI performance optimization coming in DSpace 6.3: https://jira.duraspace.org/browse/DS-3883
|
|
- I asked for it to be ported to DSpace 5.x
|
|
|
|
## 2018-04-13
|
|
|
|
- Add `PII-LAM_CSAGender` to CCAFS Phase II project tags in `input-forms.xml`
|