CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

April, 2018

2018-04-01

  • I tried to test something on DSpace Test but noticed that it’s down since god knows when
  • Catalina logs at least show some memory errors yesterday:

Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]] 
java.lang.OutOfMemoryError: Java heap space

Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: Java heap space
  • So this is getting super annoying
  • I ran all system updates on DSpace Test and rebooted it
  • For some reason Listings and Reports is not giving any results for any queries now…
  • I posted a message on Yammer to ask if people are using the Duplicate Check step from the Metadata Quality Module
  • Help Lili Szilagyi with a question about statistics on some CCAFS items

2018-04-04

  • Peter noticed that there were still some old CRP names on CGSpace, because I hadn’t forced the Discovery index to be updated after I fixed the others last week
  • For completeness I re-ran the CRP corrections on CGSpace:
$ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
  • Then started a full Discovery index:
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    76m13.841s
user    8m22.960s
sys     2m2.498s
  • Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme’s items
  • I used my add-orcid-identifiers-csv.py script:
$ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
  • The CSV format of jtohme-2018-04-04.csv was:
dc.contributor.author,cg.creator.id
"Tohme, Joseph M.",Joe Tohme: 0000-0003-2765-7101
  • There was a quoting error in my CRP CSV and the replacements for Forests, Trees and Agroforestry got messed up
  • So I fixed them and had to re-index again!
  • I started preparing the git branch for the the DSpace 5.5→5.8 upgrade:
$ git checkout -b 5_x-dspace-5.8 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.8
  • I was prepared to skip some commits that I had cherry picked from the upstream dspace-5_x branch when we did the DSpace 5.5 upgrade (see notes on 2016-10-19 and 2017-12-17):
    • [DS-3246] Improve cleanup in recyclable components (upstream commit on dspace-5_x: 9f0f5940e7921765c6a22e85337331656b18a403)
    • [DS-3250] applying patch provided by Atmire (upstream commit on dspace-5_x: c6fda557f731dbc200d7d58b8b61563f86fe6d06)
    • bump up to latest minor pdfbox version (upstream commit on dspace-5_x: b5330b78153b2052ed3dc2fd65917ccdbfcc0439)
    • DS-3583 Usage of correct Collection Array (#1731) (upstream commit on dspace-5_x: c8f62e6f496fa86846bfa6bcf2d16811087d9761)
  • … but somehow git knew, and didn’t include them in my interactive rebase!
  • I need to send this branch to Atmire and also arrange payment (see ticket #560 in their tracker)
  • Fix Sisay’s SSH access to the new DSpace Test server (linode19)

2018-04-05

  • Fix Sisay’s sudo access on the new DSpace Test server (linode19)
  • The reindexing process on DSpace Test took forever yesterday:
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    599m32.961s
user    9m3.947s
sys     2m52.585s
  • So we really should not use this Linode block storage for Solr
  • Assetstore might be fine but would complicate things with configuration and deployment (ughhh)
  • Better to use Linode block storage only for backup
  • Help Peter with the GDPR compliance / reporting form for CGSpace
  • DSpace Test crashed due to memory issues again:
# grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
16
  • I ran all system updates on DSpace Test and rebooted it
  • Proof some records on DSpace Test for Udana from IWMI
  • He has done better with the small syntax and consistency issues but then there are larger concerns with not linking to DOIs, copying titles incorrectly, etc

2018-04-10

  • I got a notice that CGSpace CPU usage was very high this morning
  • Looking at the nginx logs, here are the top users today so far:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                   
    282 207.46.13.112
    286 54.175.208.220
    287 207.46.13.113
    298 66.249.66.153
    322 207.46.13.114
    780 104.196.152.243
   3994 178.154.200.38
   4295 70.32.83.92
   4388 95.108.181.88
   7653 45.5.186.2
  • 45.5.186.2 is of course CIAT
  • 95.108.181.88 appears to be Yandex:
95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
  • And for some reason Yandex created a lot of Tomcat sessions today:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
4363
  • 70.32.83.92 appears to be some harvester we’ve seen before, but on a new IP
  • They are not creating new Tomcat sessions so there is no problem there
  • 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
3982
  • I’m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
  • Let’s try a manual request with and without their user agent:
$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2638
Content-Type: image/jpeg;charset=ISO-8859-1
Date: Tue, 10 Apr 2018 05:18:37 GMT
Expires: Tue, 10 Apr 2018 06:18:37 GMT
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block

$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg                                                                              
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: HTTPie/0.9.9

HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2638
Content-Type: image/jpeg;charset=ISO-8859-1
Date: Tue, 10 Apr 2018 05:20:08 GMT
Expires: Tue, 10 Apr 2018 06:20:08 GMT
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=31635DB42B66D6A4208CFCC96DD96875; Path=/; Secure; HttpOnly
Strict-Transport-Security: max-age=15768000
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
  • So it definitely looks like Yandex requests are getting assigned a session from the Crawler Session Manager valve
  • And if I look at the DSpace log I see its IP sharing a session with other crawlers like Google (66.249.66.153)
  • Indeed the number of Tomcat sessions appears to be normal:

Tomcat sessions week

  • Looks like the number of total requests processed by nginx in March went down from the previous months:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
2266594

real    0m13.658s
user    0m16.533s
sys     0m1.087s