--- title: "April, 2018" date: 2018-04-01T16:13:54+02:00 author: "Alan Orth" tags: ["Notes"] --- ## 2018-04-01 - I tried to test something on DSpace Test but noticed that it's down since god knows when - Catalina logs at least show some memory errors yesterday: ``` Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]] java.lang.OutOfMemoryError: Java heap space Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: Java heap space ``` - So this is getting super annoying - I ran all system updates on DSpace Test and rebooted it - For some reason Listings and Reports is not giving any results for any queries now... - I posted a message on Yammer to ask if people are using the Duplicate Check step from the Metadata Quality Module - Help Lili Szilagyi with a question about statistics on some CCAFS items ## 2018-04-04 - Peter noticed that there were still some old CRP names on CGSpace, because I hadn't forced the Discovery index to be updated after I fixed the others last week - For completeness I re-ran the CRP corrections on CGSpace: ``` $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu' Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH ``` - Then started a full Discovery index: ``` $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m' $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b real 76m13.841s user 8m22.960s sys 2m2.498s ``` - Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme's items - I used my [add-orcid-identifiers-csv.py](https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py) script: ``` $ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu' ``` - The CSV format of `jtohme-2018-04-04.csv` was: ```csv dc.contributor.author,cg.creator.id "Tohme, Joseph M.",Joe Tohme: 0000-0003-2765-7101 ``` - There was a quoting error in my CRP CSV and the replacements for `Forests, Trees and Agroforestry` got messed up - So I fixed them and had to re-index again! - I started preparing the git branch for the the DSpace 5.5→5.8 upgrade: ``` $ git checkout -b 5_x-dspace-5.8 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.8 ``` - I was prepared to skip some commits that I had cherry picked from the upstream `dspace-5_x` branch when we did the DSpace 5.5 upgrade (see notes on 2016-10-19 and 2017-12-17): - [DS-3246] Improve cleanup in recyclable components (upstream commit on dspace-5_x: 9f0f5940e7921765c6a22e85337331656b18a403) - [DS-3250] applying patch provided by Atmire (upstream commit on dspace-5_x: c6fda557f731dbc200d7d58b8b61563f86fe6d06) - bump up to latest minor pdfbox version (upstream commit on dspace-5_x: b5330b78153b2052ed3dc2fd65917ccdbfcc0439) - DS-3583 Usage of correct Collection Array (#1731) (upstream commit on dspace-5_x: c8f62e6f496fa86846bfa6bcf2d16811087d9761) - ... but somehow git knew, and didn't include them in my interactive rebase! - I need to send this branch to Atmire and also arrange payment (see [ticket #560](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560) in their tracker) - Fix Sisay's SSH access to the new DSpace Test server (linode19) ## 2018-04-05 - Fix Sisay's sudo access on the new DSpace Test server (linode19) - The reindexing process on DSpace Test took _forever_ yesterday: ``` $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b real 599m32.961s user 9m3.947s sys 2m52.585s ``` - So we really should not use this Linode block storage for Solr - Assetstore might be fine but would complicate things with configuration and deployment (ughhh) - Better to use Linode block storage only for backup - Help Peter with the GDPR compliance / reporting form for CGSpace - DSpace Test crashed due to memory issues again: ``` # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out 16 ``` - I ran all system updates on DSpace Test and rebooted it - Proof some records on DSpace Test for Udana from IWMI - He has done better with the small syntax and consistency issues but then there are larger concerns with not linking to DOIs, copying titles incorrectly, etc ## 2018-04-10 - I got a notice that CGSpace CPU usage was very high this morning - Looking at the nginx logs, here are the top users today so far: ``` # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 282 286 287 298 322 780 3994 4295 4388 7653 ``` - is of course CIAT - appears to be Yandex: ``` - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" ``` - And for some reason Yandex created a lot of Tomcat sessions today: ``` $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=' dspace.log.2018-04-10 4363 ``` - appears to be some harvester we've seen before, but on a new IP - They are not creating new Tomcat sessions so there is no problem there - also appears to be Yandex, and is also creating many Tomcat sessions: ``` $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=' dspace.log.2018-04-10 3982 ``` - I'm not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve - Let's try a manual request with and without their user agent: ``` $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)' GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: cgspace.cgiar.org User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) HTTP/1.1 200 OK Connection: keep-alive Content-Language: en-US Content-Length: 2638 Content-Type: image/jpeg;charset=ISO-8859-1 Date: Tue, 10 Apr 2018 05:18:37 GMT Expires: Tue, 10 Apr 2018 06:18:37 GMT Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT Server: nginx Strict-Transport-Security: max-age=15768000 Vary: User-Agent X-Cocoon-Version: 2.2.0 X-Content-Type-Options: nosniff X-Frame-Options: SAMEORIGIN X-XSS-Protection: 1; mode=block $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate Connection: keep-alive Host: cgspace.cgiar.org User-Agent: HTTPie/0.9.9 HTTP/1.1 200 OK Connection: keep-alive Content-Language: en-US Content-Length: 2638 Content-Type: image/jpeg;charset=ISO-8859-1 Date: Tue, 10 Apr 2018 05:20:08 GMT Expires: Tue, 10 Apr 2018 06:20:08 GMT Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT Server: nginx Set-Cookie: JSESSIONID=31635DB42B66D6A4208CFCC96DD96875; Path=/; Secure; HttpOnly Strict-Transport-Security: max-age=15768000 Vary: User-Agent X-Cocoon-Version: 2.2.0 X-Content-Type-Options: nosniff X-Frame-Options: SAMEORIGIN X-XSS-Protection: 1; mode=block ``` - So it definitely looks like Yandex requests are getting assigned a session from the Crawler Session Manager valve - And if I look at the DSpace log I see its IP sharing a session with other crawlers like Google ( - Indeed the number of Tomcat sessions appears to be normal: ![Tomcat sessions week](/cgspace-notes/2018/04/jmx_dspace_sessions-week.png) - Looks like the number of total requests processed by nginx in March went down from the previous months: ``` # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018" 2266594 real 0m13.658s user 0m16.533s sys 0m1.087s ```