2018-04-01 16:21:58 +02:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "April, 2018" / >
< meta property = "og:description" content = "2018-04-01
I tried to test something on DSpace Test but noticed that it’ s down since god knows when
Catalina logs at least show some memory errors yesterday:
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2018-04/" / >
< meta property = "article:published_time" content = "2018-04-01T16:13:54+02:00" / >
2018-04-10 07:27:55 +02:00
< meta property = "article:modified_time" content = "2018-04-04T17:01:08+03:00" / >
2018-04-01 16:21:58 +02:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "April, 2018" / >
< meta name = "twitter:description" content = "2018-04-01
I tried to test something on DSpace Test but noticed that it’ s down since god knows when
Catalina logs at least show some memory errors yesterday:
"/>
2018-04-10 07:27:55 +02:00
< meta name = "generator" content = "Hugo 0.38.2" / >
2018-04-01 16:21:58 +02:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "April, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-04/",
2018-04-10 07:27:55 +02:00
"wordCount": "1005",
2018-04-01 16:21:58 +02:00
"datePublished": "2018-04-01T16:13:54+ 02:00",
2018-04-10 07:27:55 +02:00
"dateModified": "2018-04-04T17:01:08+ 03:00",
2018-04-01 16:21:58 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2018-04/" >
< title > April, 2018 | CGSpace Notes< / title >
<!-- combined, minified CSS -->
< link href = "https://alanorth.github.io/cgspace-notes/css/style.css" rel = "stylesheet" integrity = "sha384-CoMzlF7G4xk3ftqRr7leobnWP85AuISUJljMFjtTG/UHyP/+bBwWAvBlXkB4VQQk" crossorigin = "anonymous" >
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
< h2 class = "blog-post-title" > < a href = "https://alanorth.github.io/cgspace-notes/2018-04/" > April, 2018< / a > < / h2 >
< p class = "blog-post-meta" > < time datetime = "2018-04-01T16:13:54+02:00" > Sun Apr 01, 2018< / time > by Alan Orth in
< i class = "fa fa-tag" aria-hidden = "true" > < / i > < a href = "/cgspace-notes/tags/notes" rel = "tag" > Notes< / a >
< / p >
< / header >
< h2 id = "2018-04-01" > 2018-04-01< / h2 >
< ul >
< li > I tried to test something on DSpace Test but noticed that it’ s down since god knows when< / li >
< li > Catalina logs at least show some memory errors yesterday:< / li >
< / ul >
< p > < / p >
< pre > < code > Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
Exception in thread " ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: Java heap space
< / code > < / pre >
< ul >
< li > So this is getting super annoying< / li >
< li > I ran all system updates on DSpace Test and rebooted it< / li >
< li > For some reason Listings and Reports is not giving any results for any queries now… < / li >
< li > I posted a message on Yammer to ask if people are using the Duplicate Check step from the Metadata Quality Module< / li >
< li > Help Lili Szilagyi with a question about statistics on some CCAFS items< / li >
< / ul >
2018-04-04 14:57:34 +02:00
< h2 id = "2018-04-04" > 2018-04-04< / h2 >
< ul >
< li > Peter noticed that there were still some old CRP names on CGSpace, because I hadn’ t forced the Discovery index to be updated after I fixed the others last week< / li >
< li > For completeness I re-ran the CRP corrections on CGSpace:< / li >
< / ul >
< pre > < code > $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
< / code > < / pre >
< ul >
< li > Then started a full Discovery index:< / li >
< / ul >
< pre > < code > $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
2018-04-04 16:01:08 +02:00
real 76m13.841s
user 8m22.960s
sys 2m2.498s
2018-04-04 14:57:34 +02:00
< / code > < / pre >
< ul >
< li > Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme’ s items< / li >
< li > I used my < a href = "https://gist.githubusercontent.com/alanorth/a49d85cd9c5dea89cddbe809813a7050/raw/f67b6e45a9a940732882ae4bb26897a9b245ef31/add-orcid-identifiers-csv.py" > add-orcid-identifiers-csv.py< / a > script:< / li >
< / ul >
< pre > < code > $ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
< / code > < / pre >
< ul >
< li > The CSV format of < code > jtohme-2018-04-04.csv< / code > was:< / li >
< / ul >
< pre > < code class = "language-csv" > dc.contributor.author,cg.creator.id
" Tohme, Joseph M." ,Joe Tohme: 0000-0003-2765-7101
< / code > < / pre >
2018-04-04 16:01:08 +02:00
< ul >
< li > There was a quoting error in my CRP CSV and the replacements for < code > Forests, Trees and Agroforestry< / code > got messed up< / li >
< li > So I fixed them and had to re-index again!< / li >
< li > I started preparing the git branch for the the DSpace 5.5→5.8 upgrade:< / li >
< / ul >
< pre > < code > $ git checkout -b 5_x-dspace-5.8 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.8
< / code > < / pre >
< ul >
< li > I was prepared to skip some commits that I had cherry picked from the upstream < code > dspace-5_x< / code > branch when we did the DSpace 5.5 upgrade (see notes on 2016-10-19 and 2017-12-17):
< ul >
< li > [DS-3246] Improve cleanup in recyclable components (upstream commit on dspace-5_x: 9f0f5940e7921765c6a22e85337331656b18a403)< / li >
< li > [DS-3250] applying patch provided by Atmire (upstream commit on dspace-5_x: c6fda557f731dbc200d7d58b8b61563f86fe6d06)< / li >
< li > bump up to latest minor pdfbox version (upstream commit on dspace-5_x: b5330b78153b2052ed3dc2fd65917ccdbfcc0439)< / li >
< li > DS-3583 Usage of correct Collection Array (#1731) (upstream commit on dspace-5_x: c8f62e6f496fa86846bfa6bcf2d16811087d9761)< / li >
< / ul > < / li >
< li > … but somehow git knew, and didn’ t include them in my interactive rebase!< / li >
< li > I need to send this branch to Atmire and also arrange payment (see < a href = "https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560" > ticket #560< / a > in their tracker)< / li >
2018-04-10 07:27:55 +02:00
< li > Fix Sisay’ s SSH access to the new DSpace Test server (linode19)< / li >
2018-04-04 16:01:08 +02:00
< / ul >
2018-04-10 07:27:55 +02:00
< h2 id = "2018-04-05" > 2018-04-05< / h2 >
< ul >
< li > Fix Sisay’ s sudo access on the new DSpace Test server (linode19)< / li >
< li > The reindexing process on DSpace Test took < em > forever< / em > yesterday:< / li >
< / ul >
< pre > < code > $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 599m32.961s
user 9m3.947s
sys 2m52.585s
< / code > < / pre >
< ul >
< li > So we really should not use this Linode block storage for Solr< / li >
< li > Assetstore might be fine but would complicate things with configuration and deployment (ughhh)< / li >
< li > Better to use Linode block storage only for backup< / li >
< li > Help Peter with the GDPR compliance / reporting form for CGSpace< / li >
< li > DSpace Test crashed due to memory issues again:< / li >
< / ul >
< pre > < code > # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
16
< / code > < / pre >
< ul >
< li > I ran all system updates on DSpace Test and rebooted it< / li >
< li > Proof some records on DSpace Test for Udana from IWMI< / li >
< li > He has done better with the small syntax and consistency issues but then there are larger concerns with not linking to DOIs, copying titles incorrectly, etc< / li >
< / ul >
< h2 id = "2018-04-10" > 2018-04-10< / h2 >
< ul >
< li > I got a notice that CGSpace CPU usage was very high this morning< / li >
< li > Looking at the nginx logs, here are the top users today so far:< / li >
< / ul >
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
282 207.46.13.112
286 54.175.208.220
287 207.46.13.113
298 66.249.66.153
322 207.46.13.114
780 104.196.152.243
3994 178.154.200.38
4295 70.32.83.92
4388 95.108.181.88
7653 45.5.186.2
< / code > < / pre >
< ul >
< li > 45.5.186.2 is of course CIAT< / li >
< li > 95.108.181.88 appears to be Yandex:< / li >
< / ul >
< pre > < code > 95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] " GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 " -" " Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
< / code > < / pre >
< ul >
< li > And for some reason Yandex created a lot of Tomcat sessions today:< / li >
< / ul >
< pre > < code > $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
4363
< / code > < / pre >
< ul >
< li > 70.32.83.92 appears to be some harvester we’ ve seen before, but on a new IP< / li >
< li > They are not creating new Tomcat sessions so there is no problem there< / li >
< li > 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:< / li >
< / ul >
< pre > < code > $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
3982
< / code > < / pre >
< ul >
< li > I’ m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve< / li >
< li > Let’ s try a manual request with and without their user agent:< / li >
< / ul >
< pre > < code > $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2638
Content-Type: image/jpeg;charset=ISO-8859-1
Date: Tue, 10 Apr 2018 05:18:37 GMT
Expires: Tue, 10 Apr 2018 06:18:37 GMT
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
$ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg
GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cgspace.cgiar.org
User-Agent: HTTPie/0.9.9
HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2638
Content-Type: image/jpeg;charset=ISO-8859-1
Date: Tue, 10 Apr 2018 05:20:08 GMT
Expires: Tue, 10 Apr 2018 06:20:08 GMT
Last-Modified: Tue, 25 Apr 2017 07:05:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=31635DB42B66D6A4208CFCC96DD96875; Path=/; Secure; HttpOnly
Strict-Transport-Security: max-age=15768000
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
< / code > < / pre >
< ul >
< li > So it definitely looks like Yandex requests are getting assigned a session from the Crawler Session Manager valve< / li >
< li > And if I look at the DSpace log I see its IP sharing a session with other crawlers like Google (66.249.66.153)< / li >
< li > Indeed the number of Tomcat sessions appears to be normal:< / li >
< / ul >
< p > < img src = "/cgspace-notes/2018/04/jmx_dspace_sessions-week.png" alt = "Tomcat sessions week" / > < / p >
< ul >
< li > Looks like the number of total requests processed by nginx in March went down from the previous months:< / li >
< / ul >
< pre > < code > # time zcat --force /var/log/nginx/* | grep -cE " [0-9]{1,2}/Mar/2018"
2266594
real 0m13.658s
user 0m16.533s
sys 0m1.087s
< / code > < / pre >
2018-04-01 16:21:58 +02:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "/cgspace-notes/2018-04/" > April, 2018< / a > < / li >
< li > < a href = "/cgspace-notes/2018-03/" > March, 2018< / a > < / li >
< li > < a href = "/cgspace-notes/2018-02/" > February, 2018< / a > < / li >
< li > < a href = "/cgspace-notes/2018-01/" > January, 2018< / a > < / li >
< li > < a href = "/cgspace-notes/2017-12/" > December, 2017< / a > < / li >
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >