2019-04-01 08:02:18 +02:00
<!DOCTYPE html>
2019-10-11 10:19:42 +02:00
< html lang = "en" >
2019-04-01 08:02:18 +02:00
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
2020-12-06 15:53:29 +01:00
2019-04-01 08:02:18 +02:00
< meta property = "og:title" content = "April, 2019" / >
2019-04-01 16:02:54 +02:00
< meta property = "og:description" content = "2019-04-01
Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
They asked if we had plans to enable RDF support in CGSpace
2019-05-05 15:45:12 +02:00
2019-04-01 16:02:54 +02:00
There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
2019-11-28 16:30:45 +01:00
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep ' Spore-192-EN-web.pdf' | grep -E ' (18.196.196.108|18.195.78.144|18.195.218.6)' | awk ' {print $9}' | sort | uniq -c | sort -n | tail -n 5
4432 200
2019-04-01 16:02:54 +02:00
In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
2019-05-05 15:45:12 +02:00
Apply country and region corrections and deletions on DSpace Test and CGSpace:
2019-04-01 16:02:54 +02:00
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p ' fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p ' fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p ' fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p ' fuuu' -m 231 -f cg.coverage.region -d
" />
2019-04-01 08:02:18 +02:00
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2019-04/" / >
2019-08-08 17:10:44 +02:00
< meta property = "article:published_time" content = "2019-04-01T09:00:43+03:00" / >
2021-08-18 15:24:40 +02:00
< meta property = "article:modified_time" content = "2021-08-18T15:29:31+03:00" / >
2019-04-01 08:02:18 +02:00
2020-12-06 15:53:29 +01:00
2019-04-01 08:02:18 +02:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "April, 2019" / >
2019-04-01 16:02:54 +02:00
< meta name = "twitter:description" content = "2019-04-01
Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
They asked if we had plans to enable RDF support in CGSpace
2019-05-05 15:45:12 +02:00
2019-04-01 16:02:54 +02:00
There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
2019-11-28 16:30:45 +01:00
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep ' Spore-192-EN-web.pdf' | grep -E ' (18.196.196.108|18.195.78.144|18.195.218.6)' | awk ' {print $9}' | sort | uniq -c | sort -n | tail -n 5
4432 200
2019-04-01 16:02:54 +02:00
In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
2019-05-05 15:45:12 +02:00
Apply country and region corrections and deletions on DSpace Test and CGSpace:
2019-04-01 16:02:54 +02:00
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p ' fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p ' fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p ' fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p ' fuuu' -m 231 -f cg.coverage.region -d
"/>
2022-02-23 12:46:23 +01:00
< meta name = "generator" content = "Hugo 0.92.2" / >
2019-04-01 08:02:18 +02:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "April, 2019",
2020-04-02 09:55:42 +02:00
"url": "https://alanorth.github.io/cgspace-notes/2019-04/",
2019-11-28 16:30:45 +01:00
"wordCount": "6778",
2019-10-11 10:19:42 +02:00
"datePublished": "2019-04-01T09:00:43+03:00",
2021-08-18 15:24:40 +02:00
"dateModified": "2021-08-18T15:29:31+03:00",
2019-04-01 08:02:18 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2019-04/" >
< title > April, 2019 | CGSpace Notes< / title >
2019-10-11 10:19:42 +02:00
2019-04-01 08:02:18 +02:00
<!-- combined, minified CSS -->
2020-01-23 19:19:38 +01:00
2021-01-24 08:46:27 +01:00
< link href = "https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel = "stylesheet" integrity = "sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin = "anonymous" >
2019-10-11 10:19:42 +02:00
2019-04-01 08:02:18 +02:00
2020-01-28 11:01:42 +01:00
<!-- minified Font Awesome for SVG icons -->
2021-09-28 09:32:32 +02:00
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
2020-01-28 11:01:42 +01:00
2019-04-14 15:59:47 +02:00
<!-- RSS 2.0 feed -->
2019-04-01 08:02:18 +02:00
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
2019-10-11 10:19:42 +02:00
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
2019-04-01 08:02:18 +02:00
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
2019-10-11 10:19:42 +02:00
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2019-04/" > April, 2019< / a > < / h2 >
2020-11-16 09:54:00 +01:00
< p class = "blog-post-meta" >
< time datetime = "2019-04-01T09:00:43+03:00" > Mon Apr 01, 2019< / time >
in
2020-01-28 11:01:42 +01:00
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/cgspace-notes/categories/notes/" rel = "category tag" > Notes< / a >
2019-04-01 08:02:18 +02:00
< / p >
< / header >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-01" > 2019-04-01< / h2 >
2019-04-01 16:02:54 +02:00
< ul >
< li > Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
< ul >
< li > They asked if we had plans to enable RDF support in CGSpace< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
2019-04-01 16:02:54 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
2019-11-28 16:30:45 +01:00
4432 200
< / code > < / pre > < ul >
< li > In the last two weeks there have been 47,000 downloads of this < em > same exact PDF< / em > by these three IP addresses< / li >
< li > Apply country and region corrections and deletions on DSpace Test and CGSpace:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
2019-04-01 16:02:54 +02:00
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
2019-12-17 13:49:24 +01:00
< / code > < / pre > < h2 id = "2019-04-02" > 2019-04-02< / h2 >
2019-04-02 11:44:18 +02:00
< ul >
< li > CTA says the Amazon IPs are AWS gateways for real user traffic< / li >
2020-01-27 15:20:44 +01:00
< li > I was trying to add Felix Shaw’ s account back to the Administrators group on DSpace Test, but I couldn’ t find his name in the user search of the groups page
2019-04-02 19:32:18 +02:00
< ul >
< li > If I searched for “ Felix” or “ Shaw” I saw other matches, included one for his personal email address!< / li >
< li > I ended up finding him via searching for his email address< / li >
2019-04-02 11:44:18 +02:00
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-03" > 2019-04-03< / h2 >
2019-04-03 16:01:31 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
2019-04-03 16:01:31 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > First I need to extract the ones that are unique from their list compared to our existing one:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!< / li >
< li > Next I will resolve all their names using my < code > resolve-orcids.py< / code > script:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim< / li >
2020-01-27 15:20:44 +01:00
< li > One user’ s name has changed so I will update those using my < code > fix-metadata-values.py< / code > script:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I created a pull request and merged the changes to the < code > 5_x-prod< / code > branch (< a href = "https://github.com/ilri/DSpace/pull/417" > #417< / a > )< / li >
2020-01-27 15:20:44 +01:00
< li > A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’ s still going:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
2019-11-28 16:30:45 +01:00
1
3 http://localhost:8081/solr//statistics-2017
5662 http://localhost:8081/solr//statistics-2018
< / code > < / pre > < ul >
< li > I will have to keep an eye on it because nothing should be updating 2018 stats in 2019… < / li >
2019-04-03 16:40:05 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-05" > 2019-04-05< / h2 >
2019-04-05 21:22:41 +02:00
< ul >
< li > Uptime Robot reported that CGSpace (linode18) went down tonight< / li >
2019-11-28 16:30:45 +01:00
< li > I see there are lots of PostgreSQL connections:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2019-11-28 16:30:45 +01:00
5 dspaceApi
10 dspaceCli
250 dspaceWeb
< / code > < / pre > < ul >
< li > I still see those weird messages about updating the statistics-2018 Solr core:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Looking at < code > iostat 1 10< / code > I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:< / li >
2019-04-05 21:22:41 +02:00
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2019/04/cpu-week.png" alt = "CPU usage week" > < / p >
2019-04-05 21:22:41 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > The other thing visible there is that the past few days the load has spiked to 500% and I don’ t think it’ s a coincidence that the Solr updating thing is happening… < / li >
2019-11-28 16:30:45 +01:00
< li > I ran all system updates and rebooted the server
2019-04-05 22:07:30 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > The load was lower on the server after reboot, but Solr didn’ t come back up properly according to the Solr Admin UI:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I restarted it again and all the Solr cores came up properly… < / li >
2019-04-05 21:22:41 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-06" > 2019-04-06< / h2 >
2019-04-06 10:47:45 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Udana asked why item < a href = "https://cgspace.cgiar.org/handle/10568/91278" > 10568/91278< / a > didn’ t have an Altmetric badge on CGSpace, but on the < a href = "https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity" > WLE website< / a > it does
2019-04-06 10:47:45 +02:00
< ul >
< li > I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all< / li >
< li > I tweeted the item and I assume this will link the Handle with the DOI in the system< / li >
2019-04-06 11:06:14 +02:00
< li > Twenty minutes later I see the same Altmetric score (9) on CGSpace< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E " 06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
222 18.195.78.144
245 207.46.13.58
303 207.46.13.194
328 66.249.79.33
564 207.46.13.210
566 66.249.79.62
575 40.77.167.66
1803 66.249.79.59
2834 2a01:4f8:140:3192::2
9623 45.5.184.72
2019-04-06 10:47:45 +02:00
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E " 06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
31 66.249.79.62
41 207.46.13.210
42 40.77.167.66
54 42.113.50.219
132 66.249.79.59
785 2001:41d0:d:1990::
1164 45.5.184.72
2014 50.116.102.77
4267 45.5.186.2
4893 205.186.128.185
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > < code > 45.5.184.72< / code > is in Colombia so it’ s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’ s datasets collection:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > GET /handle/10568/72970/discover?filtertype_0=type& filtertype_1=author& filter_relational_operator_1=contains& filter_relational_operator_0=equals& filter_1=& filter_0=Dataset& filtertype=dateIssued& filter_relational_operator=equals& filter=2014
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Their user agent is the one I added to the badbots list in nginx last week: “ GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1” < / li >
2020-01-27 15:20:44 +01:00
< li > They made 22,000 requests to Discover on this collection today alone (and it’ s only 11AM):< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep " 06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
2019-11-28 16:30:45 +01:00
22077 /handle/10568/72970/discover
< / code > < / pre > < ul >
< li > Yesterday they made 43,000 requests and we actually blocked most of them:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep " 05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
2019-11-28 16:30:45 +01:00
43631 /handle/10568/72970/discover
2019-04-06 11:01:09 +02:00
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep " 05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
2019-11-28 16:30:45 +01:00
142 200
43489 503
< / code > < / pre > < ul >
< li > I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover< / li >
< li > Maria from Bioversity recommended that we use the phrase “ AGROVOC subject” instead of “ Subject” in Listings and Reports
2019-04-06 10:47:45 +02:00
< ul >
< li > I made a pull request to update this and merged it to the < code > 5_x-prod< / code > branch (< a href = "https://github.com/ilri/DSpace/pull/418" > #418< / a > )< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-07" > 2019-04-07< / h2 >
2019-04-07 10:45:34 +02:00
< ul >
< li > Looking into the impact of harvesters like < code > 45.5.184.72< / code > , I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands < em > per day< / em > < / li >
2019-11-28 16:30:45 +01:00
< li > Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
2019-04-07 10:45:34 +02:00
< ul >
< li > I am trying to see if these are registered as downloads in Solr or not< / li >
2019-11-28 16:30:45 +01:00
< li > I see 96,925 downloads from their AWS gateway IPs in 2019-03:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)& fq=statistics_type%3Aview& fq=bundleName%3AORIGINAL& fq=dateYearMonth%3A2019-03& rows=0& wt=json& indent=true'
2019-04-07 10:45:34 +02:00
{
2019-11-28 16:30:45 +01:00
" response" : {
" docs" : [],
" numFound" : 96925,
" start" : 0
},
" responseHeader" : {
" QTime" : 1,
" params" : {
" fq" : [
" statistics_type:view" ,
" bundleName:ORIGINAL" ,
" dateYearMonth:2019-03"
],
" indent" : " true" ,
" q" : " type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)" ,
" rows" : " 0" ,
" wt" : " json"
},
" status" : 0
}
2019-05-05 15:45:12 +02:00
}
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > Strangely I don’ t see many hits in 2019-04:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)& fq=statistics_type%3Aview& fq=bundleName%3AORIGINAL& fq=dateYearMonth%3A2019-04& rows=0& wt=json& indent=true'
2019-04-07 10:45:34 +02:00
{
2019-11-28 16:30:45 +01:00
" response" : {
" docs" : [],
" numFound" : 38,
" start" : 0
2019-04-07 10:45:34 +02:00
},
2019-11-28 16:30:45 +01:00
" responseHeader" : {
" QTime" : 1,
" params" : {
" fq" : [
" statistics_type:view" ,
" bundleName:ORIGINAL" ,
" dateYearMonth:2019-04"
],
" indent" : " true" ,
" q" : " type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)" ,
" rows" : " 0" ,
" wt" : " json"
},
" status" : 0
}
2019-05-05 15:45:12 +02:00
}
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Making some tests on GET vs HEAD requests on the < a href = "https://dspacetest.cgiar.org/handle/10568/100289" > CTA Spore 192 item< / a > on DSpace Test:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
2019-04-07 10:45:34 +02:00
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: dspacetest.cgiar.org
User-Agent: HTTPie/1.0.2
HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2069158
Content-Type: application/pdf;charset=ISO-8859-1
Date: Sun, 07 Apr 2019 08:38:34 GMT
Expires: Sun, 07 Apr 2019 09:38:34 GMT
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
Server: nginx
Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Robots-Tag: none
X-XSS-Protection: 1; mode=block
$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: dspacetest.cgiar.org
User-Agent: HTTPie/1.0.2
HTTP/1.1 200 OK
Connection: keep-alive
Content-Language: en-US
Content-Length: 2069158
Content-Type: application/pdf;charset=ISO-8859-1
Date: Sun, 07 Apr 2019 08:39:01 GMT
Expires: Sun, 07 Apr 2019 09:39:01 GMT
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
Server: nginx
Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly
Vary: User-Agent
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Robots-Tag: none
X-XSS-Protection: 1; mode=block
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > And from the server side, the nginx logs show:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 78.x.x.x - - [07/Apr/2019:01:38:35 -0700] " GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 " -" " HTTPie/1.0.2"
2019-04-07 10:45:34 +02:00
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] " HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 " -" " HTTPie/1.0.2"
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > So definitely the < em > size< / em > of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
2019-04-07 17:08:38 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > After twenty minutes of waiting I still don’ t see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-04-07 17:08:38 +02:00
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
2019-04-07 17:08:38 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Strangely, the statistics Solr core says it hasn’ t been modified in 24 hours, so I tried to start the “ optimize” process from the Admin UI and I see this in the Solr log:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are < code > statistics_type:view< / code > … very weird
2019-04-07 17:08:38 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > I don’ t even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)< / li >
2019-04-07 17:08:38 +02:00
< li > I will try to re-deploy the < code > 5_x-dev< / code > branch and test again< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2020-04-13 16:24:05 +02:00
< li > According to the < a href = "https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics" > DSpace 5.x Solr documentation< / a > the default commit time is after 15 minutes or 10,000 documents (see < code > solrconfig.xml< / code > )< / li >
2019-11-28 16:30:45 +01:00
< li > I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they < em > do< / em > register as downloads (even though they are internal):< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*& fq=statistics_type%3Aview& fq=isInternal%3Atrue& rows=0& wt=json& indent=true'
2019-04-07 17:08:38 +02:00
{
2019-11-28 16:30:45 +01:00
" response" : {
" docs" : [],
" numFound" : 909,
" start" : 0
2019-04-07 17:08:38 +02:00
},
2019-11-28 16:30:45 +01:00
" responseHeader" : {
" QTime" : 0,
" params" : {
" fq" : [
" statistics_type:view" ,
" isInternal:true"
],
" indent" : " true" ,
" q" : " type:0 AND time:2019-04-07*" ,
" rows" : " 0" ,
" wt" : " json"
},
" status" : 0
}
2019-04-07 17:08:38 +02:00
}
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I confirmed the same on CGSpace itself after making one HEAD request< / li >
2020-01-27 15:20:44 +01:00
< li > So I’ m pretty sure it’ s something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
2019-04-07 17:08:38 +02:00
< ul >
< li > I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
2019-04-07 17:08:38 +02:00
< ul >
< li > This leads me to believe there is something specifically wrong with DSpace Test (linode19)< / li >
< li > The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Holy shit, all this is actually because of the GeoIP1 deprecation and a missing < code > GeoLiteCity.dat< / code >
2019-04-07 17:08:38 +02:00
< ul >
< li > For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!< / li >
< li > See: < a href = "https://jira.duraspace.org/browse/DS-3986" > DS-3986< / a > < / li >
< li > See: < a href = "https://jira.duraspace.org/browse/DS-4020" > DS-4020< / a > < / li >
< li > See: < a href = "https://jira.duraspace.org/browse/DS-3832" > DS-3832< / a > < / li >
2020-01-27 15:20:44 +01:00
< li > DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been < em > removed< / em > from MaxMind’ s server as of 2018-04-01< / li >
2019-04-07 17:08:38 +02:00
< li > Now I made 100 requests and I see them in the Solr statistics… fuck my life for wasting five hours debugging this< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check < code > iostat 1 10< / code > and I saw that CPU steal is around 10– 30 percent right now… < / li >
2020-01-27 15:20:44 +01:00
< li > The load average is super high right now, as I’ ve noticed the last few times UptimeRobot said that CGSpace went down:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ cat /proc/loadavg
2019-04-07 20:05:52 +02:00
10.70 9.17 8.85 18/633 4198
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > According to the server logs there is actually not much going on right now:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E " 07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
118 18.195.78.144
128 207.46.13.219
129 167.114.64.100
159 207.46.13.129
179 207.46.13.33
188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
195 66.249.79.59
363 40.77.167.21
740 2a01:4f8:140:3192::2
4823 45.5.184.72
2019-04-07 20:17:16 +02:00
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E " 07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
3 66.249.79.62
3 66.249.83.196
4 207.46.13.86
5 82.145.222.150
6 2a01:4f9:2b:1263::2
6 41.204.190.40
7 35.174.176.49
10 40.77.167.21
11 194.246.119.6
11 66.249.79.59
< / code > < / pre > < ul >
< li > < code > 45.5.184.72< / code > is CIAT, who I already blocked and am waiting to hear from< / li >
< li > < code > 2a01:4f8:140:3192::2< / code > is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve< / li >
< li > < code > 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142< / code > is some stupid Chinese bot making malicious POST requests< / li >
< li > There are free database connections in the pool:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2019-11-28 16:30:45 +01:00
5 dspaceApi
7 dspaceCli
23 dspaceWeb
< / code > < / pre > < ul >
< li > It seems that the issue with CGSpace being “ down” is actually because of CPU steal again!!!< / li >
< li > I opened a ticket with support and asked them to migrate the VM to a less busy host< / li >
2019-04-07 10:45:34 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-08" > 2019-04-08< / h2 >
2019-04-08 10:26:20 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Start checking IITA’ s last round of batch uploads from < a href = "https://dspacetest.cgiar.org/handle/10568/100333" > March on DSpace Test< / a > (20193rd.xls)
2019-04-08 10:26:20 +02:00
< ul >
< li > Lots of problems with affiliations, I had to correct about sixty of them< / li >
2019-11-28 16:30:45 +01:00
< li > I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
2019-04-08 10:26:20 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > The matched values can be accessed with < code > cell.recon.match.name< / code > , but some of the new values don’ t appear, perhaps because I edited the original cell values?< / li >
2019-11-28 16:30:45 +01:00
< li > I ended up using this GREL expression to copy all values to a new column:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > if(cell.recon.matched, cell.recon.match.name, value)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > See the < a href = "https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon" > OpenRefine variables documentation< / a > for more notes about the < code > recon< / code > object< / li >
< li > I also noticed a handful of errors in our current list of affiliations so I corrected them:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > We should create a new list of affiliations to update our controlled vocabulary again< / li >
< li > I dumped a list of the top 1500 affiliations:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
2019-04-08 15:29:20 +02:00
COPY 1500
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
2019-04-08 15:29:20 +02:00
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural and Livestock Research^M%';
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’ %') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
2019-04-08 15:29:20 +02:00
COPY 60
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’ %') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
COPY 20
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
2019-04-08 15:29:20 +02:00
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > UptimeRobot said that CGSpace (linode18) went down tonight
2019-04-08 18:55:20 +02:00
< ul >
< li > The load average is at < code > 9.42, 8.87, 7.87< / code > < / li >
2019-11-28 16:30:45 +01:00
< li > I looked at PostgreSQL and see shitloads of connections there:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2019-11-28 16:30:45 +01:00
5 dspaceApi
7 dspaceCli
250 dspaceWeb
< / code > < / pre > < ul >
< li > On a related note I see connection pool errors in the DSpace log:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
2019-04-08 19:03:58 +02:00
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > But still I see 10 to 30% CPU steal in < code > iostat< / code > that is also reflected in the Munin graphs:< / li >
2019-04-08 19:22:40 +02:00
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2019/04/cpu-week2.png" alt = "CPU usage week" > < / p >
2019-04-08 19:22:40 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Linode Support still didn’ t respond to my ticket from yesterday, so I attached a new output of < code > iostat 1 10< / code > and asked them to move the VM to a less busy host< / li >
2019-11-28 16:30:45 +01:00
< li > The web server logs are not very busy:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E " 08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
124 40.77.167.135
135 95.108.181.88
139 157.55.39.206
190 66.249.79.133
202 45.5.186.2
284 207.46.13.95
359 18.196.196.108
457 157.55.39.164
457 40.77.167.132
3822 45.5.184.72
2019-04-08 19:00:45 +02:00
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E " 08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
5 129.0.79.206
5 41.205.240.21
7 207.46.13.95
7 66.249.79.133
7 66.249.79.135
7 95.108.181.88
8 40.77.167.111
19 157.55.39.164
20 40.77.167.132
370 51.254.16.223
2019-12-17 13:49:24 +01:00
< / code > < / pre > < h2 id = "2019-04-09" > 2019-04-09< / h2 >
2019-04-09 08:33:52 +02:00
< ul >
< li > Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning< / li >
2019-11-28 16:30:45 +01:00
< li > Here are the top IPs in the web server logs around that time:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E " 09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
18 66.249.79.139
21 157.55.39.160
29 66.249.79.137
38 66.249.79.135
50 34.200.212.137
54 66.249.79.133
100 102.128.190.18
1166 45.5.184.72
4251 45.5.186.2
4895 205.186.128.185
2019-04-09 08:33:52 +02:00
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E " 09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
200 144.48.242.108
202 207.46.13.185
206 18.194.46.84
239 66.249.79.139
246 18.196.196.108
274 31.6.77.23
289 66.249.79.137
312 157.55.39.160
441 66.249.79.135
856 66.249.79.133
< / code > < / pre > < ul >
< li > < code > 45.5.186.2< / code > is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Database connection usage looks fine:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
2019-11-28 16:30:45 +01:00
5 dspaceApi
7 dspaceCli
11 dspaceWeb
< / code > < / pre > < ul >
< li > Ironically I do still see some 2 to 10% of CPU steal in < code > iostat 1 10< / code > < / li >
< li > Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace
2019-04-10 08:48:40 +02:00
< ul >
< li > I told them how to use the REST API to get the CIAT Datasets collection and enumerate its items< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< li > In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-10" > 2019-04-10< / h2 >
2019-04-10 08:48:40 +02:00
< ul >
< li > Abenet pointed out a possibility of validating funders against the < a href = "https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API" > CrossRef API< / a > < / li >
2019-11-28 16:30:45 +01:00
< li > Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ http 'https://api.crossref.org/funders?query=mercator& mailto=me@cgiar.org'
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Otherwise, they provide the funder data in < a href = "https://www.crossref.org/services/funder-registry/" > CSV and RDF format< / a > < / li >
2020-01-27 15:20:44 +01:00
< li > I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’ t match will need a human to go and do some manual checking and informed decision making… < / li >
2019-11-28 16:30:45 +01:00
< li > If I want to write a script for this I could use the Python < a href = "https://habanero.readthedocs.io/en/latest/modules/crossref.html" > habanero library< / a > :< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > from habanero import Crossref
2019-04-10 12:44:05 +02:00
cr = Crossref(mailto=" me@cgiar.org" )
x = cr.funders(query = " mercator" )
2019-12-17 13:49:24 +01:00
< / code > < / pre > < h2 id = "2019-04-11" > 2019-04-11< / h2 >
2019-04-11 12:29:52 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Continue proofing IITA’ s last round of batch uploads from < a href = "https://dspacetest.cgiar.org/handle/10568/100333" > March on DSpace Test< / a > (20193rd.xls)
2019-04-11 12:29:52 +02:00
< ul >
< li > One misspelled country< / li >
< li > Three incorrect regions< / li >
2019-11-28 16:30:45 +01:00
< li > Potential duplicates (same DOI, similar title, same authors):
< ul >
< li > < a href = "https://dspacetest.cgiar.org/handle/10568/100580" > 10568/100580< / a > and < a href = "https://dspacetest.cgiar.org/handle/10568/100579" > 10568/100579< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org/handle/10568/100444" > 10568/100444< / a > and < a href = "https://dspacetest.cgiar.org/handle/10568/100423" > 10568/100423< / a > < / li >
< / ul >
< / li >
2019-04-11 12:29:52 +02:00
< li > Two DOIs with incorrect URL formatting< / li >
< li > Two misspelled IITA subjects< / li >
< li > Two authors with smart quotes< / li >
< li > Lots of issues with sponsors< / li >
< li > One misspelled “ Peer review” < / li >
< li > One invalid ISBN that I fixed by Googling the title< / li >
< li > Lots of issues with sponsors (German Aid Agency, Swiss Aid Agency, Italian Aid Agency, Dutch Aid Agency, etc)< / li >
2019-11-28 16:30:45 +01:00
< li > I validated all the AGROVOC subjects against our latest list with reconcile-csv
< ul >
2019-04-11 12:29:52 +02:00
< li > About 720 of the 900 terms were matched, then I checked and fixed or deleted the rest manually< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
< / li >
2020-01-27 15:20:44 +01:00
< li > I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’ s records, so I applied them to DSpace Test and CGSpace:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
2019-04-11 13:23:30 +02:00
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Answer more questions about DOIs and Altmetric scores from WLE< / li >
< li > Answer more questions about DOIs and Altmetric scores from IWMI
2019-04-11 12:29:52 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > They can’ t seem to understand the Altmetric + Twitter flow for associating Handles and DOIs< / li >
< li > To make things worse, many of their items DON’ T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don’ t have scores because they never tweeted them!< / li >
< li > They added a DOI to this old item < a href = "https://cgspace.cgiar.org/handle/10568/97087" > 10567/97087< / a > this morning and wonder why Altmetric’ s score hasn’ t linked with the DOI magically< / li >
2019-04-11 12:29:52 +02:00
< li > We should check in a week to see if Altmetric will make the association after one week when they harvest again< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-13" > 2019-04-13< / h2 >
2019-04-13 11:15:55 +02:00
< ul >
< li > I copied the < code > statistics< / code > and < code > statistics-2018< / code > Solr cores from CGSpace to my local machine and watched the Java process in VisualVM while indexing item views and downloads with my < a href = "https://github.com/ilri/dspace-statistics-api" > dspace-statistics-api< / a > :< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2019/04/visualvm-solr-indexing.png" alt = "Java GC during Solr indexing with CMS" > < / p >
2019-04-13 11:15:55 +02:00
< ul >
< li > It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear “ sawtooth” pattern in the garbage collection< / li >
< li > I am curious if the GC pattern would be different if I switched from the < code > -XX:+UseConcMarkSweepGC< / code > to G1GC< / li >
2020-01-27 15:20:44 +01:00
< li > I switched to G1GC and restarted Tomcat but for some reason I couldn’ t see the Tomcat PID in VisualVM…
2019-04-13 11:15:55 +02:00
< ul >
< li > Anyways, the indexing process took much longer, perhaps twice as long!< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2019-04-13 11:15:55 +02:00
< li > I tried again with the GC tuning settings from the Solr 4.10.4 release:< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2019/04/visualvm-solr-indexing-solr-settings.png" alt = "Java GC during Solr indexing Solr 4.10.4 settings" > < / p >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-14" > 2019-04-14< / h2 >
2019-04-14 15:59:47 +02:00
< ul >
2021-08-18 15:24:40 +02:00
< li > Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > GC_TUNE=" -XX:NewRatio=3 \
2019-11-28 16:30:45 +01:00
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
-XX:+CMSScavengeBeforeRemark \
-XX:PretenureSizeThreshold=64m \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled"
< / code > < / pre > < ul >
< li > I need to remember to check the Munin JVM graphs in a few days< / li >
< li > It might be placebo, but the site < em > does< / em > feel snappier… < / li >
2019-04-14 15:59:47 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-15" > 2019-04-15< / h2 >
2019-04-15 11:58:07 +02:00
< ul >
< li > Rework the dspace-statistics-api to use the vanilla Python requests library instead of Solr client
< ul >
< li > < a href = "https://github.com/ilri/dspace-statistics-api/releases/tag/v1.0.0" > Tag version 1.0.0< / a > and deploy it on DSpace Test< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2020-01-27 15:20:44 +01:00
< li > Pretty annoying to see CGSpace (linode18) with 20– 50% CPU steal according to < code > iostat 1 10< / code > , though I haven’ t had any Linode alerts in a few days< / li >
< li > Abenet sent me a list of ILRI items that don’ t have CRPs added to them
2019-04-15 16:42:53 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > The spreadsheet only had Handles (no IDs), so I’ m experimenting with using Python in OpenRefine to get the IDs< / li >
2019-11-28 16:30:45 +01:00
< li > I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > import json
2019-04-15 16:42:53 +02:00
import re
import urllib
import urllib2
handle = re.findall('[0-9]+/[0-9]+', value)
url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0]
req = urllib2.Request(url)
req.add_header('User-agent', 'Alan Python bot')
res = urllib2.urlopen(req)
data = json.load(res)
item_id = data['id']
return item_id
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > Luckily none of the items already had CRPs, so I didn’ t have to worry about them getting removed
2019-04-15 16:42:53 +02:00
< ul >
< li > It would have been much trickier if I had to get the CRPs for the items first, then add the CRPs… < / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2020-01-27 15:20:44 +01:00
< li > I ran a full Discovery indexing on CGSpace because I didn’ t do it after all the metadata updates last week:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
2019-04-15 22:01:19 +02:00
real 82m45.324s
user 7m33.446s
sys 2m13.463s
2019-12-17 13:49:24 +01:00
< / code > < / pre > < h2 id = "2019-04-16" > 2019-04-16< / h2 >
2019-04-16 12:07:33 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Export IITA’ s community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something< / li >
2019-04-16 12:07:33 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-17" > 2019-04-17< / h2 >
2019-04-17 12:38:50 +02:00
< ul >
< li > Reading an interesting < a href = "https://teaspoon-consulting.com/articles/solr-cache-tuning.html" > blog post about Solr caching< / a > < / li >
< li > Did some tests of the dspace-statistics-api on my local DSpace instance with 28 million documents in a sharded statistics core (< code > statistics< / code > and < code > statistics-2018< / code > ) and monitored the memory usage of Tomcat in VisualVM< / li >
< li > 4GB heap, CMS GC, 512 filter cache, 512 query cache, with 28 million documents in two shards
< ul >
2019-11-28 16:30:45 +01:00
< li > Run 1:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 3.11s user 0.44s system 0% cpu 13:45.07 total< / li >
< li > Tomcat (not Solr) max JVM heap usage: 2.04 GiB< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 2:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 3.23s user 0.43s system 0% cpu 13:46.10 total< / li >
< li > Tomcat (not Solr) max JVM heap usage: 2.06 GiB< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 3:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 3.23s user 0.42s system 0% cpu 13:14.70 total< / li >
< li > Tomcat (not Solr) max JVM heap usage: 2.13 GiB< / li >
< li > < code > filterCache< / code > size: 482, < code > cumulative_lookups< / code > : 7062712, < code > cumulative_hits< / code > : 167903, < code > cumulative_hitratio< / code > : 0.02< / li >
< li > queryResultCache size: 2< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
< / li >
2019-04-17 12:38:50 +02:00
< li > 4GB heap, CMS GC, 1024 filter cache, 512 query cache, with 28 million documents in two shards
< ul >
2019-11-28 16:30:45 +01:00
< li > Run 1:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 2.92s user 0.39s system 0% cpu 12:33.08 total< / li >
< li > Tomcat (not Solr) max JVM heap usage: 2.16 GiB< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 2:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 3.10s user 0.39s system 0% cpu 12:25.32 total< / li >
< li > Tomcat (not Solr) max JVM heap usage: 2.07 GiB< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 3:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 3.29s user 0.36s system 0% cpu 11:53.47 total< / li >
< li > Tomcat (not Solr) max JVM heap usage: 2.08 GiB< / li >
< li > < code > filterCache< / code > size: 951, < code > cumulative_lookups< / code > : 7062712, < code > cumulative_hits< / code > : 254379, < code > cumulative_hitratio< / code > : 0.04< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
< / li >
2019-04-17 12:38:50 +02:00
< li > 4GB heap, CMS GC, 2048 filter cache, 512 query cache, with 28 million documents in two shards
< ul >
2019-11-28 16:30:45 +01:00
< li > Run 1:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 2.90s user 0.48s system 0% cpu 10:37.31 total< / li >
< li > Tomcat max JVM heap usage: 1.96 GiB< / li >
< li > < code > filterCache< / code > size: 1901, < code > cumulative_lookups< / code > : 2354237, < code > cumulative_hits< / code > : 180111, < code > cumulative_hitratio< / code > : 0.08< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 2:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 2.97s user 0.39s system 0% cpu 10:40.06 total< / li >
< li > Tomcat max JVM heap usage: 2.09 GiB< / li >
< li > < code > filterCache< / code > size: 1901, < code > cumulative_lookups< / code > : 4708473, < code > cumulative_hits< / code > : 360068, < code > cumulative_hitratio< / code > : 0.08< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 3:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 3.28s user 0.37s system 0% cpu 10:49.56 total< / li >
< li > Tomcat max JVM heap usage: 2.05 GiB< / li >
< li > < code > filterCache< / code > size: 1901, < code > cumulative_lookups< / code > : 7062712, < code > cumulative_hits< / code > : 540020, < code > cumulative_hitratio< / code > : 0.08< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
< / li >
2019-04-17 12:38:50 +02:00
< li > 4GB heap, CMS GC, 4096 filter cache, 512 query cache, with 28 million documents in two shards
< ul >
2019-11-28 16:30:45 +01:00
< li > Run 1:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 2.88s user 0.35s system 0% cpu 8:29.55 total< / li >
< li > Tomcat max JVM heap usage: 2.15 GiB< / li >
< li > < code > filterCache< / code > size: 3770, < code > cumulative_lookups< / code > : 2354237, < code > cumulative_hits< / code > : 414512, < code > cumulative_hitratio< / code > : 0.18< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 2:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 3.01s user 0.38s system 0% cpu 9:15.65 total< / li >
< li > Tomcat max JVM heap usage: 2.17 GiB< / li >
< li > < code > filterCache< / code > size: 3945, < code > cumulative_lookups< / code > : 4708473, < code > cumulative_hits< / code > : 829093, < code > cumulative_hitratio< / code > : 0.18< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 3:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 3.01s user 0.40s system 0% cpu 9:01.31 total< / li >
< li > Tomcat max JVM heap usage: 2.07 GiB< / li >
< li > < code > filterCache< / code > size: 3770, < code > cumulative_lookups< / code > : 7062712, < code > cumulative_hits< / code > : 1243632, < code > cumulative_hitratio< / code > : 0.18< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
< / li >
2019-04-17 12:38:50 +02:00
< li > The biggest takeaway I have is that this workload benefits from a larger < code > filterCache< / code > (for Solr fq parameter), but barely uses the < code > queryResultCache< / code > (for Solr q parameter) at all
< ul >
2020-01-27 15:20:44 +01:00
< li > The number of hits goes up and the time taken decreases when we increase the < code > filterCache< / code > , and total JVM heap memory doesn’ t seem to increase much at all< / li >
< li > I guess the < code > queryResultCache< / code > size is always 2 because I’ m only doing two queries: < code > type:0< / code > and < code > type:2< / code > (downloads and views, respectively)< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2019-04-17 12:38:50 +02:00
< li > Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2019/04/visualvm-solr-indexing-4096-filterCache.png" alt = "VisualVM Tomcat 4096 filterCache" > < / p >
2019-04-17 12:38:50 +02:00
< ul >
< li > I ran one test with a < code > filterCache< / code > of 16384 to try to see if I could make the Tomcat JVM memory balloon, but actually it < em > drastically< / em > increased the performance and memory usage of the dspace-statistics-api indexer< / li >
< li > 4GB heap, CMS GC, 16384 filter cache, 512 query cache, with 28 million documents in two shards
< ul >
2019-11-28 16:30:45 +01:00
< li > Run 1:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 2.85s user 0.42s system 2% cpu 2:28.92 total< / li >
< li > Tomcat max JVM heap usage: 1.90 GiB< / li >
< li > < code > filterCache< / code > size: 14851, < code > cumulative_lookups< / code > : 2354237, < code > cumulative_hits< / code > : 2331186, < code > cumulative_hitratio< / code > : 0.99< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 2:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 2.90s user 0.37s system 2% cpu 2:23.50 total< / li >
< li > Tomcat max JVM heap usage: 1.27 GiB< / li >
< li > < code > filterCache< / code > size: 15834, < code > cumulative_lookups< / code > : 4708476, < code > cumulative_hits< / code > : 4664762, < code > cumulative_hitratio< / code > : 0.99< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > Run 3:
< ul >
2019-04-17 12:38:50 +02:00
< li > Time: 2.93s user 0.39s system 2% cpu 2:26.17 total< / li >
< li > Tomcat max JVM heap usage: 1.05 GiB< / li >
< li > < code > filterCache< / code > size: 15248, < code > cumulative_lookups< / code > : 7062715, < code > cumulative_hits< / code > : 6998267, < code > cumulative_hitratio< / code > : 0.99< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
< / li >
2019-04-17 12:38:50 +02:00
< li > The JVM garbage collection graph is MUCH flatter, and memory usage is much lower (not to mention a drop in GC-related CPU usage)!< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2019/04/visualvm-solr-indexing-16384-filterCache.png" alt = "VisualVM Tomcat 16384 filterCache" > < / p >
2019-04-17 14:16:27 +02:00
< ul >
< li > I will deploy this < code > filterCache< / code > setting on DSpace Test (linode19)< / li >
< li > Run all system updates on DSpace Test (linode19) and reboot it< / li >
2019-04-18 11:31:49 +02:00
< li > Lots of CPU steal going on still on CGSpace (linode18):< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2019/04/cpu-week3.png" alt = "CPU usage week" > < / p >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-18" > 2019-04-18< / h2 >
2019-04-18 11:31:49 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > I’ ve been trying to copy the < code > statistics-2018< / code > Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec
2019-04-18 18:00:15 +02:00
< ul >
< li > I opened a support ticket to ask Linode to investigate< / li >
< li > They asked me to send an < code > mtr< / code > report from Fremont to Frankfurt and vice versa< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2019-04-18 11:40:54 +02:00
< li > Deploy Tomcat 7.0.94 on DSpace Test (linode19)
< ul >
< li > Also, I realized that the CMS GC changes I deployed a few days ago were ignored by Tomcat because of something with how Ansible formatted the options string< / li >
2020-01-27 15:20:44 +01:00
< li > I needed to use the “ folded” YAML variable format < code > > -< / code > (with the dash so it doesn’ t add a return at the end)< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2020-01-27 15:20:44 +01:00
< li > UptimeRobot says that CGSpace went “ down” this afternoon, but I looked at the CPU steal with < code > iostat 1 10< / code > and it’ s in the 50s and 60s
2019-04-18 18:00:15 +02:00
< ul >
< li > The munin graph shows a lot of CPU steal (red) currently (and over all during the week):< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
< p > < img src = "/cgspace-notes/2019/04/cpu-week4.png" alt = "CPU usage week" > < / p >
2019-04-18 18:00:15 +02:00
< ul >
< li > I opened a ticket with Linode to migrate us somewhere
< ul >
< li > They agreed to migrate us to a quieter host< / li >
2019-04-17 14:16:27 +02:00
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-20" > 2019-04-20< / h2 >
2019-04-20 10:48:26 +02:00
< ul >
< li > Linode agreed to move CGSpace (linode18) to a new machine shortly after I filed my ticket about CPU steal two days ago and now the load is much more sane:< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2019/04/cpu-week5.png" alt = "CPU usage week" > < / p >
2019-04-20 10:48:26 +02:00
< ul >
< li > For future reference, Linode mentioned that they consider CPU steal above 8% to be significant< / li >
2019-11-28 16:30:45 +01:00
< li > Regarding the other Linode issue about speed, I did a test with < code > iperf< / code > between linode18 and linode19:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # iperf -s
2019-04-20 11:20:12 +02:00
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 45.79.x.x port 5001 connected with 139.162.x.x port 51378
------------------------------------------------------------
Client connecting to 139.162.x.x, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 5] local 45.79.x.x port 36440 connected with 139.162.x.x port 5001
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.2 sec 172 MBytes 142 Mbits/sec
[ 4] 0.0-10.5 sec 202 MBytes 162 Mbits/sec
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > Even with the software firewalls disabled the rsync speed was low, so it’ s not a rate limiting issue< / li >
2019-11-28 16:30:45 +01:00
< li > I also tried to download a file over HTTPS from CGSpace to DSpace Test, but it was capped at 20KiB/sec
2019-04-20 11:20:12 +02:00
< ul >
< li > I updated the Linode issue with this information< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2020-01-27 15:20:44 +01:00
< li > I’ m going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode’ s latest x86_64
2019-04-20 11:24:02 +02:00
< ul >
< li > Nope, still 20KiB/sec< / li >
2019-04-20 10:48:26 +02:00
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-21" > 2019-04-21< / h2 >
2019-04-21 14:31:15 +02:00
< ul >
< li > Deploy Solr 4.10.4 on CGSpace (linode18)< / li >
< li > Deploy Tomcat 7.0.94 on CGSpace< / li >
< li > Deploy dspace-statistics-api v1.0.0 on CGSpace< / li >
2020-01-27 15:20:44 +01:00
< li > Linode support replicated the results I had from the network speed testing and said they don’ t know why it’ s so slow
2019-04-21 17:52:00 +02:00
< ul >
< li > They offered to live migrate the instance to another host to see if that helps< / li >
2019-04-21 14:31:15 +02:00
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-22" > 2019-04-22< / h2 >
2019-04-22 15:09:58 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Abenet pointed out < a href = "https://hdl.handle.net/10568/97912" > an item< / a > that doesn’ t have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
2019-04-22 15:09:58 +02:00
< ul >
< li > I tweeted the Handle to see if it will pick it up… < / li >
< li > Like clockwork, after fifteen minutes there was a donut showing on CGSpace< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > I want to get rid of this annoying warning that is constantly in our DSpace logs:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2019-04-08 19:02:31,770 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ grep -c 'Falling back to request address' dspace.log.2019-04-20
2019-04-22 15:09:58 +02:00
dspace.log.2019-04-20:1515
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I will fix it in < code > dspace/config/modules/oai.cfg< / code > < / li >
< li > Linode says that it is likely that the host CGSpace (linode18) is on is showing signs of hardware failure and they recommended that I migrate the VM to a new host
2019-04-22 15:09:58 +02:00
< ul >
< li > I told them to migrate it at 04:00:00AM Frankfurt time, when nobody in East Africa, Europe, or South America should be using the server< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-23" > 2019-04-23< / h2 >
2019-04-23 12:04:37 +02:00
< ul >
< li > One blog post says that there is < a href = "https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/" > no overprovisioning in Azure< / a > :< / li >
< / ul >
2019-11-28 16:30:45 +01:00
<!-- raw HTML omitted -->
2019-04-23 12:04:37 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Perhaps that’ s why the < a href = "https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/" > Azure pricing< / a > is so expensive!< / li >
2019-04-23 12:04:37 +02:00
< li > Add a privacy page to CGSpace
< ul >
< li > The work was mostly similar to the About page at < code > /page/about< / code > , but in addition to adding i18n strings etc, I had to add the logic for the trail to < code > dspace-xmlui-mirage2/src/main/webapp/xsl/preprocess/general.xsl< / code > < / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-24" > 2019-04-24< / h2 >
2019-04-24 15:50:24 +02:00
< ul >
< li > Linode migrated CGSpace (linode18) to a new host, but I am still getting poor performance when copying data to DSpace Test (linode19)
< ul >
< li > I asked them if we can migrate DSpace Test to a new host< / li >
< li > They migrated DSpace Test to a new host and the rsync speed from Frankfurt was still capped at 20KiB/sec… < / li >
< li > I booted DSpace Test to a rescue CD and tried the rsync from CGSpace there too, but it was still capped at 20KiB/sec… < / li >
2019-04-24 16:15:13 +02:00
< li > I copied the 18GB < code > statistics-2018< / code > Solr core from Frankfurt to a Linode in London at 15MiB/sec, then from the London one to DSpace Test in Fremont at 15MiB/sec… so WTF us up with Frankfurt→Fremont?!< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2019-04-24 15:50:24 +02:00
< li > Finally upload the 218 IITA items from March to CGSpace
< ul >
< li > Abenet and I had to do a little bit more work to correct the metadata of one item that appeared to be a duplicate, but really just had the wrong DOI< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (< code > dc.identifier.uri< / code > )
2019-04-24 15:50:24 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > According to my notes in 2018-09 I had noticed this when he uploaded the records and told him to remove them, but he didn’ t… < / li >
2019-11-28 16:30:45 +01:00
< li > I exported the IITA community as a CSV then used < code > csvcut< / code > to extract the two URI columns and identify and fix the records:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv > /tmp/iita.csv
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
2019-04-24 15:50:24 +02:00
< ul >
< li > I told him we never finished it, and that he should try to use the < code > /items/find-by-metadata-field< / code > endpoint, with the caveat that you need to match the language attribute exactly (ie “ en” , “ en_US” , null, etc)< / li >
< li > I asked him how many terms they are interested in, as we could probably make it easier by normalizing the language attributes of these fields (it would help us anyways)< / li >
2020-01-27 15:20:44 +01:00
< li > He says he’ s getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ curl -f -H " accept: application/json" -H " Content-Type: application/json" -X POST " https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{" key" :" cg.subject.cpwf" , " value" :" WATER MANAGEMENT" ," language" : " en_US" }'
2019-04-24 17:49:55 +02:00
curl: (22) The requested URL returned error: 401
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > Note that curl only shows the HTTP 401 error if you use < code > -f< / code > (fail), and only then if you < em > don’ t< / em > include < code > -s< / code >
2019-04-24 17:49:55 +02:00
< ul >
< li > I see there are about 1,000 items using CPWF subject “ WATER MANAGEMENT” in the database, so there should definitely be results< / li >
2019-11-28 16:30:45 +01:00
< li > The breakdown of < code > text_lang< / code > fields used in those items is 942:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
2019-11-28 16:30:45 +01:00
count
2019-04-24 17:49:55 +02:00
-------
2019-11-28 16:30:45 +01:00
376
2019-04-24 17:49:55 +02:00
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
2019-11-28 16:30:45 +01:00
count
2019-04-24 17:49:55 +02:00
-------
2019-11-28 16:30:45 +01:00
149
2019-04-24 17:49:55 +02:00
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
2019-11-28 16:30:45 +01:00
count
2019-04-24 17:49:55 +02:00
-------
2019-11-28 16:30:45 +01:00
417
2019-04-24 17:49:55 +02:00
(1 row)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn’ t have permission to access… from the DSpace log:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
2019-04-24 17:49:55 +02:00
2019-04-24 08:11:51,231 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
2019-04-24 08:11:51,238 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
2019-04-24 08:11:51,243 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
2019-04-24 08:11:51,252 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item!
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Nevertheless, if I request using the < code > null< / code > language I get 1020 results, plus 179 for a blank language attribute:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ curl -s -H " Content-Type: application/json" -X POST " https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{" key" :" cg.subject.cpwf" , " value" :" WATER MANAGEMENT" ," language" : null}' | jq length
2019-04-24 17:49:55 +02:00
1020
$ curl -s -H " Content-Type: application/json" -X POST " https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{" key" :" cg.subject.cpwf" , " value" :" WATER MANAGEMENT" ," language" : " " }' | jq length
179
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > This is weird because I see 942– 1156 items with “ WATER MANAGEMENT” (depending on wildcard matching for errors in subject spelling):< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
2019-11-28 16:30:45 +01:00
count
2019-04-24 17:49:55 +02:00
-------
2019-11-28 16:30:45 +01:00
942
2019-04-24 17:49:55 +02:00
(1 row)
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
2019-11-28 16:30:45 +01:00
count
2019-04-24 17:49:55 +02:00
-------
2019-11-28 16:30:45 +01:00
1156
2019-04-24 17:49:55 +02:00
(1 row)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I sent a message to the dspace-tech mailing list to ask for help< / li >
2019-04-24 17:49:55 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-25" > 2019-04-25< / h2 >
2019-04-25 22:06:39 +02:00
< ul >
< li > Peter pointed out that we need to remove Delicious and Google+ from our social sharing links
< ul >
< li > Also, it would be nice if we could include the item title in the shared link< / li >
< li > I created an issue on GitHub to track this (< a href = "https://github.com/ilri/DSpace/issues/419" > #419< / a > )< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li > I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ curl -f -H " Content-Type: application/json" -X POST " https://dspacetest.cgiar.org/rest/login" -d '{" email" :" example@me.com" ," password" :" fuuuuu" }'
2019-04-25 22:06:39 +02:00
$ curl -f -H " Content-Type: application/json" -H " rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -X GET " https://dspacetest.cgiar.org/rest/status"
$ curl -f -H " rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -H " Content-Type: application/json" -X POST " https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{" key" :" cg.subject.cpwf" , " value" :" WATER MANAGEMENT" ," language" : " en_US" }'
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I created a normal user for Carlos to try as an unprivileged user:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > But still I get the HTTP 401 and I have no idea which item is causing it< / li >
< li > I enabled more verbose logging in < code > ItemsResource.java< / code > and now I can at least see the item ID that causes the failure…
2019-04-25 22:06:39 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > The item is not even in the archive, but somehow it is discoverable< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# SELECT * FROM item WHERE item_id=74648;
2019-11-28 16:30:45 +01:00
item_id | submitter_id | in_archive | withdrawn | last_modified | owning_collection | discoverable
2019-04-25 22:06:39 +02:00
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
2019-11-28 16:30:45 +01:00
74648 | 113 | f | f | 2016-03-30 09:00:52.131+00 | | t
2019-04-25 22:06:39 +02:00
(1 row)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I tried to delete the item in the web interface, and it seems successful, but I can still access the item in the admin interface, and nothing changes in PostgreSQL< / li >
< li > Meet with CodeObia to see progress on AReS version 2< / li >
< li > Marissa Van Epp asked me to add a few new metadata values to their Phase II Project Tags field (cg.identifier.ccafsprojectpii)
2019-04-25 22:06:39 +02:00
< ul >
< li > I created a < a href = "https://github.com/ilri/DSpace/pull/420" > pull request< / a > for it and will do it the next time I run updates on CGSpace< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< li > Communicate with Carlos Tejo from the Land Portal about the < code > /items/find-by-metadata-value< / code > endpoint< / li >
< li > Run all system updates on DSpace Test (linode19) and reboot it< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-26" > 2019-04-26< / h2 >
2019-04-26 11:16:02 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Export a list of authors for Peter to look through:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
2019-04-26 11:16:02 +02:00
COPY 65752
2019-12-17 13:49:24 +01:00
< / code > < / pre > < h2 id = "2019-04-28" > 2019-04-28< / h2 >
2019-04-28 18:07:51 +02:00
< ul >
2020-01-27 15:20:44 +01:00
< li > Still trying to figure out the issue with the items that cause the REST API’ s < code > /items/find-by-metadata-value< / code > endpoint to throw an exception
2019-04-28 18:07:51 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > I made the item private in the UI and then I see in the UI and PostgreSQL that it is no longer discoverable:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# SELECT * FROM item WHERE item_id=74648;
2019-11-28 16:30:45 +01:00
item_id | submitter_id | in_archive | withdrawn | last_modified | owning_collection | discoverable
2019-04-28 18:07:51 +02:00
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
2019-11-28 16:30:45 +01:00
74648 | 113 | f | f | 2019-04-28 08:48:52.114-07 | | f
2019-04-28 18:07:51 +02:00
(1 row)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > And I tried the < code > curl< / code > command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-04-13 16:24:05 +02:00
< li > I even tried to “ expunge” the item using an < a href = "https://wiki.lyrasis.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-Performing'actions'onitems" > action in CSV< / a > , and it said “ EXPUNGED!” but the item is still there… < / li >
2019-04-28 18:07:51 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2019-04-30" > 2019-04-30< / h2 >
2019-04-30 10:39:09 +02:00
< ul >
< li > Send mail to the dspace-tech mailing list to ask about the item expunge issue< / li >
2019-11-28 16:30:45 +01:00
< li > Delete and re-create Podman container for dspacedb after pulling a new PostgreSQL container:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I’ ll try to do a CSV
2019-04-30 10:39:09 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:< / li >
< / ul >
< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
2019-11-28 16:30:45 +01:00
text_lang | count
2019-04-30 10:39:09 +02:00
-----------+---------
2019-11-28 16:30:45 +01:00
| 358647
* | 11
E. | 1
en | 1635
en_US | 602312
es | 12
es_ES | 2
ethnob | 1
fr | 2
spa | 2
| 1074345
2019-04-30 10:39:09 +02:00
(11 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
UPDATE 360295
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
UPDATE 1074345
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
UPDATE 14
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos< / li >
< li > In other news, while I was looking through the CSV in OpenRefine I saw lots of weird values in some fields… we should check, for example:
2019-04-30 10:39:09 +02:00
< ul >
< li > issue dates< / li >
< li > items missing handles< / li >
< li > authorship types< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< / ul >
<!-- raw HTML omitted -->
2019-04-01 08:02:18 +02:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2022-03-01 15:48:40 +01:00
< li > < a href = "/cgspace-notes/2022-03/" > March, 2022< / a > < / li >
2022-02-10 18:35:40 +01:00
< li > < a href = "/cgspace-notes/2022-02/" > February, 2022< / a > < / li >
2022-01-01 14:21:47 +01:00
< li > < a href = "/cgspace-notes/2022-01/" > January, 2022< / a > < / li >
2021-12-03 11:58:43 +01:00
< li > < a href = "/cgspace-notes/2021-12/" > December, 2021< / a > < / li >
2021-11-01 09:49:21 +01:00
< li > < a href = "/cgspace-notes/2021-11/" > November, 2021< / a > < / li >
2019-04-01 08:02:18 +02:00
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
2019-10-11 10:19:42 +02:00
< p dir = "auto" >
2019-04-01 08:02:18 +02:00
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >