mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-18 19:22:18 +01:00
1062 lines
52 KiB
Markdown
1062 lines
52 KiB
Markdown
---
|
||
title: "April, 2019"
|
||
date: 2019-04-01T09:00:43+03:00
|
||
author: "Alan Orth"
|
||
categories: ["Notes"]
|
||
---
|
||
|
||
## 2019-04-01
|
||
|
||
- Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
|
||
- They asked if we had plans to enable RDF support in CGSpace
|
||
- There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
|
||
- I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
|
||
|
||
```
|
||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
|
||
4432 200
|
||
```
|
||
|
||
- In the last two weeks there have been 47,000 downloads of this *same exact PDF* by these three IP addresses
|
||
- Apply country and region corrections and deletions on DSpace Test and CGSpace:
|
||
|
||
```
|
||
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
|
||
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
||
```
|
||
|
||
<!--more-->
|
||
|
||
## 2019-04-02
|
||
|
||
- CTA says the Amazon IPs are AWS gateways for real user traffic
|
||
- I was trying to add Felix Shaw's account back to the Administrators group on DSpace Test, but I couldn't find his name in the user search of the groups page
|
||
- If I searched for "Felix" or "Shaw" I saw other matches, included one for his personal email address!
|
||
- I ended up finding him via searching for his email address
|
||
|
||
## 2019-04-03
|
||
|
||
- Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
|
||
- First I need to extract the ones that are unique from their list compared to our existing one:
|
||
|
||
```
|
||
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
|
||
```
|
||
|
||
- We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!
|
||
- Next I will resolve all their names using my `resolve-orcids.py` script:
|
||
|
||
```
|
||
$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
|
||
```
|
||
|
||
- After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
|
||
- One user's name has changed so I will update those using my `fix-metadata-values.py` script:
|
||
|
||
```
|
||
$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
|
||
```
|
||
|
||
- I created a pull request and merged the changes to the `5_x-prod` branch ([#417](https://github.com/ilri/DSpace/pull/417))
|
||
- A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it's still going:
|
||
|
||
```
|
||
2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
|
||
```
|
||
|
||
- Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:
|
||
|
||
```
|
||
$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
|
||
1
|
||
3 http://localhost:8081/solr//statistics-2017
|
||
5662 http://localhost:8081/solr//statistics-2018
|
||
```
|
||
|
||
- I will have to keep an eye on it because nothing should be updating 2018 stats in 2019...
|
||
|
||
## 2019-04-05
|
||
|
||
- Uptime Robot reported that CGSpace (linode18) went down tonight
|
||
- I see there are lots of PostgreSQL connections:
|
||
|
||
```
|
||
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
5 dspaceApi
|
||
10 dspaceCli
|
||
250 dspaceWeb
|
||
```
|
||
|
||
- I still see those weird messages about updating the statistics-2018 Solr core:
|
||
|
||
```
|
||
2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
|
||
```
|
||
|
||
- Looking at `iostat 1 10` I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:
|
||
|
||
![CPU usage week](/cgspace-notes/2019/04/cpu-week.png)
|
||
|
||
- The other thing visible there is that the past few days the load has spiked to 500% and I don't think it's a coincidence that the Solr updating thing is happening...
|
||
- I ran all system updates and rebooted the server
|
||
- The load was lower on the server after reboot, but Solr didn't come back up properly according to the Solr Admin UI:
|
||
|
||
```
|
||
statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
|
||
```
|
||
|
||
- I restarted it again and all the Solr cores came up properly...
|
||
|
||
## 2019-04-06
|
||
|
||
- Udana asked why item [10568/91278](https://cgspace.cgiar.org/handle/10568/91278) didn't have an Altmetric badge on CGSpace, but on the [WLE website](https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity) it does
|
||
- I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
|
||
- I tweeted the item and I assume this will link the Handle with the DOI in the system
|
||
- Twenty minutes later I see the same Altmetric score (9) on CGSpace
|
||
- Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
|
||
|
||
```
|
||
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
222 18.195.78.144
|
||
245 207.46.13.58
|
||
303 207.46.13.194
|
||
328 66.249.79.33
|
||
564 207.46.13.210
|
||
566 66.249.79.62
|
||
575 40.77.167.66
|
||
1803 66.249.79.59
|
||
2834 2a01:4f8:140:3192::2
|
||
9623 45.5.184.72
|
||
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
31 66.249.79.62
|
||
41 207.46.13.210
|
||
42 40.77.167.66
|
||
54 42.113.50.219
|
||
132 66.249.79.59
|
||
785 2001:41d0:d:1990::
|
||
1164 45.5.184.72
|
||
2014 50.116.102.77
|
||
4267 45.5.186.2
|
||
4893 205.186.128.185
|
||
```
|
||
|
||
- `45.5.184.72` is in Colombia so it's probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT's datasets collection:
|
||
|
||
```
|
||
GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
|
||
```
|
||
|
||
- Their user agent is the one I added to the badbots list in nginx last week: "GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1"
|
||
- They made 22,000 requests to Discover on this collection today alone (and it's only 11AM):
|
||
|
||
```
|
||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
|
||
22077 /handle/10568/72970/discover
|
||
```
|
||
|
||
- Yesterday they made 43,000 requests and we actually blocked most of them:
|
||
|
||
```
|
||
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
|
||
43631 /handle/10568/72970/discover
|
||
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
|
||
142 200
|
||
43489 503
|
||
```
|
||
|
||
- I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover
|
||
- Maria from Bioversity recommended that we use the phrase "AGROVOC subject" instead of "Subject" in Listings and Reports
|
||
- I made a pull request to update this and merged it to the `5_x-prod` branch ([#418](https://github.com/ilri/DSpace/pull/418))
|
||
|
||
## 2019-04-07
|
||
|
||
- Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day*
|
||
- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for bitstreams
|
||
- I am trying to see if these are registered as downloads in Solr or not
|
||
- I see 96,925 downloads from their AWS gateway IPs in 2019-03:
|
||
|
||
```
|
||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
|
||
{
|
||
"response": {
|
||
"docs": [],
|
||
"numFound": 96925,
|
||
"start": 0
|
||
},
|
||
"responseHeader": {
|
||
"QTime": 1,
|
||
"params": {
|
||
"fq": [
|
||
"statistics_type:view",
|
||
"bundleName:ORIGINAL",
|
||
"dateYearMonth:2019-03"
|
||
],
|
||
"indent": "true",
|
||
"q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
|
||
"rows": "0",
|
||
"wt": "json"
|
||
},
|
||
"status": 0
|
||
}
|
||
}
|
||
```
|
||
|
||
- Strangely I don't see many hits in 2019-04:
|
||
|
||
```
|
||
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
|
||
{
|
||
"response": {
|
||
"docs": [],
|
||
"numFound": 38,
|
||
"start": 0
|
||
},
|
||
"responseHeader": {
|
||
"QTime": 1,
|
||
"params": {
|
||
"fq": [
|
||
"statistics_type:view",
|
||
"bundleName:ORIGINAL",
|
||
"dateYearMonth:2019-04"
|
||
],
|
||
"indent": "true",
|
||
"q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
|
||
"rows": "0",
|
||
"wt": "json"
|
||
},
|
||
"status": 0
|
||
}
|
||
}
|
||
```
|
||
|
||
- Making some tests on GET vs HEAD requests on the [CTA Spore 192 item](https://dspacetest.cgiar.org/handle/10568/100289) on DSpace Test:
|
||
|
||
```
|
||
$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
|
||
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
|
||
Accept: */*
|
||
Accept-Encoding: gzip, deflate
|
||
Connection: keep-alive
|
||
Host: dspacetest.cgiar.org
|
||
User-Agent: HTTPie/1.0.2
|
||
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Language: en-US
|
||
Content-Length: 2069158
|
||
Content-Type: application/pdf;charset=ISO-8859-1
|
||
Date: Sun, 07 Apr 2019 08:38:34 GMT
|
||
Expires: Sun, 07 Apr 2019 09:38:34 GMT
|
||
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
|
||
Vary: User-Agent
|
||
X-Cocoon-Version: 2.2.0
|
||
X-Content-Type-Options: nosniff
|
||
X-Frame-Options: SAMEORIGIN
|
||
X-Robots-Tag: none
|
||
X-XSS-Protection: 1; mode=block
|
||
|
||
$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
|
||
HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
|
||
Accept: */*
|
||
Accept-Encoding: gzip, deflate
|
||
Connection: keep-alive
|
||
Host: dspacetest.cgiar.org
|
||
User-Agent: HTTPie/1.0.2
|
||
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Language: en-US
|
||
Content-Length: 2069158
|
||
Content-Type: application/pdf;charset=ISO-8859-1
|
||
Date: Sun, 07 Apr 2019 08:39:01 GMT
|
||
Expires: Sun, 07 Apr 2019 09:39:01 GMT
|
||
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly
|
||
Vary: User-Agent
|
||
X-Cocoon-Version: 2.2.0
|
||
X-Content-Type-Options: nosniff
|
||
X-Frame-Options: SAMEORIGIN
|
||
X-Robots-Tag: none
|
||
X-XSS-Protection: 1; mode=block
|
||
```
|
||
|
||
- And from the server side, the nginx logs show:
|
||
|
||
```
|
||
78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
|
||
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
|
||
```
|
||
|
||
- So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
|
||
- After twenty minutes of waiting I still don't see any new requests in the statistics core, but when I try the requests from the command line again I see the following in the DSpace log:
|
||
|
||
```
|
||
2019-04-07 02:05:30,966 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||
2019-04-07 02:05:39,265 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
|
||
```
|
||
|
||
- So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned
|
||
- Strangely, the statistics Solr core says it hasn't been modified in 24 hours, so I tried to start the "optimize" process from the Admin UI and I see this in the Solr log:
|
||
|
||
```
|
||
2019-04-07 02:08:44,186 INFO org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
|
||
```
|
||
|
||
- Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are `statistics_type:view`... very weird
|
||
- I don't even see many hits for days after 2019-03-17, when I migrated the server to Ubuntu 18.04 and copied the statistics core from CGSpace (linode18)
|
||
- I will try to re-deploy the `5_x-dev` branch and test again
|
||
- According to the [DSpace 5.x Solr documentation](https://wiki.lyrasis.org/display/DSDOC5x/SOLR+Statistics) the default commit time is after 15 minutes or 10,000 documents (see `solrconfig.xml`)
|
||
- I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they *do* register as downloads (even though they are internal):
|
||
|
||
```
|
||
$ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
|
||
{
|
||
"response": {
|
||
"docs": [],
|
||
"numFound": 909,
|
||
"start": 0
|
||
},
|
||
"responseHeader": {
|
||
"QTime": 0,
|
||
"params": {
|
||
"fq": [
|
||
"statistics_type:view",
|
||
"isInternal:true"
|
||
],
|
||
"indent": "true",
|
||
"q": "type:0 AND time:2019-04-07*",
|
||
"rows": "0",
|
||
"wt": "json"
|
||
},
|
||
"status": 0
|
||
}
|
||
}
|
||
```
|
||
|
||
- I confirmed the same on CGSpace itself after making one HEAD request
|
||
- So I'm pretty sure it's something about DSpace Test using the CGSpace statistics core, and not that I deployed Solr 4.10.4 there last week
|
||
- I deployed Solr 4.10.4 locally and ran a bunch of requests for bitstreams and they do show up in the Solr statistics log, so the issue must be with re-using the existing Solr core from CGSpace
|
||
- Now this gets more frustrating: I did the same GET and HEAD tests on a local Ubuntu 16.04 VM with Solr 4.10.2 and 4.10.4 and the statistics are recorded
|
||
- This leads me to believe there is something specifically wrong with DSpace Test (linode19)
|
||
- The only thing I can think of is that the JVM is using G1GC instead of ConcMarkSweepGC
|
||
- Holy shit, all this is actually because of the GeoIP1 deprecation and a missing `GeoLiteCity.dat`
|
||
- For some reason the missing GeoIP data causes stats to not be recorded whatsoever and there is no error!
|
||
- See: [DS-3986](https://jira.duraspace.org/browse/DS-3986)
|
||
- See: [DS-4020](https://jira.duraspace.org/browse/DS-4020)
|
||
- See: [DS-3832](https://jira.duraspace.org/browse/DS-3832)
|
||
- DSpace 5.10 upgraded to use GeoIP2, but we are on 5.8 so I just copied the missing database file from another server because it has been *removed* from MaxMind's server as of 2018-04-01
|
||
- Now I made 100 requests and I see them in the Solr statistics... fuck my life for wasting five hours debugging this
|
||
- UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check `iostat 1 10` and I saw that CPU steal is around 10–30 percent right now...
|
||
- The load average is super high right now, as I've noticed the last few times UptimeRobot said that CGSpace went down:
|
||
|
||
```
|
||
$ cat /proc/loadavg
|
||
10.70 9.17 8.85 18/633 4198
|
||
```
|
||
|
||
- According to the server logs there is actually not much going on right now:
|
||
|
||
```
|
||
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
118 18.195.78.144
|
||
128 207.46.13.219
|
||
129 167.114.64.100
|
||
159 207.46.13.129
|
||
179 207.46.13.33
|
||
188 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142
|
||
195 66.249.79.59
|
||
363 40.77.167.21
|
||
740 2a01:4f8:140:3192::2
|
||
4823 45.5.184.72
|
||
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
3 66.249.79.62
|
||
3 66.249.83.196
|
||
4 207.46.13.86
|
||
5 82.145.222.150
|
||
6 2a01:4f9:2b:1263::2
|
||
6 41.204.190.40
|
||
7 35.174.176.49
|
||
10 40.77.167.21
|
||
11 194.246.119.6
|
||
11 66.249.79.59
|
||
```
|
||
|
||
- `45.5.184.72` is CIAT, who I already blocked and am waiting to hear from
|
||
- `2a01:4f8:140:3192::2` is BLEXbot, which should be handled by the Tomcat Crawler Session Manager Valve
|
||
- `2408:8214:7a00:868f:7c1e:e0f3:20c6:c142` is some stupid Chinese bot making malicious POST requests
|
||
- There are free database connections in the pool:
|
||
|
||
```
|
||
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
5 dspaceApi
|
||
7 dspaceCli
|
||
23 dspaceWeb
|
||
```
|
||
|
||
- It seems that the issue with CGSpace being "down" is actually because of CPU steal again!!!
|
||
- I opened a ticket with support and asked them to migrate the VM to a less busy host
|
||
|
||
## 2019-04-08
|
||
|
||
- Start checking IITA's last round of batch uploads from [March on DSpace Test](https://dspacetest.cgiar.org/handle/10568/100333) (20193rd.xls)
|
||
- Lots of problems with affiliations, I had to correct about sixty of them
|
||
- I used lein to host the latest CSV of our affiliations for OpenRefine to reconcile against:
|
||
|
||
```
|
||
$ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
|
||
```
|
||
|
||
- After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
|
||
- The matched values can be accessed with `cell.recon.match.name`, but some of the new values don't appear, perhaps because I edited the original cell values?
|
||
- I ended up using this GREL expression to copy all values to a new column:
|
||
|
||
```
|
||
if(cell.recon.matched, cell.recon.match.name, value)
|
||
```
|
||
|
||
- See the [OpenRefine variables documentation](https://github.com/OpenRefine/OpenRefine/wiki/Variables#recon) for more notes about the `recon` object
|
||
- I also noticed a handful of errors in our current list of affiliations so I corrected them:
|
||
|
||
```
|
||
$ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
|
||
```
|
||
|
||
- We should create a new list of affiliations to update our controlled vocabulary again
|
||
- I dumped a list of the top 1500 affiliations:
|
||
|
||
```
|
||
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
|
||
COPY 1500
|
||
```
|
||
|
||
- Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):
|
||
|
||
```
|
||
dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
|
||
dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural and Livestock Research^M%';
|
||
```
|
||
|
||
- I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:
|
||
|
||
```
|
||
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
|
||
COPY 60
|
||
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
|
||
COPY 20
|
||
```
|
||
|
||
- I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:
|
||
|
||
```
|
||
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
|
||
$ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
|
||
```
|
||
|
||
- UptimeRobot said that CGSpace (linode18) went down tonight
|
||
- The load average is at `9.42, 8.87, 7.87`
|
||
- I looked at PostgreSQL and see shitloads of connections there:
|
||
|
||
```
|
||
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
5 dspaceApi
|
||
7 dspaceCli
|
||
250 dspaceWeb
|
||
```
|
||
|
||
- On a related note I see connection pool errors in the DSpace log:
|
||
|
||
```
|
||
2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
|
||
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
|
||
```
|
||
|
||
- But still I see 10 to 30% CPU steal in `iostat` that is also reflected in the Munin graphs:
|
||
|
||
![CPU usage week](/cgspace-notes/2019/04/cpu-week2.png)
|
||
|
||
- Linode Support still didn't respond to my ticket from yesterday, so I attached a new output of `iostat 1 10` and asked them to move the VM to a less busy host
|
||
- The web server logs are not very busy:
|
||
|
||
```
|
||
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
124 40.77.167.135
|
||
135 95.108.181.88
|
||
139 157.55.39.206
|
||
190 66.249.79.133
|
||
202 45.5.186.2
|
||
284 207.46.13.95
|
||
359 18.196.196.108
|
||
457 157.55.39.164
|
||
457 40.77.167.132
|
||
3822 45.5.184.72
|
||
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
5 129.0.79.206
|
||
5 41.205.240.21
|
||
7 207.46.13.95
|
||
7 66.249.79.133
|
||
7 66.249.79.135
|
||
7 95.108.181.88
|
||
8 40.77.167.111
|
||
19 157.55.39.164
|
||
20 40.77.167.132
|
||
370 51.254.16.223
|
||
```
|
||
|
||
## 2019-04-09
|
||
|
||
- Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning
|
||
- Here are the top IPs in the web server logs around that time:
|
||
|
||
```
|
||
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
18 66.249.79.139
|
||
21 157.55.39.160
|
||
29 66.249.79.137
|
||
38 66.249.79.135
|
||
50 34.200.212.137
|
||
54 66.249.79.133
|
||
100 102.128.190.18
|
||
1166 45.5.184.72
|
||
4251 45.5.186.2
|
||
4895 205.186.128.185
|
||
# zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||
200 144.48.242.108
|
||
202 207.46.13.185
|
||
206 18.194.46.84
|
||
239 66.249.79.139
|
||
246 18.196.196.108
|
||
274 31.6.77.23
|
||
289 66.249.79.137
|
||
312 157.55.39.160
|
||
441 66.249.79.135
|
||
856 66.249.79.133
|
||
```
|
||
|
||
- `45.5.186.2` is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:
|
||
|
||
```
|
||
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
|
||
```
|
||
|
||
- Database connection usage looks fine:
|
||
|
||
```
|
||
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
||
5 dspaceApi
|
||
7 dspaceCli
|
||
11 dspaceWeb
|
||
```
|
||
|
||
- Ironically I do still see some 2 to 10% of CPU steal in `iostat 1 10`
|
||
- Leroy from CIAT contacted me to say he knows the team who is making all those requests to CGSpace
|
||
- I told them how to use the REST API to get the CIAT Datasets collection and enumerate its items
|
||
- In other news, Linode staff identified a noisy neighbor sharing our host and migrated it elsewhere last night
|
||
|
||
## 2019-04-10
|
||
|
||
- Abenet pointed out a possibility of validating funders against the [CrossRef API](https://support.crossref.org/hc/en-us/articles/215788143-Funder-data-via-the-API)
|
||
- Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked
|
||
|
||
```
|
||
$ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
|
||
```
|
||
|
||
- Otherwise, they provide the funder data in [CSV and RDF format](https://www.crossref.org/services/funder-registry/)
|
||
- I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn't match will need a human to go and do some manual checking and informed decision making...
|
||
- If I want to write a script for this I could use the Python [habanero library](https://habanero.readthedocs.io/en/latest/modules/crossref.html):
|
||
|
||
```
|
||
from habanero import Crossref
|
||
cr = Crossref(mailto="me@cgiar.org")
|
||
x = cr.funders(query = "mercator")
|
||
```
|
||
|
||
## 2019-04-11
|
||
|
||
- Continue proofing IITA's last round of batch uploads from [March on DSpace Test](https://dspacetest.cgiar.org/handle/10568/100333) (20193rd.xls)
|
||
- One misspelled country
|
||
- Three incorrect regions
|
||
- Potential duplicates (same DOI, similar title, same authors):
|
||
- [10568/100580](https://dspacetest.cgiar.org/handle/10568/100580) and [10568/100579](https://dspacetest.cgiar.org/handle/10568/100579)
|
||
- [10568/100444](https://dspacetest.cgiar.org/handle/10568/100444) and [10568/100423](https://dspacetest.cgiar.org/handle/10568/100423)
|
||
- Two DOIs with incorrect URL formatting
|
||
- Two misspelled IITA subjects
|
||
- Two authors with smart quotes
|
||
- Lots of issues with sponsors
|
||
- One misspelled "Peer review"
|
||
- One invalid ISBN that I fixed by Googling the title
|
||
- Lots of issues with sponsors (German Aid Agency, Swiss Aid Agency, Italian Aid Agency, Dutch Aid Agency, etc)
|
||
- I validated all the AGROVOC subjects against our latest list with reconcile-csv
|
||
- About 720 of the 900 terms were matched, then I checked and fixed or deleted the rest manually
|
||
- I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA's records, so I applied them to DSpace Test and CGSpace:
|
||
|
||
```
|
||
$ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
|
||
$ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
|
||
```
|
||
|
||
- Answer more questions about DOIs and Altmetric scores from WLE
|
||
- Answer more questions about DOIs and Altmetric scores from IWMI
|
||
- They can't seem to understand the Altmetric + Twitter flow for associating Handles and DOIs
|
||
- To make things worse, many of their items DON'T have DOIs, so when Altmetric harvests them of course there is no link! - Then, a bunch of their items don't have scores because they never tweeted them!
|
||
- They added a DOI to this old item [10567/97087](https://cgspace.cgiar.org/handle/10568/97087) this morning and wonder why Altmetric's score hasn't linked with the DOI magically
|
||
- We should check in a week to see if Altmetric will make the association after one week when they harvest again
|
||
|
||
## 2019-04-13
|
||
|
||
- I copied the `statistics` and `statistics-2018` Solr cores from CGSpace to my local machine and watched the Java process in VisualVM while indexing item views and downloads with my [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api):
|
||
|
||
![Java GC during Solr indexing with CMS](/cgspace-notes/2019/04/visualvm-solr-indexing.png)
|
||
|
||
- It took about eight minutes to index 784 pages of item views and 268 of downloads, and you can see a clear "sawtooth" pattern in the garbage collection
|
||
- I am curious if the GC pattern would be different if I switched from the `-XX:+UseConcMarkSweepGC` to G1GC
|
||
- I switched to G1GC and restarted Tomcat but for some reason I couldn't see the Tomcat PID in VisualVM...
|
||
- Anyways, the indexing process took much longer, perhaps twice as long!
|
||
- I tried again with the GC tuning settings from the Solr 4.10.4 release:
|
||
|
||
![Java GC during Solr indexing Solr 4.10.4 settings](/cgspace-notes/2019/04/visualvm-solr-indexing-solr-settings.png)
|
||
|
||
## 2019-04-14
|
||
|
||
- Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:
|
||
|
||
```
|
||
GC_TUNE="-XX:NewRatio=3 \
|
||
-XX:SurvivorRatio=4 \
|
||
-XX:TargetSurvivorRatio=90 \
|
||
-XX:MaxTenuringThreshold=8 \
|
||
-XX:+UseConcMarkSweepGC \
|
||
-XX:+UseParNewGC \
|
||
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
|
||
-XX:+CMSScavengeBeforeRemark \
|
||
-XX:PretenureSizeThreshold=64m \
|
||
-XX:+UseCMSInitiatingOccupancyOnly \
|
||
-XX:CMSInitiatingOccupancyFraction=50 \
|
||
-XX:CMSMaxAbortablePrecleanTime=6000 \
|
||
-XX:+CMSParallelRemarkEnabled \
|
||
-XX:+ParallelRefProcEnabled"
|
||
```
|
||
|
||
- I need to remember to check the Munin JVM graphs in a few days
|
||
- It might be placebo, but the site *does* feel snappier...
|
||
|
||
## 2019-04-15
|
||
|
||
- Rework the dspace-statistics-api to use the vanilla Python requests library instead of Solr client
|
||
- [Tag version 1.0.0](https://github.com/ilri/dspace-statistics-api/releases/tag/v1.0.0) and deploy it on DSpace Test
|
||
- Pretty annoying to see CGSpace (linode18) with 20–50% CPU steal according to `iostat 1 10`, though I haven't had any Linode alerts in a few days
|
||
- Abenet sent me a list of ILRI items that don't have CRPs added to them
|
||
- The spreadsheet only had Handles (no IDs), so I'm experimenting with using Python in OpenRefine to get the IDs
|
||
- I cloned the handle column and then did a transform to get the IDs from the CGSpace REST API:
|
||
|
||
```
|
||
import json
|
||
import re
|
||
import urllib
|
||
import urllib2
|
||
|
||
handle = re.findall('[0-9]+/[0-9]+', value)
|
||
|
||
url = 'https://cgspace.cgiar.org/rest/handle/' + handle[0]
|
||
req = urllib2.Request(url)
|
||
req.add_header('User-agent', 'Alan Python bot')
|
||
res = urllib2.urlopen(req)
|
||
data = json.load(res)
|
||
item_id = data['id']
|
||
|
||
return item_id
|
||
```
|
||
|
||
- Luckily none of the items already had CRPs, so I didn't have to worry about them getting removed
|
||
- It would have been much trickier if I had to get the CRPs for the items first, then add the CRPs...
|
||
- I ran a full Discovery indexing on CGSpace because I didn't do it after all the metadata updates last week:
|
||
|
||
```
|
||
$ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||
|
||
real 82m45.324s
|
||
user 7m33.446s
|
||
sys 2m13.463s
|
||
```
|
||
|
||
## 2019-04-16
|
||
|
||
- Export IITA's community from CGSpace because they want to experiment with importing it into their internal DSpace for some testing or something
|
||
|
||
## 2019-04-17
|
||
|
||
- Reading an interesting [blog post about Solr caching](https://teaspoon-consulting.com/articles/solr-cache-tuning.html)
|
||
- Did some tests of the dspace-statistics-api on my local DSpace instance with 28 million documents in a sharded statistics core (`statistics` and `statistics-2018`) and monitored the memory usage of Tomcat in VisualVM
|
||
- 4GB heap, CMS GC, 512 filter cache, 512 query cache, with 28 million documents in two shards
|
||
- Run 1:
|
||
- Time: 3.11s user 0.44s system 0% cpu 13:45.07 total
|
||
- Tomcat (not Solr) max JVM heap usage: 2.04 GiB
|
||
- Run 2:
|
||
- Time: 3.23s user 0.43s system 0% cpu 13:46.10 total
|
||
- Tomcat (not Solr) max JVM heap usage: 2.06 GiB
|
||
- Run 3:
|
||
- Time: 3.23s user 0.42s system 0% cpu 13:14.70 total
|
||
- Tomcat (not Solr) max JVM heap usage: 2.13 GiB
|
||
- `filterCache` size: 482, `cumulative_lookups`: 7062712, `cumulative_hits`: 167903, `cumulative_hitratio`: 0.02
|
||
- queryResultCache size: 2
|
||
- 4GB heap, CMS GC, 1024 filter cache, 512 query cache, with 28 million documents in two shards
|
||
- Run 1:
|
||
- Time: 2.92s user 0.39s system 0% cpu 12:33.08 total
|
||
- Tomcat (not Solr) max JVM heap usage: 2.16 GiB
|
||
- Run 2:
|
||
- Time: 3.10s user 0.39s system 0% cpu 12:25.32 total
|
||
- Tomcat (not Solr) max JVM heap usage: 2.07 GiB
|
||
- Run 3:
|
||
- Time: 3.29s user 0.36s system 0% cpu 11:53.47 total
|
||
- Tomcat (not Solr) max JVM heap usage: 2.08 GiB
|
||
- `filterCache` size: 951, `cumulative_lookups`: 7062712, `cumulative_hits`: 254379, `cumulative_hitratio`: 0.04
|
||
- 4GB heap, CMS GC, 2048 filter cache, 512 query cache, with 28 million documents in two shards
|
||
- Run 1:
|
||
- Time: 2.90s user 0.48s system 0% cpu 10:37.31 total
|
||
- Tomcat max JVM heap usage: 1.96 GiB
|
||
- `filterCache` size: 1901, `cumulative_lookups`: 2354237, `cumulative_hits`: 180111, `cumulative_hitratio`: 0.08
|
||
- Run 2:
|
||
- Time: 2.97s user 0.39s system 0% cpu 10:40.06 total
|
||
- Tomcat max JVM heap usage: 2.09 GiB
|
||
- `filterCache` size: 1901, `cumulative_lookups`: 4708473, `cumulative_hits`: 360068, `cumulative_hitratio`: 0.08
|
||
- Run 3:
|
||
- Time: 3.28s user 0.37s system 0% cpu 10:49.56 total
|
||
- Tomcat max JVM heap usage: 2.05 GiB
|
||
- `filterCache` size: 1901, `cumulative_lookups`: 7062712, `cumulative_hits`: 540020, `cumulative_hitratio`: 0.08
|
||
- 4GB heap, CMS GC, 4096 filter cache, 512 query cache, with 28 million documents in two shards
|
||
- Run 1:
|
||
- Time: 2.88s user 0.35s system 0% cpu 8:29.55 total
|
||
- Tomcat max JVM heap usage: 2.15 GiB
|
||
- `filterCache` size: 3770, `cumulative_lookups`: 2354237, `cumulative_hits`: 414512, `cumulative_hitratio`: 0.18
|
||
- Run 2:
|
||
- Time: 3.01s user 0.38s system 0% cpu 9:15.65 total
|
||
- Tomcat max JVM heap usage: 2.17 GiB
|
||
- `filterCache` size: 3945, `cumulative_lookups`: 4708473, `cumulative_hits`: 829093, `cumulative_hitratio`: 0.18
|
||
- Run 3:
|
||
- Time: 3.01s user 0.40s system 0% cpu 9:01.31 total
|
||
- Tomcat max JVM heap usage: 2.07 GiB
|
||
- `filterCache` size: 3770, `cumulative_lookups`: 7062712, `cumulative_hits`: 1243632, `cumulative_hitratio`: 0.18
|
||
- The biggest takeaway I have is that this workload benefits from a larger `filterCache` (for Solr fq parameter), but barely uses the `queryResultCache` (for Solr q parameter) at all
|
||
- The number of hits goes up and the time taken decreases when we increase the `filterCache`, and total JVM heap memory doesn't seem to increase much at all
|
||
- I guess the `queryResultCache` size is always 2 because I'm only doing two queries: `type:0` and `type:2` (downloads and views, respectively)
|
||
- Here is the general pattern of running three sequential indexing runs as seen in VisualVM while monitoring the Tomcat process:
|
||
|
||
![VisualVM Tomcat 4096 filterCache](/cgspace-notes/2019/04/visualvm-solr-indexing-4096-filterCache.png)
|
||
|
||
- I ran one test with a `filterCache` of 16384 to try to see if I could make the Tomcat JVM memory balloon, but actually it *drastically* increased the performance and memory usage of the dspace-statistics-api indexer
|
||
- 4GB heap, CMS GC, 16384 filter cache, 512 query cache, with 28 million documents in two shards
|
||
- Run 1:
|
||
- Time: 2.85s user 0.42s system 2% cpu 2:28.92 total
|
||
- Tomcat max JVM heap usage: 1.90 GiB
|
||
- `filterCache` size: 14851, `cumulative_lookups`: 2354237, `cumulative_hits`: 2331186, `cumulative_hitratio`: 0.99
|
||
- Run 2:
|
||
- Time: 2.90s user 0.37s system 2% cpu 2:23.50 total
|
||
- Tomcat max JVM heap usage: 1.27 GiB
|
||
- `filterCache` size: 15834, `cumulative_lookups`: 4708476, `cumulative_hits`: 4664762, `cumulative_hitratio`: 0.99
|
||
- Run 3:
|
||
- Time: 2.93s user 0.39s system 2% cpu 2:26.17 total
|
||
- Tomcat max JVM heap usage: 1.05 GiB
|
||
- `filterCache` size: 15248, `cumulative_lookups`: 7062715, `cumulative_hits`: 6998267, `cumulative_hitratio`: 0.99
|
||
- The JVM garbage collection graph is MUCH flatter, and memory usage is much lower (not to mention a drop in GC-related CPU usage)!
|
||
|
||
![VisualVM Tomcat 16384 filterCache](/cgspace-notes/2019/04/visualvm-solr-indexing-16384-filterCache.png)
|
||
|
||
- I will deploy this `filterCache` setting on DSpace Test (linode19)
|
||
- Run all system updates on DSpace Test (linode19) and reboot it
|
||
- Lots of CPU steal going on still on CGSpace (linode18):
|
||
|
||
![CPU usage week](/cgspace-notes/2019/04/cpu-week3.png)
|
||
|
||
## 2019-04-18
|
||
|
||
- I've been trying to copy the `statistics-2018` Solr core from CGSpace to DSpace Test since yesterday, but the network speed is like 20KiB/sec
|
||
- I opened a support ticket to ask Linode to investigate
|
||
- They asked me to send an `mtr` report from Fremont to Frankfurt and vice versa
|
||
- Deploy Tomcat 7.0.94 on DSpace Test (linode19)
|
||
- Also, I realized that the CMS GC changes I deployed a few days ago were ignored by Tomcat because of something with how Ansible formatted the options string
|
||
- I needed to use the "folded" YAML variable format `>-` (with the dash so it doesn't add a return at the end)
|
||
- UptimeRobot says that CGSpace went "down" this afternoon, but I looked at the CPU steal with `iostat 1 10` and it's in the 50s and 60s
|
||
- The munin graph shows a lot of CPU steal (red) currently (and over all during the week):
|
||
|
||
![CPU usage week](/cgspace-notes/2019/04/cpu-week4.png)
|
||
|
||
- I opened a ticket with Linode to migrate us somewhere
|
||
- They agreed to migrate us to a quieter host
|
||
|
||
## 2019-04-20
|
||
|
||
- Linode agreed to move CGSpace (linode18) to a new machine shortly after I filed my ticket about CPU steal two days ago and now the load is much more sane:
|
||
|
||
![CPU usage week](/cgspace-notes/2019/04/cpu-week5.png)
|
||
|
||
- For future reference, Linode mentioned that they consider CPU steal above 8% to be significant
|
||
- Regarding the other Linode issue about speed, I did a test with `iperf` between linode18 and linode19:
|
||
|
||
```
|
||
# iperf -s
|
||
------------------------------------------------------------
|
||
Server listening on TCP port 5001
|
||
TCP window size: 85.3 KByte (default)
|
||
------------------------------------------------------------
|
||
[ 4] local 45.79.x.x port 5001 connected with 139.162.x.x port 51378
|
||
------------------------------------------------------------
|
||
Client connecting to 139.162.x.x, TCP port 5001
|
||
TCP window size: 85.0 KByte (default)
|
||
------------------------------------------------------------
|
||
[ 5] local 45.79.x.x port 36440 connected with 139.162.x.x port 5001
|
||
[ ID] Interval Transfer Bandwidth
|
||
[ 5] 0.0-10.2 sec 172 MBytes 142 Mbits/sec
|
||
[ 4] 0.0-10.5 sec 202 MBytes 162 Mbits/sec
|
||
```
|
||
|
||
- Even with the software firewalls disabled the rsync speed was low, so it's not a rate limiting issue
|
||
- I also tried to download a file over HTTPS from CGSpace to DSpace Test, but it was capped at 20KiB/sec
|
||
- I updated the Linode issue with this information
|
||
- I'm going to try to switch the kernel to the latest upstream (5.0.8) instead of Linode's latest x86_64
|
||
- Nope, still 20KiB/sec
|
||
|
||
## 2019-04-21
|
||
|
||
- Deploy Solr 4.10.4 on CGSpace (linode18)
|
||
- Deploy Tomcat 7.0.94 on CGSpace
|
||
- Deploy dspace-statistics-api v1.0.0 on CGSpace
|
||
- Linode support replicated the results I had from the network speed testing and said they don't know why it's so slow
|
||
- They offered to live migrate the instance to another host to see if that helps
|
||
|
||
## 2019-04-22
|
||
|
||
- Abenet pointed out [an item](https://hdl.handle.net/10568/97912) that doesn't have an Altmetric score on CGSpace, but has a score of 343 in the CGSpace Altmetric dashboard
|
||
- I tweeted the Handle to see if it will pick it up...
|
||
- Like clockwork, after fifteen minutes there was a donut showing on CGSpace
|
||
- I want to get rid of this annoying warning that is constantly in our DSpace logs:
|
||
|
||
```
|
||
2019-04-08 19:02:31,770 WARN org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
|
||
```
|
||
|
||
- Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):
|
||
|
||
```
|
||
$ grep -c 'Falling back to request address' dspace.log.2019-04-20
|
||
dspace.log.2019-04-20:1515
|
||
```
|
||
|
||
- I will fix it in `dspace/config/modules/oai.cfg`
|
||
- Linode says that it is likely that the host CGSpace (linode18) is on is showing signs of hardware failure and they recommended that I migrate the VM to a new host
|
||
- I told them to migrate it at 04:00:00AM Frankfurt time, when nobody in East Africa, Europe, or South America should be using the server
|
||
|
||
## 2019-04-23
|
||
|
||
- One blog post says that there is [no overprovisioning in Azure](https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/):
|
||
|
||
<blockquote cite="https://kvaes.wordpress.com/2017/07/01/what-azure-virtual-machine-size-should-i-pick/">In Azure, with one exception being the A0, there is no overprovisioning… Each physical cores is only supporting one virtual core.</blockquote>
|
||
|
||
- Perhaps that's why the [Azure pricing](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) is so expensive!
|
||
- Add a privacy page to CGSpace
|
||
- The work was mostly similar to the About page at `/page/about`, but in addition to adding i18n strings etc, I had to add the logic for the trail to `dspace-xmlui-mirage2/src/main/webapp/xsl/preprocess/general.xsl`
|
||
|
||
## 2019-04-24
|
||
|
||
- Linode migrated CGSpace (linode18) to a new host, but I am still getting poor performance when copying data to DSpace Test (linode19)
|
||
- I asked them if we can migrate DSpace Test to a new host
|
||
- They migrated DSpace Test to a new host and the rsync speed from Frankfurt was still capped at 20KiB/sec...
|
||
- I booted DSpace Test to a rescue CD and tried the rsync from CGSpace there too, but it was still capped at 20KiB/sec...
|
||
- I copied the 18GB `statistics-2018` Solr core from Frankfurt to a Linode in London at 15MiB/sec, then from the London one to DSpace Test in Fremont at 15MiB/sec... so WTF us up with Frankfurt→Fremont?!
|
||
- Finally upload the 218 IITA items from March to CGSpace
|
||
- Abenet and I had to do a little bit more work to correct the metadata of one item that appeared to be a duplicate, but really just had the wrong DOI
|
||
- While I was uploading the IITA records I noticed that twenty of the records Sisay uploaded in 2018-09 had double Handles (`dc.identifier.uri`)
|
||
- According to my notes in 2018-09 I had noticed this when he uploaded the records and told him to remove them, but he didn't...
|
||
- I exported the IITA community as a CSV then used `csvcut` to extract the two URI columns and identify and fix the records:
|
||
|
||
```
|
||
$ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv > /tmp/iita.csv
|
||
```
|
||
|
||
- Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
|
||
- I told him we never finished it, and that he should try to use the `/items/find-by-metadata-field` endpoint, with the caveat that you need to match the language attribute exactly (ie "en", "en_US", null, etc)
|
||
- I asked him how many terms they are interested in, as we could probably make it easier by normalizing the language attributes of these fields (it would help us anyways)
|
||
- He says he's getting HTTP 401 errors when trying to search for CPWF subject terms, which I can reproduce:
|
||
|
||
```
|
||
$ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
|
||
curl: (22) The requested URL returned error: 401
|
||
```
|
||
|
||
- Note that curl only shows the HTTP 401 error if you use `-f` (fail), and only then if you *don't* include `-s`
|
||
- I see there are about 1,000 items using CPWF subject "WATER MANAGEMENT" in the database, so there should definitely be results
|
||
- The breakdown of `text_lang` fields used in those items is 942:
|
||
|
||
```
|
||
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
|
||
count
|
||
-------
|
||
376
|
||
(1 row)
|
||
|
||
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='';
|
||
count
|
||
-------
|
||
149
|
||
(1 row)
|
||
|
||
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang IS NULL;
|
||
count
|
||
-------
|
||
417
|
||
(1 row)
|
||
```
|
||
|
||
- I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn't have permission to access... from the DSpace log:
|
||
|
||
```
|
||
2019-04-24 08:11:51,129 INFO org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
|
||
2019-04-24 08:11:51,231 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
|
||
2019-04-24 08:11:51,238 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
|
||
2019-04-24 08:11:51,243 INFO org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
|
||
2019-04-24 08:11:51,252 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item!
|
||
```
|
||
|
||
- Nevertheless, if I request using the `null` language I get 1020 results, plus 179 for a blank language attribute:
|
||
|
||
```
|
||
$ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": null}' | jq length
|
||
1020
|
||
$ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": ""}' | jq length
|
||
179
|
||
```
|
||
|
||
- This is weird because I see 942–1156 items with "WATER MANAGEMENT" (depending on wildcard matching for errors in subject spelling):
|
||
|
||
```
|
||
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
|
||
count
|
||
-------
|
||
942
|
||
(1 row)
|
||
|
||
dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value LIKE '%WATER MANAGEMENT%';
|
||
count
|
||
-------
|
||
1156
|
||
(1 row)
|
||
```
|
||
|
||
- I sent a message to the dspace-tech mailing list to ask for help
|
||
|
||
## 2019-04-25
|
||
|
||
- Peter pointed out that we need to remove Delicious and Google+ from our social sharing links
|
||
- Also, it would be nice if we could include the item title in the shared link
|
||
- I created an issue on GitHub to track this ([#419](https://github.com/ilri/DSpace/issues/419))
|
||
- I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:
|
||
|
||
```
|
||
$ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/login" -d '{"email":"example@me.com","password":"fuuuuu"}'
|
||
$ curl -f -H "Content-Type: application/json" -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -X GET "https://dspacetest.cgiar.org/rest/status"
|
||
$ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
|
||
```
|
||
|
||
- I created a normal user for Carlos to try as an unprivileged user:
|
||
|
||
```
|
||
$ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
|
||
```
|
||
|
||
- But still I get the HTTP 401 and I have no idea which item is causing it
|
||
- I enabled more verbose logging in `ItemsResource.java` and now I can at least see the item ID that causes the failure...
|
||
- The item is not even in the archive, but somehow it is discoverable
|
||
|
||
```
|
||
dspace=# SELECT * FROM item WHERE item_id=74648;
|
||
item_id | submitter_id | in_archive | withdrawn | last_modified | owning_collection | discoverable
|
||
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
|
||
74648 | 113 | f | f | 2016-03-30 09:00:52.131+00 | | t
|
||
(1 row)
|
||
```
|
||
|
||
- I tried to delete the item in the web interface, and it seems successful, but I can still access the item in the admin interface, and nothing changes in PostgreSQL
|
||
- Meet with CodeObia to see progress on AReS version 2
|
||
- Marissa Van Epp asked me to add a few new metadata values to their Phase II Project Tags field (cg.identifier.ccafsprojectpii)
|
||
- I created a [pull request](https://github.com/ilri/DSpace/pull/420) for it and will do it the next time I run updates on CGSpace
|
||
- Communicate with Carlos Tejo from the Land Portal about the `/items/find-by-metadata-value` endpoint
|
||
- Run all system updates on DSpace Test (linode19) and reboot it
|
||
|
||
## 2019-04-26
|
||
|
||
- Export a list of authors for Peter to look through:
|
||
|
||
```
|
||
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
|
||
COPY 65752
|
||
```
|
||
|
||
## 2019-04-28
|
||
|
||
- Still trying to figure out the issue with the items that cause the REST API's `/items/find-by-metadata-value` endpoint to throw an exception
|
||
- I made the item private in the UI and then I see in the UI and PostgreSQL that it is no longer discoverable:
|
||
|
||
```
|
||
dspace=# SELECT * FROM item WHERE item_id=74648;
|
||
item_id | submitter_id | in_archive | withdrawn | last_modified | owning_collection | discoverable
|
||
---------+--------------+------------+-----------+----------------------------+-------------------+--------------
|
||
74648 | 113 | f | f | 2019-04-28 08:48:52.114-07 | | f
|
||
(1 row)
|
||
```
|
||
|
||
- And I tried the `curl` command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:
|
||
|
||
```
|
||
2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
|
||
```
|
||
|
||
- I even tried to "expunge" the item using an [action in CSV](https://wiki.lyrasis.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-Performing'actions'onitems), and it said "EXPUNGED!" but the item is still there...
|
||
|
||
## 2019-04-30
|
||
|
||
- Send mail to the dspace-tech mailing list to ask about the item expunge issue
|
||
- Delete and re-create Podman container for dspacedb after pulling a new PostgreSQL container:
|
||
|
||
```
|
||
$ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
|
||
```
|
||
|
||
- Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I'll try to do a CSV
|
||
- In order to make it easier for him to understand the CSV I will normalize the text languages (minus the provenance field) on my local development instance before exporting:
|
||
|
||
```
|
||
dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
|
||
text_lang | count
|
||
-----------+---------
|
||
| 358647
|
||
* | 11
|
||
E. | 1
|
||
en | 1635
|
||
en_US | 602312
|
||
es | 12
|
||
es_ES | 2
|
||
ethnob | 1
|
||
fr | 2
|
||
spa | 2
|
||
| 1074345
|
||
(11 rows)
|
||
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
|
||
UPDATE 360295
|
||
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
|
||
UPDATE 1074345
|
||
dspace=# UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
|
||
UPDATE 14
|
||
```
|
||
|
||
- Then I exported the whole repository as CSV, imported it into OpenRefine, removed a few unneeded columns, exported it, zipped it down to 36MB, and emailed a link to Carlos
|
||
- In other news, while I was looking through the CSV in OpenRefine I saw lots of weird values in some fields... we should check, for example:
|
||
- issue dates
|
||
- items missing handles
|
||
- authorship types
|
||
|
||
<!-- vim: set sw=2 ts=2: -->
|