mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-17 20:27:05 +01:00
298 lines
12 KiB
Markdown
298 lines
12 KiB
Markdown
---
|
|
title: "April, 2019"
|
|
date: 2019-04-01T09:00:43+03:00
|
|
author: "Alan Orth"
|
|
tags: ["Notes"]
|
|
---
|
|
|
|
## 2019-04-01
|
|
|
|
- Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
|
|
- They asked if we had plans to enable RDF support in CGSpace
|
|
- There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
|
|
- I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
|
|
|
|
```
|
|
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
|
|
4432 200
|
|
```
|
|
|
|
- In the last two weeks there have been 47,000 downloads of this *same exact PDF* by these three IP addresses
|
|
- Apply country and region corrections and deletions on DSpace Test and CGSpace:
|
|
|
|
```
|
|
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
|
|
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
|
|
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
|
|
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
|
|
```
|
|
|
|
<!--more-->
|
|
|
|
## 2019-04-02
|
|
|
|
- CTA says the Amazon IPs are AWS gateways for real user traffic
|
|
- I was trying to add Felix Shaw's account back to the Administrators group on DSpace Test, but I couldn't find his name in the user search of the groups page
|
|
- If I searched for "Felix" or "Shaw" I saw other matches, included one for his personal email address!
|
|
- I ended up finding him via searching for his email address
|
|
|
|
## 2019-04-03
|
|
|
|
- Maria from Bioversity emailed me a list of new ORCID identifiers for their researchers so I will add them to our controlled vocabulary
|
|
- First I need to extract the ones that are unique from their list compared to our existing one:
|
|
|
|
```
|
|
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
|
|
```
|
|
|
|
- We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!
|
|
- Next I will resolve all their names using my `resolve-orcids.py` script:
|
|
|
|
```
|
|
$ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
|
|
```
|
|
|
|
- After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
|
|
- One user's name has changed so I will update those using my `fix-metadata-values.py` script:
|
|
|
|
```
|
|
$ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
|
|
```
|
|
|
|
- I created a pull request and merged the changes to the 5_x-prod branch ([#417](https://github.com/ilri/DSpace/pull/417))
|
|
- A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it's still going:
|
|
|
|
```
|
|
2019-04-03 16:34:02,262 INFO org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
|
|
```
|
|
|
|
- Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:
|
|
|
|
```
|
|
$ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
|
|
1
|
|
3 http://localhost:8081/solr//statistics-2017
|
|
5662 http://localhost:8081/solr//statistics-2018
|
|
```
|
|
|
|
- I will have to keep an eye on it because nothing should be updating 2018 stats in 2019...
|
|
|
|
## 2019-04-05
|
|
|
|
- Uptime Robot reported that CGSpace (linode18) went down tonight
|
|
- I see there are lots of PostgreSQL connections:
|
|
|
|
```
|
|
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
|
|
5 dspaceApi
|
|
10 dspaceCli
|
|
250 dspaceWeb
|
|
```
|
|
|
|
- I still see those weird messages about updating the statistics-2018 Solr core:
|
|
|
|
```
|
|
2019-04-05 21:06:53,770 INFO org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
|
|
```
|
|
|
|
- Looking at `iostat 1 10` I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:
|
|
|
|
![CPU usage week](/cgspace-notes/2019/04/cpu-week.png)
|
|
|
|
- The other thing visible there is that the past few days the load has spiked to 500% and I don't think it's a coincidence that the Solr updating thing is happening...
|
|
- I ran all system updates and rebooted the server
|
|
- The load was lower on the server after reboot, but Solr didn't come back up properly according to the Solr Admin UI:
|
|
|
|
```
|
|
statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
|
|
```
|
|
|
|
- I restarted it again and all the Solr cores came up properly...
|
|
|
|
## 2019-04-06
|
|
|
|
- Udana asked why item [10568/91278](https://cgspace.cgiar.org/handle/10568/91278) didn't have an Altmetric badge on CGSpace, but on the [WLE website](https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity) it does
|
|
- I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
|
|
- I tweeted the item and I assume this will link the Handle with the DOI in the system
|
|
- Twenty minutes later I see the same Altmetric score (9) on CGSpace
|
|
- Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
222 18.195.78.144
|
|
245 207.46.13.58
|
|
303 207.46.13.194
|
|
328 66.249.79.33
|
|
564 207.46.13.210
|
|
566 66.249.79.62
|
|
575 40.77.167.66
|
|
1803 66.249.79.59
|
|
2834 2a01:4f8:140:3192::2
|
|
9623 45.5.184.72
|
|
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
|
31 66.249.79.62
|
|
41 207.46.13.210
|
|
42 40.77.167.66
|
|
54 42.113.50.219
|
|
132 66.249.79.59
|
|
785 2001:41d0:d:1990::
|
|
1164 45.5.184.72
|
|
2014 50.116.102.77
|
|
4267 45.5.186.2
|
|
4893 205.186.128.185
|
|
```
|
|
|
|
- `45.5.184.72` is in Colombia so it's probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT's datasets collection:
|
|
|
|
```
|
|
GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
|
|
```
|
|
|
|
- Their user agent is the one I added to the badbots list in nginx last week: "GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1"
|
|
- They made 22,000 requests to Discover on this collection today alone (and it's only 11AM):
|
|
|
|
```
|
|
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
|
|
22077 /handle/10568/72970/discover
|
|
```
|
|
|
|
- Yesterday they made 43,000 requests and we actually blocked most of them:
|
|
|
|
```
|
|
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
|
|
43631 /handle/10568/72970/discover
|
|
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c
|
|
142 200
|
|
43489 503
|
|
```
|
|
|
|
- I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover
|
|
- Maria from Bioversity recommended that we use the phrase "AGROVOC subject" instead of "Subject" in Listings and Reports
|
|
- I made a pull request to update this and merged it to the `5_x-prod` branch ([#418](https://github.com/ilri/DSpace/pull/418))
|
|
|
|
## 2019-04-07
|
|
|
|
- Looking into the impact of harvesters like `45.5.184.72`, I see in Solr that this user is not categorized as a bot so it definitely impacts the usage stats by some tens of thousands *per day*
|
|
- Last week CTA switched their frontend code to use HEAD requests instead of GET requests for PDF bitstreams
|
|
- I am trying to see if these are registered as downloads in Solr or not
|
|
- I see 96,925 downloads from their AWS gateway IPs in 2019-03:
|
|
|
|
```
|
|
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
|
|
{
|
|
"response": {
|
|
"docs": [],
|
|
"numFound": 96925,
|
|
"start": 0
|
|
},
|
|
"responseHeader": {
|
|
"QTime": 1,
|
|
"params": {
|
|
"fq": [
|
|
"statistics_type:view",
|
|
"bundleName:ORIGINAL",
|
|
"dateYearMonth:2019-03"
|
|
],
|
|
"indent": "true",
|
|
"q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
|
|
"rows": "0",
|
|
"wt": "json"
|
|
},
|
|
"status": 0
|
|
}
|
|
}
|
|
```
|
|
|
|
- Strangely I don't see many hits in 2019-04:
|
|
|
|
```
|
|
$ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
|
|
{
|
|
"response": {
|
|
"docs": [],
|
|
"numFound": 38,
|
|
"start": 0
|
|
},
|
|
"responseHeader": {
|
|
"QTime": 1,
|
|
"params": {
|
|
"fq": [
|
|
"statistics_type:view",
|
|
"bundleName:ORIGINAL",
|
|
"dateYearMonth:2019-04"
|
|
],
|
|
"indent": "true",
|
|
"q": "type:0 AND (ip:18.196.196.108 OR ip:18.195.78.144 OR ip:18.195.218.6)",
|
|
"rows": "0",
|
|
"wt": "json"
|
|
},
|
|
"status": 0
|
|
}
|
|
}
|
|
```
|
|
|
|
- Making some tests on GET vs HEAD requests on the [CTA Spore 192 item](https://dspacetest.cgiar.org/handle/10568/100289) on DSpace Test:
|
|
|
|
```
|
|
$ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
|
|
GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
|
|
Accept: */*
|
|
Accept-Encoding: gzip, deflate
|
|
Connection: keep-alive
|
|
Host: dspacetest.cgiar.org
|
|
User-Agent: HTTPie/1.0.2
|
|
|
|
HTTP/1.1 200 OK
|
|
Connection: keep-alive
|
|
Content-Language: en-US
|
|
Content-Length: 2069158
|
|
Content-Type: application/pdf;charset=ISO-8859-1
|
|
Date: Sun, 07 Apr 2019 08:38:34 GMT
|
|
Expires: Sun, 07 Apr 2019 09:38:34 GMT
|
|
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
|
|
Server: nginx
|
|
Set-Cookie: JSESSIONID=21A492CC31CA8845278DFA078BD2D9ED; Path=/; Secure; HttpOnly
|
|
Vary: User-Agent
|
|
X-Cocoon-Version: 2.2.0
|
|
X-Content-Type-Options: nosniff
|
|
X-Frame-Options: SAMEORIGIN
|
|
X-Robots-Tag: none
|
|
X-XSS-Protection: 1; mode=block
|
|
|
|
$ http --print Hh HEAD https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
|
|
HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
|
|
Accept: */*
|
|
Accept-Encoding: gzip, deflate
|
|
Connection: keep-alive
|
|
Host: dspacetest.cgiar.org
|
|
User-Agent: HTTPie/1.0.2
|
|
|
|
HTTP/1.1 200 OK
|
|
Connection: keep-alive
|
|
Content-Language: en-US
|
|
Content-Length: 2069158
|
|
Content-Type: application/pdf;charset=ISO-8859-1
|
|
Date: Sun, 07 Apr 2019 08:39:01 GMT
|
|
Expires: Sun, 07 Apr 2019 09:39:01 GMT
|
|
Last-Modified: Thu, 14 Mar 2019 11:20:05 GMT
|
|
Server: nginx
|
|
Set-Cookie: JSESSIONID=36C8502257CC6C72FD3BC9EBF91C4A0E; Path=/; Secure; HttpOnly
|
|
Vary: User-Agent
|
|
X-Cocoon-Version: 2.2.0
|
|
X-Content-Type-Options: nosniff
|
|
X-Frame-Options: SAMEORIGIN
|
|
X-Robots-Tag: none
|
|
X-XSS-Protection: 1; mode=block
|
|
```
|
|
|
|
- And from the server side, the nginx logs show:
|
|
|
|
```
|
|
78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
|
|
78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
|
|
```
|
|
|
|
- So definitely the *size* of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|