mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-09-21 01:34:48 +02:00
223 lines
9.1 KiB
Markdown
223 lines
9.1 KiB
Markdown
---
|
|
title: "July, 2020"
|
|
date: 2020-07-01T10:53:54+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2020-07-01
|
|
|
|
- A few users noticed that CGSpace wasn't loading items today, item pages seem blank
|
|
- I looked at the PostgreSQL locks but they don't seem unusual
|
|
- I guess this is the same "blank item page" issue that we had a few times in 2019 that we never solved
|
|
- I restarted Tomcat and PostgreSQL and the issue was gone
|
|
- Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the `5_x-prod` branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter's request
|
|
|
|
<!--more-->
|
|
|
|
- Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning
|
|
- First looking at the traffic in the morning:
|
|
|
|
```
|
|
# cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
|
|
...
|
|
9659 33.56% 1 0.08% 340.94 MiB 64.39.99.13
|
|
3317 11.53% 1 0.08% 871.71 MiB 199.47.87.140
|
|
2986 10.38% 1 0.08% 17.39 MiB 199.47.87.144
|
|
2286 7.94% 1 0.08% 13.04 MiB 199.47.87.142
|
|
```
|
|
|
|
- 64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:
|
|
|
|
```
|
|
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
|
|
```
|
|
|
|
- I will purge hits from that IP from Solr
|
|
- The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:
|
|
|
|
```
|
|
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
|
|
numFound="41694"
|
|
```
|
|
|
|
- They used to be "TurnitinBot"... hhmmmm, seems they use both: https://turnitin.com/robot/crawlerinfo.html
|
|
- I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting `robots.txt` and only requesting item pages, so that's impressive! I don't need to add them to the "bad bot" rate limit list in nginx
|
|
- While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:
|
|
|
|
```
|
|
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
|
|
```
|
|
|
|
- The IPs all belong to HostRoyale:
|
|
|
|
```
|
|
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
|
|
81
|
|
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
|
|
185.152.250.1
|
|
185.152.250.101
|
|
185.152.250.103
|
|
185.152.250.105
|
|
185.152.250.107
|
|
185.152.250.111
|
|
185.152.250.115
|
|
185.152.250.119
|
|
185.152.250.121
|
|
185.152.250.123
|
|
185.152.250.125
|
|
185.152.250.129
|
|
185.152.250.13
|
|
185.152.250.131
|
|
185.152.250.133
|
|
185.152.250.135
|
|
185.152.250.137
|
|
185.152.250.141
|
|
185.152.250.145
|
|
185.152.250.149
|
|
185.152.250.153
|
|
185.152.250.155
|
|
185.152.250.157
|
|
185.152.250.159
|
|
185.152.250.161
|
|
185.152.250.163
|
|
185.152.250.165
|
|
185.152.250.167
|
|
185.152.250.17
|
|
185.152.250.171
|
|
185.152.250.183
|
|
185.152.250.189
|
|
185.152.250.191
|
|
185.152.250.197
|
|
185.152.250.201
|
|
185.152.250.205
|
|
185.152.250.209
|
|
185.152.250.21
|
|
185.152.250.213
|
|
185.152.250.217
|
|
185.152.250.219
|
|
185.152.250.221
|
|
185.152.250.223
|
|
185.152.250.225
|
|
185.152.250.227
|
|
185.152.250.229
|
|
185.152.250.231
|
|
185.152.250.233
|
|
185.152.250.235
|
|
185.152.250.239
|
|
185.152.250.243
|
|
185.152.250.247
|
|
185.152.250.249
|
|
185.152.250.25
|
|
185.152.250.251
|
|
185.152.250.253
|
|
185.152.250.255
|
|
185.152.250.27
|
|
185.152.250.29
|
|
185.152.250.3
|
|
185.152.250.31
|
|
185.152.250.39
|
|
185.152.250.41
|
|
185.152.250.47
|
|
185.152.250.5
|
|
185.152.250.59
|
|
185.152.250.63
|
|
185.152.250.65
|
|
185.152.250.67
|
|
185.152.250.7
|
|
185.152.250.71
|
|
185.152.250.73
|
|
185.152.250.77
|
|
185.152.250.81
|
|
185.152.250.85
|
|
185.152.250.89
|
|
185.152.250.9
|
|
185.152.250.93
|
|
185.152.250.95
|
|
185.152.250.97
|
|
185.152.250.99
|
|
```
|
|
|
|
- It's only a few hundred requests each, but I am very suspicious so I will record it here and purge their IPs from Solr
|
|
- Then I see 185.187.30.14 and 185.187.30.13 making requests also, with several different "normal" user agents
|
|
- They are both apparently in France, belonging to Scalair FR hosting
|
|
- I will purge their requests from Solr too
|
|
- Now I see some other new bots I hadn't noticed before:
|
|
- `Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com`
|
|
- `Consilio (WebHare Platform 4.28.2-dev); LinkChecker)`, which appears to be a [university CMS](https://www.utwente.nl/en/websites/webhare/)
|
|
- I will add `LinkCheck`, `Consilio`, and `WebHare` to the list of DSpace bot agents and purge them from Solr stats
|
|
- COUNTER-Robots list already has `link.?check` but for some reason DSpace didn't match that and I see hits for some of these...
|
|
- Maybe I should add `[Ll]ink.?[Cc]heck.?` to a custom list for now?
|
|
- For now I added `Turnitin` to the [new bots pull request on COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/34)
|
|
- I purged 20,000 hits from IPs and 45,000 hits from user agents
|
|
- I will revert the default "example" agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven't merged yet:
|
|
|
|
```
|
|
$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
|
|
Citoid
|
|
ecointernet
|
|
GigablastOpenSource
|
|
Jersey\/\d
|
|
MarcEdit
|
|
OgScrper
|
|
okhttp
|
|
^Pattern\/\d
|
|
ReactorNetty\/\d
|
|
sqlmap
|
|
Typhoeus
|
|
7siters
|
|
```
|
|
|
|
- Just a note that I *still* can't deploy the `6_x-dev-atmire-modules` branch as it fails at ant update:
|
|
|
|
```
|
|
[java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
|
|
```
|
|
|
|
- I had told Atmire about this several weeks ago... but I reminded them again in the ticket
|
|
- Atmire says they are able to build fine, so I tried again and noticed that I had been building with `-Denv=dspacetest.cgiar.org`, which is not necessary for DSpace 6 of course
|
|
- Once I removed that it builds fine
|
|
- I quickly re-applied the Font Awesome 5 changes to use SVG+JS instead of web fonts (from 2020-04) and things are looking good!
|
|
- Run all system updates on DSpace Test (linode26), deploy latest `6_x-dev-atmire-modules` branch, and reboot it
|
|
|
|
## 2020-07-02
|
|
|
|
- I need to export some Solr statistics data from CGSpace to test Salem's modifications to the dspace-statistics-api
|
|
- He modified it to query Solr on the fly instead of indexing it, which will be heavier and slower, but allows us to get more granular stats and countries/cities
|
|
- Because have so many records I want to use solr-import-export-json to get several months at a time with a date range, but it seems there are first issues with curl (need to disable globbing with `-g` and URL encode the range)
|
|
- For reference, the [Solr 4.10.x DateField docs](https://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/schema/DateField.html)
|
|
- This range works in Solr UI: `[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]`
|
|
- As well in curl:
|
|
|
|
```
|
|
$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
|
|
{
|
|
"responseHeader":{
|
|
"status":0,
|
|
"QTime":0,
|
|
"params":{
|
|
"q":"*:*",
|
|
"indent":"true",
|
|
"fq":"time:[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]",
|
|
"rows":"0",
|
|
"wt":"json"}},
|
|
"response":{"numFound":7784285,"start":0,"docs":[]
|
|
}}
|
|
```
|
|
|
|
- But not in solr-import-export-json... hmmm... seems we need to URL encode *only* the date range itself, but not the brackets:
|
|
|
|
```
|
|
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
|
|
$ zstd /tmp/statistics-2019-1.json
|
|
```
|
|
|
|
- Then import it on my local dev environment:
|
|
|
|
```
|
|
$ zstd -d statistics-2019-1.json.zst
|
|
$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-1.json -k uid
|
|
```
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|