2021-11-02
- I experimented with manually sharding the Solr statistics on DSpace Test
- First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
# create core in Solr admin
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2019-*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
- The key thing above is that you create the core in the Solr admin UI, but the data directory must already exist so you have to do that first in the file system
- I restarted the server after the import was done to see if the cores would come back up OK
- I remember last time I tried this the manually created statistics cores didn’t come back up after I rebooted, but this time they did
2021-11-03
- While inspecting the stats for the new statistics-2019 shard on DSpace Test I noticed that I can’t find any stats via the DSpace Statistics API for an item that should have some
- I checked on CGSpace’s and I can’t find them there either, but I see them in Solr when I query in the admin UI
- I need to debug that, but it doesn’t seem to be related to the sharding…
2021-11-04
- I spent a little bit of time debugging the Solr bug with the statistics-2019 shard but couldn’t reproduce it for the few items I tested
- So that’s good, it seems the sharding worked
- Linode alerted me to high CPU usage on CGSpace (linode18) yesterday
- Looking at the Solr hits from yesterday I see 91.213.50.11 making 2,300 requests
- According to AbuseIPDB.com this is owned by Registrarus LLC (registrarus.ru) and it has been reported for malicious activity by several users
- The ASN is 50340 (SELECTEL-MSK, RU)
- They are attempting SQL injection:
91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] "HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&isAllowed=y HTTP/1.1" 200 0 "https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf" "Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10"
- Another is in China, and they grabbed 1,200 PDFs from the REST API in under an hour:
# zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
1178
- I will continue to split the Solr statistics back into year-shards on DSpace Test (linode26)
- Today I did all 2018 stats…
- I want to see if there is a noticeable change in JVM memory, Solr response time, etc
2021-11-07
- Update all Docker containers on AReS and rebuild OpenRXV:
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
- Then restart the server and start a fresh harvest
- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017 today)