- Then on DSpace Test I created a `statistics-2019` core with the same instance dir as the main `statistics` core (as [illustrated in the DSpace docs](https://wiki.lyrasis.org/display/DSDOC6x/Testing+Solr+Shards))
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
```
- The key thing above is that you create the core in the Solr admin UI, but the data directory must already exist so you have to do that first in the file system
- I restarted the server after the import was done to see if the cores would come back up OK
- I remember last time I tried this the manually created statistics cores didn't come back up after I rebooted, but this time they did
## 2021-11-03
- While inspecting the stats for the new statistics-2019 shard on DSpace Test I noticed that I can't find any stats via the DSpace Statistics API for an item that _should_ have some
- I checked on CGSpace's and I can't find them there either, but I see them in Solr when I query in the admin UI
- I need to debug that, but it doesn't seem to be related to the sharding...
- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today)
- Several users wrote to me last week to say that workflow emails haven't been working since 2021-10-21 or so
- I did a test on CGSpace and it's indeed broken:
```console
$ dspace test-email
About to send test email:
- To: fuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
)
Please see the DSpace documentation for assistance.
```
- I sent a message to ILRI ICT to ask them to check the account/password
- I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance:
```console
# tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
- I migrated the 2013, 2012, and 2011 statistics to yearly shards on DSpace Test's Solr to continute my testing of memory / latency impact
- I found out why the CI jobs for the DSpace Statistics API had been failing the past few weeks
- When I reverted to using the original falcon-swagger-ui project after they apparently merged my Falcon 3 changes, it seems that they actually only merged the Swagger UI changes, not the Falcon 3 fix!
- I switched back to using my own fork and now it's working
- Unfortunately now I'm getting an error installing my dependencies with Poetry:
```console
RuntimeError
Unable to find installation candidates for regex (2021.11.9)
at /usr/lib/python3.9/site-packages/poetry/installation/chooser.py:72 in choose_for
68│
69│ links.append(link)
70│
71│ if not links:
→ 72│ raise RuntimeError(
73│ "Unable to find installation candidates for {}".format(package)
74│ )
75│
76│ # Get the best link
```
- So that's super annoying... I'm going to try using Pipenv again...
## 2021-11-10
- 93.158.91.62 is scraping us again
- That's an IP in Sweden that is clearly a bot, but pretending to use a normal user agent
- I added them to the "bot" list in nginx so the requests will share a common DSpace session with other bots and not create Solr hits, but still they are causing high outbound traffic
- I modified the nginx configuration to send them an HTTP 403 and tell them to use a bot user agent
## 2021-11-14
- I decided to update AReS to the latest OpenRXV version with Elasticsearch 7.13
- First I took backups of the Elasticsearch volume and OpenRXV backend data:
```console
$ docker-compose down
$ sudo tar czf openrxv_esData_7-2021-11-14.tar.xz /var/lib/docker/volumes/openrxv_esData_7
$ cp -a backend/data backend/data.2021-11-14
```
- Then I checked out the latest git commit, updated all images, rebuilt the project:
- I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
- Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently...
- I sent an email to Pythagoras from GARDIAN to ask if they can use a more specific user agent than "Microsoft Internet Explorer" for their scraper
- He said he will change the user agent
## 2021-11-24
- I had an idea to check our Solr statistics for hits from all the IPs that I have listed in nginx as being bots
- Other than a few that I ruled out that *may* be humans, these are all making requests within one month or with no user agent, which is highly suspicious:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
Found 8352 hits from 138.201.49.199 in statistics
Found 9374 hits from 78.46.89.18 in statistics
Found 2112 hits from 93.179.69.74 in statistics
Found 1 hits from 31.6.77.23 in statistics
Found 5 hits from 34.209.213.122 in statistics
Found 86772 hits from 163.172.68.99 in statistics
Found 77 hits from 163.172.70.248 in statistics
Found 15842 hits from 163.172.71.24 in statistics
Found 172954 hits from 104.154.216.0 in statistics
- Peter sent me corrections for the authors that I had sent him back in 2021-09
- I did a quick sanity check on them with OpenRefine, filtering out all the metadata with no replacements, then ran through my csv-metadata-quality script
- Then I imported them into my local instance as a test: