- I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive
- I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in [September, 2023]({{< relref "2023-09.md" >}}),
## 2023-11-08
- DSpace 7 Test has very high load again and I see more Java heap space errors in the log
- I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine
- Importing duplicates records, so I deleted and re-created the index in Elasticsearch first
- Then I started a new harvest on AReS to make sure the mappings are applied
## 2023-11-09
- Ryan asked me for help uploading a large PDF to CGSpace
- I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images
- Interestingly, the [GhostScript docs](https://ghostscript.com/docs/9.54.0/VectorDevices.htm) mention that `prepress` doesn't give the best results:
> Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The 'best' quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).
- Also, I found [a question on StackOverflow discussing some further techniques for PDFs with images](https://stackoverflow.com/questions/40849325/ghostscript-pdfwrite-specify-jpeg-quality):
- I was checking out the [DSpace 7 statistics](https://github.com/DSpace/RestContract/blob/main/statistics-reports.md) again and found that we have total visits and total downloads for each DSpace object, for example [this item](https://dspace7test.ilri.org/items/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748):
- And the numbers match those in my dspace-statisitcs-api *exactly*!
- This can be useful to get an individual DSpace object's stats, but there is no way to iterate over all objects like all items...
- We can look at using this to draw stats on the community, collection, and item pages
## 2023-11-23
- Brian King was asking me how many PDFs we had in CGSpace so I got a rough estimate using this SQL query:
```console
localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype='application/pdf');
count
───────
47818
(1 row)
```
- It's been some time since I looked at our Solr statistics to find new bots
- I found a few new ones that I [submitted to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/60) and added to our local bot list:
- GuzzleHttp/7
- Owler@ows.eu/1
- newspaperjs
- I ran my old `check-spider-hits.sh` script with a list of bots from our local overrides to purge hits from Solr:
Purging 40 hits from AdobeUxTechC4-Async in statistics
Purging 21 hits from ZaloPC-win32-24v473 in statistics
Purging 21 hits from nbertaupete95 in statistics
Purging 52 hits from Scoop\.it in statistics
Purging 16 hits from WebAPIClient in statistics
Purging 241 hits from RStudio in statistics
Purging 1255 hits from ^MEL in statistics
Purging 47850 hits from GuzzleHttp in statistics
Purging 8714 hits from Owler in statistics
Purging 1083 hits from newspaperjs in statistics
Purging 369 hits from ^Chrome$ in statistics
Purging 1474 hits from curl in statistics
Total number of bot hits purged: 61504
```
- I also noticed 35,000 requests over the past few years from lowercase user agents, which is [definitely weird](https://developers.whatismybrowser.com/api/features/user-agent-checks/weird/#all_lower_case), for example:
-`mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36`
-`mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36`
- I'm gonna add those to our overrides and purge them: