diff --git a/content/posts/2023-11.md b/content/posts/2023-11.md index 4b6f5dd68..a3b744a98 100644 --- a/content/posts/2023-11.md +++ b/content/posts/2023-11.md @@ -142,4 +142,69 @@ $ du -sh images-* - Export CGSpace to check for missing Initiative collection mappings - Start a harvest on AReS +## 2023-11-22 + +- I was checking out the [DSpace 7 statistics](https://github.com/DSpace/RestContract/blob/main/statistics-reports.md) again and found that we have total visits and total downloads for each DSpace object, for example [this item](https://dspace7test.ilri.org/items/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748): + - TotalVisits: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalVisits + - TotalDownloads: https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalDownloads +- And the numbers match those in my dspace-statisitcs-api *exactly*! +- This can be useful to get an individual DSpace object's stats, but there is no way to iterate over all objects like all items... + - We can look at using this to draw stats on the community, collection, and item pages + +## 2023-11-23 + +- Brian King was asking me how many PDFs we had in CGSpace so I got a rough estimate using this SQL query: + +```console +localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype='application/pdf'); + count +─────── + 47818 +(1 row) +``` + +- It's been some time since I looked at our Solr statistics to find new bots + - I found a few new ones that I [submitted to COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/60) and added to our local bot list: + - GuzzleHttp/7 + - Owler@ows.eu/1 + - newspaperjs +- I ran my old `check-spider-hits.sh` script with a list of bots from our local overrides to purge hits from Solr: + +```console +$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p +Purging 30 hits from ubermetrics in statistics +Purging 59 hits from curb in statistics +Purging 36 hits from bitdiscovery in statistics +Purging 87 hits from omgili in statistics +Purging 47 hits from Vizzit in statistics +Purging 109 hits from Java\/17-ea in statistics +Purging 40 hits from AdobeUxTechC4-Async in statistics +Purging 21 hits from ZaloPC-win32-24v473 in statistics +Purging 21 hits from nbertaupete95 in statistics +Purging 52 hits from Scoop\.it in statistics +Purging 16 hits from WebAPIClient in statistics +Purging 241 hits from RStudio in statistics +Purging 1255 hits from ^MEL in statistics +Purging 47850 hits from GuzzleHttp in statistics +Purging 8714 hits from Owler in statistics +Purging 1083 hits from newspaperjs in statistics +Purging 369 hits from ^Chrome$ in statistics +Purging 1474 hits from curl in statistics + +Total number of bot hits purged: 61504 +``` + +- I also noticed 35,000 requests over the past few years from lowercase user agents, which is [definitely weird](https://developers.whatismybrowser.com/api/features/user-agent-checks/weird/#all_lower_case), for example: + - `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36` + - `mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36` +- I'm gonna add those to our overrides and purge them: + +```console +$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p +Purging 35816 hits from ^mozilla in statistics + +Total number of bot hits purged: 35816 +``` + + diff --git a/docs/2023-11/index.html b/docs/2023-11/index.html index 8c58e8ad2..8ccdba81f 100644 --- a/docs/2023-11/index.html +++ b/docs/2023-11/index.html @@ -23,7 +23,7 @@ Start a harvest on AReS - + @@ -52,9 +52,9 @@ Start a harvest on AReS "@type": "BlogPosting", "headline": "November, 2023", "url": "https://alanorth.github.io/cgspace-notes/2023-11/", - "wordCount": "889", + "wordCount": "1278", "datePublished": "2023-11-02T12:59:36+03:00", - "dateModified": "2023-11-16T17:25:15+03:00", + "dateModified": "2023-11-19T14:29:52+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -284,7 +284,79 @@ tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
  • Export CGSpace to check for missing Initiative collection mappings
  • Start a harvest on AReS
  • - +

    2023-11-22

    + +

    2023-11-23

    + +
    localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype='application/pdf');
    + count 
    +───────
    + 47818
    +(1 row)
    +
    +
    $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
    +Purging 30 hits from ubermetrics in statistics
    +Purging 59 hits from curb in statistics
    +Purging 36 hits from bitdiscovery in statistics
    +Purging 87 hits from omgili in statistics
    +Purging 47 hits from Vizzit in statistics
    +Purging 109 hits from Java\/17-ea in statistics
    +Purging 40 hits from AdobeUxTechC4-Async in statistics
    +Purging 21 hits from ZaloPC-win32-24v473 in statistics
    +Purging 21 hits from nbertaupete95 in statistics
    +Purging 52 hits from Scoop\.it in statistics
    +Purging 16 hits from WebAPIClient in statistics
    +Purging 241 hits from RStudio in statistics
    +Purging 1255 hits from ^MEL in statistics
    +Purging 47850 hits from GuzzleHttp in statistics
    +Purging 8714 hits from Owler in statistics
    +Purging 1083 hits from newspaperjs in statistics
    +Purging 369 hits from ^Chrome$ in statistics
    +Purging 1474 hits from curl in statistics
    +
    +Total number of bot hits purged: 61504
    +
    +
    $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
    +Purging 35816 hits from ^mozilla in statistics
    +
    +Total number of bot hits purged: 35816
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index 08339cf11..e433475e9 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 453443709..b02eeacb1 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index bc3739c84..1c150b021 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 28097342f..652a52d8e 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 02e09ecd2..59107560d 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index aa6a2039d..645f671ae 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index f95619b32..2c8ca3a63 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index 01b794510..bd1722279 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/8/index.html b/docs/categories/notes/page/8/index.html index a0c000a2e..9f5fa7117 100644 --- a/docs/categories/notes/page/8/index.html +++ b/docs/categories/notes/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index bac7d0e97..3c42ae62c 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/10/index.html b/docs/page/10/index.html index 4e8202d24..0df999ff2 100644 --- a/docs/page/10/index.html +++ b/docs/page/10/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index dd78e0d82..816a40511 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index a02ef1994..86b2712fb 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index fe4775ad6..da92e62fc 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index c6f291c0a..4a1cd36e7 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 0cd3744f7..c709280fb 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index f949cb9f8..b80b94cdc 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index e900caa92..9d4a4e586 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index a51eb15ed..8d8505535 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 424d6642d..c491885fa 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/10/index.html b/docs/posts/page/10/index.html index 35bef706f..7ca3ec467 100644 --- a/docs/posts/page/10/index.html +++ b/docs/posts/page/10/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index e495d6880..73da1c122 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 8104a5699..19e0a7fc5 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index fe407e679..e4926533e 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index c6fb8eff6..d2204611a 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 790c52f23..d5093f5c6 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index b49dfbfb4..843e0c2f4 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 77e3e5250..028daa51d 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index 985e658a9..52f77b5b9 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 22b9a907e..28daee1cd 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2023-11-16T17:25:15+03:00 + 2023-11-19T14:29:52+03:00 https://alanorth.github.io/cgspace-notes/ - 2023-11-16T17:25:15+03:00 + 2023-11-19T14:29:52+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2023-11-16T17:25:15+03:00 + 2023-11-19T14:29:52+03:00 https://alanorth.github.io/cgspace-notes/2023-11/ - 2023-11-16T17:25:15+03:00 + 2023-11-19T14:29:52+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2023-11-16T17:25:15+03:00 + 2023-11-19T14:29:52+03:00 https://alanorth.github.io/cgspace-notes/2023-10/ 2023-11-02T20:58:43+03:00