diff --git a/content/posts/2020-11.md b/content/posts/2020-11.md index f7305f21c..1d9e73844 100644 --- a/content/posts/2020-11.md +++ b/content/posts/2020-11.md @@ -119,4 +119,40 @@ dspace=# COMMIT; - I restarted Tomcat a few more times before all cores loaded, and still there are no stats before 2020-01... hmmmmm - I added a [lowercase formatter to OpenRXV](https://github.com/ilri/OpenRXV/commit/3816b9b3f3d9182d2ba1a899c1017c5895a59dee) so that we can lowercase AGROVOC subjects during harvesting +## 2020-11-11 + +- Atmire responded with a quote for the work to fix the duplicate owningComm, etc in our Solr data + - I told them to proceed, as it's within our budget of credits + - They will write a processor for DSpace 6 to remove the duplicates +- I did some tests to add a usage statistics chart to the item views on DSpace Test + - It is inspired by Salem's work on WorldFish's repository, and it hits the dspace-statistics-api for the current item and displays a graph + - I got it working very easily for all-time statistics with Chart.js, but I think I will need to use Highcharts or something else because Chart.js is HTML5 canvas and doesn't allow theming via CSS (so our Bootstrap brand colors for each theme won't work) + - I think I'll pursue this after the DSpace 6 upgrade... + +## 2020-11-12 + +- I was looking at Solr again trying to find a way to get community and collection stats by faceting on `owningComm` and `owningColl` and it seems to work actually + - The duplicated values in the multi-value fields don't seem to affect the counts, as I had thought previously (though we should still get rid of them) + - One major difference between the raw numbers I was looking at and Atmire's numbers is that Atmire's code filters "Internal" IP addresses... + - Also, instead of doing `isBot:false` I think I should do `-isBot:true` because it's not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as true +- First we get the total number of communities with stats (using calcdistinct): + +``` +facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010 +``` + +- Then get stats themselves, iterating 100 items at a time with limit and offset: + +``` +facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010 +``` + +- I was surprised to see 10,000,000 docs with `isBot:true` when I was testing on DSpace Test... + - This has got to be a mistake of some kind, as I see 4 million in 2014 that are from `dns:localhost.`, perhaps that's when we didn't have useProxies set up correctly? + - I don't see the same thing on CGSpace... I wonder what happened? + - Perhaps they got re-tagged during the DSpace 6 upgrade, somehow during the Solr migration? Hmmmmm. Definitely have to be careful with `isBot:true` in the future and not automatically purge these!!! +- I noticed 120,000+ hits from monit, FeedBurner, and Blackboard Safeassign in 2014, 2015, 2016, 2017, etc... + - I hadn't seen monit before, but the others are already in DSpace's spider agents lists for some time so probably only appear in older stats cores + - The issue with purging these using `check-spider-hits.sh` is that it can't do case-insensitive regexes and some metacharacters like `\s` don't work so I added case-sensitive patterns to a local agents file and purged them with the script + diff --git a/docs/2020-11/index.html b/docs/2020-11/index.html index 74206555b..0b739ed46 100644 --- a/docs/2020-11/index.html +++ b/docs/2020-11/index.html @@ -17,7 +17,7 @@ So far we’ve spent at least fifty hours to process the statistics and stat - + @@ -39,9 +39,9 @@ So far we’ve spent at least fifty hours to process the statistics and stat "@type": "BlogPosting", "headline": "November, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-11/", - "wordCount": "817", + "wordCount": "1266", "datePublished": "2020-11-01T13:11:54+02:00", - "dateModified": "2020-11-08T15:03:02+02:00", + "dateModified": "2020-11-10T17:00:02+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -246,6 +246,53 @@ dspace=# COMMIT;
  • I added a lowercase formatter to OpenRXV so that we can lowercase AGROVOC subjects during harvesting
  • +

    2020-11-11

    + +

    2020-11-12

    + +
    facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
    +
    +
    facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index 466a72626..5f0b67863 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 64ccf5837..717996bee 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index b7fb5b53d..fdb267c95 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index d3f7cf7f7..e7ab6f81a 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index f3ef37359..f18ec8ec0 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/index.html b/docs/index.html index aa3753b03..4e3cb818e 100644 --- a/docs/index.html +++ b/docs/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 5b73c1fd2..184160d27 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 51c3488af..14dbc9027 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 8fa58cf87..e31f8fe2c 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index f700d26ed..e93688575 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 573a339f9..f26af7ea6 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index dcfaf493a..e6f1e64e2 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index f2f3037c8..201f0ce61 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index f0d77b85c..e0d56313b 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 332c3cab1..af1632184 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index a0ad29bab..38d0e92e4 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index d53a1987d..c4bd7702e 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 2d8ad7560..89c0d123b 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 75ff0d4f8..110193aaa 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index d367b8a5f..e592f0970 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2020-11-08T15:03:02+02:00 + 2020-11-10T17:00:02+02:00 https://alanorth.github.io/cgspace-notes/ - 2020-11-08T15:03:02+02:00 + 2020-11-10T17:00:02+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-11-08T15:03:02+02:00 + 2020-11-10T17:00:02+02:00 https://alanorth.github.io/cgspace-notes/2020-11/ - 2020-11-08T15:03:02+02:00 + 2020-11-10T17:00:02+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-11-08T15:03:02+02:00 + 2020-11-10T17:00:02+02:00