diff --git a/content/posts/2020-11.md b/content/posts/2020-11.md index f7305f21c..1d9e73844 100644 --- a/content/posts/2020-11.md +++ b/content/posts/2020-11.md @@ -119,4 +119,40 @@ dspace=# COMMIT; - I restarted Tomcat a few more times before all cores loaded, and still there are no stats before 2020-01... hmmmmm - I added a [lowercase formatter to OpenRXV](https://github.com/ilri/OpenRXV/commit/3816b9b3f3d9182d2ba1a899c1017c5895a59dee) so that we can lowercase AGROVOC subjects during harvesting +## 2020-11-11 + +- Atmire responded with a quote for the work to fix the duplicate owningComm, etc in our Solr data + - I told them to proceed, as it's within our budget of credits + - They will write a processor for DSpace 6 to remove the duplicates +- I did some tests to add a usage statistics chart to the item views on DSpace Test + - It is inspired by Salem's work on WorldFish's repository, and it hits the dspace-statistics-api for the current item and displays a graph + - I got it working very easily for all-time statistics with Chart.js, but I think I will need to use Highcharts or something else because Chart.js is HTML5 canvas and doesn't allow theming via CSS (so our Bootstrap brand colors for each theme won't work) + - I think I'll pursue this after the DSpace 6 upgrade... + +## 2020-11-12 + +- I was looking at Solr again trying to find a way to get community and collection stats by faceting on `owningComm` and `owningColl` and it seems to work actually + - The duplicated values in the multi-value fields don't seem to affect the counts, as I had thought previously (though we should still get rid of them) + - One major difference between the raw numbers I was looking at and Atmire's numbers is that Atmire's code filters "Internal" IP addresses... + - Also, instead of doing `isBot:false` I think I should do `-isBot:true` because it's not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as true +- First we get the total number of communities with stats (using calcdistinct): + +``` +facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010 +``` + +- Then get stats themselves, iterating 100 items at a time with limit and offset: + +``` +facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010 +``` + +- I was surprised to see 10,000,000 docs with `isBot:true` when I was testing on DSpace Test... + - This has got to be a mistake of some kind, as I see 4 million in 2014 that are from `dns:localhost.`, perhaps that's when we didn't have useProxies set up correctly? + - I don't see the same thing on CGSpace... I wonder what happened? + - Perhaps they got re-tagged during the DSpace 6 upgrade, somehow during the Solr migration? Hmmmmm. Definitely have to be careful with `isBot:true` in the future and not automatically purge these!!! +- I noticed 120,000+ hits from monit, FeedBurner, and Blackboard Safeassign in 2014, 2015, 2016, 2017, etc... + - I hadn't seen monit before, but the others are already in DSpace's spider agents lists for some time so probably only appear in older stats cores + - The issue with purging these using `check-spider-hits.sh` is that it can't do case-insensitive regexes and some metacharacters like `\s` don't work so I added case-sensitive patterns to a local agents file and purged them with the script + diff --git a/docs/2020-11/index.html b/docs/2020-11/index.html index 74206555b..0b739ed46 100644 --- a/docs/2020-11/index.html +++ b/docs/2020-11/index.html @@ -17,7 +17,7 @@ So far we’ve spent at least fifty hours to process the statistics and stat - + @@ -39,9 +39,9 @@ So far we’ve spent at least fifty hours to process the statistics and stat "@type": "BlogPosting", "headline": "November, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-11/", - "wordCount": "817", + "wordCount": "1266", "datePublished": "2020-11-01T13:11:54+02:00", - "dateModified": "2020-11-08T15:03:02+02:00", + "dateModified": "2020-11-10T17:00:02+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -246,6 +246,53 @@ dspace=# COMMIT;
owningComm
and owningColl
and it seems to work actually
+isBot:false
I think I should do -isBot:true
because it’s not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as truefacet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
+
facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
+
isBot:true
when I was testing on DSpace Test…
+dns:localhost.
, perhaps that’s when we didn’t have useProxies set up correctly?isBot:true
in the future and not automatically purge these!!!check-spider-hits.sh
is that it can’t do case-insensitive regexes and some metacharacters like \s
don’t work so I added case-sensitive patterns to a local agents file and purged them with the script