Add notes for 2020-11-12

This commit is contained in:
2020-11-13 09:22:18 +02:00
parent b99114d8e4
commit 61474b6663
22 changed files with 110 additions and 27 deletions

View File

@ -119,4 +119,40 @@ dspace=# COMMIT;
- I restarted Tomcat a few more times before all cores loaded, and still there are no stats before 2020-01... hmmmmm
- I added a [lowercase formatter to OpenRXV](https://github.com/ilri/OpenRXV/commit/3816b9b3f3d9182d2ba1a899c1017c5895a59dee) so that we can lowercase AGROVOC subjects during harvesting
## 2020-11-11
- Atmire responded with a quote for the work to fix the duplicate owningComm, etc in our Solr data
- I told them to proceed, as it's within our budget of credits
- They will write a processor for DSpace 6 to remove the duplicates
- I did some tests to add a usage statistics chart to the item views on DSpace Test
- It is inspired by Salem's work on WorldFish's repository, and it hits the dspace-statistics-api for the current item and displays a graph
- I got it working very easily for all-time statistics with Chart.js, but I think I will need to use Highcharts or something else because Chart.js is HTML5 canvas and doesn't allow theming via CSS (so our Bootstrap brand colors for each theme won't work)
- I think I'll pursue this after the DSpace 6 upgrade...
## 2020-11-12
- I was looking at Solr again trying to find a way to get community and collection stats by faceting on `owningComm` and `owningColl` and it seems to work actually
- The duplicated values in the multi-value fields don't seem to affect the counts, as I had thought previously (though we should still get rid of them)
- One major difference between the raw numbers I was looking at and Atmire's numbers is that Atmire's code filters "Internal" IP addresses...
- Also, instead of doing `isBot:false` I think I should do `-isBot:true` because it's not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as true
- First we get the total number of communities with stats (using calcdistinct):
```
facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
```
- Then get stats themselves, iterating 100 items at a time with limit and offset:
```
facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
```
- I was surprised to see 10,000,000 docs with `isBot:true` when I was testing on DSpace Test...
- This has got to be a mistake of some kind, as I see 4 million in 2014 that are from `dns:localhost.`, perhaps that's when we didn't have useProxies set up correctly?
- I don't see the same thing on CGSpace... I wonder what happened?
- Perhaps they got re-tagged during the DSpace 6 upgrade, somehow during the Solr migration? Hmmmmm. Definitely have to be careful with `isBot:true` in the future and not automatically purge these!!!
- I noticed 120,000+ hits from monit, FeedBurner, and Blackboard Safeassign in 2014, 2015, 2016, 2017, etc...
- I hadn't seen monit before, but the others are already in DSpace's spider agents lists for some time so probably only appear in older stats cores
- The issue with purging these using `check-spider-hits.sh` is that it can't do case-insensitive regexes and some metacharacters like `\s` don't work so I added case-sensitive patterns to a local agents file and purged them with the script
<!-- vim: set sw=2 ts=2: -->