Add notes for 2020-11-12

This commit is contained in:
Alan Orth 2020-11-13 09:22:18 +02:00
parent b99114d8e4
commit 61474b6663
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
22 changed files with 110 additions and 27 deletions

View File

@ -119,4 +119,40 @@ dspace=# COMMIT;
- I restarted Tomcat a few more times before all cores loaded, and still there are no stats before 2020-01... hmmmmm
- I added a [lowercase formatter to OpenRXV](https://github.com/ilri/OpenRXV/commit/3816b9b3f3d9182d2ba1a899c1017c5895a59dee) so that we can lowercase AGROVOC subjects during harvesting
## 2020-11-11
- Atmire responded with a quote for the work to fix the duplicate owningComm, etc in our Solr data
- I told them to proceed, as it's within our budget of credits
- They will write a processor for DSpace 6 to remove the duplicates
- I did some tests to add a usage statistics chart to the item views on DSpace Test
- It is inspired by Salem's work on WorldFish's repository, and it hits the dspace-statistics-api for the current item and displays a graph
- I got it working very easily for all-time statistics with Chart.js, but I think I will need to use Highcharts or something else because Chart.js is HTML5 canvas and doesn't allow theming via CSS (so our Bootstrap brand colors for each theme won't work)
- I think I'll pursue this after the DSpace 6 upgrade...
## 2020-11-12
- I was looking at Solr again trying to find a way to get community and collection stats by faceting on `owningComm` and `owningColl` and it seems to work actually
- The duplicated values in the multi-value fields don't seem to affect the counts, as I had thought previously (though we should still get rid of them)
- One major difference between the raw numbers I was looking at and Atmire's numbers is that Atmire's code filters "Internal" IP addresses...
- Also, instead of doing `isBot:false` I think I should do `-isBot:true` because it's not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as true
- First we get the total number of communities with stats (using calcdistinct):
```
facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
```
- Then get stats themselves, iterating 100 items at a time with limit and offset:
```
facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
```
- I was surprised to see 10,000,000 docs with `isBot:true` when I was testing on DSpace Test...
- This has got to be a mistake of some kind, as I see 4 million in 2014 that are from `dns:localhost.`, perhaps that's when we didn't have useProxies set up correctly?
- I don't see the same thing on CGSpace... I wonder what happened?
- Perhaps they got re-tagged during the DSpace 6 upgrade, somehow during the Solr migration? Hmmmmm. Definitely have to be careful with `isBot:true` in the future and not automatically purge these!!!
- I noticed 120,000+ hits from monit, FeedBurner, and Blackboard Safeassign in 2014, 2015, 2016, 2017, etc...
- I hadn't seen monit before, but the others are already in DSpace's spider agents lists for some time so probably only appear in older stats cores
- The issue with purging these using `check-spider-hits.sh` is that it can't do case-insensitive regexes and some metacharacters like `\s` don't work so I added case-sensitive patterns to a local agents file and purged them with the script
<!-- vim: set sw=2 ts=2: -->

View File

@ -17,7 +17,7 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-11/" />
<meta property="article:published_time" content="2020-11-01T13:11:54+02:00" />
<meta property="article:modified_time" content="2020-11-08T15:03:02+02:00" />
<meta property="article:modified_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2020"/>
@ -39,9 +39,9 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
"@type": "BlogPosting",
"headline": "November, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-11/",
"wordCount": "817",
"wordCount": "1266",
"datePublished": "2020-11-01T13:11:54+02:00",
"dateModified": "2020-11-08T15:03:02+02:00",
"dateModified": "2020-11-10T17:00:02+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -246,6 +246,53 @@ dspace=# COMMIT;
</li>
<li>I added a <a href="https://github.com/ilri/OpenRXV/commit/3816b9b3f3d9182d2ba1a899c1017c5895a59dee">lowercase formatter to OpenRXV</a> so that we can lowercase AGROVOC subjects during harvesting</li>
</ul>
<h2 id="2020-11-11">2020-11-11</h2>
<ul>
<li>Atmire responded with a quote for the work to fix the duplicate owningComm, etc in our Solr data
<ul>
<li>I told them to proceed, as it&rsquo;s within our budget of credits</li>
<li>They will write a processor for DSpace 6 to remove the duplicates</li>
</ul>
</li>
<li>I did some tests to add a usage statistics chart to the item views on DSpace Test
<ul>
<li>It is inspired by Salem&rsquo;s work on WorldFish&rsquo;s repository, and it hits the dspace-statistics-api for the current item and displays a graph</li>
<li>I got it working very easily for all-time statistics with Chart.js, but I think I will need to use Highcharts or something else because Chart.js is HTML5 canvas and doesn&rsquo;t allow theming via CSS (so our Bootstrap brand colors for each theme won&rsquo;t work)</li>
<li>I think I&rsquo;ll pursue this after the DSpace 6 upgrade&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2020-11-12">2020-11-12</h2>
<ul>
<li>I was looking at Solr again trying to find a way to get community and collection stats by faceting on <code>owningComm</code> and <code>owningColl</code> and it seems to work actually
<ul>
<li>The duplicated values in the multi-value fields don&rsquo;t seem to affect the counts, as I had thought previously (though we should still get rid of them)</li>
<li>One major difference between the raw numbers I was looking at and Atmire&rsquo;s numbers is that Atmire&rsquo;s code filters &ldquo;Internal&rdquo; IP addresses&hellip;</li>
<li>Also, instead of doing <code>isBot:false</code> I think I should do <code>-isBot:true</code> because it&rsquo;s not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as true</li>
</ul>
</li>
<li>First we get the total number of communities with stats (using calcdistinct):</li>
</ul>
<pre><code>facet=true&amp;facet.field=owningComm&amp;facet.mincount=1&amp;facet.limit=1&amp;facet.offset=0&amp;stats=true&amp;stats.field=owningComm&amp;stats.calcdistinct=true&amp;shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
</code></pre><ul>
<li>Then get stats themselves, iterating 100 items at a time with limit and offset:</li>
</ul>
<pre><code>facet=true&amp;facet.field=owningComm&amp;facet.mincount=1&amp;facet.limit=100&amp;facet.offset=0&amp;shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
</code></pre><ul>
<li>I was surprised to see 10,000,000 docs with <code>isBot:true</code> when I was testing on DSpace Test&hellip;
<ul>
<li>This has got to be a mistake of some kind, as I see 4 million in 2014 that are from <code>dns:localhost.</code>, perhaps that&rsquo;s when we didn&rsquo;t have useProxies set up correctly?</li>
<li>I don&rsquo;t see the same thing on CGSpace&hellip; I wonder what happened?</li>
<li>Perhaps they got re-tagged during the DSpace 6 upgrade, somehow during the Solr migration? Hmmmmm. Definitely have to be careful with <code>isBot:true</code> in the future and not automatically purge these!!!</li>
</ul>
</li>
<li>I noticed 120,000+ hits from monit, FeedBurner, and Blackboard Safeassign in 2014, 2015, 2016, 2017, etc&hellip;
<ul>
<li>I hadn&rsquo;t seen monit before, but the others are already in DSpace&rsquo;s spider agents lists for some time so probably only appear in older stats cores</li>
<li>The issue with purging these using <code>check-spider-hits.sh</code> is that it can&rsquo;t do case-insensitive regexes and some metacharacters like <code>\s</code> don&rsquo;t work so I added case-sensitive patterns to a local agents file and purged them with the script</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-11-08T15:03:02+02:00" />
<meta property="og:updated_time" content="2020-11-10T17:00:02+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-11-08T15:03:02+02:00</lastmod>
<lastmod>2020-11-10T17:00:02+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-11-08T15:03:02+02:00</lastmod>
<lastmod>2020-11-10T17:00:02+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-11-08T15:03:02+02:00</lastmod>
<lastmod>2020-11-10T17:00:02+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-11/</loc>
<lastmod>2020-11-08T15:03:02+02:00</lastmod>
<lastmod>2020-11-10T17:00:02+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-11-08T15:03:02+02:00</lastmod>
<lastmod>2020-11-10T17:00:02+02:00</lastmod>
</url>
<url>