Update notes for 2018-09-25

This commit is contained in:
Alan Orth 2018-09-25 19:05:02 +03:00
parent 2634aff405
commit f4c053ef76
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
3 changed files with 50 additions and 8 deletions

View File

@ -468,5 +468,24 @@ $ psql -h localhost -U postgres dspacestatistics
dspacestatistics=> CREATE TABLE IF NOT EXISTS items
dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
```
## 2018-09-25
- I deployed the DSpace statistics API on CGSpace, but when I ran the indexer it wanted to index 180,000 pages of item views
- I'm not even sure how that's possible, as we only have 74,000 items!
- I need to inspect the `id` values that are returned for views and cross check them with the `owningItem` values for bitstream downloads...
- Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr `id` field doesn't correspond with *actual* DSpace items?)
- I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don't give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)
- CGSpace's Solr core has 150,000,000 documents in it... and it's still pretty fast to query, but it's really a maintenance and backup burden
- DSpace Test currently has about 2,000,000 documents with `isBot:true` in its Solr statistics core, and the size on disk is 2GB (it's not much, but I have to test this somewhere!)
- According to the [DSpace 5.x Solr documentation](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance) I can use `dspace stats-util -f`, so let's try it:
```
$ dspace stats-util -f
```
- The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with `isBot:true`
- I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it's 201 instead of 2,000,000, and statistics core is only 30MB now!
- I will set the `logBots = false` property in `dspace/config/modules/usage-statistics.cfg` on DSpace Test and check if the number of `isBot:true` events goes up any more...
- I restarted the server with `logBots = false` and after it came back up I see 266 events with `isBots:true` (maybe they were buffered)... I will check again tomorrow
<!-- vim: set sw=2 ts=2: -->

View File

@ -18,7 +18,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-09/" /><meta property="article:published_time" content="2018-09-02T09:55:54&#43;03:00"/>
<meta property="article:modified_time" content="2018-09-25T02:24:43&#43;03:00"/>
<meta property="article:modified_time" content="2018-09-25T11:33:05&#43;03:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2018"/>
<meta name="twitter:description" content="2018-09-02
@ -41,9 +41,9 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
"@type": "BlogPosting",
"headline": "September, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-09/",
"wordCount": "3379",
"wordCount": "3702",
"datePublished": "2018-09-02T09:55:54&#43;03:00",
"dateModified": "2018-09-25T02:24:43&#43;03:00",
"dateModified": "2018-09-25T11:33:05&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -649,6 +649,29 @@ dspacestatistics=&gt; CREATE TABLE IF NOT EXISTS items
dspacestatistics-&gt; (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
</code></pre>
<h2 id="2018-09-25">2018-09-25</h2>
<ul>
<li>I deployed the DSpace statistics API on CGSpace, but when I ran the indexer it wanted to index 180,000 pages of item views</li>
<li>I&rsquo;m not even sure how that&rsquo;s possible, as we only have 74,000 items!</li>
<li>I need to inspect the <code>id</code> values that are returned for views and cross check them with the <code>owningItem</code> values for bitstream downloads&hellip;</li>
<li>Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr <code>id</code> field doesn&rsquo;t correspond with <em>actual</em> DSpace items?)</li>
<li>I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don&rsquo;t give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)</li>
<li>CGSpace&rsquo;s Solr core has 150,000,000 documents in it&hellip; and it&rsquo;s still pretty fast to query, but it&rsquo;s really a maintenance and backup burden</li>
<li>DSpace Test currently has about 2,000,000 documents with <code>isBot:true</code> in its Solr statistics core, and the size on disk is 2GB (it&rsquo;s not much, but I have to test this somewhere!)</li>
<li>According to the <a href="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance">DSpace 5.x Solr documentation</a> I can use <code>dspace stats-util -f</code>, so let&rsquo;s try it:</li>
</ul>
<pre><code>$ dspace stats-util -f
</code></pre>
<ul>
<li>The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with <code>isBot:true</code></li>
<li>I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it&rsquo;s 201 instead of 2,000,000, and statistics core is only 30MB now!</li>
<li>I will set the <code>logBots = false</code> property in <code>dspace/config/modules/usage-statistics.cfg</code> on DSpace Test and check if the number of <code>isBot:true</code> events goes up any more&hellip;</li>
<li>I restarted the server with <code>logBots = false</code> and after it came back up I see 266 events with <code>isBots:true</code> (maybe they were buffered)&hellip; I will check again tomorrow</li>
</ul>
<!-- vim: set sw=2 ts=2: -->

View File

@ -4,7 +4,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2018-09/</loc>
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
</url>
<url>
@ -184,7 +184,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
<priority>0</priority>
</url>
@ -195,7 +195,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
<priority>0</priority>
</url>
@ -207,13 +207,13 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
<priority>0</priority>
</url>