mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
Update notes for 2018-09-25
This commit is contained in:
parent
2634aff405
commit
f4c053ef76
@ -468,5 +468,24 @@ $ psql -h localhost -U postgres dspacestatistics
|
|||||||
dspacestatistics=> CREATE TABLE IF NOT EXISTS items
|
dspacestatistics=> CREATE TABLE IF NOT EXISTS items
|
||||||
dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
|
dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
|
||||||
```
|
```
|
||||||
|
## 2018-09-25
|
||||||
|
|
||||||
|
- I deployed the DSpace statistics API on CGSpace, but when I ran the indexer it wanted to index 180,000 pages of item views
|
||||||
|
- I'm not even sure how that's possible, as we only have 74,000 items!
|
||||||
|
- I need to inspect the `id` values that are returned for views and cross check them with the `owningItem` values for bitstream downloads...
|
||||||
|
- Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr `id` field doesn't correspond with *actual* DSpace items?)
|
||||||
|
- I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don't give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)
|
||||||
|
- CGSpace's Solr core has 150,000,000 documents in it... and it's still pretty fast to query, but it's really a maintenance and backup burden
|
||||||
|
- DSpace Test currently has about 2,000,000 documents with `isBot:true` in its Solr statistics core, and the size on disk is 2GB (it's not much, but I have to test this somewhere!)
|
||||||
|
- According to the [DSpace 5.x Solr documentation](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance) I can use `dspace stats-util -f`, so let's try it:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ dspace stats-util -f
|
||||||
|
```
|
||||||
|
|
||||||
|
- The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with `isBot:true`
|
||||||
|
- I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it's 201 instead of 2,000,000, and statistics core is only 30MB now!
|
||||||
|
- I will set the `logBots = false` property in `dspace/config/modules/usage-statistics.cfg` on DSpace Test and check if the number of `isBot:true` events goes up any more...
|
||||||
|
- I restarted the server with `logBots = false` and after it came back up I see 266 events with `isBots:true` (maybe they were buffered)... I will check again tomorrow
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
@ -18,7 +18,7 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
|
|||||||
" />
|
" />
|
||||||
<meta property="og:type" content="article" />
|
<meta property="og:type" content="article" />
|
||||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-09/" /><meta property="article:published_time" content="2018-09-02T09:55:54+03:00"/>
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-09/" /><meta property="article:published_time" content="2018-09-02T09:55:54+03:00"/>
|
||||||
<meta property="article:modified_time" content="2018-09-25T02:24:43+03:00"/>
|
<meta property="article:modified_time" content="2018-09-25T11:33:05+03:00"/>
|
||||||
<meta name="twitter:card" content="summary"/>
|
<meta name="twitter:card" content="summary"/>
|
||||||
<meta name="twitter:title" content="September, 2018"/>
|
<meta name="twitter:title" content="September, 2018"/>
|
||||||
<meta name="twitter:description" content="2018-09-02
|
<meta name="twitter:description" content="2018-09-02
|
||||||
@ -41,9 +41,9 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "September, 2018",
|
"headline": "September, 2018",
|
||||||
"url": "https://alanorth.github.io/cgspace-notes/2018-09/",
|
"url": "https://alanorth.github.io/cgspace-notes/2018-09/",
|
||||||
"wordCount": "3379",
|
"wordCount": "3702",
|
||||||
"datePublished": "2018-09-02T09:55:54+03:00",
|
"datePublished": "2018-09-02T09:55:54+03:00",
|
||||||
"dateModified": "2018-09-25T02:24:43+03:00",
|
"dateModified": "2018-09-25T11:33:05+03:00",
|
||||||
"author": {
|
"author": {
|
||||||
"@type": "Person",
|
"@type": "Person",
|
||||||
"name": "Alan Orth"
|
"name": "Alan Orth"
|
||||||
@ -649,6 +649,29 @@ dspacestatistics=> CREATE TABLE IF NOT EXISTS items
|
|||||||
dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
|
dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
|
||||||
</code></pre>
|
</code></pre>
|
||||||
|
|
||||||
|
<h2 id="2018-09-25">2018-09-25</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>I deployed the DSpace statistics API on CGSpace, but when I ran the indexer it wanted to index 180,000 pages of item views</li>
|
||||||
|
<li>I’m not even sure how that’s possible, as we only have 74,000 items!</li>
|
||||||
|
<li>I need to inspect the <code>id</code> values that are returned for views and cross check them with the <code>owningItem</code> values for bitstream downloads…</li>
|
||||||
|
<li>Also, I could try to check all IDs against the items table to see if they are actually items (perhaps the Solr <code>id</code> field doesn’t correspond with <em>actual</em> DSpace items?)</li>
|
||||||
|
<li>I want to purge the bot hits from the Solr statistics core, as I am now realizing that I don’t give a shit about tens of millions of hits by Google and Bing indexing my shit every day (at least not in Solr!)</li>
|
||||||
|
<li>CGSpace’s Solr core has 150,000,000 documents in it… and it’s still pretty fast to query, but it’s really a maintenance and backup burden</li>
|
||||||
|
<li>DSpace Test currently has about 2,000,000 documents with <code>isBot:true</code> in its Solr statistics core, and the size on disk is 2GB (it’s not much, but I have to test this somewhere!)</li>
|
||||||
|
<li>According to the <a href="https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance">DSpace 5.x Solr documentation</a> I can use <code>dspace stats-util -f</code>, so let’s try it:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>$ dspace stats-util -f
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with <code>isBot:true</code></li>
|
||||||
|
<li>I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it’s 201 instead of 2,000,000, and statistics core is only 30MB now!</li>
|
||||||
|
<li>I will set the <code>logBots = false</code> property in <code>dspace/config/modules/usage-statistics.cfg</code> on DSpace Test and check if the number of <code>isBot:true</code> events goes up any more…</li>
|
||||||
|
<li>I restarted the server with <code>logBots = false</code> and after it came back up I see 266 events with <code>isBots:true</code> (maybe they were buffered)… I will check again tomorrow</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
|
||||||
|
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2018-09/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2018-09/</loc>
|
||||||
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
|
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
@ -184,7 +184,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
|
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -195,7 +195,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||||
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
|
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -207,13 +207,13 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||||
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
|
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||||
<lastmod>2018-09-25T02:24:43+03:00</lastmod>
|
<lastmod>2018-09-25T11:33:05+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user