mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
Update notes for 2018-09-25
This commit is contained in:
parent
a55bf52bc1
commit
fd2cca6fd5
@ -489,8 +489,26 @@ $ dspace stats-util -f
|
|||||||
- I restarted the server with `logBots = false` and after it came back up I see 266 events with `isBots:true` (maybe they were buffered)... I will check again tomorrow
|
- I restarted the server with `logBots = false` and after it came back up I see 266 events with `isBots:true` (maybe they were buffered)... I will check again tomorrow
|
||||||
- After a few hours I see there are still only 266 view events with `isBot:true` on DSpace Test's Solr statistics core, so I'm definitely going to deploy this on CGSpace soon
|
- After a few hours I see there are still only 266 view events with `isBot:true` on DSpace Test's Solr statistics core, so I'm definitely going to deploy this on CGSpace soon
|
||||||
- Also, CGSpace currently has 60,089,394 view events with `isBot:true` in it's Solr statistics core and it is 124GB!
|
- Also, CGSpace currently has 60,089,394 view events with `isBot:true` in it's Solr statistics core and it is 124GB!
|
||||||
- Amazing! After running `dspace stats-util -f` on CGSpace the Solr statistics core went from 124GB to 84GB, and there are only 700 events with `isBot:true` so I should really disable logging of bot events!
|
- Amazing! After running `dspace stats-util -f` on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with `isBot:true` so I should really disable logging of bot events!
|
||||||
- I'm super curious to see how the JVM heap usage changes...
|
- I'm super curious to see how the JVM heap usage changes...
|
||||||
- I made (and merged) a pull request to disable bot logging on the `5_x-prod` branch ([#387](https://github.com/ilri/DSpace/pull/387))
|
- I made (and merged) a pull request to disable bot logging on the `5_x-prod` branch ([#387](https://github.com/ilri/DSpace/pull/387))
|
||||||
|
- Now I'm wondering if there are other bot requests that aren't classified as bots because the IP lists or user agents are outdated
|
||||||
|
- DSpace ships a list of spider IPs, for example: `config/spiders/iplists.com-google.txt`
|
||||||
|
- I checked the list against all the IPs we've seen using the "Googlebot" useragent on CGSpace's nginx access logs
|
||||||
|
- The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be "Googlebot"...
|
||||||
|
- According to the [Googlebot FAQ](https://support.google.com/webmasters/answer/80553) the domain name in the reverse DNS lookup should contain either `googlebot.com` or `google.com`
|
||||||
|
- In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):
|
||||||
|
|
||||||
|
```
|
||||||
|
*:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
|
||||||
|
```
|
||||||
|
|
||||||
|
- I translate that into a delete command using the `/update` handler:
|
||||||
|
|
||||||
|
```
|
||||||
|
http://localhost:8081/solr/statistics/update?commit=true&stream.body=<delete><query>*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false</query></delete>
|
||||||
|
```
|
||||||
|
|
||||||
|
- And magically all those 81,000 documents are gone!
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
@ -18,7 +18,7 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
|
|||||||
" />
|
" />
|
||||||
<meta property="og:type" content="article" />
|
<meta property="og:type" content="article" />
|
||||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-09/" /><meta property="article:published_time" content="2018-09-02T09:55:54+03:00"/>
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-09/" /><meta property="article:published_time" content="2018-09-02T09:55:54+03:00"/>
|
||||||
<meta property="article:modified_time" content="2018-09-25T21:45:14+03:00"/>
|
<meta property="article:modified_time" content="2018-09-25T22:06:05+03:00"/>
|
||||||
<meta name="twitter:card" content="summary"/>
|
<meta name="twitter:card" content="summary"/>
|
||||||
<meta name="twitter:title" content="September, 2018"/>
|
<meta name="twitter:title" content="September, 2018"/>
|
||||||
<meta name="twitter:description" content="2018-09-02
|
<meta name="twitter:description" content="2018-09-02
|
||||||
@ -41,9 +41,9 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "September, 2018",
|
"headline": "September, 2018",
|
||||||
"url": "https://alanorth.github.io/cgspace-notes/2018-09/",
|
"url": "https://alanorth.github.io/cgspace-notes/2018-09/",
|
||||||
"wordCount": "3812",
|
"wordCount": "3955",
|
||||||
"datePublished": "2018-09-02T09:55:54+03:00",
|
"datePublished": "2018-09-02T09:55:54+03:00",
|
||||||
"dateModified": "2018-09-25T21:45:14+03:00",
|
"dateModified": "2018-09-25T22:06:05+03:00",
|
||||||
"author": {
|
"author": {
|
||||||
"@type": "Person",
|
"@type": "Person",
|
||||||
"name": "Alan Orth"
|
"name": "Alan Orth"
|
||||||
@ -672,9 +672,29 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
|
|||||||
<li>I restarted the server with <code>logBots = false</code> and after it came back up I see 266 events with <code>isBots:true</code> (maybe they were buffered)… I will check again tomorrow</li>
|
<li>I restarted the server with <code>logBots = false</code> and after it came back up I see 266 events with <code>isBots:true</code> (maybe they were buffered)… I will check again tomorrow</li>
|
||||||
<li>After a few hours I see there are still only 266 view events with <code>isBot:true</code> on DSpace Test’s Solr statistics core, so I’m definitely going to deploy this on CGSpace soon</li>
|
<li>After a few hours I see there are still only 266 view events with <code>isBot:true</code> on DSpace Test’s Solr statistics core, so I’m definitely going to deploy this on CGSpace soon</li>
|
||||||
<li>Also, CGSpace currently has 60,089,394 view events with <code>isBot:true</code> in it’s Solr statistics core and it is 124GB!</li>
|
<li>Also, CGSpace currently has 60,089,394 view events with <code>isBot:true</code> in it’s Solr statistics core and it is 124GB!</li>
|
||||||
<li>Amazing! After running <code>dspace stats-util -f</code> on CGSpace the Solr statistics core went from 124GB to 84GB, and there are only 700 events with <code>isBot:true</code> so I should really disable logging of bot events!</li>
|
<li>Amazing! After running <code>dspace stats-util -f</code> on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with <code>isBot:true</code> so I should really disable logging of bot events!</li>
|
||||||
<li>I’m super curious to see how the JVM heap usage changes…</li>
|
<li>I’m super curious to see how the JVM heap usage changes…</li>
|
||||||
<li>I made (and merged) a pull request to disable bot logging on the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/387">#387</a>)</li>
|
<li>I made (and merged) a pull request to disable bot logging on the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/387">#387</a>)</li>
|
||||||
|
<li>Now I’m wondering if there are other bot requests that aren’t classified as bots because the IP lists or user agents are outdated</li>
|
||||||
|
<li>DSpace ships a list of spider IPs, for example: <code>config/spiders/iplists.com-google.txt</code></li>
|
||||||
|
<li>I checked the list against all the IPs we’ve seen using the “Googlebot” useragent on CGSpace’s nginx access logs</li>
|
||||||
|
<li>The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be “Googlebot”…</li>
|
||||||
|
<li>According to the <a href="https://support.google.com/webmasters/answer/80553">Googlebot FAQ</a> the domain name in the reverse DNS lookup should contain either <code>googlebot.com</code> or <code>google.com</code></li>
|
||||||
|
<li>In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>*:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>I translate that into a delete command using the <code>/update</code> handler:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>http://localhost:8081/solr/statistics/update?commit=true&stream.body=<delete><query>*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false</query></delete>
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>And magically all those 81,000 documents are gone!</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2018-09/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2018-09/</loc>
|
||||||
<lastmod>2018-09-25T21:45:14+03:00</lastmod>
|
<lastmod>2018-09-25T22:06:05+03:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
@ -184,7 +184,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2018-09-25T21:45:14+03:00</lastmod>
|
<lastmod>2018-09-25T22:06:05+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -195,7 +195,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||||
<lastmod>2018-09-25T21:45:14+03:00</lastmod>
|
<lastmod>2018-09-25T22:06:05+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -207,13 +207,13 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||||
<lastmod>2018-09-25T21:45:14+03:00</lastmod>
|
<lastmod>2018-09-25T22:06:05+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||||
<lastmod>2018-09-25T21:45:14+03:00</lastmod>
|
<lastmod>2018-09-25T22:06:05+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user