diff --git a/content/posts/2018-09.md b/content/posts/2018-09.md index 06f24a5f0..373e112a2 100644 --- a/content/posts/2018-09.md +++ b/content/posts/2018-09.md @@ -489,8 +489,26 @@ $ dspace stats-util -f - I restarted the server with `logBots = false` and after it came back up I see 266 events with `isBots:true` (maybe they were buffered)... I will check again tomorrow - After a few hours I see there are still only 266 view events with `isBot:true` on DSpace Test's Solr statistics core, so I'm definitely going to deploy this on CGSpace soon - Also, CGSpace currently has 60,089,394 view events with `isBot:true` in it's Solr statistics core and it is 124GB! -- Amazing! After running `dspace stats-util -f` on CGSpace the Solr statistics core went from 124GB to 84GB, and there are only 700 events with `isBot:true` so I should really disable logging of bot events! +- Amazing! After running `dspace stats-util -f` on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with `isBot:true` so I should really disable logging of bot events! - I'm super curious to see how the JVM heap usage changes... - I made (and merged) a pull request to disable bot logging on the `5_x-prod` branch ([#387](https://github.com/ilri/DSpace/pull/387)) +- Now I'm wondering if there are other bot requests that aren't classified as bots because the IP lists or user agents are outdated +- DSpace ships a list of spider IPs, for example: `config/spiders/iplists.com-google.txt` +- I checked the list against all the IPs we've seen using the "Googlebot" useragent on CGSpace's nginx access logs +- The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be "Googlebot"... +- According to the [Googlebot FAQ](https://support.google.com/webmasters/answer/80553) the domain name in the reverse DNS lookup should contain either `googlebot.com` or `google.com` +- In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents): + +``` +*:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false +``` + +- I translate that into a delete command using the `/update` handler: + +``` +http://localhost:8081/solr/statistics/update?commit=true&stream.body=*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false +``` + +- And magically all those 81,000 documents are gone! diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index aa6046c08..2249c19f4 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -18,7 +18,7 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I " /> - + I restarted the server with logBots = false and after it came back up I see 266 events with isBots:true (maybe they were buffered)… I will check again tomorrow
  • After a few hours I see there are still only 266 view events with isBot:true on DSpace Test’s Solr statistics core, so I’m definitely going to deploy this on CGSpace soon
  • Also, CGSpace currently has 60,089,394 view events with isBot:true in it’s Solr statistics core and it is 124GB!
  • -
  • Amazing! After running dspace stats-util -f on CGSpace the Solr statistics core went from 124GB to 84GB, and there are only 700 events with isBot:true so I should really disable logging of bot events!
  • +
  • Amazing! After running dspace stats-util -f on CGSpace the Solr statistics core went from 124GB to 60GB, and now there are only 700 events with isBot:true so I should really disable logging of bot events!
  • I’m super curious to see how the JVM heap usage changes…
  • I made (and merged) a pull request to disable bot logging on the 5_x-prod branch (#387)
  • +
  • Now I’m wondering if there are other bot requests that aren’t classified as bots because the IP lists or user agents are outdated
  • +
  • DSpace ships a list of spider IPs, for example: config/spiders/iplists.com-google.txt
  • +
  • I checked the list against all the IPs we’ve seen using the “Googlebot” useragent on CGSpace’s nginx access logs
  • +
  • The first thing I learned is that shit tons of IPs in Russia, Ukraine, Ireland, Brazil, Portugal, the US, Canada, etc are pretending to be “Googlebot”…
  • +
  • According to the Googlebot FAQ the domain name in the reverse DNS lookup should contain either googlebot.com or google.com
  • +
  • In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):
  • + + +
    *:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
    +
    + + + +
    http://localhost:8081/solr/statistics/update?commit=true&stream.body=<delete><query>*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false</query></delete>
    +
    + + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index c71d029f8..b7e93182a 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-09/ - 2018-09-25T21:45:14+03:00 + 2018-09-25T22:06:05+03:00 @@ -184,7 +184,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-09-25T21:45:14+03:00 + 2018-09-25T22:06:05+03:00 0 @@ -195,7 +195,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-09-25T21:45:14+03:00 + 2018-09-25T22:06:05+03:00 0 @@ -207,13 +207,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2018-09-25T21:45:14+03:00 + 2018-09-25T22:06:05+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-09-25T21:45:14+03:00 + 2018-09-25T22:06:05+03:00 0