2022-05-04 10:09:45 +02:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "May, 2022" / >
< meta property = "og:description" content = "2022-05-04
I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
18.207.136.176
185.189.36.248
50.118.223.78
52.70.76.123
3.236.10.11
Looking at the Solr statistics for 2022-04
52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’ t know why its requests were logged in Solr
52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’ t know why its requests were logged in Solr
If I query Solr for time:2022-04* AND dns:*msnbot* AND dns:*.msn.com. I see a handful of IPs that made 41,000 requests
I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2022-05/" / >
< meta property = "article:published_time" content = "2022-05-04T09:13:39+03:00" / >
2022-05-04 15:48:24 +02:00
< meta property = "article:modified_time" content = "2022-05-04T11:09:45+03:00" / >
2022-05-04 10:09:45 +02:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "May, 2022" / >
< meta name = "twitter:description" content = "2022-05-04
I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
18.207.136.176
185.189.36.248
50.118.223.78
52.70.76.123
3.236.10.11
Looking at the Solr statistics for 2022-04
52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’ t know why its requests were logged in Solr
52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’ t know why its requests were logged in Solr
If I query Solr for time:2022-04* AND dns:*msnbot* AND dns:*.msn.com. I see a handful of IPs that made 41,000 requests
I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
"/>
< meta name = "generator" content = "Hugo 0.98.0" / >
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-05/",
2022-05-04 15:48:24 +02:00
"wordCount": "300",
2022-05-04 10:09:45 +02:00
"datePublished": "2022-05-04T09:13:39+03:00",
2022-05-04 15:48:24 +02:00
"dateModified": "2022-05-04T11:09:45+03:00",
2022-05-04 10:09:45 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2022-05/" >
< title > May, 2022 | CGSpace Notes< / title >
<!-- combined, minified CSS -->
< link href = "https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel = "stylesheet" integrity = "sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin = "anonymous" >
<!-- minified Font Awesome for SVG icons -->
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2022-05/" > May, 2022< / a > < / h2 >
< p class = "blog-post-meta" >
< time datetime = "2022-05-04T09:13:39+03:00" > Wed May 04, 2022< / time >
in
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/cgspace-notes/categories/notes/" rel = "category tag" > Notes< / a >
< / p >
< / header >
< h2 id = "2022-05-04" > 2022-05-04< / h2 >
< ul >
< li > I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
< ul >
< li > 18.207.136.176< / li >
< li > 185.189.36.248< / li >
< li > 50.118.223.78< / li >
< li > 52.70.76.123< / li >
< li > 3.236.10.11< / li >
< / ul >
< / li >
< li > Looking at the Solr statistics for 2022-04
< ul >
< li > 52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests< / li >
< li > 64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc< / li >
< li > 185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt< / li >
< li > 157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’ t know why its requests were logged in Solr< / li >
< li > 52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests< / li >
< li > 157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests< / li >
< li > 207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’ t know why its requests were logged in Solr< / li >
< li > If I query Solr for < code > time:2022-04* AND dns:*msnbot* AND dns:*.msn.com.< / code > I see a handful of IPs that made 41,000 requests< / li >
< / ul >
< / li >
< li > I purged 93,974 hits from these IPs using my < code > check-spider-ip-hits.sh< / code > script< / li >
< / ul >
< ul >
< li > Now looking at the Solr statistics by user agent I see:
< ul >
< li > < code > SomeRandomText< / code > < / li >
< li > < code > RestSharp/106.11.7.0< / code > < / li >
< li > < code > MetaInspector/5.7.0 (+https://github.com/jaimeiniesta/metainspector)< / code > < / li >
< li > < code > wp_is_mobile< / code > < / li >
< li > < code > Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" < / code > < / li >
< li > < code > insomnia/2022.2.1< / code > < / li >
< li > < code > ZoteroTranslationServer< / code > < / li >
< li > < code > omgili/0.5 +http://omgili.com< / code > < / li >
< li > < code > curb< / code > < / li >
< li > < code > Sprout Social (Link Attachment)< / code > < / li >
< / ul >
< / li >
< li > I purged 2,900 hits from these user agents from Solr using my < code > check-spider-hits.sh< / code > script< / li >
< li > I made a < a href = "https://github.com/atmire/COUNTER-Robots/pull/54" > pull request to COUNTER-Robots< / a > for some of these agents
< ul >
< li > In the mean time I will add them to our local overrides in DSpace< / li >
< / ul >
< / li >
2022-05-04 15:48:24 +02:00
< li > Run all system updates on AReS server, update all Docker containers, and restart the server
< ul >
< li > Start a harvest on AReS< / li >
< / ul >
< / li >
2022-05-04 10:09:45 +02:00
< / ul >
<!-- raw HTML omitted -->
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "/cgspace-notes/2022-05/" > May, 2022< / a > < / li >
< li > < a href = "/cgspace-notes/2022-04/" > April, 2022< / a > < / li >
< li > < a href = "/cgspace-notes/2022-03/" > March, 2022< / a > < / li >
< li > < a href = "/cgspace-notes/2022-02/" > February, 2022< / a > < / li >
< li > < a href = "/cgspace-notes/2022-01/" > January, 2022< / a > < / li >
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p dir = "auto" >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >