<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="July, 2020" /> <meta property="og:description" content="2020-07-01 A few users noticed that CGSpace wasn’t loading items today, item pages seem blank I looked at the PostgreSQL locks but they don’t seem unusual I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved I restarted Tomcat and PostgreSQL and the issue was gone Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-07/" /> <meta property="article:published_time" content="2020-07-01T10:53:54+03:00" /> <meta property="article:modified_time" content="2020-07-01T15:37:20+03:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="July, 2020"/> <meta name="twitter:description" content="2020-07-01 A few users noticed that CGSpace wasn’t loading items today, item pages seem blank I looked at the PostgreSQL locks but they don’t seem unusual I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved I restarted Tomcat and PostgreSQL and the issue was gone Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request "/> <meta name="generator" content="Hugo 0.73.0" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "July, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-07/", "wordCount": "844", "datePublished": "2020-07-01T10:53:54+03:00", "dateModified": "2020-07-01T15:37:20+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-07/"> <title>July, 2020 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-07/">July, 2020</a></h2> <p class="blog-post-meta"><time datetime="2020-07-01T10:53:54+03:00">Wed Jul 01, 2020</time> by Alan Orth in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2020-07-01">2020-07-01</h2> <ul> <li>A few users noticed that CGSpace wasn’t loading items today, item pages seem blank <ul> <li>I looked at the PostgreSQL locks but they don’t seem unusual</li> <li>I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved</li> <li>I restarted Tomcat and PostgreSQL and the issue was gone</li> </ul> </li> <li>Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the <code>5_x-prod</code> branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request</li> </ul> <ul> <li>Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning</li> <li>First looking at the traffic in the morning:</li> </ul> <pre><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED - ... 9659 33.56% 1 0.08% 340.94 MiB 64.39.99.13 3317 11.53% 1 0.08% 871.71 MiB 199.47.87.140 2986 10.38% 1 0.08% 17.39 MiB 199.47.87.144 2286 7.94% 1 0.08% 13.04 MiB 199.47.87.142 </code></pre><ul> <li>64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:</li> </ul> <pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15 </code></pre><ul> <li>I will purge hits from that IP from Solr</li> <li>The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:</li> </ul> <pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"' numFound="41694" </code></pre><ul> <li>They used to be “TurnitinBot”… hhmmmm, seems they use both: <a href="https://turnitin.com/robot/crawlerinfo.html">https://turnitin.com/robot/crawlerinfo.html</a></li> <li>I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting <code>robots.txt</code> and only requesting item pages, so that’s impressive! I don’t need to add them to the “bad bot” rate limit list in nginx</li> <li>While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:</li> </ul> <pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0 </code></pre><ul> <li>The IPs all belong to HostRoyale:</li> </ul> <pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l 81 # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h 185.152.250.1 185.152.250.101 185.152.250.103 185.152.250.105 185.152.250.107 185.152.250.111 185.152.250.115 185.152.250.119 185.152.250.121 185.152.250.123 185.152.250.125 185.152.250.129 185.152.250.13 185.152.250.131 185.152.250.133 185.152.250.135 185.152.250.137 185.152.250.141 185.152.250.145 185.152.250.149 185.152.250.153 185.152.250.155 185.152.250.157 185.152.250.159 185.152.250.161 185.152.250.163 185.152.250.165 185.152.250.167 185.152.250.17 185.152.250.171 185.152.250.183 185.152.250.189 185.152.250.191 185.152.250.197 185.152.250.201 185.152.250.205 185.152.250.209 185.152.250.21 185.152.250.213 185.152.250.217 185.152.250.219 185.152.250.221 185.152.250.223 185.152.250.225 185.152.250.227 185.152.250.229 185.152.250.231 185.152.250.233 185.152.250.235 185.152.250.239 185.152.250.243 185.152.250.247 185.152.250.249 185.152.250.25 185.152.250.251 185.152.250.253 185.152.250.255 185.152.250.27 185.152.250.29 185.152.250.3 185.152.250.31 185.152.250.39 185.152.250.41 185.152.250.47 185.152.250.5 185.152.250.59 185.152.250.63 185.152.250.65 185.152.250.67 185.152.250.7 185.152.250.71 185.152.250.73 185.152.250.77 185.152.250.81 185.152.250.85 185.152.250.89 185.152.250.9 185.152.250.93 185.152.250.95 185.152.250.97 185.152.250.99 </code></pre><ul> <li>It’s only a few hundred requests each, but I am very suspicious so I will record it here and purge their IPs from Solr</li> <li>Then I see 185.187.30.14 and 185.187.30.13 making requests also, with several different “normal” user agents <ul> <li>They are both apparently in France, belonging to Scalair FR hosting</li> <li>I will purge their requests from Solr too</li> </ul> </li> <li>Now I see some other new bots I hadn’t noticed before: <ul> <li><code>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com</code></li> <li><code>Consilio (WebHare Platform 4.28.2-dev); LinkChecker)</code>, which appears to be a <a href="https://www.utwente.nl/en/websites/webhare/">university CMS</a></li> <li>I will add <code>LinkCheck</code>, <code>Consilio</code>, and <code>WebHare</code> to the list of DSpace bot agents and purge them from Solr stats</li> <li>COUNTER-Robots list already has <code>link.?check</code> but for some reason DSpace didn’t match that and I see hits for some of these…</li> <li>Maybe I should add <code>[Ll]ink.?[Cc]heck.?</code> to a custom list for now?</li> <li>For now I added <code>Turnitin</code> to the <a href="https://github.com/atmire/COUNTER-Robots/pull/34">new bots pull request on COUNTER-Robots</a></li> </ul> </li> <li>I purged 20,000 hits from IPs and 45,000 hits from user agents</li> <li>I will revert the default “example” agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven’t merged yet:</li> </ul> <pre><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt Citoid ecointernet GigablastOpenSource Jersey\/\d MarcEdit OgScrper okhttp ^Pattern\/\d ReactorNetty\/\d sqlmap Typhoeus 7siters </code></pre><ul> <li>Just a note that I <em>still</em> can’t deploy the <code>6_x-dev-atmire-modules</code> branch as it fails at ant update:</li> </ul> <pre><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1 </code></pre><ul> <li>I had told Atmire about this several weeks ago… but I reminded them again in the ticket <ul> <li>Atmire says they are able to build fine, so I tried again and noticed that I had been building with <code>-Denv=dspacetest.cgiar.org</code>, which is not necessary for DSpace 6 of course</li> <li>Once I removed that it builds fine</li> </ul> </li> <li>I quickly re-applied the Font Awesome 5 changes to use SVG+JS instead of web fonts (from 2020-04) and things are looking good!</li> </ul> <!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2020-07/">July, 2020</a></li> <li><a href="/cgspace-notes/2020-06/">June, 2020</a></li> <li><a href="/cgspace-notes/2020-05/">May, 2020</a></li> <li><a href="/cgspace-notes/2020-04/">April, 2020</a></li> <li><a href="/cgspace-notes/2020-03/">March, 2020</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>