mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-23 13:34:32 +01:00
430 lines
17 KiB
HTML
430 lines
17 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
<meta property="og:title" content="July, 2020" />
|
|
<meta property="og:description" content="2020-07-01
|
|
|
|
A few users noticed that CGSpace wasn’t loading items today, item pages seem blank
|
|
|
|
I looked at the PostgreSQL locks but they don’t seem unusual
|
|
I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved
|
|
I restarted Tomcat and PostgreSQL and the issue was gone
|
|
|
|
|
|
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-07/" />
|
|
<meta property="article:published_time" content="2020-07-01T10:53:54+03:00" />
|
|
<meta property="article:modified_time" content="2020-07-05T10:50:09+03:00" />
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="July, 2020"/>
|
|
<meta name="twitter:description" content="2020-07-01
|
|
|
|
A few users noticed that CGSpace wasn’t loading items today, item pages seem blank
|
|
|
|
I looked at the PostgreSQL locks but they don’t seem unusual
|
|
I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved
|
|
I restarted Tomcat and PostgreSQL and the issue was gone
|
|
|
|
|
|
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.73.0" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "July, 2020",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2020-07/",
|
|
"wordCount": "1256",
|
|
"datePublished": "2020-07-01T10:53:54+03:00",
|
|
"dateModified": "2020-07-05T10:50:09+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-07/">
|
|
|
|
<title>July, 2020 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-07/">July, 2020</a></h2>
|
|
<p class="blog-post-meta"><time datetime="2020-07-01T10:53:54+03:00">Wed Jul 01, 2020</time> by Alan Orth in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2020-07-01">2020-07-01</h2>
|
|
<ul>
|
|
<li>A few users noticed that CGSpace wasn’t loading items today, item pages seem blank
|
|
<ul>
|
|
<li>I looked at the PostgreSQL locks but they don’t seem unusual</li>
|
|
<li>I guess this is the same “blank item page” issue that we had a few times in 2019 that we never solved</li>
|
|
<li>I restarted Tomcat and PostgreSQL and the issue was gone</li>
|
|
</ul>
|
|
</li>
|
|
<li>Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the <code>5_x-prod</code> branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request</li>
|
|
</ul>
|
|
<ul>
|
|
<li>Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning</li>
|
|
<li>First looking at the traffic in the morning:</li>
|
|
</ul>
|
|
<pre><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
|
|
...
|
|
9659 33.56% 1 0.08% 340.94 MiB 64.39.99.13
|
|
3317 11.53% 1 0.08% 871.71 MiB 199.47.87.140
|
|
2986 10.38% 1 0.08% 17.39 MiB 199.47.87.144
|
|
2286 7.94% 1 0.08% 13.04 MiB 199.47.87.142
|
|
</code></pre><ul>
|
|
<li>64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:</li>
|
|
</ul>
|
|
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
|
|
</code></pre><ul>
|
|
<li>I will purge hits from that IP from Solr</li>
|
|
<li>The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:</li>
|
|
</ul>
|
|
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
|
|
numFound="41694"
|
|
</code></pre><ul>
|
|
<li>They used to be “TurnitinBot”… hhmmmm, seems they use both: <a href="https://turnitin.com/robot/crawlerinfo.html">https://turnitin.com/robot/crawlerinfo.html</a></li>
|
|
<li>I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting <code>robots.txt</code> and only requesting item pages, so that’s impressive! I don’t need to add them to the “bad bot” rate limit list in nginx</li>
|
|
<li>While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:</li>
|
|
</ul>
|
|
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
|
|
</code></pre><ul>
|
|
<li>The IPs all belong to HostRoyale:</li>
|
|
</ul>
|
|
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
|
|
81
|
|
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
|
|
185.152.250.1
|
|
185.152.250.101
|
|
185.152.250.103
|
|
185.152.250.105
|
|
185.152.250.107
|
|
185.152.250.111
|
|
185.152.250.115
|
|
185.152.250.119
|
|
185.152.250.121
|
|
185.152.250.123
|
|
185.152.250.125
|
|
185.152.250.129
|
|
185.152.250.13
|
|
185.152.250.131
|
|
185.152.250.133
|
|
185.152.250.135
|
|
185.152.250.137
|
|
185.152.250.141
|
|
185.152.250.145
|
|
185.152.250.149
|
|
185.152.250.153
|
|
185.152.250.155
|
|
185.152.250.157
|
|
185.152.250.159
|
|
185.152.250.161
|
|
185.152.250.163
|
|
185.152.250.165
|
|
185.152.250.167
|
|
185.152.250.17
|
|
185.152.250.171
|
|
185.152.250.183
|
|
185.152.250.189
|
|
185.152.250.191
|
|
185.152.250.197
|
|
185.152.250.201
|
|
185.152.250.205
|
|
185.152.250.209
|
|
185.152.250.21
|
|
185.152.250.213
|
|
185.152.250.217
|
|
185.152.250.219
|
|
185.152.250.221
|
|
185.152.250.223
|
|
185.152.250.225
|
|
185.152.250.227
|
|
185.152.250.229
|
|
185.152.250.231
|
|
185.152.250.233
|
|
185.152.250.235
|
|
185.152.250.239
|
|
185.152.250.243
|
|
185.152.250.247
|
|
185.152.250.249
|
|
185.152.250.25
|
|
185.152.250.251
|
|
185.152.250.253
|
|
185.152.250.255
|
|
185.152.250.27
|
|
185.152.250.29
|
|
185.152.250.3
|
|
185.152.250.31
|
|
185.152.250.39
|
|
185.152.250.41
|
|
185.152.250.47
|
|
185.152.250.5
|
|
185.152.250.59
|
|
185.152.250.63
|
|
185.152.250.65
|
|
185.152.250.67
|
|
185.152.250.7
|
|
185.152.250.71
|
|
185.152.250.73
|
|
185.152.250.77
|
|
185.152.250.81
|
|
185.152.250.85
|
|
185.152.250.89
|
|
185.152.250.9
|
|
185.152.250.93
|
|
185.152.250.95
|
|
185.152.250.97
|
|
185.152.250.99
|
|
</code></pre><ul>
|
|
<li>It’s only a few hundred requests each, but I am very suspicious so I will record it here and purge their IPs from Solr</li>
|
|
<li>Then I see 185.187.30.14 and 185.187.30.13 making requests also, with several different “normal” user agents
|
|
<ul>
|
|
<li>They are both apparently in France, belonging to Scalair FR hosting</li>
|
|
<li>I will purge their requests from Solr too</li>
|
|
</ul>
|
|
</li>
|
|
<li>Now I see some other new bots I hadn’t noticed before:
|
|
<ul>
|
|
<li><code>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com</code></li>
|
|
<li><code>Consilio (WebHare Platform 4.28.2-dev); LinkChecker)</code>, which appears to be a <a href="https://www.utwente.nl/en/websites/webhare/">university CMS</a></li>
|
|
<li>I will add <code>LinkCheck</code>, <code>Consilio</code>, and <code>WebHare</code> to the list of DSpace bot agents and purge them from Solr stats</li>
|
|
<li>COUNTER-Robots list already has <code>link.?check</code> but for some reason DSpace didn’t match that and I see hits for some of these…</li>
|
|
<li>Maybe I should add <code>[Ll]ink.?[Cc]heck.?</code> to a custom list for now?</li>
|
|
<li>For now I added <code>Turnitin</code> to the <a href="https://github.com/atmire/COUNTER-Robots/pull/34">new bots pull request on COUNTER-Robots</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>I purged 20,000 hits from IPs and 45,000 hits from user agents</li>
|
|
<li>I will revert the default “example” agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven’t merged yet:</li>
|
|
</ul>
|
|
<pre><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
|
|
Citoid
|
|
ecointernet
|
|
GigablastOpenSource
|
|
Jersey\/\d
|
|
MarcEdit
|
|
OgScrper
|
|
okhttp
|
|
^Pattern\/\d
|
|
ReactorNetty\/\d
|
|
sqlmap
|
|
Typhoeus
|
|
7siters
|
|
</code></pre><ul>
|
|
<li>Just a note that I <em>still</em> can’t deploy the <code>6_x-dev-atmire-modules</code> branch as it fails at ant update:</li>
|
|
</ul>
|
|
<pre><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
|
|
</code></pre><ul>
|
|
<li>I had told Atmire about this several weeks ago… but I reminded them again in the ticket
|
|
<ul>
|
|
<li>Atmire says they are able to build fine, so I tried again and noticed that I had been building with <code>-Denv=dspacetest.cgiar.org</code>, which is not necessary for DSpace 6 of course</li>
|
|
<li>Once I removed that it builds fine</li>
|
|
</ul>
|
|
</li>
|
|
<li>I quickly re-applied the Font Awesome 5 changes to use SVG+JS instead of web fonts (from 2020-04) and things are looking good!</li>
|
|
<li>Run all system updates on DSpace Test (linode26), deploy latest <code>6_x-dev-atmire-modules</code> branch, and reboot it</li>
|
|
</ul>
|
|
<h2 id="2020-07-02">2020-07-02</h2>
|
|
<ul>
|
|
<li>I need to export some Solr statistics data from CGSpace to test Salem’s modifications to the dspace-statistics-api
|
|
<ul>
|
|
<li>He modified it to query Solr on the fly instead of indexing it, which will be heavier and slower, but allows us to get more granular stats and countries/cities</li>
|
|
<li>Because have so many records I want to use solr-import-export-json to get several months at a time with a date range, but it seems there are first issues with curl (need to disable globbing with <code>-g</code> and URL encode the range)</li>
|
|
<li>For reference, the <a href="https://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/schema/DateField.html">Solr 4.10.x DateField docs</a></li>
|
|
<li>This range works in Solr UI: <code>[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]</code></li>
|
|
<li>As well in curl:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
|
|
{
|
|
"responseHeader":{
|
|
"status":0,
|
|
"QTime":0,
|
|
"params":{
|
|
"q":"*:*",
|
|
"indent":"true",
|
|
"fq":"time:[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]",
|
|
"rows":"0",
|
|
"wt":"json"}},
|
|
"response":{"numFound":7784285,"start":0,"docs":[]
|
|
}}
|
|
</code></pre><ul>
|
|
<li>But not in solr-import-export-json… hmmm… seems we need to URL encode <em>only</em> the date range itself, but not the brackets:</li>
|
|
</ul>
|
|
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
|
|
$ zstd /tmp/statistics-2019-1.json
|
|
</code></pre><ul>
|
|
<li>Then import it on my local dev environment:</li>
|
|
</ul>
|
|
<pre><code>$ zstd -d statistics-2019-1.json.zst
|
|
$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-1.json -k uid
|
|
</code></pre><h2 id="2020-07-05">2020-07-05</h2>
|
|
<ul>
|
|
<li>Import twelve items into the <a href="https://hdl.handle.net/10568/97076">CRP Livestock multimedia</a> collection for Peter Ballantyne
|
|
<ul>
|
|
<li>I ran the data through csv-metadata-quality first to validate and fix some common mistakes</li>
|
|
<li>Interesting to check the data with <code>csvstat</code> to see if there are any duplicates</li>
|
|
</ul>
|
|
</li>
|
|
<li>Peter recently asked me to add Target audience (<code>cg.targetaudience</code>) to the CGSpace sidebar facets and AReS filters
|
|
<ul>
|
|
<li>I added it on my local DSpace test instance, but I’m waiting for him to tell me what he wants the header to be “Audiences” or “Target audience” etc…</li>
|
|
</ul>
|
|
</li>
|
|
<li>Peter also asked me to increase the size of links in the CGSpace “Welcome” text
|
|
<ul>
|
|
<li>I suggested using the CSS <code>font-size: larger</code> property to just bump it up one relative to what it already is</li>
|
|
<li>He said it looks good, but that actually now the links seem OK (I told him to refresh, as I had made them bold a few days ago) so we don’t need to adjust it actually</li>
|
|
</ul>
|
|
</li>
|
|
<li>Mohammed Salem modified my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to query Solr directly so I started writing a script to benchmark it today
|
|
<ul>
|
|
<li>I will monitor the JVM memory and CPU usage in visualvm, just like I did in 2019-04</li>
|
|
<li>I noticed an issue with his limit parameter so I sent him some feedback on that in the meantime</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2020-07/">July, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2020-06/">June, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2020-05/">May, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2020-04/">April, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2020-03/">March, 2020</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|