cgspace-notes/docs/2020-07/index.html

466 lines
20 KiB
HTML
Raw Normal View History

2020-07-01 14:37:20 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="July, 2020" />
<meta property="og:description" content="2020-07-01
A few users noticed that CGSpace wasn&rsquo;t loading items today, item pages seem blank
I looked at the PostgreSQL locks but they don&rsquo;t seem unusual
I guess this is the same &ldquo;blank item page&rdquo; issue that we had a few times in 2019 that we never solved
I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter&rsquo;s request
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-07/" />
<meta property="article:published_time" content="2020-07-01T10:53:54+03:00" />
2020-07-06 14:20:30 +02:00
<meta property="article:modified_time" content="2020-07-06T15:14:09+03:00" />
2020-07-01 14:37:20 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="July, 2020"/>
<meta name="twitter:description" content="2020-07-01
A few users noticed that CGSpace wasn&rsquo;t loading items today, item pages seem blank
I looked at the PostgreSQL locks but they don&rsquo;t seem unusual
I guess this is the same &ldquo;blank item page&rdquo; issue that we had a few times in 2019 that we never solved
I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter&rsquo;s request
"/>
<meta name="generator" content="Hugo 0.73.0" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "July, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-07/",
2020-07-06 14:20:30 +02:00
"wordCount": "1547",
2020-07-01 14:37:20 +02:00
"datePublished": "2020-07-01T10:53:54+03:00",
2020-07-06 14:20:30 +02:00
"dateModified": "2020-07-06T15:14:09+03:00",
2020-07-01 14:37:20 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-07/">
<title>July, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-07/">July, 2020</a></h2>
<p class="blog-post-meta"><time datetime="2020-07-01T10:53:54+03:00">Wed Jul 01, 2020</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-07-01">2020-07-01</h2>
<ul>
<li>A few users noticed that CGSpace wasn&rsquo;t loading items today, item pages seem blank
<ul>
<li>I looked at the PostgreSQL locks but they don&rsquo;t seem unusual</li>
<li>I guess this is the same &ldquo;blank item page&rdquo; issue that we had a few times in 2019 that we never solved</li>
<li>I restarted Tomcat and PostgreSQL and the issue was gone</li>
</ul>
</li>
<li>Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the <code>5_x-prod</code> branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter&rsquo;s request</li>
</ul>
<ul>
<li>Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning</li>
<li>First looking at the traffic in the morning:</li>
</ul>
<pre><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E &quot;01/Jul/2020:(00|01|02|03|04)&quot; | goaccess --log-format=COMBINED -
...
9659 33.56% 1 0.08% 340.94 MiB 64.39.99.13
3317 11.53% 1 0.08% 871.71 MiB 199.47.87.140
2986 10.38% 1 0.08% 17.39 MiB 199.47.87.144
2286 7.94% 1 0.08% 13.04 MiB 199.47.87.142
</code></pre><ul>
<li>64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
</code></pre><ul>
<li>I will purge hits from that IP from Solr</li>
<li>The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics/select&quot; -d &quot;q=userAgent:/Turnitin.*/&amp;rows=0&quot; | grep -oE 'numFound=&quot;[0-9]+&quot;'
numFound=&quot;41694&quot;
</code></pre><ul>
<li>They used to be &ldquo;TurnitinBot&rdquo;&hellip; hhmmmm, seems they use both: <a href="https://turnitin.com/robot/crawlerinfo.html">https://turnitin.com/robot/crawlerinfo.html</a></li>
<li>I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting <code>robots.txt</code> and only requesting item pages, so that&rsquo;s impressive! I don&rsquo;t need to add them to the &ldquo;bad bot&rdquo; rate limit list in nginx</li>
<li>While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
</code></pre><ul>
<li>The IPs all belong to HostRoyale:</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
81
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
185.152.250.1
185.152.250.101
185.152.250.103
185.152.250.105
185.152.250.107
185.152.250.111
185.152.250.115
185.152.250.119
185.152.250.121
185.152.250.123
185.152.250.125
185.152.250.129
185.152.250.13
185.152.250.131
185.152.250.133
185.152.250.135
185.152.250.137
185.152.250.141
185.152.250.145
185.152.250.149
185.152.250.153
185.152.250.155
185.152.250.157
185.152.250.159
185.152.250.161
185.152.250.163
185.152.250.165
185.152.250.167
185.152.250.17
185.152.250.171
185.152.250.183
185.152.250.189
185.152.250.191
185.152.250.197
185.152.250.201
185.152.250.205
185.152.250.209
185.152.250.21
185.152.250.213
185.152.250.217
185.152.250.219
185.152.250.221
185.152.250.223
185.152.250.225
185.152.250.227
185.152.250.229
185.152.250.231
185.152.250.233
185.152.250.235
185.152.250.239
185.152.250.243
185.152.250.247
185.152.250.249
185.152.250.25
185.152.250.251
185.152.250.253
185.152.250.255
185.152.250.27
185.152.250.29
185.152.250.3
185.152.250.31
185.152.250.39
185.152.250.41
185.152.250.47
185.152.250.5
185.152.250.59
185.152.250.63
185.152.250.65
185.152.250.67
185.152.250.7
185.152.250.71
185.152.250.73
185.152.250.77
185.152.250.81
185.152.250.85
185.152.250.89
185.152.250.9
185.152.250.93
185.152.250.95
185.152.250.97
185.152.250.99
</code></pre><ul>
<li>It&rsquo;s only a few hundred requests each, but I am very suspicious so I will record it here and purge their IPs from Solr</li>
<li>Then I see 185.187.30.14 and 185.187.30.13 making requests also, with several different &ldquo;normal&rdquo; user agents
<ul>
<li>They are both apparently in France, belonging to Scalair FR hosting</li>
<li>I will purge their requests from Solr too</li>
</ul>
</li>
<li>Now I see some other new bots I hadn&rsquo;t noticed before:
<ul>
<li><code>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com</code></li>
<li><code>Consilio (WebHare Platform 4.28.2-dev); LinkChecker)</code>, which appears to be a <a href="https://www.utwente.nl/en/websites/webhare/">university CMS</a></li>
<li>I will add <code>LinkCheck</code>, <code>Consilio</code>, and <code>WebHare</code> to the list of DSpace bot agents and purge them from Solr stats</li>
<li>COUNTER-Robots list already has <code>link.?check</code> but for some reason DSpace didn&rsquo;t match that and I see hits for some of these&hellip;</li>
<li>Maybe I should add <code>[Ll]ink.?[Cc]heck.?</code> to a custom list for now?</li>
<li>For now I added <code>Turnitin</code> to the <a href="https://github.com/atmire/COUNTER-Robots/pull/34">new bots pull request on COUNTER-Robots</a></li>
</ul>
</li>
<li>I purged 20,000 hits from IPs and 45,000 hits from user agents</li>
<li>I will revert the default &ldquo;example&rdquo; agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven&rsquo;t merged yet:</li>
</ul>
<pre><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
Citoid
ecointernet
GigablastOpenSource
Jersey\/\d
MarcEdit
OgScrper
okhttp
^Pattern\/\d
ReactorNetty\/\d
sqlmap
Typhoeus
7siters
</code></pre><ul>
<li>Just a note that I <em>still</em> can&rsquo;t deploy the <code>6_x-dev-atmire-modules</code> branch as it fails at ant update:</li>
</ul>
2020-07-02 08:53:45 +02:00
<pre><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
2020-07-01 14:37:20 +02:00
</code></pre><ul>
2020-07-02 08:53:45 +02:00
<li>I had told Atmire about this several weeks ago&hellip; but I reminded them again in the ticket
<ul>
<li>Atmire says they are able to build fine, so I tried again and noticed that I had been building with <code>-Denv=dspacetest.cgiar.org</code>, which is not necessary for DSpace 6 of course</li>
<li>Once I removed that it builds fine</li>
</ul>
</li>
<li>I quickly re-applied the Font Awesome 5 changes to use SVG+JS instead of web fonts (from 2020-04) and things are looking good!</li>
2020-07-05 09:50:09 +02:00
<li>Run all system updates on DSpace Test (linode26), deploy latest <code>6_x-dev-atmire-modules</code> branch, and reboot it</li>
2020-07-01 14:37:20 +02:00
</ul>
2020-07-05 09:50:09 +02:00
<h2 id="2020-07-02">2020-07-02</h2>
<ul>
<li>I need to export some Solr statistics data from CGSpace to test Salem&rsquo;s modifications to the dspace-statistics-api
<ul>
<li>He modified it to query Solr on the fly instead of indexing it, which will be heavier and slower, but allows us to get more granular stats and countries/cities</li>
<li>Because have so many records I want to use solr-import-export-json to get several months at a time with a date range, but it seems there are first issues with curl (need to disable globbing with <code>-g</code> and URL encode the range)</li>
<li>For reference, the <a href="https://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/schema/DateField.html">Solr 4.10.x DateField docs</a></li>
<li>This range works in Solr UI: <code>[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]</code></li>
<li>As well in curl:</li>
</ul>
</li>
</ul>
<pre><code>$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&amp;fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&amp;rows=0&amp;wt=json&amp;indent=true'
{
&quot;responseHeader&quot;:{
&quot;status&quot;:0,
&quot;QTime&quot;:0,
&quot;params&quot;:{
&quot;q&quot;:&quot;*:*&quot;,
&quot;indent&quot;:&quot;true&quot;,
&quot;fq&quot;:&quot;time:[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]&quot;,
&quot;rows&quot;:&quot;0&quot;,
&quot;wt&quot;:&quot;json&quot;}},
&quot;response&quot;:{&quot;numFound&quot;:7784285,&quot;start&quot;:0,&quot;docs&quot;:[]
}}
</code></pre><ul>
<li>But not in solr-import-export-json&hellip; hmmm&hellip; seems we need to URL encode <em>only</em> the date range itself, but not the brackets:</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
$ zstd /tmp/statistics-2019-1.json
</code></pre><ul>
<li>Then import it on my local dev environment:</li>
</ul>
<pre><code>$ zstd -d statistics-2019-1.json.zst
$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-1.json -k uid
2020-07-05 15:29:04 +02:00
</code></pre><h2 id="2020-07-05">2020-07-05</h2>
<ul>
<li>Import twelve items into the <a href="https://hdl.handle.net/10568/97076">CRP Livestock multimedia</a> collection for Peter Ballantyne
<ul>
<li>I ran the data through csv-metadata-quality first to validate and fix some common mistakes</li>
<li>Interesting to check the data with <code>csvstat</code> to see if there are any duplicates</li>
</ul>
</li>
<li>Peter recently asked me to add Target audience (<code>cg.targetaudience</code>) to the CGSpace sidebar facets and AReS filters
<ul>
<li>I added it on my local DSpace test instance, but I&rsquo;m waiting for him to tell me what he wants the header to be &ldquo;Audiences&rdquo; or &ldquo;Target audience&rdquo; etc&hellip;</li>
</ul>
</li>
<li>Peter also asked me to increase the size of links in the CGSpace &ldquo;Welcome&rdquo; text
<ul>
<li>I suggested using the CSS <code>font-size: larger</code> property to just bump it up one relative to what it already is</li>
<li>He said it looks good, but that actually now the links seem OK (I told him to refresh, as I had made them bold a few days ago) so we don&rsquo;t need to adjust it actually</li>
</ul>
</li>
<li>Mohammed Salem modified my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to query Solr directly so I started writing a script to benchmark it today
<ul>
<li>I will monitor the JVM memory and CPU usage in visualvm, just like I did in 2019-04</li>
<li>I noticed an issue with his limit parameter so I sent him some feedback on that in the meantime</li>
</ul>
</li>
2020-07-05 20:52:01 +02:00
<li>I noticed that we have 20,000 distinct values for <code>dc.subject</code>, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:</li>
2020-07-05 15:29:04 +02:00
</ul>
2020-07-05 20:52:01 +02:00
<pre><code>dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
</code></pre><ul>
<li>DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:</li>
</ul>
<pre><code>dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
</code></pre><ul>
<li>Note the use of the POSIX character class :)</li>
<li>I suggest that we generate a list of the top 5,000 values that don&rsquo;t match AGROVOC so that Sisay can correct them
<ul>
<li>Start by getting the top 6,500 subjects (assuming that the top ~1,500 are valid from our previous work):</li>
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
COPY 19640
dspace=# \q
$ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 &gt; 2020-07-05-cgspace-subjects.txt
</code></pre><ul>
<li>Then start looking them up using <code>agrovoc-lookup.py</code>:</li>
</ul>
<pre><code>$ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
2020-07-06 14:14:09 +02:00
</code></pre><h2 id="2020-07-06">2020-07-06</h2>
<ul>
<li>I made some optimizations to the suite of Python utility scripts in our DSpace directory as well as the <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> script
<ul>
<li>Mostly to make more efficient usage of the requests cache and to use parameterized requests instead of building the request URL by concatenating the URL with query parameters</li>
</ul>
</li>
<li>I modified the <code>agrovoc-lookup.py</code> script to save its results as a CSV, with the subject, language, type of match (preferred, alternate, and total number of matches) rather than save two separate files
<ul>
<li>Note that I see <code>prefLabel</code>, <code>matchedPrefLabel</code>, and <code>altLabel</code> in the REST API responses and I&rsquo;m not sure what the second one means</li>
2020-07-06 14:20:30 +02:00
<li>I emailed FAO&rsquo;s AGROVOC contact to ask them</li>
2020-07-06 14:14:09 +02:00
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
2020-07-01 14:37:20 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2020-07/">July, 2020</a></li>
<li><a href="/cgspace-notes/2020-06/">June, 2020</a></li>
<li><a href="/cgspace-notes/2020-05/">May, 2020</a></li>
<li><a href="/cgspace-notes/2020-04/">April, 2020</a></li>
<li><a href="/cgspace-notes/2020-03/">March, 2020</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>