344 lines
14 KiB

<!DOCTYPE html>
<html lang="en" >
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="December, 2021" />
<meta property="og:description" content="2021-12-01
Atmire merged some changes I had submitted to the COUNTER-Robots project
I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
Total number of bot hits purged: 3679
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-12/" />
<meta property="article:published_time" content="2021-12-01T16:07:07+02:00" />
<meta property="article:modified_time" content="2021-12-06T16:40:50+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="December, 2021"/>
<meta name="twitter:description" content="2021-12-01
Atmire merged some changes I had submitted to the COUNTER-Robots project
I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
Total number of bot hits purged: 3679
<meta name="generator" content="Hugo 0.89.4" />
<script type="application/ld+json">
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "December, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-12/",
"wordCount": "968",
"datePublished": "2021-12-01T16:07:07+02:00",
"dateModified": "2021-12-06T16:40:50+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
"keywords": "Notes"
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-12/">
<title>December, 2021 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-12/">December, 2021</a></h2>
<p class="blog-post-meta">
<time datetime="2021-12-01T16:07:07+02:00">Wed Dec 01, 2021</time>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
<h2 id="2021-12-01">2021-12-01</h2>
<li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li>
<li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</code></pre></div><h2 id="2021-12-02">2021-12-02</h2>
<li>Francesca from Alliance asked me for help with approving a submission that gets stuck
<li>I looked at the PostgreSQL activity and the locks are back up like they were earlier this week</li>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#34;SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | sort | uniq -c | sort -n
1 ------------------
1 (1437 rows)
1 application_name
9 psql
1428 dspaceWeb
<li>Munin shows the same:</li>
<p><img src="/cgspace-notes/2021/12/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
<li>Last month I enabled the <code>log_lock_waits</code> in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:</li>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -E <span style="color:#e6db74">&#39;^2021-(11-29|11-30|12-01|12-02)&#39;</span> /var/log/postgresql/postgresql-10-main.log | grep -c <span style="color:#e6db74">&#39;still waiting for&#39;</span>
<li>I think you could analyze the locks for the <code>dspaceWeb</code> user (XMLUI) and find out what queries were locking&hellip; but it&rsquo;s so much information and I don&rsquo;t know where to start
<li>For now I just restarted PostgreSQL&hellip;</li>
<li>Francesca was able to do her submission immediately&hellip;</li>
<li>On a related note, I want to enable the <code>pg_stat_statement</code> feature to see which queries get run the most, so I created the extension on the CGSpace database</li>
<li>I was doing some research on PostgreSQL locks and found some interesting things to consider
<li>The default <code>lock_timeout</code> is 0, aka disabled</li>
<li>The default <code>statement_timeout</code> is 0, aka disabled</li>
<li>It seems to be recommended to start by setting <code>statement_timeout</code> first, rule of thumb <a href="https://github.com/jberkus/annotated.conf/blob/master/postgresql.10.simple.conf#L211">ten times longer than your longest query</a></li>
<li>Mark Wood mentioned the <code>checker</code> cron job that apparently runs in one transaction and might be an issue
<li>I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks</li>
<li>Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace
<li>After some troubleshooting it turns out that the emails from CGSpace were going in her Junk!</li>
<h2 id="2021-12-03">2021-12-03</h2>
<li>I see GARDIAN is now using a &ldquo;GARDIAN&rdquo; user agent finally
<li>I will add them to our local spider agent override in DSpace so that the hits don&rsquo;t get counted in Solr</li>
<h2 id="2021-12-05">2021-12-05</h2>
<li>Proof fifty records Abenet sent me from Africa Rice Center (&ldquo;AfricaRice 1st batch Import&rdquo;)
<li>Fixed forty-six incorrect collections</li>
<li>Cleaned up and normalize affiliations</li>
<li>Cleaned up dates (extra <code>*</code> character in all?)</li>
<li>Cleaned up citation format</li>
<li>Fixed some encoding issues in abstracts</li>
<li>Removed empty columns</li>
<li>Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna</li>
<li>Added volume and issue metadata by extracting it from the citations</li>
<li>All PDFs hosted on davidpublishing.com are dead&hellip;</li>
<li>All DOIs linking to African Journal of Agricultural Research are dead&hellip;</li>
<li>Fixed a handful of items marked as &ldquo;Open Access&rdquo; that are actually closed</li>
<li>Added many missing ISSNs</li>
<li>Added many missing countries/regions</li>
<li>Fixed invalid AGROVOC terms and added some more based on article subjects</li>
<li>I also made some minor changes to the <a href="https://github.com/ilri/csv-metadata-quality">CSV Metadata Quality Checker</a>
<li>Added the ability to check if the item&rsquo;s title exists in the citation</li>
<li>Updated to only run the mojibake check if we&rsquo;re not running in unsafe mode (so we don&rsquo;t print the same warning during both the check and fix steps)</li>
<li>I ran the re-harvesting on AReS</li>
<h2 id="2021-12-06">2021-12-06</h2>
<li>Some minor work on the <code>check-duplicates.py</code> script I wrote last month
<li>I found some corner cases where there were items that matched in the database, but they were <code>in_archive=f</code> and or <code>withdrawn=t</code> so I check that before trying to resolve the handles of potential duplicates</li>
<li>More work on the Africa Rice Center 1st batch import
<li>I merged the metadata for three duplicates in Africa Rice&rsquo;s items and mapped them on CGSpace</li>
<li>I did a bit more work to add missing AGROVOC subjects, countries, regions, extents, etc and then uploaded the forty-six items to CGSpace</li>
<li>I started looking at the seventy CAS records that Abenet has been working on for the past few months</li>
<h2 id="2021-12-07">2021-12-07</h2>
<li>I sent Vini from CGIAR CAS some questions about the seventy records I was working on yesterday
<li>Also, I ran the <code>check-duplicates.py</code> script on them and found that they might ALL be duplicates!!!</li>
<li>I tweaked the script a bit more to use the issue dates as a third criteria and now there are less duplicates, but it&rsquo;s still at least twenty or so&hellip;</li>
<li>The script now checks if the issue date of the item in the CSV and the issue date of the item in the database are less than 365 days apart (by default)</li>
<li>For example, many items like &ldquo;Annual Report 2020&rdquo; can have similar title and type to previous annual reports, but are not duplicates</li>
<li>I noticed a strange user agent in the XMLUI logs on CGSpace:</li>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"> - - [07/Dec/2021:11:51:24 +0100] &#34;GET /handle/10568/33203 HTTP/1.1&#34; 200 6328 &#34;-&#34; &#34;python-requests/2.25.1&#34; - - [07/Dec/2021:11:51:27 +0100] &#34;GET /handle/10568/33203 HTTP/2.0&#34; 200 6315 &#34;-&#34; &#34;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36&#34;
<li>I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft
<li>It could be someone on Azure?</li>
<li>I opened <a href="https://github.com/atmire/COUNTER-Robots/pull/49">a pull request to COUNTER-Robots</a> and I&rsquo;ll add this user agent to our local override until they decide to include it or not</li>
<li>I purged 34,000 hits from this user agent in our Solr statistics:</li>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 34458 hits from HeadlessChrome in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 34458
<li>Meeting with partners about repositories in the One CGIAR</li>
<!-- raw HTML omitted -->
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-12/">December, 2021</a></li>
<li><a href="/cgspace-notes/2021-11/">November, 2021</a></li>
<li><a href="/cgspace-notes/2021-10/">October, 2021</a></li>
<li><a href="/cgspace-notes/2021-09/">September, 2021</a></li>
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<section class="sidebar-module">
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
<a href="#">Back to top</a>