cgspace-notes/docs/2021-08/index.html

406 lines
18 KiB
HTML
Raw Normal View History

2021-08-01 15:19:05 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="August, 2021" />
<meta property="og:description" content="2021-08-01
Update Docker images on AReS server (linode20) and reboot the server:
# docker images | grep -v ^REPO | sed &#39;s/ \&#43;/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-08/" />
<meta property="article:published_time" content="2021-08-01T09:01:07+03:00" />
2021-08-09 16:10:45 +02:00
<meta property="article:modified_time" content="2021-08-09T08:38:44+03:00" />
2021-08-01 15:19:05 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2021"/>
<meta name="twitter:description" content="2021-08-01
Update Docker images on AReS server (linode20) and reboot the server:
# docker images | grep -v ^REPO | sed &#39;s/ \&#43;/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"/>
2021-08-06 08:08:15 +02:00
<meta name="generator" content="Hugo 0.87.0" />
2021-08-01 15:19:05 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "August, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-08/",
2021-08-09 16:10:45 +02:00
"wordCount": "1537",
2021-08-01 15:19:05 +02:00
"datePublished": "2021-08-01T09:01:07+03:00",
2021-08-09 16:10:45 +02:00
"dateModified": "2021-08-09T08:38:44+03:00",
2021-08-01 15:19:05 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-08/">
<title>August, 2021 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-08/">August, 2021</a></h2>
<p class="blog-post-meta">
<time datetime="2021-08-01T09:01:07+03:00">Sun Aug 01, 2021</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2021-08-01">2021-08-01</h2>
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<ul>
<li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li>
</ul>
<pre><code class="language-console" data-lang="console"># apt update &amp;&amp; apt dist-upgrade
# apt autoremove &amp;&amp; apt autoclean
# check for any packages with residual configs we can purge
# dpkg -l | grep -E '^rc' | awk '{print $2}'
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# dpkg -C
# dpkg -l &gt; 2021-08-01-linode20-dpkg.txt
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
# reboot
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
# do-release-upgrade
</code></pre><ul>
<li>&hellip; but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li>
<li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li>
</ul>
<pre><code class="language-console" data-lang="console"># apt install -f
# apt dist-upgrade
# reboot
</code></pre><ul>
<li>After rebooting I purged all packages with residual configs and cleaned up again:</li>
</ul>
<pre><code class="language-console" data-lang="console"># dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# apt autoremove &amp;&amp; apt autoclean
</code></pre><ul>
<li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li>
<li>Open <a href="https://github.com/ilri/OpenRXV/issues/111">an issue for the value mappings global replacement bug in OpenRXV</a></li>
<li>Advise Peter and Abenet on expected CGSpace budget for 2022</li>
<li>Start a fresh harvesting on AReS (linode20)</li>
</ul>
2021-08-02 15:00:42 +02:00
<h2 id="2021-08-02">2021-08-02</h2>
<ul>
<li>Help Udana with OAI validation on CGSpace
<ul>
<li>He was checking the OAI base URL on OpenArchives and I had to verify the results in order to proceed to Step 2</li>
<li>Now it seems to be verified (all green): <a href="https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85">https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85</a></li>
<li>We are listed in the OpenArchives list of databases conforming to OAI 2.0</li>
</ul>
</li>
</ul>
2021-08-06 08:08:15 +02:00
<h2 id="2021-08-03">2021-08-03</h2>
<ul>
<li>Run fresh re-harvest on AReS</li>
</ul>
<h2 id="2021-08-05">2021-08-05</h2>
<ul>
<li>Have a quick call with Mishell Portilla from CIP about a journal article that was flagged as being in a predatory journal (Beall&rsquo;s List)
<ul>
<li>We agreed to unmap it from RTB&rsquo;s collection for now, and I asked for advice from Peter and Abenet for what to do in the future</li>
</ul>
</li>
<li>A developer from the Alliance asked for access to the CGSpace database so they can make some integration with PowerBI
<ul>
<li>I told them we don&rsquo;t allow direct database access, and that it would be tricky anyways (that&rsquo;s what APIs are for!)</li>
</ul>
</li>
<li>I&rsquo;m curious if there are still any requests coming in to CGSpace from the abusive Russian networks
<ul>
<li>I extracted all the unique IPs that nginx processed in the last week:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
</code></pre><ul>
<li>Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
<ul>
<li>Indeed, now I see that there are no IPs from those networks coming in now:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv
2021-08-09 07:38:44 +02:00
</code></pre><h2 id="2021-08-08">2021-08-08</h2>
<ul>
<li>Advise IWMI colleagues on best practices for thumbnails</li>
<li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
<ul>
<li>I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:</li>
<li>WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific</li>
<li>WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea</li>
</ul>
</li>
<li>Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
<ul>
<li>35.174.144.154 made 11,000 requests last month with the following user agent:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre><ul>
<li>That IP is on Amazon, and from looking at the DSpace logs I don&rsquo;t see them logging in at all, only scraping&hellip; so I will purge hits from that IP</li>
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
<ul>
<li>Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so</li>
<li>Same deal with 93.158.90.91, also in Sweden</li>
</ul>
</li>
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&quot;
</code></pre><ul>
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</code></pre><ul>
<li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
<ul>
<li>I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots</li>
</ul>
</li>
<li>I see a new bot using this user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
</code></pre><ul>
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
<li>There are probably more but that&rsquo;s most of them over 1,000 hits last month, so I will purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
Purging 24863 hits from 3.225.28.105 in statistics
Purging 2988 hits from 93.158.90.91 in statistics
Purging 2497 hits from 61.143.40.50 in statistics
Purging 13866 hits from 159.138.131.15 in statistics
Purging 2721 hits from 95.87.154.12 in statistics
Purging 2786 hits from 47.252.80.214 in statistics
Purging 1485 hits from 129.0.211.251 in statistics
Purging 8952 hits from 217.182.21.193 in statistics
Purging 3446 hits from 103.135.104.139 in statistics
Total number of bot hits purged: 90485
</code></pre><ul>
<li>Then I purged a few thousand more by user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics
Total number of hits from bots: 4492
</code></pre><ul>
<li>I found some CGSpace metadata in the wrong fields
<ul>
<li>Seven metadata in dc.subject (57) should be in dcterms.subject (187)</li>
<li>Twelve metadata in cg.title.journal (202) should be in cg.journal (251)</li>
<li>Three dc.identifier.isbn (20) should be in cg.isbn (252)</li>
<li>Three dc.identifier.issn (21) should be in cg.issn (253)</li>
<li>I re-ran the <code>migrate-fields.sh</code> script on CGSpace</li>
</ul>
</li>
<li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</code></pre><ul>
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
<ul>
<li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li>
<li>I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes&hellip;</li>
2021-08-09 16:10:45 +02:00
<li>In total it was 1,195 changes to ISSN and ISBN metadata fields</li>
</ul>
</li>
</ul>
<h2 id="2021-08-09">2021-08-09</h2>
<ul>
<li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
</code></pre><ul>
<li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv &gt; /tmp/2021-08-09-journals-all.csv
</code></pre><ul>
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
</ul>
<pre><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,&quot;same&quot;,&quot;different&quot;)
</code></pre><ul>
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
<ul>
<li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li>
</ul>
</li>
<li>Convert my <code>generate-thumbnails.py</code> script to use libvips instead of Graphicsmagick
<ul>
<li>It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs</li>
<li>One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can&rsquo;t work in CMYK</li>
<li>I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I&rsquo;m not sure&hellip;</li>
2021-08-09 07:38:44 +02:00
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
2021-08-01 15:19:05 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>