<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="August, 2021" /> <meta property="og:description" content="2021-08-01 Update Docker images on AReS server (linode20) and reboot the server: # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull I decided to upgrade linode20 from Ubuntu 18.04 to 20.04 " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-08/" /> <meta property="article:published_time" content="2021-08-01T09:01:07+03:00" /> <meta property="article:modified_time" content="2021-09-02T17:06:28+03:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="August, 2021"/> <meta name="twitter:description" content="2021-08-01 Update Docker images on AReS server (linode20) and reboot the server: # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull I decided to upgrade linode20 from Ubuntu 18.04 to 20.04 "/> <meta name="generator" content="Hugo 0.121.1"> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "August, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-08/", "wordCount": "3195", "datePublished": "2021-08-01T09:01:07+03:00", "dateModified": "2021-09-02T17:06:28+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-08/"> <title>August, 2021 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-08/">August, 2021</a></h2> <p class="blog-post-meta"> <time datetime="2021-08-01T09:01:07+03:00">Sun Aug 01, 2021</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2021-08-01">2021-08-01</h2> <ul> <li>Update Docker images on AReS server (linode20) and reboot the server:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">'s/ \+/:/g'</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull </span></span></code></pre></div><ul> <li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li> </ul> <ul> <li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt update <span style="color:#f92672">&&</span> apt dist-upgrade </span></span><span style="display:flex;"><span># apt autoremove <span style="color:#f92672">&&</span> apt autoclean </span></span><span style="display:flex;"><span># check <span style="color:#66d9ef">for</span> any packages with residual configs we can purge </span></span><span style="display:flex;"><span># dpkg -l | grep -E <span style="color:#e6db74">'^rc'</span> | awk <span style="color:#e6db74">'{print $2}'</span> </span></span><span style="display:flex;"><span># dpkg -l | grep -E <span style="color:#e6db74">'^rc'</span> | awk <span style="color:#e6db74">'{print $2}'</span> | xargs dpkg -P </span></span><span style="display:flex;"><span># dpkg -C </span></span><span style="display:flex;"><span># dpkg -l > 2021-08-01-linode20-dpkg.txt </span></span><span style="display:flex;"><span># tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc </span></span><span style="display:flex;"><span># reboot </span></span><span style="display:flex;"><span># sed -i <span style="color:#e6db74">'s/bionic/focal/'</span> /etc/apt/sources.list.d/*.list </span></span><span style="display:flex;"><span># <span style="color:#66d9ef">do</span>-release-upgrade </span></span></code></pre></div><ul> <li>… but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li> <li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt install -f </span></span><span style="display:flex;"><span># apt dist-upgrade </span></span><span style="display:flex;"><span># reboot </span></span></code></pre></div><ul> <li>After rebooting I purged all packages with residual configs and cleaned up again:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># dpkg -l | grep -E <span style="color:#e6db74">'^rc'</span> | awk <span style="color:#e6db74">'{print $2}'</span> | xargs dpkg -P </span></span><span style="display:flex;"><span># apt autoremove <span style="color:#f92672">&&</span> apt autoclean </span></span></code></pre></div><ul> <li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li> <li>Open <a href="https://github.com/ilri/OpenRXV/issues/111">an issue for the value mappings global replacement bug in OpenRXV</a></li> <li>Advise Peter and Abenet on expected CGSpace budget for 2022</li> <li>Start a fresh harvesting on AReS (linode20)</li> </ul> <h2 id="2021-08-02">2021-08-02</h2> <ul> <li>Help Udana with OAI validation on CGSpace <ul> <li>He was checking the OAI base URL on OpenArchives and I had to verify the results in order to proceed to Step 2</li> <li>Now it seems to be verified (all green): <a href="https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85">https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85</a></li> <li>We are listed in the OpenArchives list of databases conforming to OAI 2.0</li> </ul> </li> </ul> <h2 id="2021-08-03">2021-08-03</h2> <ul> <li>Run fresh re-harvest on AReS</li> </ul> <h2 id="2021-08-05">2021-08-05</h2> <ul> <li>Have a quick call with Mishell Portilla from CIP about a journal article that was flagged as being in a predatory journal (Beall’s List) <ul> <li>We agreed to unmap it from RTB’s collection for now, and I asked for advice from Peter and Abenet for what to do in the future</li> </ul> </li> <li>A developer from the Alliance asked for access to the CGSpace database so they can make some integration with PowerBI <ul> <li>I told them we don’t allow direct database access, and that it would be tricky anyways (that’s what APIs are for!)</li> </ul> </li> <li>I’m curious if there are still any requests coming in to CGSpace from the abusive Russian networks <ul> <li>I extracted all the unique IPs that nginx processed in the last week:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E <span style="color:#e6db74">" (200|499) "</span> | grep -v -E <span style="color:#e6db74">"(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)"</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort | uniq > /tmp/2021-08-05-all-ips.txt </span></span><span style="display:flex;"><span># wc -l /tmp/2021-08-05-all-ips.txt </span></span><span style="display:flex;"><span>43428 /tmp/2021-08-05-all-ips.txt </span></span></code></pre></div><ul> <li>Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!) <ul> <li>Indeed, now I see that there are no IPs from those networks coming in now:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv </span></span><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">'^(49453|46844|206485|62282|36352|35913|35624|8100)$'</span> /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv </span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv </span></span><span style="display:flex;"><span>0 /tmp/2021-08-05-all-ips-to-purge.csv </span></span></code></pre></div><h2 id="2021-08-08">2021-08-08</h2> <ul> <li>Advise IWMI colleagues on best practices for thumbnails</li> <li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest <ul> <li>I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:</li> <li>WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific</li> <li>WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea</li> </ul> </li> <li>Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange <ul> <li>35.174.144.154 made 11,000 requests last month with the following user agent:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36 </span></span></code></pre></div><ul> <li>That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP</li> <li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too <ul> <li>Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so</li> <li>Same deal with 93.158.90.91, also in Sweden</li> </ul> </li> <li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li> <li>61.143.40.50 is in China and uses this hilarious user agent:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}" </span></span></code></pre></div><ul> <li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li> <li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li> <li>95.87.154.12 seems to be a new bot with the following user agent:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/ </span></span></code></pre></div><ul> <li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU <ul> <li>I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots</li> </ul> </li> <li>I see a new bot using this user agent:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>nettle (+https://www.nettle.sk) </span></span></code></pre></div><ul> <li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li> <li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li> <li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li> <li>There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p </span></span><span style="display:flex;"><span>Purging 10796 hits from 35.174.144.154 in statistics </span></span><span style="display:flex;"><span>Purging 9993 hits from 93.158.90.30 in statistics </span></span><span style="display:flex;"><span>Purging 6092 hits from 130.255.162.173 in statistics </span></span><span style="display:flex;"><span>Purging 24863 hits from 3.225.28.105 in statistics </span></span><span style="display:flex;"><span>Purging 2988 hits from 93.158.90.91 in statistics </span></span><span style="display:flex;"><span>Purging 2497 hits from 61.143.40.50 in statistics </span></span><span style="display:flex;"><span>Purging 13866 hits from 159.138.131.15 in statistics </span></span><span style="display:flex;"><span>Purging 2721 hits from 95.87.154.12 in statistics </span></span><span style="display:flex;"><span>Purging 2786 hits from 47.252.80.214 in statistics </span></span><span style="display:flex;"><span>Purging 1485 hits from 129.0.211.251 in statistics </span></span><span style="display:flex;"><span>Purging 8952 hits from 217.182.21.193 in statistics </span></span><span style="display:flex;"><span>Purging 3446 hits from 103.135.104.139 in statistics </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 90485 </span></span></code></pre></div><ul> <li>Then I purged a few thousand more by user agent:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri </span></span><span style="display:flex;"><span>Found 2707 hits from MaCoCu in statistics </span></span><span style="display:flex;"><span>Found 1785 hits from nettle in statistics </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 4492 </span></span></code></pre></div><ul> <li>I found some CGSpace metadata in the wrong fields <ul> <li>Seven metadata in dc.subject (57) should be in dcterms.subject (187)</li> <li>Twelve metadata in cg.title.journal (202) should be in cg.journal (251)</li> <li>Three dc.identifier.isbn (20) should be in cg.isbn (252)</li> <li>Three dc.identifier.issn (21) should be in cg.issn (253)</li> <li>I re-ran the <code>migrate-fields.sh</code> script on CGSpace</li> </ul> </li> <li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]'</span> /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv </span></span></code></pre></div><ul> <li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves <ul> <li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li> <li>I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes…</li> <li>In total it was 1,195 changes to ISSN and ISBN metadata fields</li> </ul> </li> </ul> <h2 id="2021-08-09">2021-08-09</h2> <ul> <li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'cg.issn[en_US]'</span> ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c <span style="color:#ae81ff">1</span> -r <span style="color:#e6db74">'^[0-9]{4}'</span> | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt </span></span><span style="display:flex;"><span>$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv </span></span><span style="display:flex;"><span>$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv </span></span></code></pre></div><ul> <li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sed -i <span style="color:#e6db74">'1s/journal title/sherpa romeo journal title/'</span> /tmp/2021-08-09-journals-sherpa-romeo.csv </span></span><span style="display:flex;"><span>$ sed -i <span style="color:#e6db74">'1s/journal title/crossref journal title/'</span> /tmp/2021-08-09-journals-crossref.csv </span></span><span style="display:flex;"><span>$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv </span></span></code></pre></div><ul> <li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different") </span></span></code></pre></div><ul> <li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections <ul> <li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li> </ul> </li> <li>Convert my <code>generate-thumbnails.py</code> script to use libvips instead of Graphicsmagick <ul> <li>It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs</li> <li>One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can’t work in CMYK</li> <li>I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I’m not sure…</li> <li>Perhaps this is not a problem after all, see this PR from 2019: <a href="https://github.com/libvips/libvips/pull/1196">https://github.com/libvips/libvips/pull/1196</a></li> </ul> </li> <li>I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">'%s-vips.jpg[Q=85,optimize_coding,strip]'</span> </span></span><span style="display:flex;"><span>39004:0.08 </span></span><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e gm convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -quality <span style="color:#ae81ff">85</span> -thumbnail x600 -flatten IPCC-gm.jpg </span></span><span style="display:flex;"><span>40932:0.53 </span></span><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg </span></span><span style="display:flex;"><span>41724:0.59 </span></span><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality <span style="color:#ae81ff">85</span> -thumbnail 600x600 IPCC-im.jpg </span></span><span style="display:flex;"><span>24736:0.04 </span></span></code></pre></div><ul> <li>The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail) <ul> <li>libvips does use less time and memory… I should do more tests!</li> <li>I wonder if I can try to use these <a href="https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java">unofficial Java bindings</a> in DSpace</li> <li>The authors of the JVips project wrote a nice blog post about libvips performance: <a href="https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55">https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55</a></li> <li>Ouch, JVips is Java 8 only as far as I can tell… that works now, but it’s a non-starter going forward</li> </ul> </li> </ul> <h2 id="2021-08-11">2021-08-11</h2> <ul> <li>Peter got back to me about the journal title cleanup <ul> <li>From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref’s</li> <li>Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter’s selections for where Sherpa Romeo and Crossref differred:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt </span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'sherpa romeo journal title'</span> ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt </span></span><span style="display:flex;"><span>$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l </span></span><span style="display:flex;"><span>1911 </span></span></code></pre></div><ul> <li>Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine</li> <li>I exported a list of all the journal titles we have in the <code>cg.journal</code> field:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV; </span></span><span style="display:flex;"><span>COPY 3245 </span></span></code></pre></div><ul> <li>I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them… <ul> <li>I think it’s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way</li> <li>Or instead of doing it via SQL I could use CSV and parse the values there…</li> </ul> </li> <li>A few more issues: <ul> <li>Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org’s web search (their API is invite only)</li> <li>Some titles are different across all three datasets, for example ISSN 0003-1305: <ul> <li><a href="https://portal.issn.org/resource/ISSN/0003-1305">According to ISSN.org</a> this is “The American statistician”</li> <li><a href="https://v2.sherpa.ac.uk/id/publication/20807">According to Sherpa Romeo</a> this is “American Statistician”</li> <li><a href="https://search.crossref.org/?q=0003-1305&from_ui=yes&container-title=The+American+Statistician">According to Crossref</a> this is “The American Statistician”</li> </ul> </li> </ul> </li> <li>I also realized that our previous controlled vocabulary came from CGSpace’s top 500 journals, so when I replaced it with the generated list earlier today we lost some journals <ul> <li>Now I went back and merged the previous with the new, and manually removed duplicates (sigh)</li> <li>I requested access to the issn.org OAI-PMH API so I can use their registry…</li> </ul> </li> </ul> <h2 id="2021-08-12">2021-08-12</h2> <ul> <li>I sent an email to Sherpa Romeo’s help contact to ask about missing ISSNs <ul> <li>They pointed me to their <a href="https://v2.sherpa.ac.uk/romeo/about.html">inclusion criteria</a> and said that missing journals should submit their open access policies to be included</li> </ul> </li> <li>The contact from issn.org got back to me and said I should pay 1,000/year EUR for 100,000 requests to their API… no thanks</li> <li>Submit a pull request to COUNTER-Robots for the httpx bot (<a href="https://github.com/atmire/COUNTER-Robots/pull/45">#45</a>) <ul> <li>In the mean time I added it to our local ILRI overrides</li> </ul> </li> </ul> <h2 id="2021-08-15">2021-08-15</h2> <ul> <li>Start a fresh reindex on AReS</li> </ul> <h2 id="2021-08-16">2021-08-16</h2> <ul> <li>Meeting with Abenet and Peter about CGSpace actions and future <ul> <li>We agreed to move three top-level Feed the Future projects into one community, so I created one and moved them:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/72600 </span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/35730 </span></span><span style="display:flex;"><span>$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/76451 </span></span></code></pre></div><ul> <li>I made a minor fix to OpenRXV to prefix all image names with <code>docker.io</code> so it works with less changes on podman <ul> <li>Docker assumes the <code>docker.io</code> registry by default, but we should be explicit</li> </ul> </li> </ul> <h2 id="2021-08-17">2021-08-17</h2> <ul> <li>I made an initial attempt on the policy statements page on DSpace Test <ul> <li>It is modeled on Sherpa Romeo’s OpenDOAR policy statements advice</li> </ul> </li> <li>Sit with Moayad and discuss the future of AReS <ul> <li>We specifically discussed formalizing the API and documenting its use to allow as an alternative to harvesting directly from CGSpace</li> <li>We also discussed allowing linking to search results to enable something like “Explore this collection” links on CGSpace collection pages</li> </ul> </li> <li>Lower case all AGROVOC metadata, as I had noticed a few in sentence case:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]'; </span></span><span style="display:flex;"><span>UPDATE 484 </span></span></code></pre></div><ul> <li>Also update some DOIs using the <code>dx.doi.org</code> format, just to keep things uniform:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%'; </span></span><span style="display:flex;"><span>UPDATE 469 </span></span></code></pre></div><ul> <li>Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>real 322m16.917s </span></span><span style="display:flex;"><span>user 226m43.121s </span></span><span style="display:flex;"><span>sys 3m17.469s </span></span></code></pre></div><ul> <li>I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -X POST <span style="color:#e6db74">'https://cgspace.cgiar.org/explorer/api/search?scroll=1d'</span> <span style="color:#ae81ff">\ </span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> -H 'Content-Type: application/json' \ </span></span><span style="display:flex;"><span> -d '{ </span></span><span style="display:flex;"><span> "size": 10, </span></span><span style="display:flex;"><span> "query": { </span></span><span style="display:flex;"><span> "bool": { </span></span><span style="display:flex;"><span> "filter": { </span></span><span style="display:flex;"><span> "term": { </span></span><span style="display:flex;"><span> "repo.keyword": "CGSpace" </span></span><span style="display:flex;"><span> } </span></span><span style="display:flex;"><span> } </span></span><span style="display:flex;"><span> } </span></span><span style="display:flex;"><span> } </span></span><span style="display:flex;"><span>}' </span></span><span style="display:flex;"><span>$ curl -X POST <span style="color:#e6db74">'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='</span> </span></span></code></pre></div><ul> <li>This uses the Elasticsearch scroll ID to page through results <ul> <li>The second query doesn’t need the request body because it is saved for 1 day as part of the first request</li> </ul> </li> <li>Attempt to re-do my tests with VisualVM from 2019-04 <ul> <li>I found that I can’t connect to the Tomcat JMX port using SSH forwarding (visualvm gives an error about localhost already being monitored)</li> <li>Instead, I had to create a SOCKS proxy with SSH (ssh -D 8096), then set that up as a proxy in the VisualVM network settings, and then add the JMX connection</li> <li>See: <a href="https://dzone.com/articles/visualvm-monitoring-remote-jvm">https://dzone.com/articles/visualvm-monitoring-remote-jvm</a></li> <li>I have to spend more time on this…</li> </ul> </li> <li>I fixed a bug in the Altmetric donuts on OpenRXV <ul> <li>We now <a href="https://github.com/ilri/OpenRXV/pull/113">try to show the donut for the DOI first if it exists, then fall back to the Handle</a></li> <li>This is working on my local test, but not on the live site… sigh</li> <li>I started a fresh harvest, maybe it’s something to do with the metadata in Elasticsearch</li> </ul> </li> <li>I improved the quality of the “no thumbnail” placeholder image on AReS: <a href="https://github.com/ilri/OpenRXV/pull/114">https://github.com/ilri/OpenRXV/pull/114</a></li> <li>I sent some feedback to some ILRI and CCAFS colleagues about how to use better thumbnails for publications</li> </ul> <h2 id="2021-08-24">2021-08-24</h2> <ul> <li>In the last few days I did a lot of work on OpenRXV <ul> <li>I started exploring the Angular 9.0 to 9.1 update</li> <li>I tested some updates to dependencies for Angular 9 that we somehow missed, like @tinymce/tinymce-angular, @nicky-lenaers/ngx-scroll-to, and @ng-select/ng-select</li> <li>I changed the default target from ES5 to ES2015 because ES5 was released in 2009 and the only thing we lose by moving to ES2015 is IE11 support</li> <li>I fixed a handful of issues in the Docker build and deployment process</li> <li>I started exploring changing the Docker configuration from using volumes to <code>COPY</code> instructions in the <code>Dockerfile</code> because we are having sporadic issues with permissions in containers caused by copying the host’s frontend/backend directories and not being able to write to them</li> <li>I tested moving from node-sass to sass, as it has been <a href="https://blog.ninja-squad.com/2019/05/29/angular-cli-8.0/">supported since Angular 8 apparently</a> and will allow us to avoid stupid node-gyp issues</li> </ul> </li> </ul> <h2 id="2021-08-25">2021-08-25</h2> <ul> <li>I did a bunch of tests of the OpenRXV Angular 9.1 update and merged it to master (<a href="https://github.com/ilri/OpenRXV/pull/115">#115</a>)</li> <li>Last week Maria Garruccio sent me a handful of new ORCID identifiers for Bioversity staff <ul> <li>We currently have 1320 unique identifiers, so this adds eleven new ones:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE <span style="color:#e6db74">'[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}'</span> | sort | uniq > /tmp/2021-08-25-combined-orcids.txt </span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-08-25-combined-orcids.txt </span></span><span style="display:flex;"><span>1331 </span></span></code></pre></div><ul> <li>After I combined them and removed duplicates, I resolved all the names using my <code>resolve-orcids.py</code> script:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt </span></span></code></pre></div><ul> <li>Tag existing items from the Alliance’s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code> (181 new metadata fields added):</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2021-08-25-add-orcids.csv </span></span><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier </span></span><span style="display:flex;"><span>"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279" </span></span><span style="display:flex;"><span>"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279" </span></span><span style="display:flex;"><span>"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279" </span></span><span style="display:flex;"><span>"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833" </span></span><span style="display:flex;"><span>"Rahn, E.","Eric Rahn: 0000-0001-6280-7430" </span></span><span style="display:flex;"><span>"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430" </span></span><span style="display:flex;"><span>"Jager M.","Matthias Jager: 0000-0003-1059-3949" </span></span><span style="display:flex;"><span>"Jager, M.","Matthias Jager: 0000-0003-1059-3949" </span></span><span style="display:flex;"><span>"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949" </span></span><span style="display:flex;"><span>"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215" </span></span><span style="display:flex;"><span>"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215" </span></span><span style="display:flex;"><span>"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873" </span></span><span style="display:flex;"><span>"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854" </span></span><span style="display:flex;"><span>"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483" </span></span><span style="display:flex;"><span>"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483" </span></span><span style="display:flex;"><span>"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389" </span></span><span style="display:flex;"><span>"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389" </span></span><span style="display:flex;"><span>"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389" </span></span><span style="display:flex;"><span>"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170" </span></span><span style="display:flex;"><span>"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170" </span></span><span style="display:flex;"><span>"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170" </span></span><span style="display:flex;"><span>"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170" </span></span><span style="display:flex;"><span>"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223" </span></span><span style="display:flex;"><span>"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223" </span></span><span style="display:flex;"><span>"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304" </span></span><span style="display:flex;"><span>"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304" </span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> </span></span></code></pre></div><h2 id="2021-08-29">2021-08-29</h2> <ul> <li>Run a full harvest on AReS</li> <li>Also do more work the past few days on OpenRXV <ul> <li>I switched the backend target from ES2017 to ES2019</li> <li>I did a proof of concept with multi-stage builds and simplifying the Docker configuration</li> </ul> </li> <li>Update the list of ORCID identifiers on CGSpace</li> <li>Run system updates and reboot CGSpace (linode18)</li> </ul> <h2 id="2021-08-31">2021-08-31</h2> <ul> <li>Yesterday I finished the work to make OpenRXV use a new multi-stage Docker build system and use smarter <code>COPY</code> instructions instead of runtime volumes <ul> <li>Today I merged the changes to the master branch and re-deployed AReS on linode20</li> <li>Because the <code>docker-compose.yml</code> moved to the root the Docker volume prefix changed from <code>docker_</code> to <code>openrxv_</code> so I had to stop the containers and rsync the data from the old volume to the new one in /var/lib/docker</li> </ul> </li> </ul> <!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2023-12/">December, 2023</a></li> <li><a href="/cgspace-notes/2023-11/">November, 2023</a></li> <li><a href="/cgspace-notes/2023-10/">October, 2023</a></li> <li><a href="/cgspace-notes/2023-09/">September, 2023</a></li> <li><a href="/cgspace-notes/2023-08/">August, 2023</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>