cgspace-notes/docs/2021-08/index.html

<!DOCTYPE html>
<html lang="en" >

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">


<meta property="og:title" content="August, 2021" />
<meta property="og:description" content="2021-08-01

Update Docker images on AReS server (linode20) and reboot the server:

# docker images | grep -v ^REPO | sed &#39;s/ \&#43;/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull

I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-08/" />
<meta property="article:published_time" content="2021-08-01T09:01:07+03:00" />
<meta property="article:modified_time" content="2021-08-17T10:59:14+03:00" />


<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2021"/>
<meta name="twitter:description" content="2021-08-01

Update Docker images on AReS server (linode20) and reboot the server:

# docker images | grep -v ^REPO | sed &#39;s/ \&#43;/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull

I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"/>
<meta name="generator" content="Hugo 0.87.0" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "August, 2021",
  "url": "https://alanorth.github.io/cgspace-notes/2021-08/",
  "wordCount": "2512",
  "datePublished": "2021-08-01T09:01:07+03:00",
  "dateModified": "2021-08-17T10:59:14+03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-08/">

    <title>August, 2021 | CGSpace Notes</title>


    <!-- combined, minified CSS -->

    <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">


    <!-- minified Font Awesome for SVG icons -->

    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>

    <!-- RSS 2.0 feed -->


  </head>

  <body>


    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>


    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>


    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">


<article class="blog-post">
  <header>
    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-08/">August, 2021</a></h2>
    <p class="blog-post-meta">
<time datetime="2021-08-01T09:01:07+03:00">Sun Aug 01, 2021</time>
 in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>


</p>
  </header>
  <h2 id="2021-08-01">2021-08-01</h2>
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<ul>
<li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li>
</ul>
<pre><code class="language-console" data-lang="console"># apt update &amp;&amp; apt dist-upgrade
# apt autoremove &amp;&amp; apt autoclean
# check for any packages with residual configs we can purge
# dpkg -l | grep -E '^rc' | awk '{print $2}'
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# dpkg -C
# dpkg -l &gt; 2021-08-01-linode20-dpkg.txt
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
# reboot
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
# do-release-upgrade
</code></pre><ul>
<li>&hellip; but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li>
<li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li>
</ul>
<pre><code class="language-console" data-lang="console"># apt install -f
# apt dist-upgrade
# reboot
</code></pre><ul>
<li>After rebooting I purged all packages with residual configs and cleaned up again:</li>
</ul>
<pre><code class="language-console" data-lang="console"># dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# apt autoremove &amp;&amp; apt autoclean
</code></pre><ul>
<li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li>
<li>Open <a href="https://github.com/ilri/OpenRXV/issues/111">an issue for the value mappings global replacement bug in OpenRXV</a></li>
<li>Advise Peter and Abenet on expected CGSpace budget for 2022</li>
<li>Start a fresh harvesting on AReS (linode20)</li>
</ul>
<h2 id="2021-08-02">2021-08-02</h2>
<ul>
<li>Help Udana with OAI validation on CGSpace
<ul>
<li>He was checking the OAI base URL on OpenArchives and I had to verify the results in order to proceed to Step 2</li>
<li>Now it seems to be verified (all green): <a href="https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85">https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85</a></li>
<li>We are listed in the OpenArchives list of databases conforming to OAI 2.0</li>
</ul>
</li>
</ul>
<h2 id="2021-08-03">2021-08-03</h2>
<ul>
<li>Run fresh re-harvest on AReS</li>
</ul>
<h2 id="2021-08-05">2021-08-05</h2>
<ul>
<li>Have a quick call with Mishell Portilla from CIP about a journal article that was flagged as being in a predatory journal (Beall&rsquo;s List)
<ul>
<li>We agreed to unmap it from RTB&rsquo;s collection for now, and I asked for advice from Peter and Abenet for what to do in the future</li>
</ul>
</li>
<li>A developer from the Alliance asked for access to the CGSpace database so they can make some integration with PowerBI
<ul>
<li>I told them we don&rsquo;t allow direct database access, and that it would be tricky anyways (that&rsquo;s what APIs are for!)</li>
</ul>
</li>
<li>I&rsquo;m curious if there are still any requests coming in to CGSpace from the abusive Russian networks
<ul>
<li>I extracted all the unique IPs that nginx processed in the last week:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
</code></pre><ul>
<li>Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
<ul>
<li>Indeed, now I see that there are no IPs from those networks coming in now:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv
</code></pre><h2 id="2021-08-08">2021-08-08</h2>
<ul>
<li>Advise IWMI colleagues on best practices for thumbnails</li>
<li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
<ul>
<li>I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:</li>
<li>WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific</li>
<li>WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea</li>
</ul>
</li>
<li>Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
<ul>
<li>35.174.144.154 made 11,000 requests last month with the following user agent:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre><ul>
<li>That IP is on Amazon, and from looking at the DSpace logs I don&rsquo;t see them logging in at all, only scraping&hellip; so I will purge hits from that IP</li>
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
<ul>
<li>Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so</li>
<li>Same deal with 93.158.90.91, also in Sweden</li>
</ul>
</li>
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&quot;
</code></pre><ul>
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</code></pre><ul>
<li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
<ul>
<li>I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots</li>
</ul>
</li>
<li>I see a new bot using this user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
</code></pre><ul>
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
<li>There are probably more but that&rsquo;s most of them over 1,000 hits last month, so I will purge them:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
Purging 24863 hits from 3.225.28.105 in statistics
Purging 2988 hits from 93.158.90.91 in statistics
Purging 2497 hits from 61.143.40.50 in statistics
Purging 13866 hits from 159.138.131.15 in statistics
Purging 2721 hits from 95.87.154.12 in statistics
Purging 2786 hits from 47.252.80.214 in statistics
Purging 1485 hits from 129.0.211.251 in statistics
Purging 8952 hits from 217.182.21.193 in statistics
Purging 3446 hits from 103.135.104.139 in statistics

Total number of bot hits purged: 90485
</code></pre><ul>
<li>Then I purged a few thousand more by user agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics

Total number of hits from bots: 4492
</code></pre><ul>
<li>I found some CGSpace metadata in the wrong fields
<ul>
<li>Seven metadata in dc.subject (57) should be in dcterms.subject (187)</li>
<li>Twelve metadata in cg.title.journal (202) should be in cg.journal (251)</li>
<li>Three dc.identifier.isbn (20) should be in cg.isbn (252)</li>
<li>Three dc.identifier.issn (21) should be in cg.issn (253)</li>
<li>I re-ran the <code>migrate-fields.sh</code> script on CGSpace</li>
</ul>
</li>
<li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</code></pre><ul>
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
<ul>
<li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li>
<li>I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes&hellip;</li>
<li>In total it was 1,195 changes to ISSN and ISBN metadata fields</li>
</ul>
</li>
</ul>
<h2 id="2021-08-09">2021-08-09</h2>
<ul>
<li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
</code></pre><ul>
<li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv &gt; /tmp/2021-08-09-journals-all.csv
</code></pre><ul>
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
</ul>
<pre><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,&quot;same&quot;,&quot;different&quot;)
</code></pre><ul>
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
<ul>
<li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li>
</ul>
</li>
<li>Convert my <code>generate-thumbnails.py</code> script to use libvips instead of Graphicsmagick
<ul>
<li>It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs</li>
<li>One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can&rsquo;t work in CMYK</li>
<li>I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I&rsquo;m not sure&hellip;</li>
<li>Perhaps this is not a problem after all, see this PR from 2019: <a href="https://github.com/libvips/libvips/pull/1196">https://github.com/libvips/libvips/pull/1196</a></li>
</ul>
</li>
<li>I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
39004:0.08
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
40932:0.53
$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
41724:0.59
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
24736:0.04
</code></pre><ul>
<li>The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
<ul>
<li>libvips does use less time and memory&hellip; I should do more tests!</li>
<li>I wonder if I can try to use these <a href="https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java">unofficial Java bindings</a> in DSpace</li>
<li>The authors of the JVips project wrote a nice blog post about libvips performance: <a href="https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55">https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55</a></li>
<li>Ouch, JVips is Java 8 only as far as I can tell&hellip; that works now, but it&rsquo;s a non-starter going forward</li>
</ul>
</li>
</ul>
<h2 id="2021-08-11">2021-08-11</h2>
<ul>
<li>Peter got back to me about the journal title cleanup
<ul>
<li>From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref&rsquo;s</li>
<li>Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter&rsquo;s selections for where Sherpa Romeo and Crossref differred:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d &gt; /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
</code></pre><ul>
<li>Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine</li>
<li>I exported a list of all the journal titles we have in the <code>cg.journal</code> field:</li>
</ul>
<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &quot;cg.journal&quot; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
</code></pre><ul>
<li>I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don&rsquo;t match, so I&rsquo;d have to go check many of them manually before selecting a match or fixing them&hellip;
<ul>
<li>I think it&rsquo;s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way</li>
<li>Or instead of doing it via SQL I could use CSV and parse the values there&hellip;</li>
</ul>
</li>
<li>A few more issues:
<ul>
<li>Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org&rsquo;s web search (their API is invite only)</li>
<li>Some titles are different across all three datasets, for example ISSN 0003-1305:
<ul>
<li><a href="https://portal.issn.org/resource/ISSN/0003-1305">According to ISSN.org</a> this is &ldquo;The American statistician&rdquo;</li>
<li><a href="https://v2.sherpa.ac.uk/id/publication/20807">According to Sherpa Romeo</a> this is &ldquo;American Statistician&rdquo;</li>
<li><a href="https://search.crossref.org/?q=0003-1305&amp;from_ui=yes&amp;container-title=The+American+Statistician">According to Crossref</a> this is &ldquo;The American Statistician&rdquo;</li>
</ul>
</li>
</ul>
</li>
<li>I also realized that our previous controlled vocabulary came from CGSpace&rsquo;s top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
<ul>
<li>Now I went back and merged the previous with the new, and manually removed duplicates (sigh)</li>
<li>I requested access to the issn.org OAI-PMH API so I can use their registry&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2021-08-12">2021-08-12</h2>
<ul>
<li>I sent an email to Sherpa Romeo&rsquo;s help contact to ask about missing ISSNs
<ul>
<li>They pointed me to their <a href="https://v2.sherpa.ac.uk/romeo/about.html">inclusion criteria</a> and said that missing journals should submit their open access policies to be included</li>
</ul>
</li>
<li>The contact from issn.org got back to me and said I should pay 1,000/year EUR for 100,000 requests to their API&hellip; no thanks</li>
<li>Submit a pull request to COUNTER-Robots for the httpx bot (<a href="https://github.com/atmire/COUNTER-Robots/pull/45">#45</a>)
<ul>
<li>In the mean time I added it to our local ILRI overrides</li>
</ul>
</li>
</ul>
<h2 id="2021-08-15">2021-08-15</h2>
<ul>
<li>Start a fresh reindex on AReS</li>
</ul>
<h2 id="2021-08-16">2021-08-16</h2>
<ul>
<li>Meeting with Abenet and Peter about CGSpace actions and future
<ul>
<li>We agreed to move three top-level Feed the Future projects into one community, so I created one and moved them:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
</code></pre><ul>
<li>I made a minor fix to OpenRXV to prefix all image names with <code>docker.io</code> so it works with less changes on podman
<ul>
<li>Docker assumes the <code>docker.io</code> registry by default, but we should be explicit</li>
</ul>
</li>
</ul>
<h2 id="2021-08-17">2021-08-17</h2>
<ul>
<li>I made an initial attempt on the policy statements page on DSpace Test
<ul>
<li>It is modeled on Sherpa Romeo&rsquo;s OpenDOAR policy statements advice</li>
</ul>
</li>
<li>Sit with Moayad and discuss the future of AReS
<ul>
<li>We specifically discussed formalizing the API and documenting its use to allow as an alternative to harvesting directly from CGSpace</li>
<li>We also discussed allowing linking to search results to enable something like &ldquo;Explore this collection&rdquo; links on CGSpace collection pages</li>
</ul>
</li>
<li>Lower case all AGROVOC metadata, as I had noticed a few in sentence case:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 484
</code></pre><ul>
<li>Also update some DOIs using the <code>dx.doi.org</code> format, just to keep things uniform:</li>
</ul>
<pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE 469
</code></pre><ul>
<li>Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    322m16.917s
user    226m43.121s
sys     3m17.469s
</code></pre><ul>
<li>I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
    -H 'Content-Type: application/json' \
    -d '{
    &quot;size&quot;: 10,
    &quot;query&quot;: {
        &quot;bool&quot;: {
            &quot;filter&quot;: {
                &quot;term&quot;: {
                    &quot;repo.keyword&quot;: &quot;CGSpace&quot;
                }
            }
        }
    }
}'
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ==
</code></pre><ul>
<li>This uses the Elasticsearch scroll ID to page through results
<ul>
<li>The second query doesn&rsquo;t need the request body because it is saved for 1 day as part of the first request</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->


</article>


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">


        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>

<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>

<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>

<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>

<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>

    </ol>
  </section>


  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">

      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>

      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>

      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>

    </ol>
  </section>

</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->


    <footer class="blog-footer">
      <p dir="auto">

      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.

      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>


  </body>

</html>