Add notes for 2021-11-08

This commit is contained in:
2021-11-09 06:29:52 +02:00
parent b3df4ff58f
commit 9afe5c13f9
110 changed files with 1827 additions and 1737 deletions

View File

@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -122,37 +122,37 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<ul>
<li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># apt update &amp;&amp; apt dist-upgrade
# apt autoremove &amp;&amp; apt autoclean
# check for any packages with residual configs we can purge
# dpkg -l | grep -E '^rc' | awk '{print $2}'
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt update <span style="color:#f92672">&amp;&amp;</span> apt dist-upgrade
# apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
# check <span style="color:#66d9ef">for</span> any packages with residual configs we can purge
# dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span>
# dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -P
# dpkg -C
# dpkg -l &gt; 2021-08-01-linode20-dpkg.txt
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
# reboot
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
# do-release-upgrade
</code></pre><ul>
# sed -i <span style="color:#e6db74">&#39;s/bionic/focal/&#39;</span> /etc/apt/sources.list.d/*.list
# <span style="color:#66d9ef">do</span>-release-upgrade
</code></pre></div><ul>
<li>&hellip; but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li>
<li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># apt install -f
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt install -f
# apt dist-upgrade
# reboot
</code></pre><ul>
</code></pre></div><ul>
<li>After rebooting I purged all packages with residual configs and cleaned up again:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# apt autoremove &amp;&amp; apt autoclean
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -P
# apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
</code></pre></div><ul>
<li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li>
<li>Open <a href="https://github.com/ilri/OpenRXV/issues/111">an issue for the value mappings global replacement bug in OpenRXV</a></li>
<li>Advise Peter and Abenet on expected CGSpace budget for 2022</li>
@ -190,21 +190,21 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
</code></pre><ul>
</code></pre></div><ul>
<li>Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
<ul>
<li>Indeed, now I see that there are no IPs from those networks coming in now:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv
</code></pre><h2 id="2021-08-08">2021-08-08</h2>
</code></pre></div><h2 id="2021-08-08">2021-08-08</h2>
<ul>
<li>Advise IWMI colleagues on best practices for thumbnails</li>
<li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
@ -220,8 +220,8 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre></div><ul>
<li>That IP is on Amazon, and from looking at the DSpace logs I don&rsquo;t see them logging in at all, only scraping&hellip; so I will purge hits from that IP</li>
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
<ul>
@ -232,14 +232,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&#34;
</code></pre></div><ul>
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</code></pre></div><ul>
<li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
<ul>
<li>I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots</li>
@ -247,14 +247,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</li>
<li>I see a new bot using this user agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
</code></pre></div><ul>
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
<li>There are probably more but that&rsquo;s most of them over 1,000 hits last month, so I will purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
@ -267,17 +267,17 @@ Purging 2786 hits from 47.252.80.214 in statistics
Purging 1485 hits from 129.0.211.251 in statistics
Purging 8952 hits from 217.182.21.193 in statistics
Purging 3446 hits from 103.135.104.139 in statistics
Total number of bot hits purged: 90485
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 90485
</code></pre></div><ul>
<li>Then I purged a few thousand more by user agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics
Total number of hits from bots: 4492
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 4492
</code></pre></div><ul>
<li>I found some CGSpace metadata in the wrong fields
<ul>
<li>Seven metadata in dc.subject (57) should be in dcterms.subject (187)</li>
@ -289,8 +289,8 @@ Total number of hits from bots: 4492
</li>
<li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]&#39;</span> /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</code></pre></div><ul>
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
<ul>
<li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li>
@ -303,20 +303,20 @@ Total number of hits from bots: 4492
<ul>
<li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;cg.issn[en_US]&#39;</span> ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c <span style="color:#ae81ff">1</span> -r <span style="color:#e6db74">&#39;^[0-9]{4}&#39;</span> | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
</code></pre><ul>
</code></pre></div><ul>
<li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ sed -i <span style="color:#e6db74">&#39;1s/journal title/sherpa romeo journal title/&#39;</span> /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i <span style="color:#e6db74">&#39;1s/journal title/crossref journal title/&#39;</span> /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv &gt; /tmp/2021-08-09-journals-all.csv
</code></pre><ul>
</code></pre></div><ul>
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,&quot;same&quot;,&quot;different&quot;)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">if(cells[&#39;sherpa romeo journal title&#39;].value == cells[&#39;crossref journal title&#39;].value,&#34;same&#34;,&#34;different&#34;)
</code></pre></div><ul>
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
<ul>
<li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li>
@ -332,15 +332,15 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
</li>
<li>I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">&#39;%s-vips.jpg[Q=85,optimize_coding,strip]&#39;</span>
39004:0.08
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -quality <span style="color:#ae81ff">85</span> -thumbnail x600 -flatten IPCC-gm.jpg
40932:0.53
$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
$ /usr/bin/time -f %M:%e convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
41724:0.59
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality <span style="color:#ae81ff">85</span> -thumbnail 600x600 IPCC-im.jpg
24736:0.04
</code></pre><ul>
</code></pre></div><ul>
<li>The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
<ul>
<li>libvips does use less time and memory&hellip; I should do more tests!</li>
@ -359,17 +359,17 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d &gt; /tmp/journals2.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
$ csvcut -c <span style="color:#e6db74">&#39;sherpa romeo journal title&#39;</span> ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d &gt; /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
</code></pre><ul>
</code></pre></div><ul>
<li>Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine</li>
<li>I exported a list of all the journal titles we have in the <code>cg.journal</code> field:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &quot;cg.journal&quot; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &#34;cg.journal&#34; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
</code></pre><ul>
</code></pre></div><ul>
<li>I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don&rsquo;t match, so I&rsquo;d have to go check many of them manually before selecting a match or fixing them&hellip;
<ul>
<li>I think it&rsquo;s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way</li>
@ -421,10 +421,10 @@ COPY 3245
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/72600
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/35730
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/76451
</code></pre></div><ul>
<li>I made a minor fix to OpenRXV to prefix all image names with <code>docker.io</code> so it works with less changes on podman
<ul>
<li>Docker assumes the <code>docker.io</code> registry by default, but we should be explicit</li>
@ -446,40 +446,40 @@ $ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
</li>
<li>Lower case all AGROVOC metadata, as I had noticed a few in sentence case:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 484
</code></pre><ul>
</code></pre></div><ul>
<li>Also update some DOIs using the <code>dx.doi.org</code> format, just to keep things uniform:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
UPDATE 469
</code></pre><ul>
</code></pre></div><ul>
<li>Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 322m16.917s
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 322m16.917s
user 226m43.121s
sys 3m17.469s
</code></pre><ul>
</code></pre></div><ul>
<li>I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
-H 'Content-Type: application/json' \
-d '{
&quot;size&quot;: 10,
&quot;query&quot;: {
&quot;bool&quot;: {
&quot;filter&quot;: {
&quot;term&quot;: {
&quot;repo.keyword&quot;: &quot;CGSpace&quot;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explorer/api/search?scroll=1d&#39;</span> <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span> -H &#39;Content-Type: application/json&#39; \
-d &#39;{
&#34;size&#34;: 10,
&#34;query&#34;: {
&#34;bool&#34;: {
&#34;filter&#34;: {
&#34;term&#34;: {
&#34;repo.keyword&#34;: &#34;CGSpace&#34;
}
}
}
}
}'
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
</code></pre><ul>
}&#39;
$ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ==&#39;</span>
</code></pre></div><ul>
<li>This uses the Elasticsearch scroll ID to page through results
<ul>
<li>The second query doesn&rsquo;t need the request body because it is saved for 1 day as part of the first request</li>
@ -525,46 +525,46 @@ $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-08-25-combined-orcids.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-08-25-combined-orcids.txt
$ wc -l /tmp/2021-08-25-combined-orcids.txt
1331
</code></pre><ul>
</code></pre></div><ul>
<li>After I combined them and removed duplicates, I resolved all the names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
</code></pre></div><ul>
<li>Tag existing items from the Alliance&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code> (181 new metadata fields added):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&quot;Chege, Christine G. Kiria&quot;,&quot;Christine G.Kiria Chege: 0000-0001-8360-0279&quot;
&quot;Chege, Christine Kiria&quot;,&quot;Christine G.Kiria Chege: 0000-0001-8360-0279&quot;
&quot;Kiria, C.&quot;,&quot;Christine G.Kiria Chege: 0000-0001-8360-0279&quot;
&quot;Kinyua, Ivy&quot;,&quot;Ivy Kinyua :0000-0002-1978-8833&quot;
&quot;Rahn, E.&quot;,&quot;Eric Rahn: 0000-0001-6280-7430&quot;
&quot;Rahn, Eric&quot;,&quot;Eric Rahn: 0000-0001-6280-7430&quot;
&quot;Jager M.&quot;,&quot;Matthias Jager: 0000-0003-1059-3949&quot;
&quot;Jager, M.&quot;,&quot;Matthias Jager: 0000-0003-1059-3949&quot;
&quot;Jager, Matthias&quot;,&quot;Matthias Jager: 0000-0003-1059-3949&quot;
&quot;Waswa, Boaz&quot;,&quot;Boaz Waswa: 0000-0002-0066-0215&quot;
&quot;Waswa, Boaz S.&quot;,&quot;Boaz Waswa: 0000-0002-0066-0215&quot;
&quot;Rivera, Tatiana&quot;,&quot;Tatiana Rivera: 0000-0003-4876-5873&quot;
&quot;Andrade, Robert&quot;,&quot;Robert Andrade: 0000-0002-5764-3854&quot;
&quot;Ceccarelli, Viviana&quot;,&quot;Viviana Ceccarelli: 0000-0003-2160-9483&quot;
&quot;Ceccarellia, Viviana&quot;,&quot;Viviana Ceccarelli: 0000-0003-2160-9483&quot;
&quot;Nyawira, Sylvia&quot;,&quot;Sylvia Sarah Nyawira: 0000-0003-4913-1389&quot;
&quot;Nyawira, Sylvia S.&quot;,&quot;Sylvia Sarah Nyawira: 0000-0003-4913-1389&quot;
&quot;Nyawira, Sylvia Sarah&quot;,&quot;Sylvia Sarah Nyawira: 0000-0003-4913-1389&quot;
&quot;Groot, J.C.&quot;,&quot;Groot, J.C.J.: 0000-0001-6516-5170&quot;
&quot;Groot, J.C.J.&quot;,&quot;Groot, J.C.J.: 0000-0001-6516-5170&quot;
&quot;Groot, Jeroen C.J.&quot;,&quot;Groot, J.C.J.: 0000-0001-6516-5170&quot;
&quot;Groot, Jeroen CJ&quot;,&quot;Groot, J.C.J.: 0000-0001-6516-5170&quot;
&quot;Abera, W.&quot;,&quot;Wuletawu Abera: 0000-0002-3657-5223&quot;
&quot;Abera, Wuletawu&quot;,&quot;Wuletawu Abera: 0000-0002-3657-5223&quot;
&quot;Kanyenga Lubobo, Antoine&quot;,&quot;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&quot;
&quot;Lubobo Antoine, Kanyenga&quot;,&quot;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&quot;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
</code></pre><h2 id="2021-08-29">2021-08-29</h2>
&#34;Chege, Christine G. Kiria&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Chege, Christine Kiria&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Kiria, C.&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Kinyua, Ivy&#34;,&#34;Ivy Kinyua :0000-0002-1978-8833&#34;
&#34;Rahn, E.&#34;,&#34;Eric Rahn: 0000-0001-6280-7430&#34;
&#34;Rahn, Eric&#34;,&#34;Eric Rahn: 0000-0001-6280-7430&#34;
&#34;Jager M.&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Jager, M.&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Jager, Matthias&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Waswa, Boaz&#34;,&#34;Boaz Waswa: 0000-0002-0066-0215&#34;
&#34;Waswa, Boaz S.&#34;,&#34;Boaz Waswa: 0000-0002-0066-0215&#34;
&#34;Rivera, Tatiana&#34;,&#34;Tatiana Rivera: 0000-0003-4876-5873&#34;
&#34;Andrade, Robert&#34;,&#34;Robert Andrade: 0000-0002-5764-3854&#34;
&#34;Ceccarelli, Viviana&#34;,&#34;Viviana Ceccarelli: 0000-0003-2160-9483&#34;
&#34;Ceccarellia, Viviana&#34;,&#34;Viviana Ceccarelli: 0000-0003-2160-9483&#34;
&#34;Nyawira, Sylvia&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Nyawira, Sylvia S.&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Nyawira, Sylvia Sarah&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Groot, J.C.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, J.C.J.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, Jeroen C.J.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, Jeroen CJ&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Abera, W.&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
&#34;Abera, Wuletawu&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
&#34;Kanyenga Lubobo, Antoine&#34;,&#34;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&#34;
&#34;Lubobo Antoine, Kanyenga&#34;,&#34;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><h2 id="2021-08-29">2021-08-29</h2>
<ul>
<li>Run a full harvest on AReS</li>
<li>Also do more work the past few days on OpenRXV