mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-11-08
This commit is contained in:
@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
|
||||
|
||||
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
<meta name="generator" content="Hugo 0.89.2" />
|
||||
|
||||
|
||||
|
||||
@ -122,37 +122,37 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
|
||||
<ul>
|
||||
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">'s/ \+/:/g'</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
|
||||
</code></pre></div><ul>
|
||||
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console"># apt update && apt dist-upgrade
|
||||
# apt autoremove && apt autoclean
|
||||
# check for any packages with residual configs we can purge
|
||||
# dpkg -l | grep -E '^rc' | awk '{print $2}'
|
||||
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt update <span style="color:#f92672">&&</span> apt dist-upgrade
|
||||
# apt autoremove <span style="color:#f92672">&&</span> apt autoclean
|
||||
# check <span style="color:#66d9ef">for</span> any packages with residual configs we can purge
|
||||
# dpkg -l | grep -E <span style="color:#e6db74">'^rc'</span> | awk <span style="color:#e6db74">'{print $2}'</span>
|
||||
# dpkg -l | grep -E <span style="color:#e6db74">'^rc'</span> | awk <span style="color:#e6db74">'{print $2}'</span> | xargs dpkg -P
|
||||
# dpkg -C
|
||||
# dpkg -l > 2021-08-01-linode20-dpkg.txt
|
||||
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
|
||||
# reboot
|
||||
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
|
||||
# do-release-upgrade
|
||||
</code></pre><ul>
|
||||
# sed -i <span style="color:#e6db74">'s/bionic/focal/'</span> /etc/apt/sources.list.d/*.list
|
||||
# <span style="color:#66d9ef">do</span>-release-upgrade
|
||||
</code></pre></div><ul>
|
||||
<li>… but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li>
|
||||
<li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console"># apt install -f
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt install -f
|
||||
# apt dist-upgrade
|
||||
# reboot
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>After rebooting I purged all packages with residual configs and cleaned up again:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console"># dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
|
||||
# apt autoremove && apt autoclean
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># dpkg -l | grep -E <span style="color:#e6db74">'^rc'</span> | awk <span style="color:#e6db74">'{print $2}'</span> | xargs dpkg -P
|
||||
# apt autoremove <span style="color:#f92672">&&</span> apt autoclean
|
||||
</code></pre></div><ul>
|
||||
<li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li>
|
||||
<li>Open <a href="https://github.com/ilri/OpenRXV/issues/111">an issue for the value mappings global replacement bug in OpenRXV</a></li>
|
||||
<li>Advise Peter and Abenet on expected CGSpace budget for 2022</li>
|
||||
@ -190,21 +190,21 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E <span style="color:#e6db74">" (200|499) "</span> | grep -v -E <span style="color:#e6db74">"(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)"</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort | uniq > /tmp/2021-08-05-all-ips.txt
|
||||
# wc -l /tmp/2021-08-05-all-ips.txt
|
||||
43428 /tmp/2021-08-05-all-ips.txt
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
|
||||
<ul>
|
||||
<li>Indeed, now I see that there are no IPs from those networks coming in now:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
|
||||
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
|
||||
$ csvgrep -c asn -r <span style="color:#e6db74">'^(49453|46844|206485|62282|36352|35913|35624|8100)$'</span> /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
0 /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
</code></pre><h2 id="2021-08-08">2021-08-08</h2>
|
||||
</code></pre></div><h2 id="2021-08-08">2021-08-08</h2>
|
||||
<ul>
|
||||
<li>Advise IWMI colleagues on best practices for thumbnails</li>
|
||||
<li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
|
||||
@ -220,8 +220,8 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
|
||||
</code></pre></div><ul>
|
||||
<li>That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP</li>
|
||||
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
|
||||
<ul>
|
||||
@ -232,14 +232,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
|
||||
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
|
||||
</code></pre></div><ul>
|
||||
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
|
||||
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
|
||||
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
|
||||
</code></pre></div><ul>
|
||||
<li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
|
||||
<ul>
|
||||
<li>I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots</li>
|
||||
@ -247,14 +247,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
|
||||
</li>
|
||||
<li>I see a new bot using this user agent:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
|
||||
</code></pre></div><ul>
|
||||
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
|
||||
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
|
||||
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
|
||||
<li>There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
||||
Purging 10796 hits from 35.174.144.154 in statistics
|
||||
Purging 9993 hits from 93.158.90.30 in statistics
|
||||
Purging 6092 hits from 130.255.162.173 in statistics
|
||||
@ -267,17 +267,17 @@ Purging 2786 hits from 47.252.80.214 in statistics
|
||||
Purging 1485 hits from 129.0.211.251 in statistics
|
||||
Purging 8952 hits from 217.182.21.193 in statistics
|
||||
Purging 3446 hits from 103.135.104.139 in statistics
|
||||
|
||||
Total number of bot hits purged: 90485
|
||||
</code></pre><ul>
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 90485
|
||||
</code></pre></div><ul>
|
||||
<li>Then I purged a few thousand more by user agent:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
|
||||
Found 2707 hits from MaCoCu in statistics
|
||||
Found 1785 hits from nettle in statistics
|
||||
|
||||
Total number of hits from bots: 4492
|
||||
</code></pre><ul>
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 4492
|
||||
</code></pre></div><ul>
|
||||
<li>I found some CGSpace metadata in the wrong fields
|
||||
<ul>
|
||||
<li>Seven metadata in dc.subject (57) should be in dcterms.subject (187)</li>
|
||||
@ -289,8 +289,8 @@ Total number of hits from bots: 4492
|
||||
</li>
|
||||
<li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]'</span> /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
|
||||
</code></pre></div><ul>
|
||||
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
|
||||
<ul>
|
||||
<li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li>
|
||||
@ -303,20 +303,20 @@ Total number of hits from bots: 4492
|
||||
<ul>
|
||||
<li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">'cg.issn[en_US]'</span> ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c <span style="color:#ae81ff">1</span> -r <span style="color:#e6db74">'^[0-9]{4}'</span> | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
|
||||
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
|
||||
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
|
||||
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ sed -i <span style="color:#e6db74">'1s/journal title/sherpa romeo journal title/'</span> /tmp/2021-08-09-journals-sherpa-romeo.csv
|
||||
$ sed -i <span style="color:#e6db74">'1s/journal title/crossref journal title/'</span> /tmp/2021-08-09-journals-crossref.csv
|
||||
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
|
||||
</code></pre></div><ul>
|
||||
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
|
||||
<ul>
|
||||
<li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li>
|
||||
@ -332,15 +332,15 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
|
||||
</li>
|
||||
<li>I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">'%s-vips.jpg[Q=85,optimize_coding,strip]'</span>
|
||||
39004:0.08
|
||||
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
|
||||
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -quality <span style="color:#ae81ff">85</span> -thumbnail x600 -flatten IPCC-gm.jpg
|
||||
40932:0.53
|
||||
$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
|
||||
$ /usr/bin/time -f %M:%e convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
|
||||
41724:0.59
|
||||
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
|
||||
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality <span style="color:#ae81ff">85</span> -thumbnail 600x600 IPCC-im.jpg
|
||||
24736:0.04
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
|
||||
<ul>
|
||||
<li>libvips does use less time and memory… I should do more tests!</li>
|
||||
@ -359,17 +359,17 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
|
||||
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
|
||||
$ csvcut -c <span style="color:#e6db74">'sherpa romeo journal title'</span> ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
|
||||
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
|
||||
1911
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine</li>
|
||||
<li>I exported a list of all the journal titles we have in the <code>cg.journal</code> field:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
|
||||
COPY 3245
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them…
|
||||
<ul>
|
||||
<li>I think it’s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way</li>
|
||||
@ -421,10 +421,10 @@ COPY 3245
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
|
||||
$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
|
||||
$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/72600
|
||||
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/35730
|
||||
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/76451
|
||||
</code></pre></div><ul>
|
||||
<li>I made a minor fix to OpenRXV to prefix all image names with <code>docker.io</code> so it works with less changes on podman
|
||||
<ul>
|
||||
<li>Docker assumes the <code>docker.io</code> registry by default, but we should be explicit</li>
|
||||
@ -446,40 +446,40 @@ $ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
|
||||
</li>
|
||||
<li>Lower case all AGROVOC metadata, as I had noticed a few in sentence case:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
|
||||
UPDATE 484
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>Also update some DOIs using the <code>dx.doi.org</code> format, just to keep things uniform:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
|
||||
UPDATE 469
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 322m16.917s
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>real 322m16.917s
|
||||
user 226m43.121s
|
||||
sys 3m17.469s
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"size": 10,
|
||||
"query": {
|
||||
"bool": {
|
||||
"filter": {
|
||||
"term": {
|
||||
"repo.keyword": "CGSpace"
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X POST <span style="color:#e6db74">'https://cgspace.cgiar.org/explorer/api/search?scroll=1d'</span> <span style="color:#ae81ff">\
|
||||
</span><span style="color:#ae81ff"></span> -H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"size": 10,
|
||||
"query": {
|
||||
"bool": {
|
||||
"filter": {
|
||||
"term": {
|
||||
"repo.keyword": "CGSpace"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
|
||||
</code></pre><ul>
|
||||
}'
|
||||
$ curl -X POST <span style="color:#e6db74">'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='</span>
|
||||
</code></pre></div><ul>
|
||||
<li>This uses the Elasticsearch scroll ID to page through results
|
||||
<ul>
|
||||
<li>The second query doesn’t need the request body because it is saved for 1 day as part of the first request</li>
|
||||
@ -525,46 +525,46 @@ $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE <span style="color:#e6db74">'[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}'</span> | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
|
||||
$ wc -l /tmp/2021-08-25-combined-orcids.txt
|
||||
1331
|
||||
</code></pre><ul>
|
||||
</code></pre></div><ul>
|
||||
<li>After I combined them and removed duplicates, I resolved all the names using my <code>resolve-orcids.py</code> script:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
|
||||
</code></pre><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
|
||||
</code></pre></div><ul>
|
||||
<li>Tag existing items from the Alliance’s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code> (181 new metadata fields added):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
|
||||
dc.contributor.author,cg.creator.identifier
|
||||
"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
||||
"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
||||
"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
||||
"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
|
||||
"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
|
||||
"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
|
||||
"Jager M.","Matthias Jager: 0000-0003-1059-3949"
|
||||
"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
|
||||
"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
|
||||
"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
|
||||
"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
|
||||
"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
|
||||
"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
|
||||
"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
|
||||
"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
|
||||
"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
||||
"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
||||
"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
||||
"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
|
||||
"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
|
||||
"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
|
||||
"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
|
||||
"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
|
||||
"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
|
||||
"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
|
||||
"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
|
||||
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
|
||||
</code></pre><h2 id="2021-08-29">2021-08-29</h2>
|
||||
"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
||||
"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
||||
"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
||||
"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
|
||||
"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
|
||||
"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
|
||||
"Jager M.","Matthias Jager: 0000-0003-1059-3949"
|
||||
"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
|
||||
"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
|
||||
"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
|
||||
"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
|
||||
"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
|
||||
"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
|
||||
"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
|
||||
"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
|
||||
"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
||||
"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
||||
"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
||||
"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
|
||||
"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
|
||||
"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
|
||||
"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
|
||||
"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
|
||||
"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
|
||||
"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
|
||||
"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
|
||||
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span>
|
||||
</code></pre></div><h2 id="2021-08-29">2021-08-29</h2>
|
||||
<ul>
|
||||
<li>Run a full harvest on AReS</li>
|
||||
<li>Also do more work the past few days on OpenRXV
|
||||
|
Reference in New Issue
Block a user