mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-22 19:43:24 +01:00
471 lines
22 KiB
HTML
471 lines
22 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
|
|
<meta property="og:title" content="August, 2021" />
|
|
<meta property="og:description" content="2021-08-01
|
|
|
|
Update Docker images on AReS server (linode20) and reboot the server:
|
|
|
|
# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
|
|
|
|
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-08/" />
|
|
<meta property="article:published_time" content="2021-08-01T09:01:07+03:00" />
|
|
<meta property="article:modified_time" content="2021-08-09T17:34:13+03:00" />
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="August, 2021"/>
|
|
<meta name="twitter:description" content="2021-08-01
|
|
|
|
Update Docker images on AReS server (linode20) and reboot the server:
|
|
|
|
# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
|
|
|
|
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.87.0" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "August, 2021",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2021-08/",
|
|
"wordCount": "2059",
|
|
"datePublished": "2021-08-01T09:01:07+03:00",
|
|
"dateModified": "2021-08-09T17:34:13+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-08/">
|
|
|
|
<title>August, 2021 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk+S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-08/">August, 2021</a></h2>
|
|
<p class="blog-post-meta">
|
|
<time datetime="2021-08-01T09:01:07+03:00">Sun Aug 01, 2021</time>
|
|
in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2021-08-01">2021-08-01</h2>
|
|
<ul>
|
|
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
|
|
</code></pre><ul>
|
|
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
|
|
</ul>
|
|
<ul>
|
|
<li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console"># apt update && apt dist-upgrade
|
|
# apt autoremove && apt autoclean
|
|
# check for any packages with residual configs we can purge
|
|
# dpkg -l | grep -E '^rc' | awk '{print $2}'
|
|
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
|
|
# dpkg -C
|
|
# dpkg -l > 2021-08-01-linode20-dpkg.txt
|
|
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
|
|
# reboot
|
|
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
|
|
# do-release-upgrade
|
|
</code></pre><ul>
|
|
<li>… but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li>
|
|
<li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console"># apt install -f
|
|
# apt dist-upgrade
|
|
# reboot
|
|
</code></pre><ul>
|
|
<li>After rebooting I purged all packages with residual configs and cleaned up again:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console"># dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
|
|
# apt autoremove && apt autoclean
|
|
</code></pre><ul>
|
|
<li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li>
|
|
<li>Open <a href="https://github.com/ilri/OpenRXV/issues/111">an issue for the value mappings global replacement bug in OpenRXV</a></li>
|
|
<li>Advise Peter and Abenet on expected CGSpace budget for 2022</li>
|
|
<li>Start a fresh harvesting on AReS (linode20)</li>
|
|
</ul>
|
|
<h2 id="2021-08-02">2021-08-02</h2>
|
|
<ul>
|
|
<li>Help Udana with OAI validation on CGSpace
|
|
<ul>
|
|
<li>He was checking the OAI base URL on OpenArchives and I had to verify the results in order to proceed to Step 2</li>
|
|
<li>Now it seems to be verified (all green): <a href="https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85">https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85</a></li>
|
|
<li>We are listed in the OpenArchives list of databases conforming to OAI 2.0</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2021-08-03">2021-08-03</h2>
|
|
<ul>
|
|
<li>Run fresh re-harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2021-08-05">2021-08-05</h2>
|
|
<ul>
|
|
<li>Have a quick call with Mishell Portilla from CIP about a journal article that was flagged as being in a predatory journal (Beall’s List)
|
|
<ul>
|
|
<li>We agreed to unmap it from RTB’s collection for now, and I asked for advice from Peter and Abenet for what to do in the future</li>
|
|
</ul>
|
|
</li>
|
|
<li>A developer from the Alliance asked for access to the CGSpace database so they can make some integration with PowerBI
|
|
<ul>
|
|
<li>I told them we don’t allow direct database access, and that it would be tricky anyways (that’s what APIs are for!)</li>
|
|
</ul>
|
|
</li>
|
|
<li>I’m curious if there are still any requests coming in to CGSpace from the abusive Russian networks
|
|
<ul>
|
|
<li>I extracted all the unique IPs that nginx processed in the last week:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
|
|
# wc -l /tmp/2021-08-05-all-ips.txt
|
|
43428 /tmp/2021-08-05-all-ips.txt
|
|
</code></pre><ul>
|
|
<li>Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
|
|
<ul>
|
|
<li>Indeed, now I see that there are no IPs from those networks coming in now:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
|
|
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
|
|
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
|
|
0 /tmp/2021-08-05-all-ips-to-purge.csv
|
|
</code></pre><h2 id="2021-08-08">2021-08-08</h2>
|
|
<ul>
|
|
<li>Advise IWMI colleagues on best practices for thumbnails</li>
|
|
<li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
|
|
<ul>
|
|
<li>I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:</li>
|
|
<li>WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific</li>
|
|
<li>WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea</li>
|
|
</ul>
|
|
</li>
|
|
<li>Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
|
|
<ul>
|
|
<li>35.174.144.154 made 11,000 requests last month with the following user agent:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
|
|
</code></pre><ul>
|
|
<li>That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP</li>
|
|
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
|
|
<ul>
|
|
<li>Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so</li>
|
|
<li>Same deal with 93.158.90.91, also in Sweden</li>
|
|
</ul>
|
|
</li>
|
|
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
|
|
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
|
|
</code></pre><ul>
|
|
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
|
|
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
|
|
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
|
|
</code></pre><ul>
|
|
<li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
|
|
<ul>
|
|
<li>I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots</li>
|
|
</ul>
|
|
</li>
|
|
<li>I see a new bot using this user agent:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
|
|
</code></pre><ul>
|
|
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
|
|
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
|
|
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
|
|
<li>There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
|
Purging 10796 hits from 35.174.144.154 in statistics
|
|
Purging 9993 hits from 93.158.90.30 in statistics
|
|
Purging 6092 hits from 130.255.162.173 in statistics
|
|
Purging 24863 hits from 3.225.28.105 in statistics
|
|
Purging 2988 hits from 93.158.90.91 in statistics
|
|
Purging 2497 hits from 61.143.40.50 in statistics
|
|
Purging 13866 hits from 159.138.131.15 in statistics
|
|
Purging 2721 hits from 95.87.154.12 in statistics
|
|
Purging 2786 hits from 47.252.80.214 in statistics
|
|
Purging 1485 hits from 129.0.211.251 in statistics
|
|
Purging 8952 hits from 217.182.21.193 in statistics
|
|
Purging 3446 hits from 103.135.104.139 in statistics
|
|
|
|
Total number of bot hits purged: 90485
|
|
</code></pre><ul>
|
|
<li>Then I purged a few thousand more by user agent:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
|
|
Found 2707 hits from MaCoCu in statistics
|
|
Found 1785 hits from nettle in statistics
|
|
|
|
Total number of hits from bots: 4492
|
|
</code></pre><ul>
|
|
<li>I found some CGSpace metadata in the wrong fields
|
|
<ul>
|
|
<li>Seven metadata in dc.subject (57) should be in dcterms.subject (187)</li>
|
|
<li>Twelve metadata in cg.title.journal (202) should be in cg.journal (251)</li>
|
|
<li>Three dc.identifier.isbn (20) should be in cg.isbn (252)</li>
|
|
<li>Three dc.identifier.issn (21) should be in cg.issn (253)</li>
|
|
<li>I re-ran the <code>migrate-fields.sh</code> script on CGSpace</li>
|
|
</ul>
|
|
</li>
|
|
<li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
|
|
</code></pre><ul>
|
|
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
|
|
<ul>
|
|
<li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li>
|
|
<li>I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes…</li>
|
|
<li>In total it was 1,195 changes to ISSN and ISBN metadata fields</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2021-08-09">2021-08-09</h2>
|
|
<ul>
|
|
<li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
|
|
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
|
|
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
|
|
</code></pre><ul>
|
|
<li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
|
|
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
|
|
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
|
|
</code></pre><ul>
|
|
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
|
|
</code></pre><ul>
|
|
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
|
|
<ul>
|
|
<li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li>
|
|
</ul>
|
|
</li>
|
|
<li>Convert my <code>generate-thumbnails.py</code> script to use libvips instead of Graphicsmagick
|
|
<ul>
|
|
<li>It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs</li>
|
|
<li>One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can’t work in CMYK</li>
|
|
<li>I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I’m not sure…</li>
|
|
<li>Perhaps this is not a problem after all, see this PR from 2019: <a href="https://github.com/libvips/libvips/pull/1196">https://github.com/libvips/libvips/pull/1196</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
|
|
39004:0.08
|
|
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
|
|
40932:0.53
|
|
$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
|
|
41724:0.59
|
|
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
|
|
24736:0.04
|
|
</code></pre><ul>
|
|
<li>The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
|
|
<ul>
|
|
<li>libvips does use less time and memory… I should do more tests!</li>
|
|
<li>I wonder if I can try to use these <a href="https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java">unofficial Java bindings</a> in DSpace</li>
|
|
<li>The authors of the JVips project wrote a nice blog post about libvips performance: <a href="https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55">https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55</a></li>
|
|
<li>Ouch, JVips is Java 8 only as far as I can tell… that works now, but it’s a non-starter going forward</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2021-08-11">2021-08-11</h2>
|
|
<ul>
|
|
<li>Peter got back to me about the journal title cleanup
|
|
<ul>
|
|
<li>From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref’s</li>
|
|
<li>Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter’s selections for where Sherpa Romeo and Crossref differred:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
|
|
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
|
|
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
|
|
1911
|
|
</code></pre><ul>
|
|
<li>Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine</li>
|
|
<li>I exported a list of all the journal titles we have in the <code>cg.journal</code> field:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
|
|
COPY 3245
|
|
</code></pre><ul>
|
|
<li>I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them…
|
|
<ul>
|
|
<li>I think it’s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way</li>
|
|
<li>Or instead of doing it via SQL I could use CSV and parse the values there…</li>
|
|
</ul>
|
|
</li>
|
|
<li>A few more issues:
|
|
<ul>
|
|
<li>Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org’s web search (their API is invite only)</li>
|
|
<li>Some titles are different across all three datasets, for example ISSN 0003-1305:
|
|
<ul>
|
|
<li><a href="https://portal.issn.org/resource/ISSN/0003-1305">According to ISSN.org</a> this is “The American statistician”</li>
|
|
<li><a href="https://v2.sherpa.ac.uk/id/publication/20807">According to Sherpa Romeo</a> this is “American Statistician”</li>
|
|
<li><a href="https://search.crossref.org/?q=0003-1305&from_ui=yes&container-title=The+American+Statistician">According to Crossref</a> this is “The American Statistician”</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>I also realized that our previous controlled vocabulary came from CGSpace’s top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
|
|
<ul>
|
|
<li>Now I went back and merged the previous with the new, and manually removed duplicates (sigh)</li>
|
|
<li>I requested access to the issn.org OAI-PMH API so I can use their registry…</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|