mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-22 19:43:24 +01:00
351 lines
13 KiB
HTML
351 lines
13 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
|
|
<meta property="og:title" content="April, 2021" />
|
|
<meta property="og:description" content="2021-04-01
|
|
|
|
I wrote a script to query Sherpa’s API for our ISSNs: sherpa-issn-lookup.py
|
|
|
|
I’m curious to see how the results compare with the results from Crossref yesterday
|
|
|
|
|
|
AReS Explorer was down since this morning, I didn’t see anything in the systemd journal
|
|
|
|
I simply took everything down with docker-compose and then back up, and then it was OK
|
|
Perhaps one of the containers crashed, I should have looked closer but I was in a hurry
|
|
|
|
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-04/" />
|
|
<meta property="article:published_time" content="2021-04-01T09:50:54+03:00" />
|
|
<meta property="article:modified_time" content="2021-04-01T09:50:54+03:00" />
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="April, 2021"/>
|
|
<meta name="twitter:description" content="2021-04-01
|
|
|
|
I wrote a script to query Sherpa’s API for our ISSNs: sherpa-issn-lookup.py
|
|
|
|
I’m curious to see how the results compare with the results from Crossref yesterday
|
|
|
|
|
|
AReS Explorer was down since this morning, I didn’t see anything in the systemd journal
|
|
|
|
I simply took everything down with docker-compose and then back up, and then it was OK
|
|
Perhaps one of the containers crashed, I should have looked closer but I was in a hurry
|
|
|
|
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.82.0" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "April, 2021",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2021-04/",
|
|
"wordCount": "823",
|
|
"datePublished": "2021-04-01T09:50:54+03:00",
|
|
"dateModified": "2021-04-01T09:50:54+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-04/">
|
|
|
|
<title>April, 2021 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk+S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-04/">April, 2021</a></h2>
|
|
<p class="blog-post-meta">
|
|
<time datetime="2021-04-01T09:50:54+03:00">Thu Apr 01, 2021</time>
|
|
in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2021-04-01">2021-04-01</h2>
|
|
<ul>
|
|
<li>I wrote a script to query Sherpa’s API for our ISSNs: <code>sherpa-issn-lookup.py</code>
|
|
<ul>
|
|
<li>I’m curious to see how the results compare with the results from Crossref yesterday</li>
|
|
</ul>
|
|
</li>
|
|
<li>AReS Explorer was down since this morning, I didn’t see anything in the systemd journal
|
|
<ul>
|
|
<li>I simply took everything down with docker-compose and then back up, and then it was OK</li>
|
|
<li>Perhaps one of the containers crashed, I should have looked closer but I was in a hurry</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2021-04-03">2021-04-03</h2>
|
|
<ul>
|
|
<li>Biruk from ICT contacted me to say that some CGSpace users still can’t log in
|
|
<ul>
|
|
<li>I guess the CGSpace LDAP bind account is really still locked after last week’s reset</li>
|
|
<li>He fixed the account and then I was finally able to bind and query:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
|
|
</code></pre><h2 id="2021-04-04">2021-04-04</h2>
|
|
<ul>
|
|
<li>Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
|
|
</code></pre><ul>
|
|
<li>Then set the <code>openrxv-items-final</code> index to read-only so we can make a backup:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
|
|
{"acknowledged":true}%
|
|
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
|
|
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
|
|
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
|
|
</code></pre><ul>
|
|
<li>Then start a harvesting on AReS Explorer</li>
|
|
<li>Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
|
|
<ul>
|
|
<li>He was hitting <a href="https://github.com/ilri/OpenRXV/issues/66">a bug on AReS</a> and also he only needed stats for 2020, and AReS currently only gives all-time stats</li>
|
|
</ul>
|
|
</li>
|
|
<li>I cleaned up about 230 ISSNs on CGSpace in OpenRefine
|
|
<ul>
|
|
<li>I had exported them last week, then filtered for anything not looking like an ISSN with this GREL: <code>isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))</code></li>
|
|
<li>Then I applied them on CGSpace with the <code>fix-metadata-values.py</code> script:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
|
|
</code></pre><ul>
|
|
<li>For now I only fixed obvious errors like “1234-5678.” and “e-ISSN: 1234-5678” etc, but there are still lots of invalid ones which need more manual work:
|
|
<ul>
|
|
<li>Too few characters</li>
|
|
<li>Too many characters</li>
|
|
<li>ISBNs</li>
|
|
</ul>
|
|
</li>
|
|
<li>Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps</li>
|
|
</ul>
|
|
<h2 id="2021-04-04-1">2021-04-04</h2>
|
|
<ul>
|
|
<li>The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
|
|
{
|
|
"openrxv-items-final": {
|
|
"aliases": {}
|
|
},
|
|
"openrxv-items-temp": {
|
|
"aliases": {
|
|
"openrxv-items": {}
|
|
}
|
|
},
|
|
...
|
|
}
|
|
</code></pre><ul>
|
|
<li><code>openrxv-items</code> should be an alias of <code>openrxv-items-final</code>, not <code>openrxv-temp</code>… I will have to fix that manually</li>
|
|
<li>Enrico asked for more information on the RTB stats I gave him yesterday
|
|
<ul>
|
|
<li>I remembered (again) that we can’t filter Atmire’s CUA stats by date issued</li>
|
|
<li>To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range</li>
|
|
<li>I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
|
|
$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
|
|
sed '1d' | \
|
|
csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
|
|
csvgrep -c issued -m 2020 | \
|
|
csvcut -c id | \
|
|
sed '1d' | \
|
|
sort | \
|
|
uniq
|
|
</code></pre><ul>
|
|
<li>So I remember in the future, this basically does the following:
|
|
<ul>
|
|
<li>Use csvcut to extract the id and all date issued columns from the CSV</li>
|
|
<li>Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters)</li>
|
|
<li>Use csvsql to concatenate the various date issued columns (coalescing where null)</li>
|
|
<li>Use csvgrep to filter items by date issued in 2020</li>
|
|
<li>Use csvcut to extract the id column</li>
|
|
<li>Use sed to delete the header row</li>
|
|
<li>Use sort and uniq to filter out any duplicate IDs (there were three)</li>
|
|
</ul>
|
|
</li>
|
|
<li>Then I have a list of 296 IDs for RTB items issued in 2020</li>
|
|
<li>I constructed a JSON file to post to the DSpace Statistics API:</li>
|
|
</ul>
|
|
<div class="highlight"><pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-json" data-lang="json">{
|
|
<span style="color:#f92672">"limit"</span>: <span style="color:#ae81ff">100</span>,
|
|
<span style="color:#f92672">"page"</span>: <span style="color:#ae81ff">0</span>,
|
|
<span style="color:#f92672">"dateFrom"</span>: <span style="color:#e6db74">"2020-01-01T00:00:00Z"</span>,
|
|
<span style="color:#f92672">"dateTo"</span>: <span style="color:#e6db74">"2020-12-31T00:00:00Z"</span>,
|
|
<span style="color:#f92672">"items"</span>: [
|
|
<span style="color:#e6db74">"00358715-b70c-4fdd-aa55-730e05ba739e"</span>,
|
|
<span style="color:#e6db74">"004b54bb-f16f-4cec-9fbc-ab6c6345c43d"</span>,
|
|
<span style="color:#e6db74">"02fb7630-d71a-449e-b65d-32b4ea7d6904"</span>,
|
|
<span style="color:#960050;background-color:#1e0010">...</span>
|
|
]
|
|
}
|
|
</code></pre></div><ul>
|
|
<li>Then I submitted the file three times (changing the page parameter):</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
|
|
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
|
|
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json
|
|
</code></pre><ul>
|
|
<li>Then I extracted the views and downloads in the most ridiculous way:</li>
|
|
</ul>
|
|
<pre><code class="language-console" data-lang="console">$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
|
|
30364
|
|
$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
|
|
9100
|
|
</code></pre><ul>
|
|
<li>For curiousity I did the same exercise for items issued in 2019 and got the following:
|
|
<ul>
|
|
<li>Views: 30721</li>
|
|
<li>Downloads: 10205</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-03/">March, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-02/">February, 2021</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2021-01/">January, 2021</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|