342 lines
13 KiB
HTML
Raw Normal View History

2020-01-14 20:40:41 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="January, 2020" />
<meta property="og:description" content="2020-01-06
Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
Last week Altmetric responded about the item that had a lower score than than its DOI
The score is now linked to the DOI
Another item that had the same problem in 2019 has now also linked to the score for its DOI
Another item that had the same problem in 2019 has also been fixed
2020-01-07
Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has
The DOI has a score of 259, but the Handle has no score at all
I tweeted the CGSpace repository link
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-01/" />
<meta property="article:published_time" content="2020-01-06T10:48:30+02:00" />
2020-01-20 10:49:11 +02:00
<meta property="article:modified_time" content="2020-01-16T12:49:21+02:00" />
2020-01-14 20:40:41 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="January, 2020"/>
<meta name="twitter:description" content="2020-01-06
Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
Last week Altmetric responded about the item that had a lower score than than its DOI
The score is now linked to the DOI
Another item that had the same problem in 2019 has now also linked to the score for its DOI
Another item that had the same problem in 2019 has also been fixed
2020-01-07
Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has
The DOI has a score of 259, but the Handle has no score at all
I tweeted the CGSpace repository link
"/>
<meta name="generator" content="Hugo 0.62.2" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2020",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-01\/",
2020-01-20 10:49:11 +02:00
"wordCount": "971",
2020-01-14 20:40:41 +02:00
"datePublished": "2020-01-06T10:48:30+02:00",
2020-01-20 10:49:11 +02:00
"dateModified": "2020-01-16T12:49:21+02:00",
2020-01-14 20:40:41 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-01/">
<title>January, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-01/">January, 2020</a></h2>
<p class="blog-post-meta"><time datetime="2020-01-06T10:48:30&#43;02:00">Mon Jan 06, 2020</time> by Alan Orth in
<i class="fa fa-folder" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-01-06">2020-01-06</h2>
<ul>
<li>Open <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">a ticket</a> with Atmire to request a quote for the upgrade to DSpace 6</li>
<li>Last week Altmetric responded about the <a href="https://hdl.handle.net/10568/97087">item</a> that had a lower score than than its DOI
<ul>
<li>The score is now linked to the DOI</li>
<li>Another <a href="https://handle.hdl.net/10568/91278">item</a> that had the same problem in 2019 has now also linked to the score for its DOI</li>
<li>Another <a href="https://hdl.handle.net/10568/81236">item</a> that had the same problem in 2019 has also been fixed</li>
</ul>
</li>
</ul>
<h2 id="2020-01-07">2020-01-07</h2>
<ul>
<li>Peter Ballantyne highlighted one more WLE <a href="https://hdl.handle.net/10568/101286">item</a> that is missing the Altmetric score that its DOI has
<ul>
<li>The DOI has a score of 259, but the Handle has no score at all</li>
<li>I <a href="https://twitter.com/mralanorth/status/1214471427157626881">tweeted</a> the CGSpace repository link</li>
</ul>
</li>
</ul>
<h2 id="2020-01-08">2020-01-08</h2>
<ul>
<li>Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
</code></pre><ul>
<li>As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:</li>
</ul>
<pre><code>$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779
</code></pre><ul>
<li>According to <a href="https://www.datafix.com.au/BASHing/2018-09-13.html">this trick</a> the troublesome character is on line 5227:</li>
</ul>
<pre><code>$ awk 'END {print NR&quot;: &quot;$0}' /tmp/2020-01-08-authors-windows.csv
5227: &quot;Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22 &quot;
00000001: 4f O
00000002: 75 u
00000003: 65 e
00000004: cc .
00000005: 81 .
00000006: 64 d
00000007: 72 r
</code></pre><ul>
<li><del>According to the blog post linked above the troublesome character is probably the &ldquo;High Octect Preset&rdquo; (81)</del>, which vim identifies (using <code>ga</code> on the character) as:</li>
</ul>
<pre><code>&lt;e&gt; 101, Hex 65, Octal 145 &lt; ́&gt; 769, Hex 0301, Octal 1401
</code></pre><ul>
<li>If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database&hellip;</li>
<li>Other encodings like <code>windows-1251</code> and <code>windows-1257</code> also fail on different characters like &ldquo;ž&rdquo; and &ldquo;é&rdquo; that <em>are</em> legitimate UTF-8 characters</li>
<li>Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings</li>
<li>I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me</li>
<li>Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the <code>5_x-prod</code> (no CG Core v2) branch</li>
</ul>
<h2 id="2020-01-14">2020-01-14</h2>
<ul>
<li>I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
<ul>
<li>I manually ran it on the server as the DSpace user and it said &ldquo;Moving: 51633080 into core statistics-2019&rdquo;</li>
<li>After a few hours it died with the same error that I had seen in the log from the first run:</li>
</ul>
</li>
</ul>
<pre><code>Exception: Read timed out
java.net.SocketTimeoutException: Read timed out
</code></pre><ul>
<li>I am not sure how I will fix that shard&hellip;</li>
<li>I discovered a very interesting tool called <a href="https://github.com/LuminosoInsight/python-ftfy">ftfy</a> that attempts to fix errors in UTF-8
<ul>
<li>I'm curious to start checking input files with this to see what it highlights</li>
<li>I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:</li>
<li><code>&lt;e&gt; 101, Hex 65, Octal 145 &lt; ́&gt; 769, Hex 0301, Octal 1401</code></li>
<li><code>&lt;é&gt; 233, Hex 00e9, Oct 351, Digr e'</code></li>
</ul>
</li>
<li>Ah hah! We need to be <a href="https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html">normalizing characters into their canonical forms</a>!
<ul>
<li>In Python 3.8 we can even <a href="https://docs.python.org/3/library/unicodedata.html">check if the string is normalized using the <code>unicodedata</code> library</a>:</li>
</ul>
</li>
</ul>
<pre><code>In [7]: unicodedata.is_normalized('NFC', 'é')
Out[7]: False
In [8]: unicodedata.is_normalized('NFC', 'é')
Out[8]: True
2020-01-15 13:51:35 +02:00
</code></pre><h2 id="2020-01-15">2020-01-15</h2>
<ul>
<li>I added support for Unicode normalization to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool in <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0">v0.4.0</a></li>
<li>Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ilri&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.bioversity&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325
</code></pre><ul>
<li>She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC</li>
<li>I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my <code>fix-metadata.py</code> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
2020-01-16 12:49:21 +02:00
</code></pre><h2 id="2020-01-16">2020-01-16</h2>
<ul>
<li>Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ciat&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35
</code></pre><ul>
<li>Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
<ul>
<li>We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months</li>
<li>Sisay uploaded the records to DSpace Test as <a href="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a></li>
<li>I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data</li>
<li>I corrected one invalid AGROVOC subject</li>
<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
<ul>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
</ul>
</li>
</ul>
</li>
</ul>
2020-01-20 10:49:11 +02:00
<h2 id="2020-01-20">2020-01-20</h2>
<ul>
<li>Last week Atmire sent a quotation for the DSpace 6 upgrade that I had requested a few weeks ago
<ul>
<li>I forwarded it to Peter et al for their comment</li>
</ul>
</li>
</ul>
2020-01-16 12:49:21 +02:00
<!-- raw HTML omitted -->
2020-01-14 20:40:41 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2020-01/">January, 2020</a></li>
<li><a href="/cgspace-notes/2019-12/">December, 2019</a></li>
<li><a href="/cgspace-notes/2019-11/">November, 2019</a></li>
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
<li><a href="/cgspace-notes/2019-10/">October, 2019</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>