<li>Open <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">a ticket</a> with Atmire to request a quote for the upgrade to DSpace 6</li>
<li>Last week Altmetric responded about the <ahref="https://hdl.handle.net/10568/97087">item</a> that had a lower score than than its DOI
<ul>
<li>The score is now linked to the DOI</li>
<li>Another <ahref="https://handle.hdl.net/10568/91278">item</a> that had the same problem in 2019 has now also linked to the score for its DOI</li>
<li>Another <ahref="https://hdl.handle.net/10568/81236">item</a> that had the same problem in 2019 has also been fixed</li>
</ul>
</li>
</ul>
<h2id="2020-01-07">2020-01-07</h2>
<ul>
<li>Peter Ballantyne highlighted one more WLE <ahref="https://hdl.handle.net/10568/101286">item</a> that is missing the Altmetric score that its DOI has
<ul>
<li>The DOI has a score of 259, but the Handle has no score at all</li>
<li>I <ahref="https://twitter.com/mralanorth/status/1214471427157626881">tweeted</a> the CGSpace repository link</li>
</ul>
</li>
</ul>
<h2id="2020-01-08">2020-01-08</h2>
<ul>
<li>Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
</code></pre><ul>
<li>As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:</li>
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22 "
00000001: 4f O
00000002: 75 u
00000003: 65 e
00000004: cc .
00000005: 81 .
00000006: 64 d
00000007: 72 r
</code></pre><ul>
<li><del>According to the blog post linked above the troublesome character is probably the “High Octect Preset” (81)</del>, which vim identifies (using <code>ga</code> on the character) as:</li>
<li>If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database…</li>
<li>Other encodings like <code>windows-1251</code> and <code>windows-1257</code> also fail on different characters like “ž” and “é” that <em>are</em> legitimate UTF-8 characters</li>
<li>Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings</li>
<li>I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me</li>
<li>Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the <code>5_x-prod</code> (no CG Core v2) branch</li>
</ul>
<h2id="2020-01-14">2020-01-14</h2>
<ul>
<li>I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
<ul>
<li>I manually ran it on the server as the DSpace user and it said “Moving: 51633080 into core statistics-2019”</li>
<li>After a few hours it died with the same error that I had seen in the log from the first run:</li>
</ul>
</li>
</ul>
<pre><code>Exception: Read timed out
java.net.SocketTimeoutException: Read timed out
</code></pre><ul>
<li>I am not sure how I will fix that shard…</li>
<li>I discovered a very interesting tool called <ahref="https://github.com/LuminosoInsight/python-ftfy">ftfy</a> that attempts to fix errors in UTF-8
<ul>
<li>I'm curious to start checking input files with this to see what it highlights</li>
<li>I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:</li>
<li><code><é> 233, Hex 00e9, Oct 351, Digr e'</code></li>
</ul>
</li>
<li>Ah hah! We need to be <ahref="https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html">normalizing characters into their canonical forms</a>!
<ul>
<li>In Python 3.8 we can even <ahref="https://docs.python.org/3/library/unicodedata.html">check if the string is normalized using the <code>unicodedata</code> library</a>:</li>
<li>I added support for Unicode normalization to my <ahref="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool in <ahref="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0">v0.4.0</a></li>
<li>Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325
</code></pre><ul>
<li>She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC</li>
<li>I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my <code>fix-metadata.py</code> script:</li>
<li>Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35
</code></pre><ul>
<li>Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
<ul>
<li>We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months</li>
<li>Sisay uploaded the records to DSpace Test as <ahref="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a></li>
<li>I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data</li>
<li>I corrected one invalid AGROVOC subject</li>
<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
<ul>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>