cgspace-notes/docs/2020-01/index.html

<!DOCTYPE html>
<html lang="en" >

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<meta property="og:title" content="January, 2020" />
<meta property="og:description" content="2020-01-06

Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
Last week Altmetric responded about the item that had a lower score than than its DOI

The score is now linked to the DOI
Another item that had the same problem in 2019 has now also linked to the score for its DOI
Another item that had the same problem in 2019 has also been fixed


2020-01-07

Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has

The DOI has a score of 259, but the Handle has no score at all
I tweeted the CGSpace repository link


" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-01/" />
<meta property="article:published_time" content="2020-01-06T10:48:30+02:00" />
<meta property="article:modified_time" content="2020-03-12T12:58:21+02:00" />

<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="January, 2020"/>
<meta name="twitter:description" content="2020-01-06

Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
Last week Altmetric responded about the item that had a lower score than than its DOI

The score is now linked to the DOI
Another item that had the same problem in 2019 has now also linked to the score for its DOI
Another item that had the same problem in 2019 has also been fixed


2020-01-07

Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has

The DOI has a score of 259, but the Handle has no score at all
I tweeted the CGSpace repository link


"/>
<meta name="generator" content="Hugo 0.73.0" />


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "January, 2020",
  "url": "https://alanorth.github.io/cgspace-notes/2020-01/",
  "wordCount": "3523",
  "datePublished": "2020-01-06T10:48:30+02:00",
  "dateModified": "2020-03-12T12:58:21+02:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-01/">

    <title>January, 2020 | CGSpace Notes</title>


    <!-- combined, minified CSS -->

    <link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">


    <!-- minified Font Awesome for SVG icons -->

    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script>

    <!-- RSS 2.0 feed -->


  </head>

  <body>


    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>


    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>


    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">


<article class="blog-post">
  <header>
    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-01/">January, 2020</a></h2>
    <p class="blog-post-meta"><time datetime="2020-01-06T10:48:30+02:00">Mon Jan 06, 2020</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>


</p>
  </header>
  <h2 id="2020-01-06">2020-01-06</h2>
<ul>
<li>Open <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">a ticket</a> with Atmire to request a quote for the upgrade to DSpace 6</li>
<li>Last week Altmetric responded about the <a href="https://hdl.handle.net/10568/97087">item</a> that had a lower score than than its DOI
<ul>
<li>The score is now linked to the DOI</li>
<li>Another <a href="https://handle.hdl.net/10568/91278">item</a> that had the same problem in 2019 has now also linked to the score for its DOI</li>
<li>Another <a href="https://hdl.handle.net/10568/81236">item</a> that had the same problem in 2019 has also been fixed</li>
</ul>
</li>
</ul>
<h2 id="2020-01-07">2020-01-07</h2>
<ul>
<li>Peter Ballantyne highlighted one more WLE <a href="https://hdl.handle.net/10568/101286">item</a> that is missing the Altmetric score that its DOI has
<ul>
<li>The DOI has a score of 259, but the Handle has no score at all</li>
<li>I <a href="https://twitter.com/mralanorth/status/1214471427157626881">tweeted</a> the CGSpace repository link</li>
</ul>
</li>
</ul>
<h2 id="2020-01-08">2020-01-08</h2>
<ul>
<li>Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
</code></pre><ul>
<li>As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:</li>
</ul>
<pre><code>$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779
</code></pre><ul>
<li>According to <a href="https://www.datafix.com.au/BASHing/2018-09-13.html">this trick</a> the troublesome character is on line 5227:</li>
</ul>
<pre><code>$ awk 'END {print NR&quot;: &quot;$0}' /tmp/2020-01-08-authors-windows.csv
5227: &quot;Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22  &quot;
00000001: 4f  O
00000002: 75  u
00000003: 65  e
00000004: cc  .
00000005: 81  .
00000006: 64  d
00000007: 72  r
</code></pre><ul>
<li><del>According to the blog post linked above the troublesome character is probably the &ldquo;High Octect Preset&rdquo; (81)</del>, which vim identifies (using <code>ga</code> on the character) as:</li>
</ul>
<pre><code>&lt;e&gt;  101,  Hex 65,  Octal 145 &lt; ́&gt; 769, Hex 0301, Octal 1401
</code></pre><ul>
<li>If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it&rsquo;s stored incorrectly in the database&hellip;</li>
<li>Other encodings like <code>windows-1251</code> and <code>windows-1257</code> also fail on different characters like &ldquo;ž&rdquo; and &ldquo;é&rdquo; that <em>are</em> legitimate UTF-8 characters</li>
<li>Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings</li>
<li>I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me</li>
<li>Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the <code>5_x-prod</code> (no CG Core v2) branch</li>
</ul>
<h2 id="2020-01-14">2020-01-14</h2>
<ul>
<li>I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
<ul>
<li>I manually ran it on the server as the DSpace user and it said &ldquo;Moving: 51633080 into core statistics-2019&rdquo;</li>
<li>After a few hours it died with the same error that I had seen in the log from the first run:</li>
</ul>
</li>
</ul>
<pre><code>Exception: Read timed out
java.net.SocketTimeoutException: Read timed out
</code></pre><ul>
<li>I am not sure how I will fix that shard&hellip;</li>
<li>I discovered a very interesting tool called <a href="https://github.com/LuminosoInsight/python-ftfy">ftfy</a> that attempts to fix errors in UTF-8
<ul>
<li>I&rsquo;m curious to start checking input files with this to see what it highlights</li>
<li>I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don&rsquo;t know what it&rsquo;s called?) to digraphs (é→é), which vim identifies as:</li>
<li><code>&lt;e&gt;  101,  Hex 65,  Octal 145 &lt; ́&gt; 769, Hex 0301, Octal 1401</code></li>
<li><code>&lt;é&gt; 233, Hex 00e9, Oct 351, Digr e'</code></li>
</ul>
</li>
<li>Ah hah! We need to be <a href="https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html">normalizing characters into their canonical forms</a>!
<ul>
<li>In Python 3.8 we can even <a href="https://docs.python.org/3/library/unicodedata.html">check if the string is normalized using the <code>unicodedata</code> library</a>:</li>
</ul>
</li>
</ul>
<pre><code>In [7]: unicodedata.is_normalized('NFC', 'é')
Out[7]: False

In [8]: unicodedata.is_normalized('NFC', 'é')
Out[8]: True
</code></pre><h2 id="2020-01-15">2020-01-15</h2>
<ul>
<li>I added support for Unicode normalization to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool in <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0">v0.4.0</a></li>
<li>Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ilri&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.bioversity&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325
</code></pre><ul>
<li>She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC</li>
<li>I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my <code>fix-metadata.py</code> script:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
</code></pre><h2 id="2020-01-16">2020-01-16</h2>
<ul>
<li>Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.ciat&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35
</code></pre><ul>
<li>Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
<ul>
<li>We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months</li>
<li>Sisay uploaded the records to DSpace Test as <a href="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a></li>
<li>I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data</li>
<li>I corrected one invalid AGROVOC subject</li>
<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
<ul>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="2020-01-20">2020-01-20</h2>
<ul>
<li>Last week Atmire sent a quotation for the DSpace 6 upgrade that I had requested a few weeks ago
<ul>
<li>I forwarded it to Peter et al for their comment</li>
<li>We decided that we should probably buy enough credits to cover the upgrade and have 100 remaining for future development</li>
</ul>
</li>
<li>Visit CodeObia to discuss the next phase of AReS development</li>
</ul>
<h2 id="2020-01-21">2020-01-21</h2>
<ul>
<li>Create two accounts on CGSpace for CTA users</li>
<li>Marie-Angelique finally responded to some of the pull requests I made on the CG Core v2 repository last month:
<ul>
<li>Merged: <a href="https://github.com/AgriculturalSemantics/cg-core/pull/16">HTML syntax fixes</a></li>
<li>Merged: <a href="https://github.com/AgriculturalSemantics/cg-core/pull/17">Add LICENSE file</a></li>
<li>Merged: <a href="https://github.com/AgriculturalSemantics/cg-core/pull/18">Build main.css using npm build</a></li>
<li>Approved a <a href="https://github.com/AgriculturalSemantics/cg-core/issues/14">wider scope for <code>cg.peer-reviewed</code></a> (renaming the field and using non-boolean values), but there is more discussion needed</li>
</ul>
</li>
<li>I opened a new <a href="https://github.com/AgriculturalSemantics/cg-core/pull/24">pull request</a> on the cg-core repository validate and fix the formatting of the HTML files</li>
<li>Create more issues for OpenRXV:
<ul>
<li>Based on Peter&rsquo;s feedback on the <a href="https://github.com/ilri/OpenRXV/issues/33">text for labels and tooltips</a></li>
<li>Based on Peter&rsquo;s feedback for the <a href="https://github.com/ilri/OpenRXV/issues/35">export icon</a></li>
<li>Based on Peter&rsquo;s feedback for the <a href="https://github.com/ilri/OpenRXV/issues/31">sort options</a></li>
<li>Based on Abenet&rsquo;s feedback that <a href="https://github.com/ilri/OpenRXV/issues/34">PDF and Word exports are not working</a></li>
</ul>
</li>
</ul>
<h2 id="2020-01-22">2020-01-22</h2>
<ul>
<li>I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:</li>
</ul>
<pre><code>Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
</code></pre><ul>
<li>They started <a href="https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/">limiting public access to the database in December, 2019 due to GDPR and CCPA</a>
<ul>
<li>This will be a problem in the future (see <a href="https://jira.lyrasis.org/browse/DS-4409">DS-4409</a>)</li>
</ul>
</li>
<li>Peter sent me his corrections for the list of authors that I had sent him earlier in the month
<ul>
<li>There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8</li>
<li>I will apply them on CGSpace and DSpace Test using my <code>fix-metadata-values.py</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
</code></pre><ul>
<li>Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a>:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
COPY 67314
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
</code></pre><ul>
<li>Peter asked me to send him a list of affiliations to correct
<ul>
<li>First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:</li>
</ul>
</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, text_value as &quot;correct&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6170
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
</code></pre><ul>
<li>I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:</li>
</ul>
<pre><code>$ sleep 4h &amp;&amp; time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre><ul>
<li>Then I generated a new list for Peter:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6162
</code></pre><ul>
<li>Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author &ldquo;Hung, Nguyen&rdquo;
<ul>
<li>I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&amp;R:</li>
</ul>
</li>
</ul>
<pre><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u &gt; hung-nguyen-ares-handles.txt
$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u &gt; hung-nguyen-atmire-handles.txt
$ wc -l hung-nguyen-a*handles.txt
  46 hung-nguyen-ares-handles.txt
  56 hung-nguyen-atmire-handles.txt
 102 total
</code></pre><ul>
<li>Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven&rsquo;t been indexed yet
<ul>
<li>I am curious to check tomorrow to see if they are there</li>
</ul>
</li>
</ul>
<h2 id="2020-01-23">2020-01-23</h2>
<ul>
<li>I checked AReS and I see that there are now 55 items for author &ldquo;Hung Nguyen-Viet&rdquo;</li>
<li>Linode sent an alert that the outbound traffic rate of CGSpace (linode18) was high for several hours this morning around 5AM UTC+1
<ul>
<li>I checked the nginx logs this morning for the few hours before and after that using goaccess:</li>
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;23/Jan/2020:0[12345678]&quot; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The top two hosts according to the amount of data transferred are:
<ul>
<li>2a01:7e00::f03c:91ff:fe9a:3a37</li>
<li>2a01:7e00::f03c:91ff:fe18:7396</li>
</ul>
</li>
<li>Both are on Linode, and appear to be the new and old ilri.org servers</li>
<li>I will ask the web team</li>
<li>Judging from the <a href="https://www.ilri.org/publications/trade-offs-related-agricultural-use-antimicrobials-and-synergies-emanating-efforts">ILRI publications site</a> it seems they are downloading the PDFs so they can generate higher-quality thumbnails:</li>
<li>They are apparently using this Drupal module to generate the thumbnails: <code>sites/all/modules/contrib/pdf_to_imagefield</code></li>
<li>I see some excellent suggestions in this <a href="https://www.imagemagick.org/discourse-server/viewtopic.php?t=21589">ImageMagick thread from 2012</a> that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as <a href="https://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/">this blog post</a>:</li>
</ul>
<pre><code>$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
</code></pre><ul>
<li>Here I&rsquo;m also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using <code>-flatten</code> like DSpace already does</li>
<li>I did some tests with a modified version of above that uses uses <code>-flatten</code> and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):</li>
</ul>
<pre><code>$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
$ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg
$ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg
$ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
</code></pre><ul>
<li>This emulate&rsquo;s DSpace&rsquo;s method of generating a high-quality image from the PDF and then creating a thumbnail</li>
<li>I put together a proof of concept of this by adding the extra options to dspace-api&rsquo;s <code>ImageMagickThumbnailFilter.java</code> and it works</li>
<li>I need to run tests on a handful of PDFs to see if there are any side effects</li>
<li>The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org&rsquo;s 400KiB PNG!</li>
<li>Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:</li>
</ul>
<pre><code>$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre><h2 id="2020-01-26">2020-01-26</h2>
<ul>
<li>Add &ldquo;Gender&rdquo; to controlled vocabulary for CRPs (<a href="https://github.com/ilri/DSpace/pull/442">#442</a>)</li>
<li>Deploy the changes on CGSpace and run all updates on the server and reboot it
<ul>
<li>I had to restart the <code>tomcat7</code> service several times until all Solr statistics cores came up OK</li>
</ul>
</li>
<li>I spent a few hours writing a script (<a href="https://gist.github.com/alanorth/1c7c8b2131a19559e273fbc1e58d6a71">create-thumbnails</a>) to compare the default DSpace thumbnails with the improved parameters above and actually when comparing them at size 600px I don&rsquo;t really notice much difference, other than the new ones have slightly crisper text
<ul>
<li>So that was a waste of time, though I think our 300px thumbnails are a bit small now</li>
<li><a href="https://www.imagemagick.org/discourse-server/viewtopic.php?t=14561">Another thread on the ImageMagick forum</a> mentions that you need to set the density, then read the image, then set the density again:</li>
</ul>
</li>
</ul>
<pre><code>$ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
</code></pre><ul>
<li>One thing worth mentioning was this syntax for extracting bits from JSON in bash using <code>jq</code>:</li>
</ul>
<pre><code>$ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
$ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName==&quot;ORIGINAL&quot;) | .retrieveLink'
&quot;/bitstreams/172559/retrieve&quot;
</code></pre><h2 id="2020-01-27">2020-01-27</h2>
<ul>
<li>Bizu has been having problems when she logs into CGSpace, she can&rsquo;t see the community list on the front page
<ul>
<li>This last happened for another user in <a href="https://alanorth.github.io/cgspace-notes/2016-11/">2016-11</a>, and it was related to the Tomcat <code>maxHttpHeaderSize</code> being too small because the user was in too many groups</li>
<li>I see that it is similar, with this message appearing in the DSpace log just after she logs in:</li>
</ul>
</li>
</ul>
<pre><code>2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
</code></pre><ul>
<li>Now this appears to be a Solr limit of some kind (&ldquo;too many boolean clauses&rdquo;)
<ul>
<li>I changed the <code>maxBooleanClauses</code> for all Solr cores on DSpace Test from 1024 to 2048 and then she was able to see her communities&hellip;</li>
<li>I made a <a href="https://github.com/ilri/DSpace/pull/443">pull request</a> and merged it to the <code>5_x-prod</code> branch and will deploy on CGSpace later tonight</li>
<li>I am curious if anyone on the dspace-tech mailing list has run into this, so I will try to send a message about this there when I get a chance</li>
</ul>
</li>
</ul>
<h2 id="2020-01-28">2020-01-28</h2>
<ul>
<li>Generate a list of CIP subjects for Abenet:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.subject.cip&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
COPY 77
</code></pre><ul>
<li>Start looking over the IITA records from earlier this month (<a href="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a>)
<ul>
<li>Delete one duplicate, map one item from ILRI community</li>
<li>The following items are duplicates or something (there is not enough metadata to tell for sure):
<ul>
<li><a href="https://dspacetest.cgiar.org/handle/10568/106682">https://dspacetest.cgiar.org/handle/10568/106682</a></li>
<li><a href="https://dspacetest.cgiar.org/handle/10568/106653">https://dspacetest.cgiar.org/handle/10568/106653</a></li>
<li><a href="https://dspacetest.cgiar.org/handle/10568/106694">https://dspacetest.cgiar.org/handle/10568/106694</a></li>
</ul>
</li>
<li>This item doesn&rsquo;t exist in the journal, and Weed Science volume 55 was published in 2007, not 2003:
<ul>
<li><a href="https://dspacetest.cgiar.org/handle/10568/106665">https://dspacetest.cgiar.org/handle/10568/106665</a></li>
</ul>
</li>
<li>All items using <code>cg.journal.title</code> instead of <code>dc.source</code></li>
<li>Several items were missing ISSN despite having a journal title</li>
<li>Many items were missing DOIs, abstracts, etc</li>
<li>I did some metadata enrichment by searching for the items and copying relevant data from journal pages</li>
<li>I asked Bosede to try to do the same for the rest of the journal articles</li>
</ul>
</li>
</ul>
<h2 id="2020-01-29">2020-01-29</h2>
<ul>
<li>Normalize about 4,500 DOI, YouTube, and SlideShare links on CGSpace that are missing HTTPS or using old format:</li>
</ul>
<pre><code>UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://dx.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.youtube.com', 'https://www.youtube.com') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.youtube.com%';
UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.slideshare.net', 'https://www.slideshare.net') WHERE resource_type_id = 2 AND metadata_field_id = 219 AND text_value LIKE 'http://www.slideshare.net%';
</code></pre><ul>
<li>I exported a list of all of our ISSNs with item IDs so that I could fix them in OpenRefine and submit them with multi-value separators to DSpace metadata import:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT resource_id as &quot;id&quot;, text_value as &quot;dc.identifier.issn&quot; FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
COPY 23339
</code></pre><ul>
<li>Then, after spending two hours correcting 1,000 ISSNs I realized that I need to normalize the <code>text_lang</code> fields in the database first or else these will all look like changes due to the &ldquo;en_US&rdquo; and NULL, etc (for both ISSN and ISBN):</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
UPDATE 30454
</code></pre><ul>
<li>Then I realized that my initial PostgreSQL query wasn&rsquo;t so genius because if a field already has multiple values it will appear on separate lines with the same ID, so when <code>dspace metadata-import</code> sees it, the change will be removed and added, or added and removed, depending on the order it is seen!</li>
<li>A better course of action is to select the distinct ones and then correct them using <code>fix-metadata-values.py</code>&hellip;</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.identifier.issn[en_US]&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
COPY 2900
</code></pre><ul>
<li>I re-applied all my corrections, filtering out things like multi-value separators and values that are actually ISBNs so I can fix them later</li>
<li>Then I applied 181 fixes for ISSNs using <code>fix-metadata-values.py</code> on DSpace Test and CGSpace (after testing locally):</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
</code></pre><h2 id="2020-01-30">2020-01-30</h2>
<ul>
<li>About to start working on the DSpace 6 port and I&rsquo;m looking at commits that are in the not-yet-tagged DSpace 6.4:
<ul>
<li>[DS-4342] improve the performance of the collections/collection_id/items REST endpoint:
<ul>
<li>c2e6719fa763e291b81b2d61da2f8c758fe38ff3</li>
</ul>
</li>
<li>[DS-4136] Improve OAI import performance for a large install:
<ul>
<li>3f81daf3d89b17ff4d08783ee9899e5a745851dc</li>
<li>37004bbcf4ca3ef2a74ebc6e4774cb605884864e</li>
</ul>
</li>
<li>DS-4110: fix issue in legacy id cleanup of stats records
<ul>
<li>3752247d6a4b83ee809cc9b197f34a8ff50b9e74</li>
<li>e6004e57f0f2f3ce5f433647fe8a467b0176836b</li>
<li>2fb3751c9adfe7311c6df43dbd51a41479480f5e</li>
</ul>
</li>
<li>Fix DS-4066 by update all IDs to string type in schema:
<ul>
<li>f15cb33ab4272a3970572e608810de3076d541a3</li>
</ul>
</li>
<li>DS-3914: Fix community defiliation:
<ul>
<li>19cc9719879cf69019acad72ee13915a4128e859</li>
<li>b86a7b8d66608ee2bec67fb69b37e27c9a620aa3</li>
</ul>
</li>
<li>[DS-3849] Default ID &lsquo;order by&rsquo; clause for other &lsquo;get items&rsquo; queries:
<ul>
<li>7b888fa558e5792cd780d1d6a7f75564f4da3bf9</li>
<li>8d1aa33f7b9ea5a623e1ed13f139695671c598d4</li>
</ul>
</li>
<li>[DS-3664] ImageMagick: Only execute &ldquo;identify&rdquo; on first page:
<ul>
<li>33ba419f3560639bff8ea002cdfc38345c0fea8d</li>
</ul>
</li>
<li>DS-3658 Configure ReindexerThread disable reindex
<ul>
<li>1d2f10592ac2d86f28044749f34ac05347ea0e0a</li>
<li>05959ef315d2a1670e4b59eee4db21f93ba238fa</li>
<li>7253095b623069d7ef0a1a13cc5a21385d0878c9</li>
</ul>
</li>
<li>[DS-3602] 6x Port: Incremental Update of Legacy Id fields in Solr Statistics:
<ul>
<li>184f2b2153479045fba6239342c63e7f8564b8b6</li>
</ul>
</li>
<li>Dspace 6 ds 3545 mirage2: custom sitemap.xmap is ignored
<ul>
<li>71c68f2f54dead69329298810d0fecdf76b59c09</li>
</ul>
</li>
</ul>
</li>
<li>It&rsquo;s annoying that we have to target DSpace 6.3&hellip; I think I should totally cherry-pick these when I&rsquo;m done</li>
<li>For now I just created a new DSpace repository and checked out the <code>dspace-6.3</code> tag and started diffing and copying changes over from our 5.8 repository</li>
<li>There are some things I need to remember to check:
<ul>
<li><code>search.index</code> settings in DSpace 5&rsquo;s dspace.cfg (dunno where they are now)</li>
<li><code>thumbnail-fallback-files.xml</code></li>
</ul>
</li>
<li>The code currently lives in the <code>6_x-dev</code> branch</li>
</ul>
<!-- raw HTML omitted -->


</article>


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">


        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2020-07/">July, 2020</a></li>

<li><a href="/cgspace-notes/2020-06/">June, 2020</a></li>

<li><a href="/cgspace-notes/2020-05/">May, 2020</a></li>

<li><a href="/cgspace-notes/2020-04/">April, 2020</a></li>

<li><a href="/cgspace-notes/2020-03/">March, 2020</a></li>

    </ol>
  </section>


  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">

      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>

      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>

      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>

    </ol>
  </section>

</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->


    <footer class="blog-footer">
      <p dir="auto">

      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.

      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>


  </body>

</html>