mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-22 11:43:00 +01:00
475 lines
23 KiB
HTML
475 lines
23 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
<meta property="og:title" content="January, 2020" />
|
|
<meta property="og:description" content="2020-01-06
|
|
|
|
Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
|
|
Last week Altmetric responded about the item that had a lower score than than its DOI
|
|
|
|
The score is now linked to the DOI
|
|
Another item that had the same problem in 2019 has now also linked to the score for its DOI
|
|
Another item that had the same problem in 2019 has also been fixed
|
|
|
|
|
|
|
|
2020-01-07
|
|
|
|
Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has
|
|
|
|
The DOI has a score of 259, but the Handle has no score at all
|
|
I tweeted the CGSpace repository link
|
|
|
|
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-01/" />
|
|
<meta property="article:published_time" content="2020-01-06T10:48:30+02:00" />
|
|
<meta property="article:modified_time" content="2020-01-23T12:46:39+02:00" />
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="January, 2020"/>
|
|
<meta name="twitter:description" content="2020-01-06
|
|
|
|
Open a ticket with Atmire to request a quote for the upgrade to DSpace 6
|
|
Last week Altmetric responded about the item that had a lower score than than its DOI
|
|
|
|
The score is now linked to the DOI
|
|
Another item that had the same problem in 2019 has now also linked to the score for its DOI
|
|
Another item that had the same problem in 2019 has also been fixed
|
|
|
|
|
|
|
|
2020-01-07
|
|
|
|
Peter Ballantyne highlighted one more WLE item that is missing the Altmetric score that its DOI has
|
|
|
|
The DOI has a score of 259, but the Handle has no score at all
|
|
I tweeted the CGSpace repository link
|
|
|
|
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.62.2" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "January, 2020",
|
|
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-01\/",
|
|
"wordCount": "2117",
|
|
"datePublished": "2020-01-06T10:48:30+02:00",
|
|
"dateModified": "2020-01-23T12:46:39+02:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-01/">
|
|
|
|
<title>January, 2020 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
|
|
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-01/">January, 2020</a></h2>
|
|
<p class="blog-post-meta"><time datetime="2020-01-06T10:48:30+02:00">Mon Jan 06, 2020</time> by Alan Orth in
|
|
<i class="fa fa-folder" aria-hidden="true"></i> <a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2020-01-06">2020-01-06</h2>
|
|
<ul>
|
|
<li>Open <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">a ticket</a> with Atmire to request a quote for the upgrade to DSpace 6</li>
|
|
<li>Last week Altmetric responded about the <a href="https://hdl.handle.net/10568/97087">item</a> that had a lower score than than its DOI
|
|
<ul>
|
|
<li>The score is now linked to the DOI</li>
|
|
<li>Another <a href="https://handle.hdl.net/10568/91278">item</a> that had the same problem in 2019 has now also linked to the score for its DOI</li>
|
|
<li>Another <a href="https://hdl.handle.net/10568/81236">item</a> that had the same problem in 2019 has also been fixed</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2020-01-07">2020-01-07</h2>
|
|
<ul>
|
|
<li>Peter Ballantyne highlighted one more WLE <a href="https://hdl.handle.net/10568/101286">item</a> that is missing the Altmetric score that its DOI has
|
|
<ul>
|
|
<li>The DOI has a score of 259, but the Handle has no score at all</li>
|
|
<li>I <a href="https://twitter.com/mralanorth/status/1214471427157626881">tweeted</a> the CGSpace repository link</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2020-01-08">2020-01-08</h2>
|
|
<ul>
|
|
<li>Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:</li>
|
|
</ul>
|
|
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
|
|
COPY 68790
|
|
</code></pre><ul>
|
|
<li>As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:</li>
|
|
</ul>
|
|
<pre><code>$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
|
|
iconv: illegal input sequence at position 104779
|
|
</code></pre><ul>
|
|
<li>According to <a href="https://www.datafix.com.au/BASHing/2018-09-13.html">this trick</a> the troublesome character is on line 5227:</li>
|
|
</ul>
|
|
<pre><code>$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv
|
|
5227: "Oue
|
|
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
|
|
00000000: 22 "
|
|
00000001: 4f O
|
|
00000002: 75 u
|
|
00000003: 65 e
|
|
00000004: cc .
|
|
00000005: 81 .
|
|
00000006: 64 d
|
|
00000007: 72 r
|
|
</code></pre><ul>
|
|
<li><del>According to the blog post linked above the troublesome character is probably the “High Octect Preset” (81)</del>, which vim identifies (using <code>ga</code> on the character) as:</li>
|
|
</ul>
|
|
<pre><code><e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
|
|
</code></pre><ul>
|
|
<li>If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database…</li>
|
|
<li>Other encodings like <code>windows-1251</code> and <code>windows-1257</code> also fail on different characters like “ž” and “é” that <em>are</em> legitimate UTF-8 characters</li>
|
|
<li>Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings</li>
|
|
<li>I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me</li>
|
|
<li>Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the <code>5_x-prod</code> (no CG Core v2) branch</li>
|
|
</ul>
|
|
<h2 id="2020-01-14">2020-01-14</h2>
|
|
<ul>
|
|
<li>I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
|
|
<ul>
|
|
<li>I manually ran it on the server as the DSpace user and it said “Moving: 51633080 into core statistics-2019”</li>
|
|
<li>After a few hours it died with the same error that I had seen in the log from the first run:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>Exception: Read timed out
|
|
java.net.SocketTimeoutException: Read timed out
|
|
</code></pre><ul>
|
|
<li>I am not sure how I will fix that shard…</li>
|
|
<li>I discovered a very interesting tool called <a href="https://github.com/LuminosoInsight/python-ftfy">ftfy</a> that attempts to fix errors in UTF-8
|
|
<ul>
|
|
<li>I'm curious to start checking input files with this to see what it highlights</li>
|
|
<li>I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:</li>
|
|
<li><code><e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401</code></li>
|
|
<li><code><é> 233, Hex 00e9, Oct 351, Digr e'</code></li>
|
|
</ul>
|
|
</li>
|
|
<li>Ah hah! We need to be <a href="https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html">normalizing characters into their canonical forms</a>!
|
|
<ul>
|
|
<li>In Python 3.8 we can even <a href="https://docs.python.org/3/library/unicodedata.html">check if the string is normalized using the <code>unicodedata</code> library</a>:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>In [7]: unicodedata.is_normalized('NFC', 'é')
|
|
Out[7]: False
|
|
|
|
In [8]: unicodedata.is_normalized('NFC', 'é')
|
|
Out[8]: True
|
|
</code></pre><h2 id="2020-01-15">2020-01-15</h2>
|
|
<ul>
|
|
<li>I added support for Unicode normalization to my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool in <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0">v0.4.0</a></li>
|
|
<li>Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:</li>
|
|
</ul>
|
|
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
|
|
COPY 144
|
|
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
|
|
COPY 1325
|
|
</code></pre><ul>
|
|
<li>She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC</li>
|
|
<li>I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my <code>fix-metadata.py</code> script:</li>
|
|
</ul>
|
|
<pre><code>$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
|
|
</code></pre><h2 id="2020-01-16">2020-01-16</h2>
|
|
<ul>
|
|
<li>Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:</li>
|
|
</ul>
|
|
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
|
|
COPY 35
|
|
</code></pre><ul>
|
|
<li>Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
|
|
<ul>
|
|
<li>We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months</li>
|
|
<li>Sisay uploaded the records to DSpace Test as <a href="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a></li>
|
|
<li>I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data</li>
|
|
<li>I corrected one invalid AGROVOC subject</li>
|
|
<li>Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
|
|
<ul>
|
|
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
|
|
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2020-01-20">2020-01-20</h2>
|
|
<ul>
|
|
<li>Last week Atmire sent a quotation for the DSpace 6 upgrade that I had requested a few weeks ago
|
|
<ul>
|
|
<li>I forwarded it to Peter et al for their comment</li>
|
|
<li>We decided that we should probably buy enough credits to cover the upgrade and have 100 remaining for future development</li>
|
|
</ul>
|
|
</li>
|
|
<li>Visit CodeObia to discuss the next phase of AReS development</li>
|
|
</ul>
|
|
<h2 id="2020-01-21">2020-01-21</h2>
|
|
<ul>
|
|
<li>Create two accounts on CGSpace for CTA users</li>
|
|
<li>Marie-Angelique finally responded to some of the pull requests I made on the CG Core v2 repository last month:
|
|
<ul>
|
|
<li>Merged: <a href="https://github.com/AgriculturalSemantics/cg-core/pull/16">HTML syntax fixes</a></li>
|
|
<li>Merged: <a href="https://github.com/AgriculturalSemantics/cg-core/pull/17">Add LICENSE file</a></li>
|
|
<li>Merged: <a href="https://github.com/AgriculturalSemantics/cg-core/pull/18">Build main.css using npm build</a></li>
|
|
<li>Approved a <a href="https://github.com/AgriculturalSemantics/cg-core/issues/14">wider scope for <code>cg.peer-reviewed</code></a> (renaming the field and using non-boolean values), but there is more discussion needed</li>
|
|
</ul>
|
|
</li>
|
|
<li>I opened a new <a href="https://github.com/AgriculturalSemantics/cg-core/pull/24">pull request</a> on the cg-core repository validate and fix the formatting of the HTML files</li>
|
|
<li>Create more issues for OpenRXV:
|
|
<ul>
|
|
<li>Based on Peter's feedback on the <a href="https://github.com/ilri/OpenRXV/issues/33">text for labels and tooltips</a></li>
|
|
<li>Based on Peter's feedback for the <a href="https://github.com/ilri/OpenRXV/issues/35">export icon</a></li>
|
|
<li>Based on Peter's feedback for the <a href="https://github.com/ilri/OpenRXV/issues/31">sort options</a></li>
|
|
<li>Based on Abenet's feedback that <a href="https://github.com/ilri/OpenRXV/issues/34">PDF and Word exports are not working</a></li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2020-01-22">2020-01-22</h2>
|
|
<ul>
|
|
<li>I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:</li>
|
|
</ul>
|
|
<pre><code>Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
|
|
</code></pre><ul>
|
|
<li>They started <a href="https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/">limiting public access to the database in December, 2019 due to GDPR and CCPA</a>
|
|
<ul>
|
|
<li>This will be a problem in the future (see <a href="https://jira.lyrasis.org/browse/DS-4409">DS-4409</a>)</li>
|
|
</ul>
|
|
</li>
|
|
<li>Peter sent me his corrections for the list of authors that I had sent him earlier in the month
|
|
<ul>
|
|
<li>There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8</li>
|
|
<li>I will apply them on CGSpace and DSpace Test using my <code>fix-metadata-values.py</code> script:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
|
|
</code></pre><ul>
|
|
<li>Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a>:</li>
|
|
</ul>
|
|
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
|
|
COPY 67314
|
|
dspace=# \q
|
|
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
|
|
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
|
|
</code></pre><ul>
|
|
<li>Peter asked me to send him a list of affiliations to correct
|
|
<ul>
|
|
<li>First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
|
|
COPY 6170
|
|
dspace=# \q
|
|
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
|
|
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
|
|
</code></pre><ul>
|
|
<li>I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:</li>
|
|
</ul>
|
|
<pre><code>$ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
|
</code></pre><ul>
|
|
<li>Then I generated a new list for Peter:</li>
|
|
</ul>
|
|
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
|
|
COPY 6162
|
|
</code></pre><ul>
|
|
<li>Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author “Hung, Nguyen”
|
|
<ul>
|
|
<li>I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&R:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
|
|
$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
|
|
$ wc -l hung-nguyen-a*handles.txt
|
|
46 hung-nguyen-ares-handles.txt
|
|
56 hung-nguyen-atmire-handles.txt
|
|
102 total
|
|
</code></pre><ul>
|
|
<li>Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet
|
|
<ul>
|
|
<li>I am curious to check tomorrow to see if they are there</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2020-01-23">2020-01-23</h2>
|
|
<ul>
|
|
<li>I checked AReS and I see that there are now 55 items for author “Hung Nguyen-Viet”</li>
|
|
<li>Linode sent an alert that the outbound traffic rate of CGSpace (linode18) was high for several hours this morning around 5AM UTC+1
|
|
<ul>
|
|
<li>I checked the nginx logs this morning for the few hours before and after that using goaccess:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2020:0[12345678]" | goaccess --log-format=COMBINED -
|
|
</code></pre><ul>
|
|
<li>The top two hosts according to the amount of data transferred are:
|
|
<ul>
|
|
<li>2a01:7e00::f03c:91ff:fe9a:3a37</li>
|
|
<li>2a01:7e00::f03c:91ff:fe18:7396</li>
|
|
</ul>
|
|
</li>
|
|
<li>Both are on Linode, and appear to be the new and old ilri.org servers</li>
|
|
<li>I will ask the web team</li>
|
|
<li>Judging from the <a href="https://www.ilri.org/publications/trade-offs-related-agricultural-use-antimicrobials-and-synergies-emanating-efforts">ILRI publications site</a> it seems they are downloading the PDFs so they can generate higher-quality thumbnails:</li>
|
|
<li>They are apparently using this Drupal module to generate the thumbnails: <code>sites/all/modules/contrib/pdf_to_imagefield</code></li>
|
|
<li>I see some excellent suggestions in this <a href="https://www.imagemagick.org/discourse-server/viewtopic.php?t=21589">ImageMagick thread from 2012</a> that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as <a href="https://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/">this blog post</a>:</li>
|
|
</ul>
|
|
<pre><code>$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
|
|
</code></pre><ul>
|
|
<li>Here I'm also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using <code>-flatten</code> like DSpace already does</li>
|
|
<li>I did some tests with a modified version of above that uses uses <code>-flatten</code> and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):</li>
|
|
</ul>
|
|
<pre><code>$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
|
|
$ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg
|
|
$ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg
|
|
$ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
|
|
</code></pre><ul>
|
|
<li>This emulate's DSpace's method of generating a high-quality image from the PDF and then creating a thumbnail</li>
|
|
<li>I put together a proof of concept of this by adding the extra options to dspace-api's <code>ImageMagickThumbnailFilter.java</code> and it works</li>
|
|
<li>I need to run tests on a handful of PDFs to see if there are any side effects</li>
|
|
<li>The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org's 400KiB PNG!</li>
|
|
<li>Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:</li>
|
|
</ul>
|
|
<pre><code>$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
|
|
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
|
|
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
|
|
</code></pre><!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2020-01/">January, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2019-12/">December, 2019</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2019-11/">November, 2019</a></li>
|
|
|
|
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2019-10/">October, 2019</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|