mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-18 19:22:18 +01:00
914 lines
67 KiB
HTML
914 lines
67 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
|
|
<meta property="og:title" content="March, 2023" />
|
|
<meta property="og:description" content="2023-03-01
|
|
|
|
Remove cg.subject.wle and cg.identifier.wletheme from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
|
|
iso-codes 4.13.0 was released, which incorporates my changes to the common names for Iran, Laos, and Syria
|
|
I finally got through with porting the input form from DSpace 6 to DSpace 7
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-03/" />
|
|
<meta property="article:published_time" content="2023-03-01T07:58:36+03:00" />
|
|
<meta property="article:modified_time" content="2023-04-02T09:16:25+03:00" />
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="March, 2023"/>
|
|
<meta name="twitter:description" content="2023-03-01
|
|
|
|
Remove cg.subject.wle and cg.identifier.wletheme from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)
|
|
iso-codes 4.13.0 was released, which incorporates my changes to the common names for Iran, Laos, and Syria
|
|
I finally got through with porting the input form from DSpace 6 to DSpace 7
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.121.2">
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "March, 2023",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2023-03/",
|
|
"wordCount": "4810",
|
|
"datePublished": "2023-03-01T07:58:36+03:00",
|
|
"dateModified": "2023-04-02T09:16:25+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2023-03/">
|
|
|
|
<title>March, 2023 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2023-03/">March, 2023</a></h2>
|
|
<p class="blog-post-meta">
|
|
<time datetime="2023-03-01T07:58:36+03:00">Wed Mar 01, 2023</time>
|
|
in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2023-03-01">2023-03-01</h2>
|
|
<ul>
|
|
<li>Remove <code>cg.subject.wle</code> and <code>cg.identifier.wletheme</code> from CGSpace input form after confirming with IWMI colleagues that they no longer need them (WLE closed in 2021)</li>
|
|
<li><a href="https://salsa.debian.org/iso-codes-team/iso-codes/-/blob/main/CHANGELOG.md#4130-2023-02-28">iso-codes 4.13.0 was released</a>, which incorporates my changes to the common names for Iran, Laos, and Syria</li>
|
|
<li>I finally got through with porting the input form from DSpace 6 to DSpace 7</li>
|
|
</ul>
|
|
<ul>
|
|
<li>I can’t put my finger on it, but the input form has to be formatted very particularly, for example if your rows have more than two fields in them with out a sufficient Bootstrap grid style, or if you use a <code>twobox</code>, etc, the entire form step appears blank</li>
|
|
</ul>
|
|
<h2 id="2023-03-02">2023-03-02</h2>
|
|
<ul>
|
|
<li>I did some experiments with the new <a href="https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i">Pandas 2.0.0rc0 Apache Arrow support</a>
|
|
<ul>
|
|
<li>There is a change to the way nulls are handled and it causes my tests for <code>pd.isna(field)</code> to fail</li>
|
|
<li>I think we need consider blanks as null, but I’m not sure</li>
|
|
</ul>
|
|
</li>
|
|
<li>I made some adjustments to the Discovery sidebar facets on DSpace 6 while I was looking at the DSpace 7 configuration
|
|
<ul>
|
|
<li>I downgraded CIFOR subject, Humidtropics subject, Drylands subject, ICARDA subject, and Language from DiscoverySearchFilterFacet to DiscoverySearchFilter in <code>discovery.xml</code> since we are no longer using them in sidebar facets</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-03">2023-03-03</h2>
|
|
<ul>
|
|
<li>Atmire merged one of my old pull requests into COUNTER-Robots:
|
|
<ul>
|
|
<li><a href="https://github.com/atmire/COUNTER-Robots/pull/54">COUNTER_Robots_list.json: Add new bots</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>I will update the local ILRI overrides in our DSpace spider agents file</li>
|
|
</ul>
|
|
<h2 id="2023-03-04">2023-03-04</h2>
|
|
<ul>
|
|
<li>Submit a <a href="https://github.com/flyingcircusio/pycountry/pull/156">pull request on pycountry to use iso-codes 4.13.0</a></li>
|
|
</ul>
|
|
<h2 id="2023-03-05">2023-03-05</h2>
|
|
<ul>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2023-03-06">2023-03-06</h2>
|
|
<ul>
|
|
<li>Export CGSpace to do Initiative collection mappings
|
|
<ul>
|
|
<li>There were thirty-three that needed updating</li>
|
|
</ul>
|
|
</li>
|
|
<li>Send Abenet and Sam a list of twenty-one CAS publications that had been marked as “multiple documents” that we uploaded as metadata-only items
|
|
<ul>
|
|
<li>Goshu will download the PDFs for each and upload them to the items on CGSpace manually</li>
|
|
</ul>
|
|
</li>
|
|
<li>I spent some time trying to get csv-metadata-quality working with the new Arrow backend for Pandas 2.0.0rc0
|
|
<ul>
|
|
<li>It seems there is a problem recognizing empty strings as na with <code>pd.isna()</code></li>
|
|
<li>If I do <code>pd.isna(field) or field == ""</code> then it works as expected, but that feels hacky</li>
|
|
<li>I’m going to test again on the next release…</li>
|
|
<li>Note that I had been setting both of these global options:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>pd.options.mode.dtype_backend = 'pyarrow'
|
|
pd.options.mode.nullable_dtypes = True
|
|
</code></pre><ul>
|
|
<li>Then reading the CSV like this:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>df = pd.read_csv(args.input_file, engine='pyarrow', dtype='string[pyarrow]'
|
|
</code></pre><h2 id="2023-03-07">2023-03-07</h2>
|
|
<ul>
|
|
<li>Create a PostgreSQL 14 instance on my local environment to start testing compatibility with DSpace 6 as well as all my scripts:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ podman pull docker.io/library/postgres:14-alpine
|
|
</span></span><span style="display:flex;"><span>$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5432:5432 -d postgres:14-alpine
|
|
</span></span><span style="display:flex;"><span>$ createuser -h localhost -p <span style="color:#ae81ff">5432</span> -U postgres --pwprompt dspacetest
|
|
</span></span><span style="display:flex;"><span>$ createdb -h localhost -p <span style="color:#ae81ff">5432</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspacetest
|
|
</span></span></code></pre></div><ul>
|
|
<li>Peter sent me a list of items that had ILRI affiation on Altmetric, but that didn’t have Handles
|
|
<ul>
|
|
<li>I ran a duplicate check on them to find if they exist or if we can import them</li>
|
|
<li>There were about ninety matches, but a few dozen of those were pre-prints!</li>
|
|
<li>After excluding those there were about sixty-one items we already have on CGSpace so I will add their DOIs to the existing items
|
|
<ul>
|
|
<li>After joining these with the records from CGSpace and inspecting the DOIs I found that only forty-four were new DOIs</li>
|
|
<li>Surprisingly some of the DOIs on Altmetric were not working, though we also had some that were not working (specifically the Journal of Agricultural Economics seems to have reassigned DOIs)</li>
|
|
</ul>
|
|
</li>
|
|
<li>For the rest of the ~359 items I extracted their DOIs and looked up the metadata on Crossref using my <code>crossref-doi-lookup.py</code> script
|
|
<ul>
|
|
<li>After spending some time cleaning the data in OpenRefine I realized we don’t get access status from Crossref</li>
|
|
<li>We can imply it if the item is Creative Commons, but otherwise I might be able to use <a href="https://unpaywall.org/products/api">Unpaywall’s API</a></li>
|
|
<li>I found some false positives in Unpaywall, so I might only use their data when it says the DOI is not OA…</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>During this process I updated my <code>crossref-doi-lookup.py</code> script to get more information from Crossref like ISSNs, ISBNs, full journal title, and subjects</li>
|
|
<li>An unscientific comparison of duplicate checking Peter’s file with ~500 titles on PostgreSQL 12 and PostgreSQL 14:
|
|
<ul>
|
|
<li>PostgreSQL 12: <code>0.11s user 0.04s system 0% cpu 19:24.65 total</code></li>
|
|
<li>PostgreSQL 14: <code>0.12s user 0.04s system 0% cpu 18:13.47 total</code></li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-08">2023-03-08</h2>
|
|
<ul>
|
|
<li>I am wondering how to speed up PostgreSQL trgm searches more
|
|
<ul>
|
|
<li>I see my local PostgreSQL is using vanilla configuration and I should update some configs:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'shared_buffers';
|
|
</span></span><span style="display:flex;"><span> setting │ unit
|
|
</span></span><span style="display:flex;"><span>─────────┼──────
|
|
</span></span><span style="display:flex;"><span> 16384 │ 8kB
|
|
</span></span><span style="display:flex;"><span>(1 row)
|
|
</span></span></code></pre></div><ul>
|
|
<li>I re-created my PostgreSQL 14 container with some extra memory settings:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ podman run --name dspacedb14 -v dspacedb14_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5432:5432 -d postgres:14-alpine -c shared_buffers<span style="color:#f92672">=</span>1024MB -c random_page_cost<span style="color:#f92672">=</span>1.1
|
|
</span></span></code></pre></div><ul>
|
|
<li>Then I created a GiST <a href="https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search">index on the <code>metadatavalue</code> table to try to speed up the trgm similarity operations</a>:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=64)); # \di+ shows index size is 795MB
|
|
</span></span></code></pre></div><ul>
|
|
<li>That took a few minutes to build… then the duplicate checker ran in 12 minutes: <code>0.07s user 0.02s system 0% cpu 12:43.08 total</code></li>
|
|
<li>On a hunch, I tried with a GIN index:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gin_idx ON metadatavalue USING gin(text_value gin_trgm_ops); # \di+ shows index size is 274MB
|
|
</span></span></code></pre></div><ul>
|
|
<li>This ran in 19 minutes: <code>0.08s user 0.01s system 0% cpu 19:49.73 total</code>
|
|
<ul>
|
|
<li>So clearly the GiST index is better for this task</li>
|
|
<li>I am curious if I increase the signature length in the GiST index from 64 to 256 (which will for sure increase the size taken):</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ CREATE INDEX metadatavalue_text_value_trgm_gist_idx ON metadatavalue USING gist(text_value gist_trgm_ops(siglen=256)); # \di+ shows index size is 716MB, which is less than the previous GiST index...
|
|
</span></span></code></pre></div><ul>
|
|
<li>This one finished in ten minutes: <code>0.07s user 0.02s system 0% cpu 10:04.04 total</code></li>
|
|
<li>I might also want to <a href="https://stackoverflow.com/questions/43008382/postgresql-gin-index-slower-than-gist-for-pg-trgm">increase my <code>work_mem</code></a> (default 4MB):</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT setting, unit FROM pg_settings WHERE name = 'work_mem';
|
|
</span></span><span style="display:flex;"><span> setting │ unit
|
|
</span></span><span style="display:flex;"><span>─────────┼──────
|
|
</span></span><span style="display:flex;"><span> 4096 │ kB
|
|
</span></span><span style="display:flex;"><span>(1 row)
|
|
</span></span></code></pre></div><ul>
|
|
<li>After updating my Crossref lookup script and checking the remaining ~359 items I found a eight more duplicates already existing on CGSpace</li>
|
|
<li>Wow, I found a <a href="https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-1-fetching-and-parsing-html">really cool way to fetch URLs in OpenRefine</a>
|
|
<ul>
|
|
<li>I used this to fetch the open access status for each DOI from Unpaywall</li>
|
|
</ul>
|
|
</li>
|
|
<li>First, create a new column called “url” based on the DOI that builds the request URL. I used a Jython expression:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>unpaywall_baseurl <span style="color:#f92672">=</span> <span style="color:#e6db74">'https://api.unpaywall.org/v2/'</span>
|
|
</span></span><span style="display:flex;"><span>email <span style="color:#f92672">=</span> <span style="color:#e6db74">"a.orth+unpaywall@cgiar.org"</span>
|
|
</span></span><span style="display:flex;"><span>doi <span style="color:#f92672">=</span> value<span style="color:#f92672">.</span>replace(<span style="color:#e6db74">"https://doi.org/"</span>, <span style="color:#e6db74">""</span>)
|
|
</span></span><span style="display:flex;"><span>request_url <span style="color:#f92672">=</span> unpaywall_baseurl <span style="color:#f92672">+</span> doi <span style="color:#f92672">+</span> <span style="color:#e6db74">'?email='</span> <span style="color:#f92672">+</span> email
|
|
</span></span><span style="display:flex;"><span>
|
|
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">return</span> request_url
|
|
</span></span></code></pre></div><ul>
|
|
<li>Then create a new column based on fetching the values in that column. I called it “unpaywall_status”</li>
|
|
<li>Then you get a JSON blob in each and you can extract the Open Access status with a GREL like <code>value.parseJson()['is_oa']</code>
|
|
<ul>
|
|
<li>I checked a handful of results manually and found that the limited access status was more trustworthy from Unpaywall than the open access, so I will just tag the limited access ones</li>
|
|
</ul>
|
|
</li>
|
|
<li>I merged the funders and affiliations from Altmetric into my file, then used the same technique to get Crossref data for open access items directly into OpenRefine and parsed the abstracts
|
|
<ul>
|
|
<li>The syntax was hairy because it’s marked up with tags like <code><jats:p></code>, but this got me most of the way there:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>value.replace("jats:p", "jats-p").parseHtml().select("jats-p")[0].innerHtml()
|
|
</span></span><span style="display:flex;"><span>value.replace("<jats:italic>","").replace("</jats:italic>", "")
|
|
</span></span><span style="display:flex;"><span>value.replace("<jats:sub>","").replace("</jats:sub>", "").replace("<jats:sup>","").replace("</jats:sup>", "")
|
|
</span></span></code></pre></div><ul>
|
|
<li>I uploaded the 350 items to DSpace Test so Peter and Abenet can explore them</li>
|
|
<li>I exported a list of authors, affiliations, and funders from the new items to let Peter correct them:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c dc.contributor.author /tmp/new-items.csv | sed -e 1d -e <span style="color:#e6db74">'s/"//g'</span> -e <span style="color:#e6db74">'s/||/\n/g'</span> | sort | uniq -c | sort -nr | awk <span style="color:#e6db74">'{$1=""; print $0}'</span> | sed -e <span style="color:#e6db74">'s/^ //'</span> > /tmp/new-authors.csv
|
|
</span></span></code></pre></div><ul>
|
|
<li>Meeting with FAO AGRIS team about how to detect duplicates
|
|
<ul>
|
|
<li>They are currently using a sha256 hash on titles, which will work, but will only return exact matches</li>
|
|
<li>I told them to try to normalize the string, drop stop words, etc to increase the possibility that the hash matches</li>
|
|
</ul>
|
|
</li>
|
|
<li>Meeting with Abenet to discuss CGSpace issues
|
|
<ul>
|
|
<li>She reminded me about needing a metadata field for first author when the affiliation is ILRI</li>
|
|
<li>I said I prefer to write a small script for her that will check the first author and first affiliation… I could do it easily in Python, but would need to put a web frontend on it for her</li>
|
|
<li>Unless we could do that in AReS reports somehow</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-09">2023-03-09</h2>
|
|
<ul>
|
|
<li>Apply a bunch of corrections to authors, affiliations, and donors on the new items on DSpace Test</li>
|
|
<li>Meeting with Peter and Abenet about future OpenRXV developments, DSpace 7, etc
|
|
<ul>
|
|
<li>I submitted an <a href="https://github.com/CodeObia/MEL/issues/11173">issue on MEL asking them to add provenance metadata when submitting to CGSpace</a></li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-10">2023-03-10</h2>
|
|
<ul>
|
|
<li>CKM is getting ready to launch their new website and they display CGSpace thumbnails at 255x362px
|
|
<ul>
|
|
<li>Our thumbnails are 300px so they get up-scaled and look bad</li>
|
|
<li>I realized that the last time we <a href="https://github.com/ilri/DSpace/commit/5de61e220124c1d0441c87cd7d36d18cb2293c03">increased the size of our thumbnails was in 2013</a>, from 94x130 to 300px</li>
|
|
<li>I offered to CKM that we increase them again to 400 or 600px</li>
|
|
<li>I did some tests to check the thumbnail file sizes for 300px, 400px, 500px, and 600px on <a href="https://hdl.handle.net/10568/126388">this item</a>:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ls -lh 10568-126388-*
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 31K Mar 10 12:42 10568-126388-300px.jpg
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 52K Mar 10 12:41 10568-126388-400px.jpg
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 76K Mar 10 12:43 10568-126388-500px.jpg
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 106K Mar 10 12:44 10568-126388-600px.jpg
|
|
</span></span></code></pre></div><ul>
|
|
<li>Seems like 600px is 3 to 4 times larger file size, so maybe we should shoot for 400px or 500px
|
|
<ul>
|
|
<li>I decided on 500px</li>
|
|
<li>I started re-generating new thumbnails for the ILRI Publications, CGIAR Initiatives, and other collections</li>
|
|
</ul>
|
|
</li>
|
|
<li>On that note, I also re-worked the XMLUI item display to show larger thumbnails (from a max-width of 128px to 200px)</li>
|
|
<li>And now that I’m looking at thumbnails I am curious what it would take to get DSpace to generate WebP or AVIF thumbnails</li>
|
|
<li>Peter sent me citations and ILRI subjects for the 350 new ILRI publications
|
|
<ul>
|
|
<li>I guess he edited it in Excel because there are a bunch of encoding issues with accents</li>
|
|
<li>I merged Peter’s citations and subjects with the other metadata, ran one last duplicate check (and found one item!), then ran the items through csv-metadata-quality and uploaded them to CGSpace</li>
|
|
<li>In the end it was only 348 items for some reason…</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-12">2023-03-12</h2>
|
|
<ul>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2023-03-13">2023-03-13</h2>
|
|
<ul>
|
|
<li>Extract a list of DOIs from the Creative Commons licensed ILRI journal articles that I uploaded last week, skipping any that are “no derivatives” (ND):</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">'dc.description.provenance[en]'</span> -m <span style="color:#e6db74">'Made available in DSpace on 2023-03-10'</span> /tmp/ilri-articles.csv <span style="color:#ae81ff">\
|
|
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvgrep -c 'dcterms.license[en_US]' -r 'CC(0|\-BY)'
|
|
</span></span><span style="display:flex;"><span> | csvgrep -c 'dcterms.license[en_US]' -i -r '\-ND\-'
|
|
</span></span><span style="display:flex;"><span> | csvcut -c 'id,cg.identifier.doi[en_US],dcterms.type[en_US]' > 2023-03-13-journal-articles.csv
|
|
</span></span></code></pre></div><ul>
|
|
<li>I want to write a script to download the PDFs and create thumbnails for them, then upload to CGSpace
|
|
<ul>
|
|
<li>I wrote one based on <code>post_ciat_pdfs.py</code> but it seems there is an issue uploading anything other than a PDF</li>
|
|
<li>When I upload a JPG or a PNG the file begins with:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Content-Disposition: form-data; name="file"; filename="10.1017-s0031182013001625.pdf.jpg"
|
|
</span></span></code></pre></div><ul>
|
|
<li>… this means it is invalid…
|
|
<ul>
|
|
<li>I tried in both the <code>ORIGINAL</code> and <code>THUMBNAIL</code> bundle, and with different filenames</li>
|
|
<li>I tried manually on the command line with <code>http</code> and both PDF and PNG work… hmmmm</li>
|
|
<li>Hmm, this seems to have been due to some difference in behavior between the <code>files</code> and <code>data</code> parameters of <code>requests.get()</code></li>
|
|
<li>I finalized the <code>post_bitstreams.py</code> script and uploaded eighty-five PDF thumbnails</li>
|
|
</ul>
|
|
</li>
|
|
<li>It seems Bizu uploaded covers for a handful so I deleted them and ran them through the script to get proper thumbnails</li>
|
|
</ul>
|
|
<h2 id="2023-03-14">2023-03-14</h2>
|
|
<ul>
|
|
<li>Add twelve IFPRI authors to our controlled vocabulary for authors and ORCID identifiers
|
|
<ul>
|
|
<li>I also tagged their existing items on CGSpace</li>
|
|
</ul>
|
|
</li>
|
|
<li>Export all our ORCIDs and resolve their names to see if any have changed:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE <span style="color:#e6db74">'[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}'</span> | sort -u > /tmp/2023-03-14-orcids.txt
|
|
</span></span><span style="display:flex;"><span>$ ./ilri/resolve_orcids.py -i /tmp/2023-03-14-orcids.txt -o /tmp/2023-03-14-orcids-names.txt -d
|
|
</span></span></code></pre></div><ul>
|
|
<li>Then update them in the database:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/update_orcids.py -i /tmp/2023-03-14-orcids-names.txt -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -m <span style="color:#ae81ff">247</span>
|
|
</span></span></code></pre></div><h2 id="2023-03-15">2023-03-15</h2>
|
|
<ul>
|
|
<li>Jawoo was asking about possibilities to harvest PDFs from CGSpace for some kind of AI chatbot integration
|
|
<ul>
|
|
<li>I see we have 45,000 PDFs (format ID 2)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT COUNT(*) FROM bitstream WHERE NOT deleted AND bitstream_format_id=2;
|
|
</span></span><span style="display:flex;"><span> count
|
|
</span></span><span style="display:flex;"><span>───────
|
|
</span></span><span style="display:flex;"><span> 45281
|
|
</span></span><span style="display:flex;"><span>(1 row)
|
|
</span></span></code></pre></div><ul>
|
|
<li>Rework some of my Python scripts to use a common <code>db_connect</code> function from util</li>
|
|
<li>I reworked my <code>post_bitstreams.py</code> script to be able to overwrite bitstreams if requested
|
|
<ul>
|
|
<li>The use case is to upload thumbnails for all the journal articles where we have these horrible pixelated journal covers</li>
|
|
<li>I replaced JPEG thumbnails for ~896 ILRI publications by exporting a list of DOIs from the 10568/3 collection that were CC-BY, getting their PDFs from Sci-Hub, and then posting them with my new script</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-16">2023-03-16</h2>
|
|
<ul>
|
|
<li>Continue working on the ILRI publication thumbnails
|
|
<ul>
|
|
<li>There were about sixty-four that had existing PNG “journal cover” thumbnails that didn’t get replaced because I only overwrote the JPEG ones yesterday</li>
|
|
<li>Now I generated a list of those bitstream UUIDs and deleted them with a shell script via the REST API</li>
|
|
</ul>
|
|
</li>
|
|
<li>I made a <a href="https://github.com/DSpace/DSpace/pull/8722">pull request on DSpace 7 to update the bitstream format registry for PNG, WebP, and AVIF</a></li>
|
|
<li>Export CGSpace to perform mappings to Initiatives collections</li>
|
|
<li>I also used this export to find CC-BY items with DOIs that had JPEGs or PNGs in their provenance, meaning that the submitter likely submitted a low-quality “journal cover” for the item
|
|
<ul>
|
|
<li>I found about 330 of them and got most of their PDFs from Sci-Hub and replaced the crappy thumbnails with real ones where Sci-Hub had them (~245)</li>
|
|
</ul>
|
|
</li>
|
|
<li>In related news, I realized you can get an <a href="https://stackoverflow.com/questions/59202176/python-download-papers-from-sciencedirect-by-doi-with-requests">API key from Elsevier and download the PDFs from their API</a>:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> requests
|
|
</span></span><span style="display:flex;"><span>
|
|
</span></span><span style="display:flex;"><span>api_key <span style="color:#f92672">=</span> <span style="color:#e6db74">'fuuuuuuuuu'</span>
|
|
</span></span><span style="display:flex;"><span>doi <span style="color:#f92672">=</span> <span style="color:#e6db74">"10.1016/j.foodqual.2021.104362"</span>
|
|
</span></span><span style="display:flex;"><span>request_url <span style="color:#f92672">=</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">'https://api.elsevier.com/content/article/doi:</span><span style="color:#e6db74">{</span>doi<span style="color:#e6db74">}</span><span style="color:#e6db74">'</span>
|
|
</span></span><span style="display:flex;"><span>
|
|
</span></span><span style="display:flex;"><span>headers <span style="color:#f92672">=</span> {
|
|
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">'X-ELS-APIKEY'</span>: api_key,
|
|
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">'Accept'</span>: <span style="color:#e6db74">'application/pdf'</span>
|
|
</span></span><span style="display:flex;"><span>}
|
|
</span></span><span style="display:flex;"><span>
|
|
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> requests<span style="color:#f92672">.</span>get(request_url, stream<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, headers<span style="color:#f92672">=</span>headers) <span style="color:#66d9ef">as</span> r:
|
|
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> r<span style="color:#f92672">.</span>status_code <span style="color:#f92672">==</span> <span style="color:#ae81ff">200</span>:
|
|
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">"article.pdf"</span>, <span style="color:#e6db74">"wb"</span>) <span style="color:#66d9ef">as</span> f:
|
|
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> chunk <span style="color:#f92672">in</span> r<span style="color:#f92672">.</span>iter_content(chunk_size<span style="color:#f92672">=</span><span style="color:#ae81ff">1024</span><span style="color:#f92672">*</span><span style="color:#ae81ff">1024</span>):
|
|
</span></span><span style="display:flex;"><span> f<span style="color:#f92672">.</span>write(chunk)
|
|
</span></span></code></pre></div><ul>
|
|
<li>The question is, how do we know if a DOI is Elsevier or not…</li>
|
|
<li>CGIAR Repositories Working Group meeting
|
|
<ul>
|
|
<li>We discussed controlled vocabularies for funders</li>
|
|
<li>I suggested checking our combined lists against Crossref and ROR</li>
|
|
</ul>
|
|
</li>
|
|
<li>Export a list of donors from <code>cg.contributor.donor</code> on CGSpace:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=248) to /tmp/2023-03-16-donors.txt;
|
|
</span></span><span style="display:flex;"><span>COPY 1521
|
|
</span></span></code></pre></div><ul>
|
|
<li>Then resolve them against Crossref’s funders API:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/crossref_funders_lookup.py -e fuuuu@cgiar.org -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv -d
|
|
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
|
|
</span></span><span style="display:flex;"><span>472
|
|
</span></span><span style="display:flex;"><span>$ sed 1d ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
|
|
</span></span><span style="display:flex;"><span>1521
|
|
</span></span></code></pre></div><ul>
|
|
<li>That’s a 31% hit rate, but I see some simple things like “Bill and Melinda Gates Foundation” instead of “Bill & Melinda Gates Foundation”</li>
|
|
</ul>
|
|
<h2 id="2023-03-17">2023-03-17</h2>
|
|
<ul>
|
|
<li>I did the same lookup of CGSpace donors on ROR’s 2022-12-01 data dump:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/ror_lookup.py -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv -r v1.15-2022-12-01-ror-data.json
|
|
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
|
|
</span></span><span style="display:flex;"><span>407
|
|
</span></span><span style="display:flex;"><span>$ sed 1d ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
|
|
</span></span><span style="display:flex;"><span>1521
|
|
</span></span></code></pre></div><ul>
|
|
<li>That’s a 26.7% hit rate</li>
|
|
<li>As for the number of funders in each dataset
|
|
<ul>
|
|
<li>Crossref has about 34,000</li>
|
|
<li>ROR has 15,000 if “FundRef” data is a proxy for that:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c -rsI FundRef v1.15-2022-12-01-ror-data.json
|
|
</span></span><span style="display:flex;"><span>15162
|
|
</span></span></code></pre></div><ul>
|
|
<li>On a related note, I remembered that DOI.org has a list of DOI prefixes and publishers: <a href="https://doi.crossref.org/getPrefixPublisher">https://doi.crossref.org/getPrefixPublisher</a>
|
|
<ul>
|
|
<li>In Python I can look up publishers by prefix easily, here with a nested list comprehension:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>In [10]: [publisher for publisher in publishers if '10.3390' in publisher['prefixes']]
|
|
</span></span><span style="display:flex;"><span>Out[10]:
|
|
</span></span><span style="display:flex;"><span>[{'prefixes': ['10.1989', '10.32545', '10.20944', '10.3390', '10.35995'],
|
|
</span></span><span style="display:flex;"><span> 'name': 'MDPI AG',
|
|
</span></span><span style="display:flex;"><span> 'memberId': 1968}]
|
|
</span></span></code></pre></div><ul>
|
|
<li>And in OpenRefine, if I create a new column based on the DOI using Jython:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> json
|
|
</span></span><span style="display:flex;"><span>
|
|
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">"/home/aorth/src/git/DSpace/publisher-doi-prefixes.json"</span>, <span style="color:#e6db74">"rb"</span>) <span style="color:#66d9ef">as</span> f:
|
|
</span></span><span style="display:flex;"><span> publishers <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>load(f)
|
|
</span></span><span style="display:flex;"><span>
|
|
</span></span><span style="display:flex;"><span>doi_prefix <span style="color:#f92672">=</span> value<span style="color:#f92672">.</span>split(<span style="color:#e6db74">"/"</span>)[<span style="color:#ae81ff">3</span>]
|
|
</span></span><span style="display:flex;"><span>
|
|
</span></span><span style="display:flex;"><span>publisher <span style="color:#f92672">=</span> [publisher <span style="color:#66d9ef">for</span> publisher <span style="color:#f92672">in</span> publishers <span style="color:#66d9ef">if</span> doi_prefix <span style="color:#f92672">in</span> publisher[<span style="color:#e6db74">'prefixes'</span>]]
|
|
</span></span><span style="display:flex;"><span>
|
|
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">return</span> publisher[<span style="color:#ae81ff">0</span>][<span style="color:#e6db74">'name'</span>]
|
|
</span></span></code></pre></div><ul>
|
|
<li>… though this is very slow and hung OpenRefine when I tried it</li>
|
|
<li>I added the ability to overwrite multiple bitstream formats at once in <code>post_bitstreams.py</code></li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/post_bitstreams.py -i test.csv -u https://dspacetest.cgiar.org/rest -e fuuu@example.com -p <span style="color:#e6db74">'fffnjnjn'</span> -d -s 2B40C7C4E34CEFCF5AFAE4B75A8C52E2 --overwrite JPEG --overwrite PNG -n
|
|
</span></span><span style="display:flex;"><span>Session valid: 2B40C7C4E34CEFCF5AFAE4B75A8C52E2
|
|
</span></span><span style="display:flex;"><span>Opened test.csv
|
|
</span></span><span style="display:flex;"><span>384142cb-58b9-4e64-bcdc-0a8cc34888b3: checking for existing bitstreams in THUMBNAIL bundle
|
|
</span></span><span style="display:flex;"><span>> <span style="color:#f92672">(</span>DRY RUN<span style="color:#f92672">)</span> Deleting bitstream: IFPRI Malawi_Maize Market Report_February_202_anonymous.pdf.jpg <span style="color:#f92672">(</span>16883cb0-1fc8-4786-a04f-32132e0617d4<span style="color:#f92672">)</span>
|
|
</span></span><span style="display:flex;"><span>> <span style="color:#f92672">(</span>DRY RUN<span style="color:#f92672">)</span> Deleting bitstream: AgroEcol_Newsletter_2.png <span style="color:#f92672">(</span>7e9cd434-45a6-4d55-8d56-4efa89d73813<span style="color:#f92672">)</span>
|
|
</span></span><span style="display:flex;"><span>> <span style="color:#f92672">(</span>DRY RUN<span style="color:#f92672">)</span> Uploading file: 10568-129666.pdf.jpg
|
|
</span></span></code></pre></div><ul>
|
|
<li>I learned how to use Python’s built-in <code>logging</code> module and it simplifies all my debug and info printing
|
|
<ul>
|
|
<li>I re-factored a few scripts to use the new logging</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-18">2023-03-18</h2>
|
|
<ul>
|
|
<li>I applied changes for publishers on 16,000 items in batches of 5,000</li>
|
|
<li>While working on my <code>post_bitstreams.py</code> script I realized the Tomcat Crawler Session Manager valve that groups bot user agents into sessions is causing my login to fail the first time, every time
|
|
<ul>
|
|
<li>I’ve disabled it for now and will check the Munin session graphs after some time to see if it makes a difference</li>
|
|
<li>In any case I have much better spider user agent lists in DSpace now than I did years ago when I started using the Crawler Session Manager valve</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-19">2023-03-19</h2>
|
|
<ul>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2023-03-20">2023-03-20</h2>
|
|
<ul>
|
|
<li>Minor updates to a few of my DSpace Python scripts to fix the logging</li>
|
|
<li>Minor updates to some records for Mazingira reported by Sonja</li>
|
|
<li>Upgrade PostgreSQL on DSpace Test from version 12 to 14, the same way I did from 10 to 12 last year:
|
|
<ul>
|
|
<li>First, I installed the new version of PostgreSQL via the Ansible playbook scripts</li>
|
|
<li>Then I stopped Tomcat and all PostgreSQL clusters and used <code>pg_upgrade</code> to upgrade the old version:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># systemctl stop tomcat7
|
|
</span></span><span style="display:flex;"><span># pg_ctlcluster <span style="color:#ae81ff">12</span> main stop
|
|
</span></span><span style="display:flex;"><span># tar -cvzpf var-lib-postgresql-12.tar.gz /var/lib/postgresql/12
|
|
</span></span><span style="display:flex;"><span># tar -cvzpf etc-postgresql-12.tar.gz /etc/postgresql/12
|
|
</span></span><span style="display:flex;"><span># pg_ctlcluster <span style="color:#ae81ff">14</span> main stop
|
|
</span></span><span style="display:flex;"><span># pg_dropcluster <span style="color:#ae81ff">14</span> main
|
|
</span></span><span style="display:flex;"><span># pg_upgradecluster <span style="color:#ae81ff">12</span> main
|
|
</span></span><span style="display:flex;"><span># pg_ctlcluster <span style="color:#ae81ff">14</span> main start
|
|
</span></span></code></pre></div><ul>
|
|
<li>After that I <a href="https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/">re-indexed the database indexes using a query</a>:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ su - postgres
|
|
</span></span><span style="display:flex;"><span>$ cat /tmp/generate-reindex.sql
|
|
</span></span><span style="display:flex;"><span>SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
|
|
</span></span><span style="display:flex;"><span>FROM pg_class C
|
|
</span></span><span style="display:flex;"><span>LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
|
|
</span></span><span style="display:flex;"><span>WHERE nspname = 'public'
|
|
</span></span><span style="display:flex;"><span> AND C.relkind = 'r'
|
|
</span></span><span style="display:flex;"><span> AND nspname !~ '^pg_toast'
|
|
</span></span><span style="display:flex;"><span>ORDER BY pg_total_relation_size(C.oid) ASC;
|
|
</span></span><span style="display:flex;"><span>$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
|
|
</span></span><span style="display:flex;"><span>$ <trim the extra stuff from /tmp/reindex.sql>
|
|
</span></span><span style="display:flex;"><span>$ psql dspace < /tmp/reindex.sql
|
|
</span></span></code></pre></div><ul>
|
|
<li>The index on <code>metadatavalue</code> shrunk by 90MB, and others a bit less
|
|
<ul>
|
|
<li>This is nice, but not as drastic as I noticed last year when upgrading to PostgreSQL 12</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-21">2023-03-21</h2>
|
|
<ul>
|
|
<li>Leigh sent me a list of IFPRI authors with ORCID identifiers so I combined them with our list and resolved all their names with <code>resolve_orcids.py</code>
|
|
<ul>
|
|
<li>It adds 154 new ORCID identifiers</li>
|
|
</ul>
|
|
</li>
|
|
<li>I did a follow up to the publisher names from last week using the list from doi.org
|
|
<ul>
|
|
<li>Last week I only updated items with a DOI that had <em>no</em> publisher, but now I was curious to see how our existing publisher information compared</li>
|
|
<li>I checked a dozen or so manually and, other than CIFOR/ICRAF and CIAT/Alliance, the metadata was better than our existing data, so I overwrote them</li>
|
|
</ul>
|
|
</li>
|
|
<li>I spent some time trying to figure out how to get ssimulacra2 running so I could compare thumbnails in JPEG and WebP
|
|
<ul>
|
|
<li>I realized that we can’t directly compare JPEG to WebP, we need to convert to JPEG/WebP, then convert each to lossless PNG</li>
|
|
<li>Also, we shouldn’t be comparing the resulting images against each other, but rather the original, so I need to a straight PDF to lossless PNG version also</li>
|
|
<li>After playing with WebP at Q82 and Q92, I see it has lower ssimulacra2 scores than JPEG Q92 for the dozen test files</li>
|
|
<li>Could it just be something with ImageMagick?</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-22">2023-03-22</h2>
|
|
<ul>
|
|
<li>I updated csv-metadata-quality to use pandas 2.0.0rc1 and everything seems to work…?
|
|
<ul>
|
|
<li>So the issues with nulls (isna) when I tried the first release candidate a few weeks ago were resolved?</li>
|
|
</ul>
|
|
</li>
|
|
<li>Meeting with Jawoo and others about a “ChatGPT-like” thing for CGIAR data using CGSpace documents and metadata</li>
|
|
</ul>
|
|
<h2 id="2023-03-23">2023-03-23</h2>
|
|
<ul>
|
|
<li>Add a missing IFPRI ORCID identifier to CGSpace and tag his items on CGSpace</li>
|
|
<li>A super unscientific comparison between csv-metadata-quality’s pytest regimen using Pandas 1.5.3 and Pandas 2.0.0rc1
|
|
<ul>
|
|
<li>The data was gathered using <a href="https://justine.lol/rusage">rusage</a>, and this is the results of the last of three consecutive runs:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code># Pandas 1.5.3
|
|
RL: took 1,585,999µs wall time
|
|
RL: ballooned to 272,380kb in size
|
|
RL: needed 2,093,947µs cpu (25% kernel)
|
|
RL: caused 55,856 page faults (100% memcpy)
|
|
RL: 699 context switches (1% consensual)
|
|
RL: performed 0 reads and 16 write i/o operations
|
|
|
|
# Pandas 2.0.0rc1
|
|
RL: took 1,625,718µs wall time
|
|
RL: ballooned to 262,116kb in size
|
|
RL: needed 2,148,425µs cpu (24% kernel)
|
|
RL: caused 63,934 page faults (100% memcpy)
|
|
RL: 461 context switches (2% consensual)
|
|
RL: performed 0 reads and 16 write i/o operations
|
|
</code></pre><ul>
|
|
<li>So it seems that Pandas 2.0.0rc1 took ten megabytes less RAM… interesting to see that the PyArrow-backed dtypes make a measurable difference even on my small test set
|
|
<ul>
|
|
<li>I should try to compare runs of larger input files</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-24">2023-03-24</h2>
|
|
<ul>
|
|
<li>I added a Flyway SQL migration for the PNG bitstream format registry changes on DSpace 7.6</li>
|
|
</ul>
|
|
<h2 id="2023-03-26">2023-03-26</h2>
|
|
<ul>
|
|
<li>There seems to be a slightly high load on CGSpace
|
|
<ul>
|
|
<li>I don’t see any locks in PostgreSQL, but there’s some new bot I have never heard of:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>92.119.18.13 - - [26/Mar/2023:18:41:47 +0200] "GET /handle/10568/16500/discover?filtertype_0=impactarea&filter_relational_operator_0=equals&filter_0=Climate+adaptation+and+mitigation&filtertype=sdg&filter_relational_operator=equals&filter=SDG+11+-+Sustainable+cities+and+communities HTTP/2.0" 200 7856 "-" "colly - https://github.com/gocolly/colly"
|
|
</span></span></code></pre></div><ul>
|
|
<li>In the last week I see a handful of IPs making requests with this agent:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.<span style="color:#f92672">{</span>2,3,4,5,6,7<span style="color:#f92672">}</span>.gz | grep go
|
|
</span></span><span style="display:flex;"><span>colly | awk '{print $1}' | sort | uniq -c | sort -h
|
|
</span></span><span style="display:flex;"><span> 2 194.233.95.37
|
|
</span></span><span style="display:flex;"><span> 4304 92.119.18.142
|
|
</span></span><span style="display:flex;"><span> 9496 5.180.208.152
|
|
</span></span><span style="display:flex;"><span> 27477 92.119.18.13
|
|
</span></span></code></pre></div><ul>
|
|
<li>Most of these come from Packethub S.A. / ASN 62240 (CLOUVIDER Clouvider - Global ASN, GB)</li>
|
|
<li>Oh, I’ve apparently seen this user agent before, as it is in our ILRI spider user agent overrides</li>
|
|
<li>I exported CGSpace to check for missing Initiative collection mappings</li>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2023-03-27">2023-03-27</h2>
|
|
<ul>
|
|
<li>The harvest on AReS was incredibly slow and I stopped it about half way twelve hours later
|
|
<ul>
|
|
<li>Then I relied on the plugins to get missing items, which caused a high load on the server but actually worked fine</li>
|
|
</ul>
|
|
</li>
|
|
<li>Continue working on thumbnails on DSpace</li>
|
|
</ul>
|
|
<h2 id="2023-03-28">2023-03-28</h2>
|
|
<ul>
|
|
<li>Regarding ImageMagick there are a few things I’ve learned
|
|
<ul>
|
|
<li>The <code>-quality</code> setting does different things for different output formats, see: <a href="https://imagemagick.org/script/command-line-options.php#quality">https://imagemagick.org/script/command-line-options.php#quality</a></li>
|
|
<li>The <code>-compress</code> setting controls the compression algorithm for image data, and is unrelated to lossless/lossy
|
|
<ul>
|
|
<li>On that note, <code>-compress lossless</code> for JPEGs refers to Lossless JPEG, which is not well defined or supported and should be avoided</li>
|
|
<li>See: <a href="https://imagemagick.org/script/command-line-options.php#compress">https://imagemagick.org/script/command-line-options.php#compress</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>The way DSpace currently does its supersampling by exporting to a JPEG, then making a thumbnail of the JPEG, is a double lossy operation
|
|
<ul>
|
|
<li>We should be exporting to something lossless like PNG, PPM, or MIFF, then making a thumbnail from that</li>
|
|
</ul>
|
|
</li>
|
|
<li>The PNG format is always lossless so the <code>-quality</code> setting controls compression and filtering, but has no effect on the appearance or signature of PNG images</li>
|
|
<li>You can use <code>-quality n</code> with WebP’s <code>-define webp:lossless=true</code>, but I’m not sure about the interaction between ImageMagick quality and WebP lossless…
|
|
<ul>
|
|
<li>Also, if converting from a lossless format to WebP lossless in the same command, ImageMagick will ignore quality settings</li>
|
|
</ul>
|
|
</li>
|
|
<li>The MIFF format is useful for piping between ImageMagick commands, but it is also lossless and the quality setting is ignored</li>
|
|
<li>You can use a format specifier when piping between ImageMagick commands without writing a file</li>
|
|
<li>For example, I want to create a lossless PNG from a distorted JPEG for comparison:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ magick convert reference.jpg -quality <span style="color:#ae81ff">85</span> jpg:- | convert - distorted-lossless.png
|
|
</span></span></code></pre></div><ul>
|
|
<li>If I convert the JPEG to PNG directly it will ignore the quality setting, so I set the quality and the output format, then pipe it to ImageMagick again to convert to lossless PNG</li>
|
|
<li>In an attempt to quantify the generation loss from DSpace’s “JPG JPG” method of creating thumbnails I wrote a script called <code>generation-loss.sh</code> to test against a new “PNG JPG” method
|
|
<ul>
|
|
<li>With my sample set of seventeen PDFs from CGSpace I found that <em>the “JPG JPG” method of thumbnailing results in scores an average of 1.6% lower than with the “PNG JPG” method</em>.</li>
|
|
<li>The average file size with <em>the “PNG JPG” method was only 200 bytes larger</em>.</li>
|
|
</ul>
|
|
</li>
|
|
<li>In my brief testing, the relationship between ImageMagick’s <code>-quality</code> setting and WebP’s <code>-define webp:lossless=true</code> setting are completely unpredictable:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ magick convert img/10568-103447.pdf.png /tmp/10568-103447.webp
|
|
</span></span><span style="display:flex;"><span>$ magick convert img/10568-103447.pdf.png -define webp:lossless<span style="color:#f92672">=</span>true /tmp/10568-103447-lossless.webp
|
|
</span></span><span style="display:flex;"><span>$ magick convert img/10568-103447.pdf.png -define webp:lossless<span style="color:#f92672">=</span>true -quality <span style="color:#ae81ff">50</span> /tmp/10568-103447-lossless-q50.webp
|
|
</span></span><span style="display:flex;"><span>$ magick convert img/10568-103447.pdf.png -quality <span style="color:#ae81ff">10</span> -define webp:lossless<span style="color:#f92672">=</span>true /tmp/10568-103447-lossless-q10.webp
|
|
</span></span><span style="display:flex;"><span>$ magick convert img/10568-103447.pdf.png -quality <span style="color:#ae81ff">90</span> -define webp:lossless<span style="color:#f92672">=</span>true /tmp/10568-103447-lossless-q90.webp
|
|
</span></span><span style="display:flex;"><span>$ ls -l /tmp/10568-103447*
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 359258 Mar 28 21:16 /tmp/10568-103447-lossless-q10.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 303850 Mar 28 21:15 /tmp/10568-103447-lossless-q50.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 296832 Mar 28 21:16 /tmp/10568-103447-lossless-q90.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 299566 Mar 28 21:13 /tmp/10568-103447-lossless.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 190718 Mar 28 21:13 /tmp/10568-103447.webp
|
|
</span></span></code></pre></div><ul>
|
|
<li>I’m curious to see a comparison between the ImageMagick <code>-define webp:emulate-jpeg-size=true</code> (aka <code>-jpeg_like</code> in cwebp) option compared to normal lossy WebP quality:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">70</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> magick convert img/10568-103447.pdf.png -quality $q -define webp:emulate-jpeg-size<span style="color:#f92672">=</span>true /tmp/10568-103447-lossy-emulate-jpeg-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">70</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> magick convert /tmp/10568-103447-lossy-emulate-jpeg-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp /tmp/10568-103447-lossy-emulate-jpeg-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp.png; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">70</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> ssimulacra2 img/10568-103447.pdf.png /tmp/10568-103447-lossy-emulate-jpeg-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp.png 2>/dev/null; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>81.29082887
|
|
</span></span><span style="display:flex;"><span>84.42134524
|
|
</span></span><span style="display:flex;"><span>85.84458964
|
|
</span></span><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">70</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> magick convert img/10568-103447.pdf.png -quality $q /tmp/10568-103447-lossy-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">70</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> magick convert /tmp/10568-103447-lossy-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp /tmp/10568-103447-lossy-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp.png; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">70</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> ssimulacra2 img/10568-103447.pdf.png /tmp/10568-103447-lossy-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp.png 2>/dev/null; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>77.25789006
|
|
</span></span><span style="display:flex;"><span>80.79140936
|
|
</span></span><span style="display:flex;"><span>84.79108246
|
|
</span></span></code></pre></div><ul>
|
|
<li>Using <code>-define webp:method=6</code> (versus default 4) gets a ~0.5% increase on ssimulacra2 score</li>
|
|
</ul>
|
|
<h2 id="2023-03-29">2023-03-29</h2>
|
|
<ul>
|
|
<li>Looking at the <code>-define webp:near-lossless=$q</code> option in ImageMagick and I don’t think it’s working:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">20</span> <span style="color:#ae81ff">40</span> <span style="color:#ae81ff">60</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> magick convert -flatten data/10568-103447.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -define webp:near-lossless<span style="color:#f92672">=</span>$q -verbose /tmp/10568-103447-near-lossless-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q20.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q40.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q60.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.090u 0:00.043
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q80.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.090u 0:00.043
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-near-lossless-q90.webp PDF 595x842 595x842+0+0 16-bit sRGB 80440B 0.080u 0:00.043
|
|
</span></span></code></pre></div><ul>
|
|
<li>The file sizes are all the same…</li>
|
|
<li>If I try with <code>-quality $q</code> it works:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">20</span> <span style="color:#ae81ff">40</span> <span style="color:#ae81ff">60</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> magick convert -flatten data/10568-103447.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -quality $q -verbose /tmp/10568-103447-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-q20.webp PDF 595x842 595x842+0+0 16-bit sRGB 52602B 0.080u 0:00.045
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-q40.webp PDF 595x842 595x842+0+0 16-bit sRGB 64604B 0.090u 0:00.045
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-q60.webp PDF 595x842 595x842+0+0 16-bit sRGB 73584B 0.080u 0:00.045
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-q80.webp PDF 595x842 595x842+0+0 16-bit sRGB 88652B 0.090u 0:00.045
|
|
</span></span><span style="display:flex;"><span>data/10568-103447.pdf[0]=>/tmp/10568-103447-q90.webp PDF 595x842 595x842+0+0 16-bit sRGB 113186B 0.100u 0:00.049
|
|
</span></span></code></pre></div><ul>
|
|
<li>I don’t see any issues mentioning this in the ImageMagick GitHub issues, so I guess I have to file a bug
|
|
<ul>
|
|
<li>I first <a href="https://github.com/ImageMagick/ImageMagick/discussions/6204">asked a question on their discussion board</a> because I see that the near-lossless option should have been added to ImageMagick sometime after 2020 according to another discussion</li>
|
|
</ul>
|
|
</li>
|
|
<li>Meeting with Maria about the Alliance metadata on CGSpace
|
|
<ul>
|
|
<li>As the Alliance is not a legal entity they want to reflect that somehow in CGSpace</li>
|
|
<li>We discussed updating all metadata, but so many documents issued in the last few years have the Alliance indicated inside them and as affiliations in journal article acknowledgements, etc, we decided it is not the best option</li>
|
|
<li>Instead, we propose to:
|
|
<ul>
|
|
<li>Remove <code>Alliance of Bioversity International and CIAT</code> from the controlled vocabulary for affiliations ASAP</li>
|
|
<li>Add <code>Bioversity International and the International Center for Tropical Agriculture</code> to the controlled vocabulary for affiliations ASAP</li>
|
|
<li>Add a prominent note to the item page for every item in the Alliance community via a custom XMLUI theme (Maria and the Alliance publishing team to send the text)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-03-30">2023-03-30</h2>
|
|
<ul>
|
|
<li>The ImageMagick developers confirmed <a href="https://github.com/ImageMagick/ImageMagick/discussions/6204">my bug report</a> and created a patch on master
|
|
<ul>
|
|
<li>I’m not entirely sure how it works, but the developer seemed to imply we can use lossless mode plus a quality?</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ magick convert -flatten data/10568-103447.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -define webp:lossless<span style="color:#f92672">=</span>true -quality <span style="color:#ae81ff">90</span> /tmp/10568-103447.pdf.webp
|
|
</span></span></code></pre></div><ul>
|
|
<li>Now I see a difference between near-lossless and normal quality mode:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">20</span> <span style="color:#ae81ff">40</span> <span style="color:#ae81ff">60</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> magick convert -flatten data/10568-103447.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -define webp:lossless<span style="color:#f92672">=</span>true -quality $q /tmp/10568-103447-near-lossless-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>$ ls -l /tmp/10568-103447-near-lossless-q*
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 108186 Mar 30 11:36 /tmp/10568-103447-near-lossless-q20.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 97170 Mar 30 11:36 /tmp/10568-103447-near-lossless-q40.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 97382 Mar 30 11:36 /tmp/10568-103447-near-lossless-q60.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 106090 Mar 30 11:36 /tmp/10568-103447-near-lossless-q80.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 105926 Mar 30 11:36 /tmp/10568-103447-near-lossless-q90.webp
|
|
</span></span><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> q in <span style="color:#ae81ff">20</span> <span style="color:#ae81ff">40</span> <span style="color:#ae81ff">60</span> <span style="color:#ae81ff">80</span> 90; <span style="color:#66d9ef">do</span> magick convert -flatten data/10568-103447.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -quality $q /tmp/10568-103447-q<span style="color:#e6db74">${</span>q<span style="color:#e6db74">}</span>.webp; <span style="color:#66d9ef">done</span>
|
|
</span></span><span style="display:flex;"><span>$ ls -l /tmp/10568-103447-q*
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 52602 Mar 30 11:37 /tmp/10568-103447-q20.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 64604 Mar 30 11:37 /tmp/10568-103447-q40.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 73584 Mar 30 11:37 /tmp/10568-103447-q60.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 88652 Mar 30 11:37 /tmp/10568-103447-q80.webp
|
|
</span></span><span style="display:flex;"><span>-rw-r--r-- 1 aorth aorth 113186 Mar 30 11:37 /tmp/10568-103447-q90.webp
|
|
</span></span></code></pre></div><ul>
|
|
<li>But after reading the source code in <code>coders/webp.c</code> I am not sure I understand, so I asked for clarification in the discussion</li>
|
|
<li>Both Bosede and Abenet said mapping on CGSpace is taking a long time and I don’t see any stuck locks so I decided to quickly restart postgresql</li>
|
|
</ul>
|
|
<h2 id="2023-03-31">2023-03-31</h2>
|
|
<ul>
|
|
<li>Meeting with Daniel and Naim from Alliance in Cali about CGSpace metadata, TIP, etc</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-12/">December, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-11/">November, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-10/">October, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|