cgspace-notes/docs/2022-11/index.html

620 lines
37 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="November, 2022" />
<meta property="og:description" content="2022-11-01
Last night I re-synced DSpace 7 Test from CGSpace
I also updated all my local 7_x-dev branches on the latest upstreams
I spent some time updating the authorizations in Alliance collections
I want to make sure they use groups instead of individuals where possible!
I reverted the Cocoon autosave change because it was more of a nuissance that Peter can&rsquo;t upload CSVs from the web interface and is a very low severity security issue
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-11/" />
<meta property="article:published_time" content="2022-11-01T09:11:36+03:00" />
<meta property="article:modified_time" content="2022-11-27T12:38:48+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2022"/>
<meta name="twitter:description" content="2022-11-01
Last night I re-synced DSpace 7 Test from CGSpace
I also updated all my local 7_x-dev branches on the latest upstreams
I spent some time updating the authorizations in Alliance collections
I want to make sure they use groups instead of individuals where possible!
I reverted the Cocoon autosave change because it was more of a nuissance that Peter can&rsquo;t upload CSVs from the web interface and is a very low severity security issue
"/>
<meta name="generator" content="Hugo 0.105.0">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-11/",
"wordCount": "2297",
"datePublished": "2022-11-01T09:11:36+03:00",
"dateModified": "2022-11-27T12:38:48+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-11/">
<title>November, 2022 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-11/">November, 2022</a></h2>
<p class="blog-post-meta">
<time datetime="2022-11-01T09:11:36+03:00">Tue Nov 01, 2022</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2022-11-01">2022-11-01</h2>
<ul>
<li>Last night I re-synced DSpace 7 Test from CGSpace
<ul>
<li>I also updated all my local <code>7_x-dev</code> branches on the latest upstreams</li>
</ul>
</li>
<li>I spent some time updating the authorizations in Alliance collections
<ul>
<li>I want to make sure they use groups instead of individuals where possible!</li>
</ul>
</li>
<li>I reverted the Cocoon autosave change because it was more of a nuissance that Peter can&rsquo;t upload CSVs from the web interface and is a very low severity security issue</li>
</ul>
<ul>
<li>I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!</li>
<li>I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test</li>
<li>Tim merged my <a href="https://github.com/DSpace/DSpace/pull/8553">pull request to override the ImageMagick PDF density in DSpace 7</a>
<ul>
<li>I ported it to DSpace 6.x and submitted a pull request: <a href="https://github.com/DSpace/DSpace/pull/8560">https://github.com/DSpace/DSpace/pull/8560</a></li>
</ul>
</li>
</ul>
<h2 id="2022-11-02">2022-11-02</h2>
<ul>
<li>I joined the FAOCGIAR AGROVOC results sharing meeting
<ul>
<li>From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion</li>
</ul>
</li>
<li>Doing duplicate check on IFPRI&rsquo;s batch upload and I found one duplicate uploaded by IWMI earlier this year
<ul>
<li>I will update the metadata of that item and map it to the correct Initiative collection</li>
</ul>
</li>
</ul>
<h2 id="2022-11-03">2022-11-03</h2>
<ul>
<li>I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace</li>
<li>I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
</span></span><span style="display:flex;"><span>COPY 1268
</span></span></code></pre></div><ul>
<li>Then I started a test run on DSpace Test:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">while</span> read -r collection; <span style="color:#66d9ef">do</span> chrt -b <span style="color:#ae81ff">0</span> dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; <span style="color:#66d9ef">done</span> &lt; /tmp/collections.txt
</span></span></code></pre></div><ul>
<li>I&rsquo;ll be curious to check the log after it&rsquo;s all done.
<ul>
<li>After a few hours I see:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;Action: remove&#39;</span> /tmp/FixLowQualityThumbnails.log
</span></span><span style="display:flex;"><span>626
</span></span></code></pre></div><ul>
<li>Not bad, because last week I did a more manual selection of collections and deleted ~200
<ul>
<li>I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool</li>
</ul>
</li>
<li>I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
<ul>
<li>Export a list of items with PDFs linked there:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE &#39;%ciat-library%&#39;) to /tmp/ciat-library-items.csv;
</span></span><span style="display:flex;"><span>COPY 4621
</span></span></code></pre></div><ul>
<li>After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones&hellip;</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
</span></span><span style="display:flex;"><span>2752
</span></span></code></pre></div><ul>
<li>I&rsquo;m not sure how we&rsquo;ll handle the duplicates because many items are book chapters or something where they share a PDF</li>
</ul>
<h2 id="2022-11-04">2022-11-04</h2>
<ul>
<li>I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description &ldquo;Generated Thumbnail&rdquo;:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value=&#39;Generated Thumbnail&#39;) to /tmp/old-thumbnails.txt;
</span></span><span style="display:flex;"><span>COPY 1147
</span></span><span style="display:flex;"><span>$ grep -v <span style="color:#e6db74">&#39;\\N&#39;</span> /tmp/old-thumbnails.txt &gt; /tmp/old-thumbnail-handles.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/old-thumbnail-handles.txt
</span></span><span style="display:flex;"><span>987 /tmp/old-thumbnail-handles.txt
</span></span></code></pre></div><ul>
<li>A bunch of these have <code>\N</code> for some reason when I use the <code>ds6_bitstream2itemhandle</code> function to get their handles so I had to exclude those&hellip;
<ul>
<li>I forced the media-filter for these items on CGSpace:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">while</span> read -r handle; <span style="color:#66d9ef">do</span> JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx512m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -i $handle -f -v; <span style="color:#66d9ef">done</span> &lt; /tmp/old-thumbnail-handles.txt
</span></span></code></pre></div><ul>
<li>Upload some batch records via CSV for Peter</li>
<li>Update the about page on CGSpace with new text from Peter</li>
<li>Add a few more ORCID identifiers and names to my growing file <code>2022-09-22-add-orcids.csv</code>
<ul>
<li>I tagged fifty-four new authors using this list</li>
</ul>
</li>
<li>I deleted and mapped one duplicate item for Maria Garruccio</li>
<li>I updated the CG Core website from Bootstrap v4.6 to v5.2</li>
</ul>
<h2 id="2022-11-07">2022-11-07</h2>
<ul>
<li>I did a harvest on AReS last night but it seems that MELSpace&rsquo;s sitemap is broken again because we have 10,000 fewer records</li>
<li>I filed <a href="https://github.com/ecrmnn/iso-3166-1/issues/10">an issue</a> on the iso-3166-1 npm package to update the name of Turkey to Türkiye
<ul>
<li>I also filed <a href="https://github.com/flyingcircusio/pycountry/issues/148">an issue</a> and <a href="https://github.com/flyingcircusio/pycountry/pull/149">a pull request</a> on the pycountry package</li>
<li>I also filed <a href="https://github.com/konstantinstadler/country_converter/issues/121">an issue</a> and <a href="https://github.com/konstantinstadler/country_converter/pull/122">a pull request</a> on the country-converter package</li>
<li>I also changed one item on CGSpace that had been submitted since the name was changed</li>
<li>I also imported the new iso-codes 4.12.0 into cgspace-java-helpers</li>
<li>I also updated it in the DSpace <code>input-forms.xml</code></li>
<li>I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
<ul>
<li>I submitted a <a href="https://github.com/ecrmnn/iso-3166-1/pull/11">pull request</a> to update this upstream</li>
</ul>
</li>
</ul>
</li>
<li>Since I was making all these pull requests I also made <a href="https://github.com/konstantinstadler/country_converter/pull/123">one on country-converter for the UN M.49 region &ldquo;South-eastern Asia&rdquo;</a></li>
<li>Port the <a href="https://github.com/DSpace/DSpace/pull/8550">ImageMagick PDF cropbox fix</a> to DSpace 6.x
<ul>
<li>I deployed it on CGSpace, ran all updates, and rebooted the host</li>
<li>I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -v -f -i 10568/78 &gt;&amp; /tmp/filter-media-cropbox.log
</span></span></code></pre></div><ul>
<li>But looking at the items it processed, I&rsquo;m not sure it&rsquo;s working as expected
<ul>
<li>I looked at a few dozen</li>
</ul>
</li>
<li>I found some links to the Bioversity website on CGSpace that are not redirecting properly:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
</span></span><span style="display:flex;"><span>GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
</span></span><span style="display:flex;"><span>Accept: */*
</span></span><span style="display:flex;"><span>Accept-Encoding: gzip, deflate
</span></span><span style="display:flex;"><span>Connection: keep-alive
</span></span><span style="display:flex;"><span>Host: www.bioversityinternational.org
</span></span><span style="display:flex;"><span>User-Agent: HTTPie/3.2.1
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>HTTP/1.1 302 Found
</span></span><span style="display:flex;"><span>Connection: Keep-Alive
</span></span><span style="display:flex;"><span>Content-Length: 275
</span></span><span style="display:flex;"><span>Content-Type: text/html; charset=iso-8859-1
</span></span><span style="display:flex;"><span>Date: Mon, 07 Nov 2022 16:35:21 GMT
</span></span><span style="display:flex;"><span>Keep-Alive: timeout=15, max=100
</span></span><span style="display:flex;"><span>Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
</span></span><span style="display:flex;"><span>Server: Apache
</span></span></code></pre></div><ul>
<li>The <code>Location</code> header is clearly wrong, and if I try https directly I get an HTTP 500</li>
</ul>
<h2 id="2022-11-08">2022-11-08</h2>
<ul>
<li>Looking at the Solr statistics hits on CGSpace for 2022-11
<ul>
<li>I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
<li>I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent</li>
<li>I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection</li>
<li>I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent</li>
<li>I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection</li>
<li>I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
<li>I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent</li>
<li>I will purge all these hits and proably add China Unicom&rsquo;s subnet mask to my nginx <code>bot-network.conf</code> file to tag them as bots since there are SO many bad and malicious requests coming from there</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
</span></span><span style="display:flex;"><span>Purging 8975 hits from 221.219.100.42 in statistics
</span></span><span style="display:flex;"><span>Purging 7577 hits from 122.10.101.60 in statistics
</span></span><span style="display:flex;"><span>Purging 6536 hits from 135.125.21.38 in statistics
</span></span><span style="display:flex;"><span>Purging 23950 hits from 163.237.216.11 in statistics
</span></span><span style="display:flex;"><span>Purging 4093 hits from 51.254.154.148 in statistics
</span></span><span style="display:flex;"><span>Purging 2797 hits from 221.219.103.211 in statistics
</span></span><span style="display:flex;"><span>Purging 2618 hits from 216.218.223.53 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 56546
</span></span></code></pre></div><ul>
<li>Also interesting to see a few new user agents:
<ul>
<li><code>RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
<li><code>rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)</code></li>
<li><code>MEL</code></li>
<li><code>Gov employment data scraper ([[your email]])</code></li>
<li><code>RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
</ul>
</li>
<li>I will purge all these:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
</span></span><span style="display:flex;"><span>Purging 6155 hits from RStudio in statistics
</span></span><span style="display:flex;"><span>Purging 1929 hits from rstudio in statistics
</span></span><span style="display:flex;"><span>Purging 1454 hits from MEL in statistics
</span></span><span style="display:flex;"><span>Purging 1094 hits from Gov employment data scraper in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 10632
</span></span></code></pre></div><ul>
<li>Work on the CIAT Library items a bit again in OpenRefine
<ul>
<li>I flagged items with:
<ul>
<li>URL containing &ldquo;#page&rdquo; at the end (these are linking to book chapters, but we don&rsquo;t want to upload the PDF multiple times)</li>
<li>Same URL used by more than one item (&ldquo;Duplicates&rdquo; facet in OpenRefine, these are some corner case I don&rsquo;t want to handle right now)</li>
<li>URL containing &ldquo;:8080&rdquo; to CIAT&rsquo;s old DSpace (this server is no longer live)</li>
</ul>
</li>
<li>I want to try to handle the simple cases that should cover most of the items first</li>
</ul>
</li>
</ul>
<h2 id="2022-11-09">2022-11-09</h2>
<ul>
<li>Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
<ul>
<li>I got the basic functionality working</li>
</ul>
</li>
</ul>
<h2 id="2022-11-12">2022-11-12</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2022-11-15">2022-11-15</h2>
<ul>
<li>Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
<ul>
<li>We agreed to continue adding the feedback for each of the proposed actions</li>
<li>The others will start filling in definitions for the types</li>
<li>Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
<ul>
<li>We need to be careful especially with regards to author&rsquo;s outputs that will be reported in the PRMS</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="2022-11-16">2022-11-16</h2>
<ul>
<li>Maria asked if we can extend the timeout for XMLUI sessions
<ul>
<li>According to <a href="https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44">this issue</a> it seems to be 30 minutes by default, as a Tomcat default</li>
<li>I think we could extend this to an hour, as there is no real security risk (we&rsquo;re not a bank) and most user&rsquo;s lock screens would have activated after ten minutes or so anyways</li>
</ul>
</li>
</ul>
<h2 id="2022-11-20">2022-11-20</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2022-11-22">2022-11-22</h2>
<ul>
<li>Check and upload some items to CGSpace for Peter
<ul>
<li>I am waiting for some feedback from him about some duplicates and metadata issues for the rest</li>
</ul>
</li>
</ul>
<h2 id="2022-11-23">2022-11-23</h2>
<ul>
<li>Fix some authorization issues for ABC&rsquo;s TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)</li>
<li>Peter sent me feedback about the duplicates and metadata questions from yesterday
<ul>
<li>I uploaded the eight items for COHESA and sixty-two for Gender</li>
</ul>
</li>
<li>I ran the script to tag ORCID identifiers with my <code>2022-09-22-add-orcids.csv</code> file and tagged twenty-seven</li>
<li>Maria asked for help uploading a large PDF to CGSpace
<ul>
<li>The PDF is only two pages, but it is 139MB!</li>
<li>I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ gs -sDEVICE<span style="color:#f92672">=</span>pdfwrite -dCompatibilityLevel<span style="color:#f92672">=</span>1.4 -dPDFSETTINGS<span style="color:#f92672">=</span>/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile<span style="color:#f92672">=</span>Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market-screen.pdf Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market.pdf
</span></span><span style="display:flex;"><span>$ gs -sDEVICE<span style="color:#f92672">=</span>pdfwrite -dCompatibilityLevel<span style="color:#f92672">=</span>1.4 -dPDFSETTINGS<span style="color:#f92672">=</span>/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile<span style="color:#f92672">=</span>Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market-ebook.pdf Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market.pdf
</span></span></code></pre></div><ul>
<li>The ebook one looks really good and is only 2.4MB&hellip;</li>
<li>But for reference, this free Adobe tool seems to work: <a href="https://www.adobe.com/acrobat/online/compress-pdf.html">https://www.adobe.com/acrobat/online/compress-pdf.html</a></li>
</ul>
<h2 id="2022-11-24">2022-11-24</h2>
<ul>
<li>My script finished downloading the CIAT Library PDFs
<ul>
<li>I did some more work on my <code>post-ciat-pdfs.py</code> script and tested uploading the items to my local DSpace and DSpace Test</li>
<li>Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items</li>
</ul>
</li>
</ul>
<h2 id="2022-11-25">2022-11-25</h2>
<ul>
<li>Tony Murray, who is working on IFPRI&rsquo;s CGSpace integration, emailed me to ask some questions about the REST API</li>
<li>Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -v -f -i 10568/77010
</span></span><span style="display:flex;"><span>The following MediaFilters are enabled:
</span></span><span style="display:flex;"><span>Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
</span></span><span style="display:flex;"><span>org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span>IM Thumbnail tropentag2016_marshall.pdf is replacable.
</span></span><span style="display:flex;"><span>File: tropentag2016_marshall.pdf.jpg
</span></span><span style="display:flex;"><span>ERROR filtering, skipping bitstream:
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> Item Handle: 10568/77010
</span></span><span style="display:flex;"><span> Bundle Name: ORIGINAL
</span></span><span style="display:flex;"><span> File Size: 1486580
</span></span><span style="display:flex;"><span> Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5)
</span></span><span style="display:flex;"><span> Asset Store: 0
</span></span><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
</span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
</span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>Salem gave me a list of CGSpace collections that have double spaces in the names
<ul>
<li>Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!</li>
<li>He sent me a list of about ten collection UUIDs so I fixed them</li>
</ul>
</li>
<li>I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses&hellip; I updated about fifty of them</li>
</ul>
<h2 id="2022-11-26">2022-11-26</h2>
<ul>
<li>Sync DSpace Test with CGSpace</li>
<li>I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
<ul>
<li>See: <a href="https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44">https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44</a></li>
</ul>
</li>
<li>I re-built DSpace on CGSpace, ran all updates, and rebooted the machine
<ul>
<li>Then after coming back up the handle server won&rsquo;t start</li>
<li>The <code>handle-server.log</code> file shows:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Shutting down...
</span></span><span style="display:flex;"><span>&#34;2022/11/26 02:12:17 CET&#34; 25 Rotating log files
</span></span><span style="display:flex;"><span>Error: null
</span></span><span style="display:flex;"><span> (see the error log for details.)
</span></span></code></pre></div><ul>
<li>In the <code>error.log</code> file I see:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>&#34;2022/11/26 02:12:18 CET&#34; 25 Started new run.
</span></span><span style="display:flex;"><span>java.lang.UnsupportedOperationException
</span></span><span style="display:flex;"><span> at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287)
</span></span><span style="display:flex;"><span> at java.lang.System.runFinalizersOnExit(System.java:1059)
</span></span><span style="display:flex;"><span> at net.handle.server.Main.initialize(Main.java:124)
</span></span><span style="display:flex;"><span> at net.handle.server.Main.main(Main.java:75)
</span></span><span style="display:flex;"><span>Shutting down...
</span></span></code></pre></div><ul>
<li>Ah, it seems to be due to an <a href="https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1">issue in OpenJDK 1.8.0_352</a></li>
<li>I see the server upgraded to the new JDK version on 2022-11-10:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04)
</span></span><span style="display:flex;"><span>End-Date: 2022-11-10 04:10:45
</span></span></code></pre></div><ul>
<li>As highlighted in the dspace-tech mailing list thread above, <a href="https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html">this OpenJDK release deprecated <code>Runtime.runFinalizersOnExit</code></a>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> - JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE
</span></span></code></pre></div><ul>
<li>I downloaded the previous versions of the packages from Launchpad:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
</span></span><span style="display:flex;"><span># wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
</span></span><span style="display:flex;"><span># dpkg -i openjdk-8-j*8u342-b07*.deb
</span></span></code></pre></div><ul>
<li>Then the handle-server process starts up fine, so I held these OpenJDK versions for now:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
</span></span><span style="display:flex;"><span>openjdk-8-jdk-headless set on hold.
</span></span><span style="display:flex;"><span>openjdk-8-jre-headless set on hold.
</span></span></code></pre></div><ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2022-11-27">2022-11-27</h2>
<ul>
<li>I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
<ul>
<li>For PDFs with only one page I was seeing this in the filter-media output:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span></code></pre></div><ul>
<li>It turns out that <a href="https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-">PDDocument&rsquo;s getPage() is zero-based</a></li>
<li>I also updated PDFBox from 2.0.24 to 2.0.27</li>
<li>I synced DSpace 7 Test with CGSpace
<ul>
<li>I had to follow my notes from 2022-03 to delete the missing Atmire migrations</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-11/">November, 2022</a></li>
<li><a href="/cgspace-notes/2022-10/">October, 2022</a></li>
<li><a href="/cgspace-notes/2022-09/">September, 2022</a></li>
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>