cgspace-notes/docs/2022-11/index.html

812 lines
52 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="November, 2022" />
<meta property="og:description" content="2022-11-01
Last night I re-synced DSpace 7 Test from CGSpace
I also updated all my local 7_x-dev branches on the latest upstreams
I spent some time updating the authorizations in Alliance collections
I want to make sure they use groups instead of individuals where possible!
I reverted the Cocoon autosave change because it was more of a nuissance that Peter can&rsquo;t upload CSVs from the web interface and is a very low severity security issue
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-11/" />
<meta property="article:published_time" content="2022-11-01T09:11:36+03:00" />
<meta property="article:modified_time" content="2023-01-04T10:53:02+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2022"/>
<meta name="twitter:description" content="2022-11-01
Last night I re-synced DSpace 7 Test from CGSpace
I also updated all my local 7_x-dev branches on the latest upstreams
I spent some time updating the authorizations in Alliance collections
I want to make sure they use groups instead of individuals where possible!
I reverted the Cocoon autosave change because it was more of a nuissance that Peter can&rsquo;t upload CSVs from the web interface and is a very low severity security issue
"/>
<meta name="generator" content="Hugo 0.126.1">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-11/",
"wordCount": "3411",
"datePublished": "2022-11-01T09:11:36+03:00",
"dateModified": "2023-01-04T10:53:02+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-11/">
<title>November, 2022 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-11/">November, 2022</a></h2>
<p class="blog-post-meta">
<time datetime="2022-11-01T09:11:36+03:00">Tue Nov 01, 2022</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2022-11-01">2022-11-01</h2>
<ul>
<li>Last night I re-synced DSpace 7 Test from CGSpace
<ul>
<li>I also updated all my local <code>7_x-dev</code> branches on the latest upstreams</li>
</ul>
</li>
<li>I spent some time updating the authorizations in Alliance collections
<ul>
<li>I want to make sure they use groups instead of individuals where possible!</li>
</ul>
</li>
<li>I reverted the Cocoon autosave change because it was more of a nuissance that Peter can&rsquo;t upload CSVs from the web interface and is a very low severity security issue</li>
</ul>
<ul>
<li>I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!</li>
<li>I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test</li>
<li>Tim merged my <a href="https://github.com/DSpace/DSpace/pull/8553">pull request to override the ImageMagick PDF density in DSpace 7</a>
<ul>
<li>I ported it to DSpace 6.x and submitted a pull request: <a href="https://github.com/DSpace/DSpace/pull/8560">https://github.com/DSpace/DSpace/pull/8560</a></li>
</ul>
</li>
</ul>
<h2 id="2022-11-02">2022-11-02</h2>
<ul>
<li>I joined the FAOCGIAR AGROVOC results sharing meeting
<ul>
<li>From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion</li>
</ul>
</li>
<li>Doing duplicate check on IFPRI&rsquo;s batch upload and I found one duplicate uploaded by IWMI earlier this year
<ul>
<li>I will update the metadata of that item and map it to the correct Initiative collection</li>
</ul>
</li>
</ul>
<h2 id="2022-11-03">2022-11-03</h2>
<ul>
<li>I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace</li>
<li>I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
</span></span><span style="display:flex;"><span>COPY 1268
</span></span></code></pre></div><ul>
<li>Then I started a test run on DSpace Test:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">while</span> read -r collection; <span style="color:#66d9ef">do</span> chrt -b <span style="color:#ae81ff">0</span> dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; <span style="color:#66d9ef">done</span> &lt; /tmp/collections.txt
</span></span></code></pre></div><ul>
<li>I&rsquo;ll be curious to check the log after it&rsquo;s all done.
<ul>
<li>After a few hours I see:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;Action: remove&#39;</span> /tmp/FixLowQualityThumbnails.log
</span></span><span style="display:flex;"><span>626
</span></span></code></pre></div><ul>
<li>Not bad, because last week I did a more manual selection of collections and deleted ~200
<ul>
<li>I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool</li>
</ul>
</li>
<li>I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
<ul>
<li>Export a list of items with PDFs linked there:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE &#39;%ciat-library%&#39;) to /tmp/ciat-library-items.csv;
</span></span><span style="display:flex;"><span>COPY 4621
</span></span></code></pre></div><ul>
<li>After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones&hellip;</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
</span></span><span style="display:flex;"><span>2752
</span></span></code></pre></div><ul>
<li>I&rsquo;m not sure how we&rsquo;ll handle the duplicates because many items are book chapters or something where they share a PDF</li>
</ul>
<h2 id="2022-11-04">2022-11-04</h2>
<ul>
<li>I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description &ldquo;Generated Thumbnail&rdquo;:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value=&#39;Generated Thumbnail&#39;) to /tmp/old-thumbnails.txt;
</span></span><span style="display:flex;"><span>COPY 1147
</span></span><span style="display:flex;"><span>$ grep -v <span style="color:#e6db74">&#39;\\N&#39;</span> /tmp/old-thumbnails.txt &gt; /tmp/old-thumbnail-handles.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/old-thumbnail-handles.txt
</span></span><span style="display:flex;"><span>987 /tmp/old-thumbnail-handles.txt
</span></span></code></pre></div><ul>
<li>A bunch of these have <code>\N</code> for some reason when I use the <code>ds6_bitstream2itemhandle</code> function to get their handles so I had to exclude those&hellip;
<ul>
<li>I forced the media-filter for these items on CGSpace:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">while</span> read -r handle; <span style="color:#66d9ef">do</span> JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx512m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -i $handle -f -v; <span style="color:#66d9ef">done</span> &lt; /tmp/old-thumbnail-handles.txt
</span></span></code></pre></div><ul>
<li>Upload some batch records via CSV for Peter</li>
<li>Update the about page on CGSpace with new text from Peter</li>
<li>Add a few more ORCID identifiers and names to my growing file <code>2022-09-22-add-orcids.csv</code>
<ul>
<li>I tagged fifty-four new authors using this list</li>
</ul>
</li>
<li>I deleted and mapped one duplicate item for Maria Garruccio</li>
<li>I updated the CG Core website from Bootstrap v4.6 to v5.2</li>
</ul>
<h2 id="2022-11-07">2022-11-07</h2>
<ul>
<li>I did a harvest on AReS last night but it seems that MELSpace&rsquo;s sitemap is broken again because we have 10,000 fewer records</li>
<li>I filed <a href="https://github.com/ecrmnn/iso-3166-1/issues/10">an issue</a> on the iso-3166-1 npm package to update the name of Turkey to Türkiye
<ul>
<li>I also filed <a href="https://github.com/flyingcircusio/pycountry/issues/148">an issue</a> and <a href="https://github.com/flyingcircusio/pycountry/pull/149">a pull request</a> on the pycountry package</li>
<li>I also filed <a href="https://github.com/konstantinstadler/country_converter/issues/121">an issue</a> and <a href="https://github.com/konstantinstadler/country_converter/pull/122">a pull request</a> on the country-converter package</li>
<li>I also changed one item on CGSpace that had been submitted since the name was changed</li>
<li>I also imported the new iso-codes 4.12.0 into cgspace-java-helpers</li>
<li>I also updated it in the DSpace <code>input-forms.xml</code></li>
<li>I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
<ul>
<li>I submitted a <a href="https://github.com/ecrmnn/iso-3166-1/pull/11">pull request</a> to update this upstream</li>
</ul>
</li>
</ul>
</li>
<li>Since I was making all these pull requests I also made <a href="https://github.com/konstantinstadler/country_converter/pull/123">one on country-converter for the UN M.49 region &ldquo;South-eastern Asia&rdquo;</a></li>
<li>Port the <a href="https://github.com/DSpace/DSpace/pull/8550">ImageMagick PDF cropbox fix</a> to DSpace 6.x
<ul>
<li>I deployed it on CGSpace, ran all updates, and rebooted the host</li>
<li>I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -v -f -i 10568/78 &gt;&amp; /tmp/filter-media-cropbox.log
</span></span></code></pre></div><ul>
<li>But looking at the items it processed, I&rsquo;m not sure it&rsquo;s working as expected
<ul>
<li>I looked at a few dozen</li>
</ul>
</li>
<li>I found some links to the Bioversity website on CGSpace that are not redirecting properly:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
</span></span><span style="display:flex;"><span>GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
</span></span><span style="display:flex;"><span>Accept: */*
</span></span><span style="display:flex;"><span>Accept-Encoding: gzip, deflate
</span></span><span style="display:flex;"><span>Connection: keep-alive
</span></span><span style="display:flex;"><span>Host: www.bioversityinternational.org
</span></span><span style="display:flex;"><span>User-Agent: HTTPie/3.2.1
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>HTTP/1.1 302 Found
</span></span><span style="display:flex;"><span>Connection: Keep-Alive
</span></span><span style="display:flex;"><span>Content-Length: 275
</span></span><span style="display:flex;"><span>Content-Type: text/html; charset=iso-8859-1
</span></span><span style="display:flex;"><span>Date: Mon, 07 Nov 2022 16:35:21 GMT
</span></span><span style="display:flex;"><span>Keep-Alive: timeout=15, max=100
</span></span><span style="display:flex;"><span>Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
</span></span><span style="display:flex;"><span>Server: Apache
</span></span></code></pre></div><ul>
<li>The <code>Location</code> header is clearly wrong, and if I try https directly I get an HTTP 500</li>
</ul>
<h2 id="2022-11-08">2022-11-08</h2>
<ul>
<li>Looking at the Solr statistics hits on CGSpace for 2022-11
<ul>
<li>I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
<li>I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent</li>
<li>I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection</li>
<li>I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent</li>
<li>I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection</li>
<li>I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
<li>I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent</li>
<li>I will purge all these hits and proably add China Unicom&rsquo;s subnet mask to my nginx <code>bot-network.conf</code> file to tag them as bots since there are SO many bad and malicious requests coming from there</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
</span></span><span style="display:flex;"><span>Purging 8975 hits from 221.219.100.42 in statistics
</span></span><span style="display:flex;"><span>Purging 7577 hits from 122.10.101.60 in statistics
</span></span><span style="display:flex;"><span>Purging 6536 hits from 135.125.21.38 in statistics
</span></span><span style="display:flex;"><span>Purging 23950 hits from 163.237.216.11 in statistics
</span></span><span style="display:flex;"><span>Purging 4093 hits from 51.254.154.148 in statistics
</span></span><span style="display:flex;"><span>Purging 2797 hits from 221.219.103.211 in statistics
</span></span><span style="display:flex;"><span>Purging 2618 hits from 216.218.223.53 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 56546
</span></span></code></pre></div><ul>
<li>Also interesting to see a few new user agents:
<ul>
<li><code>RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
<li><code>rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)</code></li>
<li><code>MEL</code></li>
<li><code>Gov employment data scraper ([[your email]])</code></li>
<li><code>RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
</ul>
</li>
<li>I will purge all these:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
</span></span><span style="display:flex;"><span>Purging 6155 hits from RStudio in statistics
</span></span><span style="display:flex;"><span>Purging 1929 hits from rstudio in statistics
</span></span><span style="display:flex;"><span>Purging 1454 hits from MEL in statistics
</span></span><span style="display:flex;"><span>Purging 1094 hits from Gov employment data scraper in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 10632
</span></span></code></pre></div><ul>
<li>Work on the CIAT Library items a bit again in OpenRefine
<ul>
<li>I flagged items with:
<ul>
<li>URL containing &ldquo;#page&rdquo; at the end (these are linking to book chapters, but we don&rsquo;t want to upload the PDF multiple times)</li>
<li>Same URL used by more than one item (&ldquo;Duplicates&rdquo; facet in OpenRefine, these are some corner case I don&rsquo;t want to handle right now)</li>
<li>URL containing &ldquo;:8080&rdquo; to CIAT&rsquo;s old DSpace (this server is no longer live)</li>
</ul>
</li>
<li>I want to try to handle the simple cases that should cover most of the items first</li>
</ul>
</li>
</ul>
<h2 id="2022-11-09">2022-11-09</h2>
<ul>
<li>Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
<ul>
<li>I got the basic functionality working</li>
</ul>
</li>
</ul>
<h2 id="2022-11-12">2022-11-12</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2022-11-15">2022-11-15</h2>
<ul>
<li>Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
<ul>
<li>We agreed to continue adding the feedback for each of the proposed actions</li>
<li>The others will start filling in definitions for the types</li>
<li>Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
<ul>
<li>We need to be careful especially with regards to author&rsquo;s outputs that will be reported in the PRMS</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="2022-11-16">2022-11-16</h2>
<ul>
<li>Maria asked if we can extend the timeout for XMLUI sessions
<ul>
<li>According to <a href="https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44">this issue</a> it seems to be 30 minutes by default, as a Tomcat default</li>
<li>I think we could extend this to an hour, as there is no real security risk (we&rsquo;re not a bank) and most user&rsquo;s lock screens would have activated after ten minutes or so anyways</li>
</ul>
</li>
</ul>
<h2 id="2022-11-20">2022-11-20</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2022-11-22">2022-11-22</h2>
<ul>
<li>Check and upload some items to CGSpace for Peter
<ul>
<li>I am waiting for some feedback from him about some duplicates and metadata issues for the rest</li>
</ul>
</li>
</ul>
<h2 id="2022-11-23">2022-11-23</h2>
<ul>
<li>Fix some authorization issues for ABC&rsquo;s TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)</li>
<li>Peter sent me feedback about the duplicates and metadata questions from yesterday
<ul>
<li>I uploaded the eight items for COHESA and sixty-two for Gender</li>
</ul>
</li>
<li>I ran the script to tag ORCID identifiers with my <code>2022-09-22-add-orcids.csv</code> file and tagged twenty-seven</li>
<li>Maria asked for help uploading a large PDF to CGSpace
<ul>
<li>The PDF is only two pages, but it is 139MB!</li>
<li>I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ gs -sDEVICE<span style="color:#f92672">=</span>pdfwrite -dCompatibilityLevel<span style="color:#f92672">=</span>1.4 -dPDFSETTINGS<span style="color:#f92672">=</span>/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile<span style="color:#f92672">=</span>Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market-screen.pdf Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market.pdf
</span></span><span style="display:flex;"><span>$ gs -sDEVICE<span style="color:#f92672">=</span>pdfwrite -dCompatibilityLevel<span style="color:#f92672">=</span>1.4 -dPDFSETTINGS<span style="color:#f92672">=</span>/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile<span style="color:#f92672">=</span>Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market-ebook.pdf Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market.pdf
</span></span></code></pre></div><ul>
<li>The ebook one looks really good and is only 2.4MB&hellip;</li>
<li>But for reference, this free Adobe tool seems to work: <a href="https://www.adobe.com/acrobat/online/compress-pdf.html">https://www.adobe.com/acrobat/online/compress-pdf.html</a></li>
</ul>
<h2 id="2022-11-24">2022-11-24</h2>
<ul>
<li>My script finished downloading the CIAT Library PDFs
<ul>
<li>I did some more work on my <code>post-ciat-pdfs.py</code> script and tested uploading the items to my local DSpace and DSpace Test</li>
<li>Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items</li>
</ul>
</li>
</ul>
<h2 id="2022-11-25">2022-11-25</h2>
<ul>
<li>Tony Murray, who is working on IFPRI&rsquo;s CGSpace integration, emailed me to ask some questions about the REST API</li>
<li>Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Xmx1024m -Dfile.encoding=UTF-8&#34;</span> dspace filter-media -p <span style="color:#e6db74">&#34;ImageMagick PDF Thumbnail&#34;</span> -v -f -i 10568/77010
</span></span><span style="display:flex;"><span>The following MediaFilters are enabled:
</span></span><span style="display:flex;"><span>Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
</span></span><span style="display:flex;"><span>org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span>IM Thumbnail tropentag2016_marshall.pdf is replacable.
</span></span><span style="display:flex;"><span>File: tropentag2016_marshall.pdf.jpg
</span></span><span style="display:flex;"><span>ERROR filtering, skipping bitstream:
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> Item Handle: 10568/77010
</span></span><span style="display:flex;"><span> Bundle Name: ORIGINAL
</span></span><span style="display:flex;"><span> File Size: 1486580
</span></span><span style="display:flex;"><span> Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5)
</span></span><span style="display:flex;"><span> Asset Store: 0
</span></span><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
</span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
</span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
</span></span><span style="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
</span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>Salem gave me a list of CGSpace collections that have double spaces in the names
<ul>
<li>Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!</li>
<li>He sent me a list of about ten collection UUIDs so I fixed them</li>
</ul>
</li>
<li>I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses&hellip; I updated about fifty of them</li>
</ul>
<h2 id="2022-11-26">2022-11-26</h2>
<ul>
<li>Sync DSpace Test with CGSpace</li>
<li>I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
<ul>
<li>See: <a href="https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44">https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44</a></li>
</ul>
</li>
<li>I re-built DSpace on CGSpace, ran all updates, and rebooted the machine
<ul>
<li>Then after coming back up the handle server won&rsquo;t start</li>
<li>The <code>handle-server.log</code> file shows:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Shutting down...
</span></span><span style="display:flex;"><span>&#34;2022/11/26 02:12:17 CET&#34; 25 Rotating log files
</span></span><span style="display:flex;"><span>Error: null
</span></span><span style="display:flex;"><span> (see the error log for details.)
</span></span></code></pre></div><ul>
<li>In the <code>error.log</code> file I see:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>&#34;2022/11/26 02:12:18 CET&#34; 25 Started new run.
</span></span><span style="display:flex;"><span>java.lang.UnsupportedOperationException
</span></span><span style="display:flex;"><span> at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287)
</span></span><span style="display:flex;"><span> at java.lang.System.runFinalizersOnExit(System.java:1059)
</span></span><span style="display:flex;"><span> at net.handle.server.Main.initialize(Main.java:124)
</span></span><span style="display:flex;"><span> at net.handle.server.Main.main(Main.java:75)
</span></span><span style="display:flex;"><span>Shutting down...
</span></span></code></pre></div><ul>
<li>Ah, it seems to be due to an <a href="https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1">issue in OpenJDK 1.8.0_352</a></li>
<li>I see the server upgraded to the new JDK version on 2022-11-10:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04)
</span></span><span style="display:flex;"><span>End-Date: 2022-11-10 04:10:45
</span></span></code></pre></div><ul>
<li>As highlighted in the dspace-tech mailing list thread above, <a href="https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html">this OpenJDK release deprecated <code>Runtime.runFinalizersOnExit</code></a>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> - JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE
</span></span></code></pre></div><ul>
<li>I downloaded the previous versions of the packages from Launchpad:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
</span></span><span style="display:flex;"><span># wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
</span></span><span style="display:flex;"><span># dpkg -i openjdk-8-j*8u342-b07*.deb
</span></span></code></pre></div><ul>
<li>Then the handle-server process starts up fine, so I held these OpenJDK versions for now:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
</span></span><span style="display:flex;"><span>openjdk-8-jdk-headless set on hold.
</span></span><span style="display:flex;"><span>openjdk-8-jre-headless set on hold.
</span></span></code></pre></div><ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2022-11-27">2022-11-27</h2>
<ul>
<li>I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
<ul>
<li>For PDFs with only one page I was seeing this in the filter-media output:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span></code></pre></div><ul>
<li>It turns out that <a href="https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-">PDDocument&rsquo;s getPage() is zero-based</a></li>
<li>I also updated PDFBox from 2.0.24 to 2.0.27</li>
<li>I synced DSpace 7 Test with CGSpace
<ul>
<li>I had to follow my notes from 2022-03 to delete the missing Atmire migrations</li>
</ul>
</li>
</ul>
<h2 id="2022-11-28">2022-11-28</h2>
<ul>
<li>Update <code>ilri/fix-metadata-values.py</code> to update the <code>last_modified</code> date for items when it updates metadata
<ul>
<li>This should allow us to use the normal <code>index-discovery</code> (with out <code>-b</code>) as well as having REST API responses showing a correct last modified date</li>
</ul>
</li>
<li>Maria asked me to add some ORCID identifiers for Alliance staff to the controlled vocabulary
<ul>
<li>I also updated the <code>add-orcid-identifiers-csv.py</code> to update the <code>last_modified</code> timestamp of the item</li>
</ul>
</li>
<li>I re-factored my CGSpace Python scripts to use a helper <code>util.py</code> module with common functions
<ul>
<li>For now it only has the one for updating an item&rsquo;s <code>last_modified</code> timestamp but I will gradually add more</li>
</ul>
</li>
<li>I also ran our list of ORCID identifiers against ORCID&rsquo;s API to see if anyone changed their name format
<ul>
<li>Then I ran them on CGSpace with <code>ilri/update-orcids.py</code> to fix them</li>
</ul>
</li>
<li>Normalize the <code>text_lang</code> values for CGSpace metadata again:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
</span></span><span style="display:flex;"><span> text_lang │ count
</span></span><span style="display:flex;"><span>───────────┼─────────
</span></span><span style="display:flex;"><span> en_US │ 2912429
</span></span><span style="display:flex;"><span> │ 108387
</span></span><span style="display:flex;"><span> en │ 12457
</span></span><span style="display:flex;"><span> fr │ 2
</span></span><span style="display:flex;"><span> vi │ 2
</span></span><span style="display:flex;"><span> es │ 1
</span></span><span style="display:flex;"><span> ␀ │ 0
</span></span><span style="display:flex;"><span>(7 rows)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 624.651 ms
</span></span><span style="display:flex;"><span>localhost/dspacetest= ☘ BEGIN;
</span></span><span style="display:flex;"><span>BEGIN
</span></span><span style="display:flex;"><span>Time: 0.130 ms
</span></span><span style="display:flex;"><span>localhost/dspacetest= ☘ UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en&#39;, &#39;&#39;);
</span></span><span style="display:flex;"><span>UPDATE 120844
</span></span><span style="display:flex;"><span>Time: 4074.879 ms (00:04.075)
</span></span><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
</span></span><span style="display:flex;"><span> text_lang │ count
</span></span><span style="display:flex;"><span>───────────┼─────────
</span></span><span style="display:flex;"><span> en_US │ 3033273
</span></span><span style="display:flex;"><span> fr │ 2
</span></span><span style="display:flex;"><span> vi │ 2
</span></span><span style="display:flex;"><span> es │ 1
</span></span><span style="display:flex;"><span> ␀ │ 0
</span></span><span style="display:flex;"><span>(5 rows)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 346.913 ms
</span></span><span style="display:flex;"><span>localhost/dspacetest= ☘ COMMIT;
</span></span></code></pre></div><ul>
<li>Discussing the UN M.49 regions on CGSpace with Valentina and Abenet
<ul>
<li>The PRMS team is confused about our regions, which are mostly UN M.49 with some legacy stuff using different ones</li>
<li>I think we can fix all the stuff for Initiatives from this year very easily, then work on the legacy stuff later</li>
<li>Also, I noticed that that <a href="https://github.com/konstantinstadler/country_converter/issues/124">country_converter was using the wrong UN M.49 region for Myanmar</a></li>
<li>I submitted a <a href="https://github.com/konstantinstadler/country_converter/pull/125">pull request</a></li>
</ul>
</li>
<li>I exported a CSV of the Initiatives and ran the csv-metadata-quality script to add missing UN M.49 regions
<ul>
<li>To make sure everything was correct I got a list of the changes from csv-metadata-quality and checked them all manually on the UN M.49 site, just in case there was another bug in country_converter</li>
<li>This fixed regions for about fifty items</li>
</ul>
</li>
<li>I dumped the UN M.49 regions from the CSV on the UNSD website:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -d<span style="color:#e6db74">&#34;;&#34;</span> -c <span style="color:#e6db74">&#39;Region Name,Sub-region Name,Intermediate Region Name&#39;</span> ~/Downloads/UNSD<span style="color:#ae81ff">\ </span><span style="color:#ae81ff">\ </span>Methodology.csv | sed -e 1d -e <span style="color:#e6db74">&#39;s/,/\n/g&#39;</span> | sort -u
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Africa
</span></span><span style="display:flex;"><span>Americas
</span></span><span style="display:flex;"><span>Asia
</span></span><span style="display:flex;"><span>Australia and New Zealand
</span></span><span style="display:flex;"><span>Caribbean
</span></span><span style="display:flex;"><span>Central America
</span></span><span style="display:flex;"><span>Central Asia
</span></span><span style="display:flex;"><span>Channel Islands
</span></span><span style="display:flex;"><span>Eastern Africa
</span></span><span style="display:flex;"><span>Eastern Asia
</span></span><span style="display:flex;"><span>Eastern Europe
</span></span><span style="display:flex;"><span>Europe
</span></span><span style="display:flex;"><span>Latin America and the Caribbean
</span></span><span style="display:flex;"><span>Melanesia
</span></span><span style="display:flex;"><span>Micronesia
</span></span><span style="display:flex;"><span>Middle Africa
</span></span><span style="display:flex;"><span>Northern Africa
</span></span><span style="display:flex;"><span>Northern America
</span></span><span style="display:flex;"><span>Northern Europe
</span></span><span style="display:flex;"><span>Oceania
</span></span><span style="display:flex;"><span>Polynesia
</span></span><span style="display:flex;"><span>South America
</span></span><span style="display:flex;"><span>South-eastern Asia
</span></span><span style="display:flex;"><span>Southern Africa
</span></span><span style="display:flex;"><span>Southern Asia
</span></span><span style="display:flex;"><span>Southern Europe
</span></span><span style="display:flex;"><span>Sub-Saharan Africa
</span></span><span style="display:flex;"><span>Western Africa
</span></span><span style="display:flex;"><span>Western Asia
</span></span><span style="display:flex;"><span>Western Europe
</span></span></code></pre></div><ul>
<li>For now I will combine it with our existing list, which contains a few legacy regions, while we discuss about a long-term plan with Peter and Abenet</li>
<li>Peter wrote to ask me to change the PIM CRP&rsquo;s full name from <code>Policies, Institutions and Markets</code> to <code>Policies, Institutions, and Markets</code>
<ul>
<li>It&rsquo;s apparently the only CRP with an Oxford comma&hellip;?</li>
<li>I updated them all on CGSpace</li>
</ul>
</li>
<li>Also, I ran an <code>index-discovery</code> without the <code>-b</code> since now my metadata update scripts update the <code>last_modified</code> timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets</li>
</ul>
<h2 id="2022-11-29">2022-11-29</h2>
<ul>
<li>Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about <code>dcterms.type</code> for CG Core
<ul>
<li>We discussed some of the feedback from Peter</li>
</ul>
</li>
<li>Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
<ul>
<li>I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form</li>
<li>These are UN M.49 regions</li>
</ul>
</li>
</ul>
<h2 id="2022-11-30">2022-11-30</h2>
<ul>
<li>I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
<ul>
<li>It fixed some whitespace issues and added missing regions to about 1,200 items</li>
</ul>
</li>
<li>I thought of a way to delete duplicate metadata values, since the CSV upload method can&rsquo;t detect them correctly
<ul>
<li>First, I wrote a <a href="https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/">SQL query</a> to identify metadata falues with the same <code>text_value</code>, <code>metadata_field_id</code>, and <code>dspace_object_id</code>:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id
</span></span><span style="display:flex;"><span> FROM metadatavalue a
</span></span><span style="display:flex;"><span> JOIN (
</span></span><span style="display:flex;"><span> SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) &gt; 1
</span></span><span style="display:flex;"><span> ) b
</span></span><span style="display:flex;"><span> ON a.dspace_object_id = b.dspace_object_id
</span></span><span style="display:flex;"><span> AND a.text_value = b.text_value
</span></span><span style="display:flex;"><span> AND a.metadata_field_id = b.metadata_field_id
</span></span><span style="display:flex;"><span> ORDER BY a.text_value) TO /tmp/duplicates.txt
</span></span></code></pre></div><ul>
<li>(This query excludes metadata for accession and available dates, provenance, format, etc)</li>
<li>Then, I sorted the file by fields four and one (<code>dspace_object_id</code> and <code>text_value</code>) so that the duplicate metadata for each item were next to each other, used awk to print the second field (<code>metadata_field_id</code>) from every <em>other</em> line, and created a SQL script to delete the metadata</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sort -k4,1 /tmp/duplicates.txt | <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> awk -F&#39;\t&#39; &#39;NR%2==0 {print $2}&#39; | \
</span></span><span style="display:flex;"><span> sed &#39;s/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/&#39; &gt; /tmp/delete-duplicates.sql
</span></span></code></pre></div><ul>
<li>This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
<ul>
<li>I just ran it again two more times to find the last duplicates, now we have none!</li>
</ul>
</li>
<li>I also generated another SQL file with commands to update the last modified timestamps of these items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ awk -F<span style="color:#e6db74">&#39;\t&#39;</span> <span style="color:#e6db74">&#39;{print $4}&#39;</span> /tmp/duplicates.txt | sort -u | sed <span style="color:#e6db74">&#34;s/^\(.*\)</span>$<span style="color:#e6db74">/UPDATE item SET last_modified=NOW() WHERE uuid=&#39;\1&#39;;/&#34;</span> &gt; /tmp/update-timestamp.sql
</span></span></code></pre></div><ul>
<li>Tezira said she was having trouble archiving submissions
<ul>
<li>In the afternoon I looked and found a high number of locks:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | grep -o -E <span style="color:#e6db74">&#39;(dspaceWeb|dspaceApi|dspaceCli)&#39;</span> | sort | uniq -c | sort -n
</span></span><span style="display:flex;"><span> 60 dspaceCli
</span></span><span style="display:flex;"><span> 176 dspaceApi
</span></span><span style="display:flex;"><span> 1194 dspaceWeb
</span></span></code></pre></div><p><img src="/cgspace-notes/2022/11/postgres_locks_cgspace-day.png" alt="PostgreSQL database locks"></p>
<ul>
<li>The timing looks suspiciously close to when I was running the batch updates on the ILRI community this morning.
<ul>
<li>I restarted Tomcat and PostgreSQL and everything was back to normal</li>
</ul>
</li>
<li>I found some items on CGSpace in Dinka, Ndogo, and Bari languages, but the <code>dcterms.language</code> field was &ldquo;other&rdquo;
<ul>
<li>That&rsquo;s so unfortunate! These languages are not in ISO 639-1, but they are in ISO 639-3, which uses Alpha 3 and has more space for languages</li>
<li>I changed them from other to use the three-letter codes, and I will suggest to the CG Core group that we use ISO 639-3 in the future</li>
</ul>
</li>
<li>Send feedback to Salem about some metadata issues with MEL submissions to CGSpace</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2024-05/">May, 2024</a></li>
<li><a href="/cgspace-notes/2024-04/">April, 2024</a></li>
<li><a href="/cgspace-notes/2024-03/">March, 2024</a></li>
<li><a href="/cgspace-notes/2024-02/">February, 2024</a></li>
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>