<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="November, 2022" /> <meta property="og:description" content="2022-11-01 Last night I re-synced DSpace 7 Test from CGSpace I also updated all my local 7_x-dev branches on the latest upstreams I spent some time updating the authorizations in Alliance collections I want to make sure they use groups instead of individuals where possible! I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-11/" /> <meta property="article:published_time" content="2022-11-01T09:11:36+03:00" /> <meta property="article:modified_time" content="2023-01-04T10:53:02+03:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="November, 2022"/> <meta name="twitter:description" content="2022-11-01 Last night I re-synced DSpace 7 Test from CGSpace I also updated all my local 7_x-dev branches on the latest upstreams I spent some time updating the authorizations in Alliance collections I want to make sure they use groups instead of individuals where possible! I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue "/> <meta name="generator" content="Hugo 0.118.2"> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "November, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-11/", "wordCount": "3411", "datePublished": "2022-11-01T09:11:36+03:00", "dateModified": "2023-01-04T10:53:02+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-11/"> <title>November, 2022 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-11/">November, 2022</a></h2> <p class="blog-post-meta"> <time datetime="2022-11-01T09:11:36+03:00">Tue Nov 01, 2022</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2022-11-01">2022-11-01</h2> <ul> <li>Last night I re-synced DSpace 7 Test from CGSpace <ul> <li>I also updated all my local <code>7_x-dev</code> branches on the latest upstreams</li> </ul> </li> <li>I spent some time updating the authorizations in Alliance collections <ul> <li>I want to make sure they use groups instead of individuals where possible!</li> </ul> </li> <li>I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue</li> </ul> <ul> <li>I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!</li> <li>I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test</li> <li>Tim merged my <a href="https://github.com/DSpace/DSpace/pull/8553">pull request to override the ImageMagick PDF density in DSpace 7</a> <ul> <li>I ported it to DSpace 6.x and submitted a pull request: <a href="https://github.com/DSpace/DSpace/pull/8560">https://github.com/DSpace/DSpace/pull/8560</a></li> </ul> </li> </ul> <h2 id="2022-11-02">2022-11-02</h2> <ul> <li>I joined the FAO–CGIAR AGROVOC results sharing meeting <ul> <li>From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion</li> </ul> </li> <li>Doing duplicate check on IFPRI’s batch upload and I found one duplicate uploaded by IWMI earlier this year <ul> <li>I will update the metadata of that item and map it to the correct Initiative collection</li> </ul> </li> </ul> <h2 id="2022-11-03">2022-11-03</h2> <ul> <li>I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace</li> <li>I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt </span></span><span style="display:flex;"><span>COPY 1268 </span></span></code></pre></div><ul> <li>Then I started a test run on DSpace Test:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">while</span> read -r collection; <span style="color:#66d9ef">do</span> chrt -b <span style="color:#ae81ff">0</span> dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; <span style="color:#66d9ef">done</span> < /tmp/collections.txt </span></span></code></pre></div><ul> <li>I’ll be curious to check the log after it’s all done. <ul> <li>After a few hours I see:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">'Action: remove'</span> /tmp/FixLowQualityThumbnails.log </span></span><span style="display:flex;"><span>626 </span></span></code></pre></div><ul> <li>Not bad, because last week I did a more manual selection of collections and deleted ~200 <ul> <li>I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool</li> </ul> </li> <li>I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down <ul> <li>Export a list of items with PDFs linked there:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv; </span></span><span style="display:flex;"><span>COPY 4621 </span></span></code></pre></div><ul> <li>After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones…</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l </span></span><span style="display:flex;"><span>2752 </span></span></code></pre></div><ul> <li>I’m not sure how we’ll handle the duplicates because many items are book chapters or something where they share a PDF</li> </ul> <h2 id="2022-11-04">2022-11-04</h2> <ul> <li>I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description “Generated Thumbnail”:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt; </span></span><span style="display:flex;"><span>COPY 1147 </span></span><span style="display:flex;"><span>$ grep -v <span style="color:#e6db74">'\\N'</span> /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt </span></span><span style="display:flex;"><span>$ wc -l /tmp/old-thumbnail-handles.txt </span></span><span style="display:flex;"><span>987 /tmp/old-thumbnail-handles.txt </span></span></code></pre></div><ul> <li>A bunch of these have <code>\N</code> for some reason when I use the <code>ds6_bitstream2itemhandle</code> function to get their handles so I had to exclude those… <ul> <li>I forced the media-filter for these items on CGSpace:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">while</span> read -r handle; <span style="color:#66d9ef">do</span> JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">"-Xmx512m -Dfile.encoding=UTF-8"</span> dspace filter-media -p <span style="color:#e6db74">"ImageMagick PDF Thumbnail"</span> -i $handle -f -v; <span style="color:#66d9ef">done</span> < /tmp/old-thumbnail-handles.txt </span></span></code></pre></div><ul> <li>Upload some batch records via CSV for Peter</li> <li>Update the about page on CGSpace with new text from Peter</li> <li>Add a few more ORCID identifiers and names to my growing file <code>2022-09-22-add-orcids.csv</code> <ul> <li>I tagged fifty-four new authors using this list</li> </ul> </li> <li>I deleted and mapped one duplicate item for Maria Garruccio</li> <li>I updated the CG Core website from Bootstrap v4.6 to v5.2</li> </ul> <h2 id="2022-11-07">2022-11-07</h2> <ul> <li>I did a harvest on AReS last night but it seems that MELSpace’s sitemap is broken again because we have 10,000 fewer records</li> <li>I filed <a href="https://github.com/ecrmnn/iso-3166-1/issues/10">an issue</a> on the iso-3166-1 npm package to update the name of Turkey to Türkiye <ul> <li>I also filed <a href="https://github.com/flyingcircusio/pycountry/issues/148">an issue</a> and <a href="https://github.com/flyingcircusio/pycountry/pull/149">a pull request</a> on the pycountry package</li> <li>I also filed <a href="https://github.com/konstantinstadler/country_converter/issues/121">an issue</a> and <a href="https://github.com/konstantinstadler/country_converter/pull/122">a pull request</a> on the country-converter package</li> <li>I also changed one item on CGSpace that had been submitted since the name was changed</li> <li>I also imported the new iso-codes 4.12.0 into cgspace-java-helpers</li> <li>I also updated it in the DSpace <code>input-forms.xml</code></li> <li>I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork <ul> <li>I submitted a <a href="https://github.com/ecrmnn/iso-3166-1/pull/11">pull request</a> to update this upstream</li> </ul> </li> </ul> </li> <li>Since I was making all these pull requests I also made <a href="https://github.com/konstantinstadler/country_converter/pull/123">one on country-converter for the UN M.49 region “South-eastern Asia”</a></li> <li>Port the <a href="https://github.com/DSpace/DSpace/pull/8550">ImageMagick PDF cropbox fix</a> to DSpace 6.x <ul> <li>I deployed it on CGSpace, ran all updates, and rebooted the host</li> <li>I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">"-Xmx1024m -Dfile.encoding=UTF-8"</span> dspace filter-media -p <span style="color:#e6db74">"ImageMagick PDF Thumbnail"</span> -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log </span></span></code></pre></div><ul> <li>But looking at the items it processed, I’m not sure it’s working as expected <ul> <li>I looked at a few dozen</li> </ul> </li> <li>I found some links to the Bioversity website on CGSpace that are not redirecting properly:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html </span></span><span style="display:flex;"><span>GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1 </span></span><span style="display:flex;"><span>Accept: */* </span></span><span style="display:flex;"><span>Accept-Encoding: gzip, deflate </span></span><span style="display:flex;"><span>Connection: keep-alive </span></span><span style="display:flex;"><span>Host: www.bioversityinternational.org </span></span><span style="display:flex;"><span>User-Agent: HTTPie/3.2.1 </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>HTTP/1.1 302 Found </span></span><span style="display:flex;"><span>Connection: Keep-Alive </span></span><span style="display:flex;"><span>Content-Length: 275 </span></span><span style="display:flex;"><span>Content-Type: text/html; charset=iso-8859-1 </span></span><span style="display:flex;"><span>Date: Mon, 07 Nov 2022 16:35:21 GMT </span></span><span style="display:flex;"><span>Keep-Alive: timeout=15, max=100 </span></span><span style="display:flex;"><span>Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html </span></span><span style="display:flex;"><span>Server: Apache </span></span></code></pre></div><ul> <li>The <code>Location</code> header is clearly wrong, and if I try https directly I get an HTTP 500</li> </ul> <h2 id="2022-11-08">2022-11-08</h2> <ul> <li>Looking at the Solr statistics hits on CGSpace for 2022-11 <ul> <li>I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li> <li>I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent</li> <li>I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection</li> <li>I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent</li> <li>I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection</li> <li>I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li> <li>I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent</li> <li>I will purge all these hits and proably add China Unicom’s subnet mask to my nginx <code>bot-network.conf</code> file to tag them as bots since there are SO many bad and malicious requests coming from there</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p </span></span><span style="display:flex;"><span>Purging 8975 hits from 221.219.100.42 in statistics </span></span><span style="display:flex;"><span>Purging 7577 hits from 122.10.101.60 in statistics </span></span><span style="display:flex;"><span>Purging 6536 hits from 135.125.21.38 in statistics </span></span><span style="display:flex;"><span>Purging 23950 hits from 163.237.216.11 in statistics </span></span><span style="display:flex;"><span>Purging 4093 hits from 51.254.154.148 in statistics </span></span><span style="display:flex;"><span>Purging 2797 hits from 221.219.103.211 in statistics </span></span><span style="display:flex;"><span>Purging 2618 hits from 216.218.223.53 in statistics </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 56546 </span></span></code></pre></div><ul> <li>Also interesting to see a few new user agents: <ul> <li><code>RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li> <li><code>rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)</code></li> <li><code>MEL</code></li> <li><code>Gov employment data scraper ([[your email]])</code></li> <li><code>RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li> </ul> </li> <li>I will purge all these:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p </span></span><span style="display:flex;"><span>Purging 6155 hits from RStudio in statistics </span></span><span style="display:flex;"><span>Purging 1929 hits from rstudio in statistics </span></span><span style="display:flex;"><span>Purging 1454 hits from MEL in statistics </span></span><span style="display:flex;"><span>Purging 1094 hits from Gov employment data scraper in statistics </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 10632 </span></span></code></pre></div><ul> <li>Work on the CIAT Library items a bit again in OpenRefine <ul> <li>I flagged items with: <ul> <li>URL containing “#page” at the end (these are linking to book chapters, but we don’t want to upload the PDF multiple times)</li> <li>Same URL used by more than one item (“Duplicates” facet in OpenRefine, these are some corner case I don’t want to handle right now)</li> <li>URL containing “:8080” to CIAT’s old DSpace (this server is no longer live)</li> </ul> </li> <li>I want to try to handle the simple cases that should cover most of the items first</li> </ul> </li> </ul> <h2 id="2022-11-09">2022-11-09</h2> <ul> <li>Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace <ul> <li>I got the basic functionality working</li> </ul> </li> </ul> <h2 id="2022-11-12">2022-11-12</h2> <ul> <li>Start a harvest on AReS</li> </ul> <h2 id="2022-11-15">2022-11-15</h2> <ul> <li>Meeting with Marie-Angelique, Sara, and Valentina about CG Core types <ul> <li>We agreed to continue adding the feedback for each of the proposed actions</li> <li>The others will start filling in definitions for the types</li> <li>Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository <ul> <li>We need to be careful especially with regards to author’s outputs that will be reported in the PRMS</li> </ul> </li> </ul> </li> </ul> <h2 id="2022-11-16">2022-11-16</h2> <ul> <li>Maria asked if we can extend the timeout for XMLUI sessions <ul> <li>According to <a href="https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44">this issue</a> it seems to be 30 minutes by default, as a Tomcat default</li> <li>I think we could extend this to an hour, as there is no real security risk (we’re not a bank) and most user’s lock screens would have activated after ten minutes or so anyways</li> </ul> </li> </ul> <h2 id="2022-11-20">2022-11-20</h2> <ul> <li>Start a harvest on AReS</li> </ul> <h2 id="2022-11-22">2022-11-22</h2> <ul> <li>Check and upload some items to CGSpace for Peter <ul> <li>I am waiting for some feedback from him about some duplicates and metadata issues for the rest</li> </ul> </li> </ul> <h2 id="2022-11-23">2022-11-23</h2> <ul> <li>Fix some authorization issues for ABC’s TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)</li> <li>Peter sent me feedback about the duplicates and metadata questions from yesterday <ul> <li>I uploaded the eight items for COHESA and sixty-two for Gender</li> </ul> </li> <li>I ran the script to tag ORCID identifiers with my <code>2022-09-22-add-orcids.csv</code> file and tagged twenty-seven</li> <li>Maria asked for help uploading a large PDF to CGSpace <ul> <li>The PDF is only two pages, but it is 139MB!</li> <li>I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ gs -sDEVICE<span style="color:#f92672">=</span>pdfwrite -dCompatibilityLevel<span style="color:#f92672">=</span>1.4 -dPDFSETTINGS<span style="color:#f92672">=</span>/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile<span style="color:#f92672">=</span>Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market-screen.pdf Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market.pdf </span></span><span style="display:flex;"><span>$ gs -sDEVICE<span style="color:#f92672">=</span>pdfwrite -dCompatibilityLevel<span style="color:#f92672">=</span>1.4 -dPDFSETTINGS<span style="color:#f92672">=</span>/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile<span style="color:#f92672">=</span>Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market-ebook.pdf Key<span style="color:#ae81ff">\ </span>facts<span style="color:#ae81ff">\ </span>from<span style="color:#ae81ff">\ </span>a<span style="color:#ae81ff">\ </span>traditional<span style="color:#ae81ff">\ </span>colombian<span style="color:#ae81ff">\ </span>food<span style="color:#ae81ff">\ </span>market.pdf </span></span></code></pre></div><ul> <li>The ebook one looks really good and is only 2.4MB…</li> <li>But for reference, this free Adobe tool seems to work: <a href="https://www.adobe.com/acrobat/online/compress-pdf.html">https://www.adobe.com/acrobat/online/compress-pdf.html</a></li> </ul> <h2 id="2022-11-24">2022-11-24</h2> <ul> <li>My script finished downloading the CIAT Library PDFs <ul> <li>I did some more work on my <code>post-ciat-pdfs.py</code> script and tested uploading the items to my local DSpace and DSpace Test</li> <li>Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items</li> </ul> </li> </ul> <h2 id="2022-11-25">2022-11-25</h2> <ul> <li>Tony Murray, who is working on IFPRI’s CGSpace integration, emailed me to ask some questions about the REST API</li> <li>Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">"-Xmx1024m -Dfile.encoding=UTF-8"</span> dspace filter-media -p <span style="color:#e6db74">"ImageMagick PDF Thumbnail"</span> -v -f -i 10568/77010 </span></span><span style="display:flex;"><span>The following MediaFilters are enabled: </span></span><span style="display:flex;"><span>Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter </span></span><span style="display:flex;"><span>org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter </span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM </span></span><span style="display:flex;"><span>Changes have been processed </span></span><span style="display:flex;"><span>IM Thumbnail tropentag2016_marshall.pdf is replacable. </span></span><span style="display:flex;"><span>File: tropentag2016_marshall.pdf.jpg </span></span><span style="display:flex;"><span>ERROR filtering, skipping bitstream: </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> Item Handle: 10568/77010 </span></span><span style="display:flex;"><span> Bundle Name: ORIGINAL </span></span><span style="display:flex;"><span> File Size: 1486580 </span></span><span style="display:flex;"><span> Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5) </span></span><span style="display:flex;"><span> Asset Store: 0 </span></span><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2 </span></span><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2 </span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325) </span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248) </span></span><span style="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543) </span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167) </span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27) </span></span><span style="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103) </span></span><span style="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61) </span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181) </span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159) </span></span><span style="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232) </span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) </span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) </span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) </span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498) </span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229) </span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81) </span></span></code></pre></div><ul> <li>Salem gave me a list of CGSpace collections that have double spaces in the names <ul> <li>Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!</li> <li>He sent me a list of about ten collection UUIDs so I fixed them</li> </ul> </li> <li>I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses… I updated about fifty of them</li> </ul> <h2 id="2022-11-26">2022-11-26</h2> <ul> <li>Sync DSpace Test with CGSpace</li> <li>I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago <ul> <li>See: <a href="https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44">https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44</a></li> </ul> </li> <li>I re-built DSpace on CGSpace, ran all updates, and rebooted the machine <ul> <li>Then after coming back up the handle server won’t start</li> <li>The <code>handle-server.log</code> file shows:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Shutting down... </span></span><span style="display:flex;"><span>"2022/11/26 02:12:17 CET" 25 Rotating log files </span></span><span style="display:flex;"><span>Error: null </span></span><span style="display:flex;"><span> (see the error log for details.) </span></span></code></pre></div><ul> <li>In the <code>error.log</code> file I see:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>"2022/11/26 02:12:18 CET" 25 Started new run. </span></span><span style="display:flex;"><span>java.lang.UnsupportedOperationException </span></span><span style="display:flex;"><span> at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287) </span></span><span style="display:flex;"><span> at java.lang.System.runFinalizersOnExit(System.java:1059) </span></span><span style="display:flex;"><span> at net.handle.server.Main.initialize(Main.java:124) </span></span><span style="display:flex;"><span> at net.handle.server.Main.main(Main.java:75) </span></span><span style="display:flex;"><span>Shutting down... </span></span></code></pre></div><ul> <li>Ah, it seems to be due to an <a href="https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1">issue in OpenJDK 1.8.0_352</a></li> <li>I see the server upgraded to the new JDK version on 2022-11-10:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04) </span></span><span style="display:flex;"><span>End-Date: 2022-11-10 04:10:45 </span></span></code></pre></div><ul> <li>As highlighted in the dspace-tech mailing list thread above, <a href="https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html">this OpenJDK release deprecated <code>Runtime.runFinalizersOnExit</code></a>:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> - JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE </span></span></code></pre></div><ul> <li>I downloaded the previous versions of the packages from Launchpad:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb </span></span><span style="display:flex;"><span># wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb </span></span><span style="display:flex;"><span># dpkg -i openjdk-8-j*8u342-b07*.deb </span></span></code></pre></div><ul> <li>Then the handle-server process starts up fine, so I held these OpenJDK versions for now:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64 </span></span><span style="display:flex;"><span>openjdk-8-jdk-headless set on hold. </span></span><span style="display:flex;"><span>openjdk-8-jre-headless set on hold. </span></span></code></pre></div><ul> <li>Start a harvest on AReS</li> </ul> <h2 id="2022-11-27">2022-11-27</h2> <ul> <li>I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago <ul> <li>For PDFs with only one page I was seeing this in the filter-media output:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2 </span></span></code></pre></div><ul> <li>It turns out that <a href="https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-">PDDocument’s getPage() is zero-based</a></li> <li>I also updated PDFBox from 2.0.24 to 2.0.27</li> <li>I synced DSpace 7 Test with CGSpace <ul> <li>I had to follow my notes from 2022-03 to delete the missing Atmire migrations</li> </ul> </li> </ul> <h2 id="2022-11-28">2022-11-28</h2> <ul> <li>Update <code>ilri/fix-metadata-values.py</code> to update the <code>last_modified</code> date for items when it updates metadata <ul> <li>This should allow us to use the normal <code>index-discovery</code> (with out <code>-b</code>) as well as having REST API responses showing a correct last modified date</li> </ul> </li> <li>Maria asked me to add some ORCID identifiers for Alliance staff to the controlled vocabulary <ul> <li>I also updated the <code>add-orcid-identifiers-csv.py</code> to update the <code>last_modified</code> timestamp of the item</li> </ul> </li> <li>I re-factored my CGSpace Python scripts to use a helper <code>util.py</code> module with common functions <ul> <li>For now it only has the one for updating an item’s <code>last_modified</code> timestamp but I will gradually add more</li> </ul> </li> <li>I also ran our list of ORCID identifiers against ORCID’s API to see if anyone changed their name format <ul> <li>Then I ran them on CGSpace with <code>ilri/update-orcids.py</code> to fix them</li> </ul> </li> <li>Normalize the <code>text_lang</code> values for CGSpace metadata again:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC; </span></span><span style="display:flex;"><span> text_lang │ count </span></span><span style="display:flex;"><span>───────────┼───────── </span></span><span style="display:flex;"><span> en_US │ 2912429 </span></span><span style="display:flex;"><span> │ 108387 </span></span><span style="display:flex;"><span> en │ 12457 </span></span><span style="display:flex;"><span> fr │ 2 </span></span><span style="display:flex;"><span> vi │ 2 </span></span><span style="display:flex;"><span> es │ 1 </span></span><span style="display:flex;"><span> ␀ │ 0 </span></span><span style="display:flex;"><span>(7 rows) </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 624.651 ms </span></span><span style="display:flex;"><span>localhost/dspacetest= ☘ BEGIN; </span></span><span style="display:flex;"><span>BEGIN </span></span><span style="display:flex;"><span>Time: 0.130 ms </span></span><span style="display:flex;"><span>localhost/dspacetest= ☘ UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', ''); </span></span><span style="display:flex;"><span>UPDATE 120844 </span></span><span style="display:flex;"><span>Time: 4074.879 ms (00:04.075) </span></span><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC; </span></span><span style="display:flex;"><span> text_lang │ count </span></span><span style="display:flex;"><span>───────────┼───────── </span></span><span style="display:flex;"><span> en_US │ 3033273 </span></span><span style="display:flex;"><span> fr │ 2 </span></span><span style="display:flex;"><span> vi │ 2 </span></span><span style="display:flex;"><span> es │ 1 </span></span><span style="display:flex;"><span> ␀ │ 0 </span></span><span style="display:flex;"><span>(5 rows) </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 346.913 ms </span></span><span style="display:flex;"><span>localhost/dspacetest= ☘ COMMIT; </span></span></code></pre></div><ul> <li>Discussing the UN M.49 regions on CGSpace with Valentina and Abenet <ul> <li>The PRMS team is confused about our regions, which are mostly UN M.49 with some legacy stuff using different ones</li> <li>I think we can fix all the stuff for Initiatives from this year very easily, then work on the legacy stuff later</li> <li>Also, I noticed that that <a href="https://github.com/konstantinstadler/country_converter/issues/124">country_converter was using the wrong UN M.49 region for Myanmar</a></li> <li>I submitted a <a href="https://github.com/konstantinstadler/country_converter/pull/125">pull request</a></li> </ul> </li> <li>I exported a CSV of the Initiatives and ran the csv-metadata-quality script to add missing UN M.49 regions <ul> <li>To make sure everything was correct I got a list of the changes from csv-metadata-quality and checked them all manually on the UN M.49 site, just in case there was another bug in country_converter</li> <li>This fixed regions for about fifty items</li> </ul> </li> <li>I dumped the UN M.49 regions from the CSV on the UNSD website:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -d<span style="color:#e6db74">";"</span> -c <span style="color:#e6db74">'Region Name,Sub-region Name,Intermediate Region Name'</span> ~/Downloads/UNSD<span style="color:#ae81ff">\ </span>—<span style="color:#ae81ff">\ </span>Methodology.csv | sed -e 1d -e <span style="color:#e6db74">'s/,/\n/g'</span> | sort -u </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Africa </span></span><span style="display:flex;"><span>Americas </span></span><span style="display:flex;"><span>Asia </span></span><span style="display:flex;"><span>Australia and New Zealand </span></span><span style="display:flex;"><span>Caribbean </span></span><span style="display:flex;"><span>Central America </span></span><span style="display:flex;"><span>Central Asia </span></span><span style="display:flex;"><span>Channel Islands </span></span><span style="display:flex;"><span>Eastern Africa </span></span><span style="display:flex;"><span>Eastern Asia </span></span><span style="display:flex;"><span>Eastern Europe </span></span><span style="display:flex;"><span>Europe </span></span><span style="display:flex;"><span>Latin America and the Caribbean </span></span><span style="display:flex;"><span>Melanesia </span></span><span style="display:flex;"><span>Micronesia </span></span><span style="display:flex;"><span>Middle Africa </span></span><span style="display:flex;"><span>Northern Africa </span></span><span style="display:flex;"><span>Northern America </span></span><span style="display:flex;"><span>Northern Europe </span></span><span style="display:flex;"><span>Oceania </span></span><span style="display:flex;"><span>Polynesia </span></span><span style="display:flex;"><span>South America </span></span><span style="display:flex;"><span>South-eastern Asia </span></span><span style="display:flex;"><span>Southern Africa </span></span><span style="display:flex;"><span>Southern Asia </span></span><span style="display:flex;"><span>Southern Europe </span></span><span style="display:flex;"><span>Sub-Saharan Africa </span></span><span style="display:flex;"><span>Western Africa </span></span><span style="display:flex;"><span>Western Asia </span></span><span style="display:flex;"><span>Western Europe </span></span></code></pre></div><ul> <li>For now I will combine it with our existing list, which contains a few legacy regions, while we discuss about a long-term plan with Peter and Abenet</li> <li>Peter wrote to ask me to change the PIM CRP’s full name from <code>Policies, Institutions and Markets</code> to <code>Policies, Institutions, and Markets</code> <ul> <li>It’s apparently the only CRP with an Oxford comma…?</li> <li>I updated them all on CGSpace</li> </ul> </li> <li>Also, I ran an <code>index-discovery</code> without the <code>-b</code> since now my metadata update scripts update the <code>last_modified</code> timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets</li> </ul> <h2 id="2022-11-29">2022-11-29</h2> <ul> <li>Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about <code>dcterms.type</code> for CG Core <ul> <li>We discussed some of the feedback from Peter</li> </ul> </li> <li>Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback <ul> <li>I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form</li> <li>These are UN M.49 regions</li> </ul> </li> </ul> <h2 id="2022-11-30">2022-11-30</h2> <ul> <li>I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields <ul> <li>It fixed some whitespace issues and added missing regions to about 1,200 items</li> </ul> </li> <li>I thought of a way to delete duplicate metadata values, since the CSV upload method can’t detect them correctly <ul> <li>First, I wrote a <a href="https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/">SQL query</a> to identify metadata falues with the same <code>text_value</code>, <code>metadata_field_id</code>, and <code>dspace_object_id</code>:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id </span></span><span style="display:flex;"><span> FROM metadatavalue a </span></span><span style="display:flex;"><span> JOIN ( </span></span><span style="display:flex;"><span> SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1 </span></span><span style="display:flex;"><span> ) b </span></span><span style="display:flex;"><span> ON a.dspace_object_id = b.dspace_object_id </span></span><span style="display:flex;"><span> AND a.text_value = b.text_value </span></span><span style="display:flex;"><span> AND a.metadata_field_id = b.metadata_field_id </span></span><span style="display:flex;"><span> ORDER BY a.text_value) TO /tmp/duplicates.txt </span></span></code></pre></div><ul> <li>(This query excludes metadata for accession and available dates, provenance, format, etc)</li> <li>Then, I sorted the file by fields four and one (<code>dspace_object_id</code> and <code>text_value</code>) so that the duplicate metadata for each item were next to each other, used awk to print the second field (<code>metadata_field_id</code>) from every <em>other</em> line, and created a SQL script to delete the metadata</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sort -k4,1 /tmp/duplicates.txt | <span style="color:#ae81ff">\ </span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> awk -F'\t' 'NR%2==0 {print $2}' | \ </span></span><span style="display:flex;"><span> sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql </span></span></code></pre></div><ul> <li>This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate <ul> <li>I just ran it again two more times to find the last duplicates, now we have none!</li> </ul> </li> <li>I also generated another SQL file with commands to update the last modified timestamps of these items:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ awk -F<span style="color:#e6db74">'\t'</span> <span style="color:#e6db74">'{print $4}'</span> /tmp/duplicates.txt | sort -u | sed <span style="color:#e6db74">"s/^\(.*\)</span>$<span style="color:#e6db74">/UPDATE item SET last_modified=NOW() WHERE uuid='\1';/"</span> > /tmp/update-timestamp.sql </span></span></code></pre></div><ul> <li>Tezira said she was having trouble archiving submissions <ul> <li>In the afternoon I looked and found a high number of locks:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;'</span> | grep -o -E <span style="color:#e6db74">'(dspaceWeb|dspaceApi|dspaceCli)'</span> | sort | uniq -c | sort -n </span></span><span style="display:flex;"><span> 60 dspaceCli </span></span><span style="display:flex;"><span> 176 dspaceApi </span></span><span style="display:flex;"><span> 1194 dspaceWeb </span></span></code></pre></div><p><img src="/cgspace-notes/2022/11/postgres_locks_cgspace-day.png" alt="PostgreSQL database locks"></p> <ul> <li>The timing looks suspiciously close to when I was running the batch updates on the ILRI community this morning. <ul> <li>I restarted Tomcat and PostgreSQL and everything was back to normal</li> </ul> </li> <li>I found some items on CGSpace in Dinka, Ndogo, and Bari languages, but the <code>dcterms.language</code> field was “other” <ul> <li>That’s so unfortunate! These languages are not in ISO 639-1, but they are in ISO 639-3, which uses Alpha 3 and has more space for languages</li> <li>I changed them from other to use the three-letter codes, and I will suggest to the CG Core group that we use ISO 639-3 in the future</li> </ul> </li> <li>Send feedback to Salem about some metadata issues with MEL submissions to CGSpace</li> </ul> <!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2023-09/">September, 2023</a></li> <li><a href="/cgspace-notes/2023-08/">August, 2023</a></li> <li><a href="/cgspace-notes/2023-07/">July, 2023</a></li> <li><a href="/cgspace-notes/2023-06/">June, 2023</a></li> <li><a href="/cgspace-notes/2023-05/">May, 2023</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>