I also updated all my local 7_x-dev branches on the latest upstreams
I spent some time updating the authorizations in Alliance collections
I want to make sure they use groups instead of individuals where possible!
I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue
I also updated all my local 7_x-dev branches on the latest upstreams
I spent some time updating the authorizations in Alliance collections
I want to make sure they use groups instead of individuals where possible!
I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue
<li>Last night I re-synced DSpace 7 Test from CGSpace
<ul>
<li>I also updated all my local <code>7_x-dev</code> branches on the latest upstreams</li>
</ul>
</li>
<li>I spent some time updating the authorizations in Alliance collections
<ul>
<li>I want to make sure they use groups instead of individuals where possible!</li>
</ul>
</li>
<li>I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue</li>
</ul>
<ul>
<li>I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!</li>
<li>I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test</li>
<li>Tim merged my <ahref="https://github.com/DSpace/DSpace/pull/8553">pull request to override the ImageMagick PDF density in DSpace 7</a>
<ul>
<li>I ported it to DSpace 6.x and submitted a pull request: <ahref="https://github.com/DSpace/DSpace/pull/8560">https://github.com/DSpace/DSpace/pull/8560</a></li>
<li>I joined the FAO–CGIAR AGROVOC results sharing meeting
<ul>
<li>From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion</li>
</ul>
</li>
<li>Doing duplicate check on IFPRI’s batch upload and I found one duplicate uploaded by IWMI earlier this year
<ul>
<li>I will update the metadata of that item and map it to the correct Initiative collection</li>
</ul>
</li>
</ul>
<h2id="2022-11-03">2022-11-03</h2>
<ul>
<li>I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace</li>
<li>I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
<li>Not bad, because last week I did a more manual selection of collections and deleted ~200
<ul>
<li>I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool</li>
</ul>
</li>
<li>I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
<ul>
<li>Export a list of items with PDFs linked there:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv;
<li>I’m not sure how we’ll handle the duplicates because many items are book chapters or something where they share a PDF</li>
</ul>
<h2id="2022-11-04">2022-11-04</h2>
<ul>
<li>I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description “Generated Thumbnail”:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt;
<li>A bunch of these have <code>\N</code> for some reason when I use the <code>ds6_bitstream2itemhandle</code> function to get their handles so I had to exclude those…
<ul>
<li>I forced the media-filter for these items on CGSpace:</li>
<li>Upload some batch records via CSV for Peter</li>
<li>Update the about page on CGSpace with new text from Peter</li>
<li>Add a few more ORCID identifiers and names to my growing file <code>2022-09-22-add-orcids.csv</code>
<ul>
<li>I tagged fifty-four new authors using this list</li>
</ul>
</li>
<li>I deleted and mapped one duplicate item for Maria Garruccio</li>
<li>I updated the CG Core website from Bootstrap v4.6 to v5.2</li>
</ul>
<h2id="2022-11-07">2022-11-07</h2>
<ul>
<li>I did a harvest on AReS last night but it seems that MELSpace’s sitemap is broken again because we have 10,000 fewer records</li>
<li>I filed <ahref="https://github.com/ecrmnn/iso-3166-1/issues/10">an issue</a> on the iso-3166-1 npm package to update the name of Turkey to Türkiye
<ul>
<li>I also filed <ahref="https://github.com/flyingcircusio/pycountry/issues/148">an issue</a> and <ahref="https://github.com/flyingcircusio/pycountry/pull/149">a pull request</a> on the pycountry package</li>
<li>I also filed <ahref="https://github.com/konstantinstadler/country_converter/issues/121">an issue</a> and <ahref="https://github.com/konstantinstadler/country_converter/pull/122">a pull request</a> on the country-converter package</li>
<li>I also changed one item on CGSpace that had been submitted since the name was changed</li>
<li>I also imported the new iso-codes 4.12.0 into cgspace-java-helpers</li>
<li>I also updated it in the DSpace <code>input-forms.xml</code></li>
<li>I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
<ul>
<li>I submitted a <ahref="https://github.com/ecrmnn/iso-3166-1/pull/11">pull request</a> to update this upstream</li>
</ul>
</li>
</ul>
</li>
<li>Since I was making all these pull requests I also made <ahref="https://github.com/konstantinstadler/country_converter/pull/123">one on country-converter for the UN M.49 region “South-eastern Asia”</a></li>
<li>Port the <ahref="https://github.com/DSpace/DSpace/pull/8550">ImageMagick PDF cropbox fix</a> to DSpace 6.x
<ul>
<li>I deployed it on CGSpace, ran all updates, and rebooted the host</li>
<li>I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:</li>
<li>The <code>Location</code> header is clearly wrong, and if I try https directly I get an HTTP 500</li>
</ul>
<h2id="2022-11-08">2022-11-08</h2>
<ul>
<li>Looking at the Solr statistics hits on CGSpace for 2022-11
<ul>
<li>I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
<li>I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent</li>
<li>I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection</li>
<li>I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent</li>
<li>I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection</li>
<li>I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent</li>
<li>I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent</li>
<li>I will purge all these hits and proably add China Unicom’s subnet mask to my nginx <code>bot-network.conf</code> file to tag them as bots since there are SO many bad and malicious requests coming from there</li>
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 56546
</span></span></code></pre></div><ul>
<li>Also interesting to see a few new user agents:
<ul>
<li><code>RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
<li><code>rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)</code></li>
<li><code>MEL</code></li>
<li><code>Gov employment data scraper ([[your email]])</code></li>
<li><code>RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)</code></li>
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 10632
</span></span></code></pre></div><ul>
<li>Work on the CIAT Library items a bit again in OpenRefine
<ul>
<li>I flagged items with:
<ul>
<li>URL containing “#page” at the end (these are linking to book chapters, but we don’t want to upload the PDF multiple times)</li>
<li>Same URL used by more than one item (“Duplicates” facet in OpenRefine, these are some corner case I don’t want to handle right now)</li>
<li>URL containing “:8080” to CIAT’s old DSpace (this server is no longer live)</li>
</ul>
</li>
<li>I want to try to handle the simple cases that should cover most of the items first</li>
</ul>
</li>
</ul>
<h2id="2022-11-09">2022-11-09</h2>
<ul>
<li>Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
<li>Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
<ul>
<li>We agreed to continue adding the feedback for each of the proposed actions</li>
<li>The others will start filling in definitions for the types</li>
<li>Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
<ul>
<li>We need to be careful especially with regards to author’s outputs that will be reported in the PRMS</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2id="2022-11-16">2022-11-16</h2>
<ul>
<li>Maria asked if we can extend the timeout for XMLUI sessions
<ul>
<li>According to <ahref="https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44">this issue</a> it seems to be 30 minutes by default, as a Tomcat default</li>
<li>I think we could extend this to an hour, as there is no real security risk (we’re not a bank) and most user’s lock screens would have activated after ten minutes or so anyways</li>
<li>The ebook one looks really good and is only 2.4MB…</li>
<li>But for reference, this free Adobe tool seems to work: <ahref="https://www.adobe.com/acrobat/online/compress-pdf.html">https://www.adobe.com/acrobat/online/compress-pdf.html</a></li>
</span></span><spanstyle="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span><spanstyle="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span><spanstyle="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
</span></span><spanstyle="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
</span></span><spanstyle="display:flex;"><span> at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
</span></span><spanstyle="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
</span></span><spanstyle="display:flex;"><span> at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
</span></span><spanstyle="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><spanstyle="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><spanstyle="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><spanstyle="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>Salem gave me a list of CGSpace collections that have double spaces in the names
<ul>
<li>Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!</li>
<li>He sent me a list of about ten collection UUIDs so I fixed them</li>
</ul>
</li>
<li>I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses… I updated about fifty of them</li>
</ul>
<h2id="2022-11-26">2022-11-26</h2>
<ul>
<li>Sync DSpace Test with CGSpace</li>
<li>I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
</span></span><spanstyle="display:flex;"><span> (see the error log for details.)
</span></span></code></pre></div><ul>
<li>In the <code>error.log</code> file I see:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>"2022/11/26 02:12:18 CET" 25 Started new run.
<li>Ah, it seems to be due to an <ahref="https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1">issue in OpenJDK 1.8.0_352</a></li>
<li>I see the server upgraded to the new JDK version on 2022-11-10:</li>
<li>As highlighted in the dspace-tech mailing list thread above, <ahref="https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html">this OpenJDK release deprecated <code>Runtime.runFinalizersOnExit</code></a>:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span> - JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE
</span></span></code></pre></div><ul>
<li>I downloaded the previous versions of the packages from Launchpad:</li>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span># apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
<li>I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
<ul>
<li>For PDFs with only one page I was seeing this in the filter-media output:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
</span></span></code></pre></div><ul>
<li>It turns out that <ahref="https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-">PDDocument’s getPage() is zero-based</a></li>
<li>I also updated PDFBox from 2.0.24 to 2.0.27</li>
<li>I synced DSpace 7 Test with CGSpace
<ul>
<li>I had to follow my notes from 2022-03 to delete the missing Atmire migrations</li>
<li>Update <code>ilri/fix-metadata-values.py</code> to update the <code>last_modified</code> date for items when it updates metadata
<ul>
<li>This should allow us to use the normal <code>index-discovery</code> (with out <code>-b</code>) as well as having REST API responses showing a correct last modified date</li>
</ul>
</li>
<li>Maria asked me to add some ORCID identifiers for Alliance staff to the controlled vocabulary
<ul>
<li>I also updated the <code>add-orcid-identifiers-csv.py</code> to update the <code>last_modified</code> timestamp of the item</li>
</ul>
</li>
<li>I re-factored my CGSpace Python scripts to use a helper <code>util.py</code> module with common functions
<ul>
<li>For now it only has the one for updating an item’s <code>last_modified</code> timestamp but I will gradually add more</li>
</ul>
</li>
<li>I also ran our list of ORCID identifiers against ORCID’s API to see if anyone changed their name format
<ul>
<li>Then I ran them on CGSpace with <code>ilri/update-orcids.py</code> to fix them</li>
</ul>
</li>
<li>Normalize the <code>text_lang</code> values for CGSpace metadata again:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
</span></span><spanstyle="display:flex;"><span>Time: 0.130 ms
</span></span><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '');
</span></span><spanstyle="display:flex;"><span>Time: 4074.879 ms (00:04.075)
</span></span><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
<li>Discussing the UN M.49 regions on CGSpace with Valentina and Abenet
<ul>
<li>The PRMS team is confused about our regions, which are mostly UN M.49 with some legacy stuff using different ones</li>
<li>I think we can fix all the stuff for Initiatives from this year very easily, then work on the legacy stuff later</li>
<li>Also, I noticed that that <ahref="https://github.com/konstantinstadler/country_converter/issues/124">country_converter was using the wrong UN M.49 region for Myanmar</a></li>
<li>I submitted a <ahref="https://github.com/konstantinstadler/country_converter/pull/125">pull request</a></li>