Add notes for 2023-11-11

This commit is contained in:
2023-11-13 16:54:36 +03:00
parent 01fb17950b
commit d14dd7114a
137 changed files with 259 additions and 172 deletions

View File

@ -23,7 +23,7 @@ Start a harvest on AReS
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-11/" />
<meta property="article:published_time" content="2023-11-02T12:59:36+03:00" />
<meta property="article:modified_time" content="2023-11-02T20:58:43+03:00" />
<meta property="article:modified_time" content="2023-11-08T08:20:31+03:00" />
@ -42,7 +42,7 @@ I improved the filtering and wrote some Python using pandas to merge my sources
Export CGSpace to check missing Initiative collection mappings
Start a harvest on AReS
"/>
<meta name="generator" content="Hugo 0.120.3">
<meta name="generator" content="Hugo 0.120.4">
@ -52,9 +52,9 @@ Start a harvest on AReS
"@type": "BlogPosting",
"headline": "November, 2023",
"url": "https://alanorth.github.io/cgspace-notes/2023-11/",
"wordCount": "487",
"wordCount": "831",
"datePublished": "2023-11-02T12:59:36+03:00",
"dateModified": "2023-11-02T20:58:43+03:00",
"dateModified": "2023-11-08T08:20:31+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -222,6 +222,53 @@ tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
</span></span><span style="display:flex;"><span>7
</span></span></code></pre></div><ul>
<li>I don&rsquo;t know what is happening&hellip; I will increase the heap size from 6144m to 7168m again&hellip;</li>
<li>I did some work on the value mappings in AReS
<ul>
<li>I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine</li>
<li>Importing duplicates records, so I deleted and re-created the index in Elasticsearch first</li>
<li>Then I started a new harvest on AReS to make sure the mappings are applied</li>
</ul>
</li>
</ul>
<h2 id="2023-11-09">2023-11-09</h2>
<ul>
<li>Ryan asked me for help uploading a large PDF to CGSpace
<ul>
<li>I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images</li>
<li>Interestingly, the <a href="https://ghostscript.com/docs/9.54.0/VectorDevices.htm">GhostScript docs</a> mention that <code>prepress</code> doesn&rsquo;t give the best results:</li>
</ul>
</li>
</ul>
<blockquote>
<p>Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The &lsquo;best&rsquo; quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).</p>
</blockquote>
<ul>
<li>Also, I found <a href="https://stackoverflow.com/questions/40849325/ghostscript-pdfwrite-specify-jpeg-quality">a question on StackOverflow discussing some further techniques for PDFs with images</a>:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ gs -sOutputFile<span style="color:#f92672">=</span>137166-default-dct.pdf -sDEVICE<span style="color:#f92672">=</span>pdfwrite -dCompatibilityLevel<span style="color:#f92672">=</span>1.4 -dNOPAUSE -dBATCH -dPDFSETTINGS<span style="color:#f92672">=</span>/default -c <span style="color:#e6db74">&#34;&lt;&lt; /ColorACSImageDict &lt;&lt; /VSamples [ 1 1 1 1 ] /HSam
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"></span>ples [ 1 1 1 1 ] /QFactor 0.08 /Blend 1 &gt;&gt; /ColorImageDownsampleType /Bicubic /ColorConversionStrategy /LeaveColorUnchanged &gt;&gt; setdistillerparams&#34; -f 137166.pdf
</span></span></code></pre></div><ul>
<li>This looks much better, and is still much smaller than the original
<ul>
<li>Also, I used <code>pdfimages</code> to extract all the images from the original and the one above and found:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ du -sh images-*
</span></span><span style="display:flex;"><span>886M images-default-dct
</span></span><span style="display:flex;"><span>1012M images-original
</span></span></code></pre></div><ul>
<li>And from <a href="https://www.wecompress.com/en/analyze">WeCompress&rsquo;s analysis</a> I see that the images are 85% of the size of the PDF</li>
</ul>
<h2 id="2023-11-10">2023-11-10</h2>
<ul>
<li>I finished checking the IFPRI Slideshare records and added some tagging of countries, regions, and CRPs and then uploaded them to CGSpace</li>
</ul>
<h2 id="2023-11-11">2023-11-11</h2>
<ul>
<li>Salem fixed a bug on OpenRXV that was splitting country values by &ldquo;,&rdquo; before matching them with ISO countries</li>
<li>I exported CGSpace to check for missing Initiative collection mappings</li>
<li>Start a fresh harvest on AReS</li>
</ul>
<!-- raw HTML omitted -->