mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-18 04:37:04 +01:00
346 lines
13 KiB
HTML
346 lines
13 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
|
|
<meta property="og:title" content="November, 2023" />
|
|
<meta property="og:description" content="2023-11-01
|
|
|
|
Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
|
|
|
|
I improved the filtering and wrote some Python using pandas to merge my sources more reliably
|
|
|
|
|
|
|
|
2023-11-02
|
|
|
|
Export CGSpace to check missing Initiative collection mappings
|
|
Start a harvest on AReS
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-11/" />
|
|
<meta property="article:published_time" content="2023-11-02T12:59:36+03:00" />
|
|
<meta property="article:modified_time" content="2023-11-08T08:20:31+03:00" />
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="November, 2023"/>
|
|
<meta name="twitter:description" content="2023-11-01
|
|
|
|
Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
|
|
|
|
I improved the filtering and wrote some Python using pandas to merge my sources more reliably
|
|
|
|
|
|
|
|
2023-11-02
|
|
|
|
Export CGSpace to check missing Initiative collection mappings
|
|
Start a harvest on AReS
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.120.4">
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "November, 2023",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2023-11/",
|
|
"wordCount": "831",
|
|
"datePublished": "2023-11-02T12:59:36+03:00",
|
|
"dateModified": "2023-11-08T08:20:31+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2023-11/">
|
|
|
|
<title>November, 2023 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2023-11/">November, 2023</a></h2>
|
|
<p class="blog-post-meta">
|
|
<time datetime="2023-11-02T12:59:36+03:00">Thu Nov 02, 2023</time>
|
|
in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2023-11-01">2023-11-01</h2>
|
|
<ul>
|
|
<li>Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
|
|
<ul>
|
|
<li>I improved the filtering and wrote some Python using pandas to merge my sources more reliably</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-11-02">2023-11-02</h2>
|
|
<ul>
|
|
<li>Export CGSpace to check missing Initiative collection mappings</li>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<ul>
|
|
<li>IFPRI contacted us about importing their Slideshare presentations to CGSpace
|
|
<ul>
|
|
<li>There are ~1,700 of them and date back to as early as 2008</li>
|
|
<li>I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-11-03">2023-11-03</h2>
|
|
<ul>
|
|
<li>A little bit of work on the CGIAR Climate Change Synthesis</li>
|
|
<li>Discuss some CGSpace migration plans with Leigh from IFPRI
|
|
<ul>
|
|
<li>For their Slideshare content we agreed:
|
|
<ul>
|
|
<li>Exclude private</li>
|
|
<li>Exclude deleted</li>
|
|
<li>Exclude non presentation types</li>
|
|
<li>Exclude duplicates within the collection for now until we can sort them out</li>
|
|
</ul>
|
|
</li>
|
|
<li>That leaves about 1,500 items out of the 1,700</li>
|
|
</ul>
|
|
</li>
|
|
<li>I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those</li>
|
|
</ul>
|
|
<h2 id="2023-11-04">2023-11-04</h2>
|
|
<ul>
|
|
<li>Export CGSpace to check for missing Initiative collection mappings</li>
|
|
<li>I ran through the list of potential duplicates on the IFPRI Slideshare presentations</li>
|
|
</ul>
|
|
<h2 id="2023-11-05">2023-11-05</h2>
|
|
<ul>
|
|
<li>Work with Salem to migrate AReS to the new version</li>
|
|
</ul>
|
|
<h2 id="2023-11-07">2023-11-07</h2>
|
|
<ul>
|
|
<li>DSpace 7 Test went down and there is very high load on the server
|
|
<ul>
|
|
<li>I saw very high load from Java but didn’t have time to check exactly what was wrong so I just rebooted the host</li>
|
|
<li>A few hours after restarting the system went down again, with very high load from Java again</li>
|
|
<li>I see lots of messages like this in the Tomcat log:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>tomcat9[732]: [9955.662s][info ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M->4080M(4096M) 677.251ms
|
|
tomcat9[732]: [9955.662s][info ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
|
|
tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
|
|
</code></pre><ul>
|
|
<li>I see some messages in <code>dspace.log</code> about heap space:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>Caused by: java.lang.OutOfMemoryError: Java heap space
|
|
</code></pre><ul>
|
|
<li>I will increase Tomcat’s heap from 4096m to 5120m
|
|
<ul>
|
|
<li>A few hours later it happened again, so I increased the heap from 5120m to 6144m</li>
|
|
<li>Not sure what’s going on today…</li>
|
|
</ul>
|
|
</li>
|
|
<li>I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace community-filiator -r -p 10568/83389 -c 10947/2516
|
|
</span></span><span style="display:flex;"><span>$ dspace community-filiator -s -p 10947/2515 -c 10947/2516
|
|
</span></span><span style="display:flex;"><span>$ dspace index-discovery -r 10947/2516
|
|
</span></span><span style="display:flex;"><span>$ dspace index-discovery -r 10947/2515
|
|
</span></span><span style="display:flex;"><span>$ dspace index-discovery -r 10568/83389
|
|
</span></span><span style="display:flex;"><span>$ dspace index-discovery
|
|
</span></span></code></pre></div><ul>
|
|
<li>I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive</li>
|
|
<li>I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in <a href="/cgspace-notes/2023-09/">September, 2023</a>,</li>
|
|
</ul>
|
|
<h2 id="2023-11-08">2023-11-08</h2>
|
|
<ul>
|
|
<li>DSpace 7 Test has very high load again and I see more Java heap space errors in the log</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># grep -c <span style="color:#e6db74">'Caused by: java.lang.OutOfMemoryError: Java heap space'</span> /home/dspace7/log/dspace.log-2023-11-07
|
|
</span></span><span style="display:flex;"><span>35
|
|
</span></span><span style="display:flex;"><span># grep -c <span style="color:#e6db74">'Caused by: java.lang.OutOfMemoryError: Java heap space'</span> /home/dspace7/log/dspace.log
|
|
</span></span><span style="display:flex;"><span>7
|
|
</span></span></code></pre></div><ul>
|
|
<li>I don’t know what is happening… I will increase the heap size from 6144m to 7168m again…</li>
|
|
<li>I did some work on the value mappings in AReS
|
|
<ul>
|
|
<li>I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine</li>
|
|
<li>Importing duplicates records, so I deleted and re-created the index in Elasticsearch first</li>
|
|
<li>Then I started a new harvest on AReS to make sure the mappings are applied</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2023-11-09">2023-11-09</h2>
|
|
<ul>
|
|
<li>Ryan asked me for help uploading a large PDF to CGSpace
|
|
<ul>
|
|
<li>I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images</li>
|
|
<li>Interestingly, the <a href="https://ghostscript.com/docs/9.54.0/VectorDevices.htm">GhostScript docs</a> mention that <code>prepress</code> doesn’t give the best results:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<blockquote>
|
|
<p>Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The ‘best’ quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).</p>
|
|
</blockquote>
|
|
<ul>
|
|
<li>Also, I found <a href="https://stackoverflow.com/questions/40849325/ghostscript-pdfwrite-specify-jpeg-quality">a question on StackOverflow discussing some further techniques for PDFs with images</a>:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ gs -sOutputFile<span style="color:#f92672">=</span>137166-default-dct.pdf -sDEVICE<span style="color:#f92672">=</span>pdfwrite -dCompatibilityLevel<span style="color:#f92672">=</span>1.4 -dNOPAUSE -dBATCH -dPDFSETTINGS<span style="color:#f92672">=</span>/default -c <span style="color:#e6db74">"<< /ColorACSImageDict << /VSamples [ 1 1 1 1 ] /HSam
|
|
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"></span>ples [ 1 1 1 1 ] /QFactor 0.08 /Blend 1 >> /ColorImageDownsampleType /Bicubic /ColorConversionStrategy /LeaveColorUnchanged >> setdistillerparams" -f 137166.pdf
|
|
</span></span></code></pre></div><ul>
|
|
<li>This looks much better, and is still much smaller than the original
|
|
<ul>
|
|
<li>Also, I used <code>pdfimages</code> to extract all the images from the original and the one above and found:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ du -sh images-*
|
|
</span></span><span style="display:flex;"><span>886M images-default-dct
|
|
</span></span><span style="display:flex;"><span>1012M images-original
|
|
</span></span></code></pre></div><ul>
|
|
<li>And from <a href="https://www.wecompress.com/en/analyze">WeCompress’s analysis</a> I see that the images are 85% of the size of the PDF</li>
|
|
</ul>
|
|
<h2 id="2023-11-10">2023-11-10</h2>
|
|
<ul>
|
|
<li>I finished checking the IFPRI Slideshare records and added some tagging of countries, regions, and CRPs and then uploaded them to CGSpace</li>
|
|
</ul>
|
|
<h2 id="2023-11-11">2023-11-11</h2>
|
|
<ul>
|
|
<li>Salem fixed a bug on OpenRXV that was splitting country values by “,” before matching them with ISO countries</li>
|
|
<li>I exported CGSpace to check for missing Initiative collection mappings</li>
|
|
<li>Start a fresh harvest on AReS</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2023-11/">November, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-10/">October, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|