cgspace-notes/docs/2023-04/index.html

<!DOCTYPE html>
<html lang="en" >

  <head>
    <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">


<meta property="og:title" content="April, 2023" />
<meta property="og:description" content="2023-04-02

Run all system updates on CGSpace and reboot it
I exported CGSpace to CSV to check for any missing Initiative collection mappings

I also did a check for missing country/region mappings with csv-metadata-quality


Start a harvest on AReS
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-04/" />
<meta property="article:published_time" content="2023-04-02T08:19:36+03:00" />
<meta property="article:modified_time" content="2023-04-02T09:16:25+03:00" />


<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2023"/>
<meta name="twitter:description" content="2023-04-02

Run all system updates on CGSpace and reboot it
I exported CGSpace to CSV to check for any missing Initiative collection mappings

I also did a check for missing country/region mappings with csv-metadata-quality


Start a harvest on AReS
"/>
<meta name="generator" content="Hugo 0.111.3">


<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "April, 2023",
  "url": "https://alanorth.github.io/cgspace-notes/2023-04/",
  "wordCount": "569",
  "datePublished": "2023-04-02T08:19:36+03:00",
  "dateModified": "2023-04-02T09:16:25+03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
  },
  "keywords": "Notes"
}
</script>


    <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2023-04/">

    <title>April, 2023 | CGSpace Notes</title>

    
    <!-- combined, minified CSS -->
    
    <link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
    

    <!-- minified Font Awesome for SVG icons -->
    
    <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>

    <!-- RSS 2.0 feed -->
    

  </head>

  <body>

    
    <div class="blog-masthead">
      <div class="container">
        <nav class="nav blog-nav">
          <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
        </nav>
      </div>
    </div>
    

    <header class="blog-header">
      <div class="container">
        <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
        <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
      </div>
    </header>
    
    
    <div class="container">
      <div class="row">
        <div class="col-sm-8 blog-main">

          
<article class="blog-post">
  <header>
    <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2023-04/">April, 2023</a></h2>
    <p class="blog-post-meta">
<time datetime="2023-04-02T08:19:36+03:00">Sun Apr 02, 2023</time>
 in 
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>


</p>
  </header>
  <h2 id="2023-04-02">2023-04-02</h2>
<ul>
<li>Run all system updates on CGSpace and reboot it</li>
<li>I exported CGSpace to CSV to check for any missing Initiative collection mappings
<ul>
<li>I also did a check for missing country/region mappings with csv-metadata-quality</li>
</ul>
</li>
<li>Start a harvest on AReS</li>
</ul>
<ul>
<li>I&rsquo;m starting to get annoyed at my shell script for doing ImageMagick tests and looking to re-write it in something object oriented like Python
<ul>
<li>There doesn&rsquo;t seem to be an official ImageMagick Python binding on pypi.org, perhaps I can use <a href="https://docs.wand-py.org">Wand</a>?</li>
</ul>
</li>
<li>Testing Wand in Python:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> wand.image <span style="color:#f92672">import</span> Image
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> Image(filename<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;data/10568-103447.pdf[0]&#39;</span>, resolution<span style="color:#f92672">=</span><span style="color:#ae81ff">144</span>) <span style="color:#66d9ef">as</span> first_page:
</span></span><span style="display:flex;"><span>    print(first_page<span style="color:#f92672">.</span>height)
</span></span></code></pre></div><ul>
<li>I spent more time re-working my thumbnail scripts to compare the resized images and other minor changes
<ul>
<li>I am realizing that doing the thumbnails directly from the source improves the ssimulacra2 score by 1-3% points compared to DSpace&rsquo;s method of creating a lossy supersample followed by a lossy resized thumbnail</li>
</ul>
</li>
</ul>
<h2 id="2023-04-03">2023-04-03</h2>
<ul>
<li>The harvest on AReS that I started yesterday never finished, and actually seems to have died&hellip;
<ul>
<li>Also, Fabio and Patrizio from Alliance emailed me to ask if there is something wrong with the REST API because they are having problems</li>
<li>I stopped the harvest and started the plugins to get the remaining items via the sitemap&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2023-04-04">2023-04-04</h2>
<ul>
<li>Presentation about CGSpace metadata, controlled vocabularies, and curation to Pooja&rsquo;s communications and development team at UNEP
<ul>
<li>I uploaded the presentation to CGSpace here: <a href="https://hdl.handle.net/10568/129896">https://hdl.handle.net/10568/129896</a></li>
</ul>
</li>
<li>Someone from the system organization contacted me to ask how to download a few thousand PDFs from a spreadsheet with DOIs and Handles</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c Handle ~/Downloads/2023-04-04-Donald.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    | sed \
</span></span><span style="display:flex;"><span>        -e 1d \
</span></span><span style="display:flex;"><span>        -e &#39;s_https://hdl.handle.net/__&#39; \
</span></span><span style="display:flex;"><span>        -e &#39;s_https://cgspace.cgiar.org/handle/__&#39; \
</span></span><span style="display:flex;"><span>        -e &#39;s_http://hdl.handle.net/__&#39; \
</span></span><span style="display:flex;"><span>    | sort -u &gt; /tmp/handles.txt
</span></span></code></pre></div><ul>
<li>Then I used the <code>get_dspace_pdfs.py</code> script to download them</li>
</ul>
<h2 id="2023-04-05">2023-04-05</h2>
<ul>
<li>After some cleanup on Donald&rsquo;s DOIs I started the <code>get_scihub_pdfs.py</code> script</li>
</ul>
<h2 id="2023-04-06">2023-04-06</h2>
<ul>
<li>I did some more work to cleanup and streamline my next generation of DSpace thumbnail testing scripts
<ul>
<li>I think I found a bug in ImageMagick 7.1.1.5 where CMYK to sRGB conversion fails if we use image operations like <code>-density</code> or <code>-define</code> before reading the input file</li>
<li>I started <a href="https://github.com/ImageMagick/ImageMagick/discussions/6234">a discussion on the ImageMagick GitHub</a> to ask</li>
</ul>
</li>
<li>Yesterday I started downloading the rest of the PDFs from Donald, those that had DOIs
<ul>
<li>As a measure of caution, I extracted the list of DOIs and used my <code>crossref_doi_lookup.py</code> script to get their licenses from Crossref:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/crossref_doi_lookup.py -e xxxx@i.org -i /tmp/dois.txt -o /tmp/donald-crossref-dois.csv -d
</span></span></code></pre></div><ul>
<li>Then I did some CSV manipulation to extract the DOIs that were Creative Commons licensed, excluding any that were &ldquo;No Derivatives&rdquo;, and re-formatting the DOIs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c doi,license /tmp/donald-crossref-dois.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  | csvgrep -c license -m &#39;creativecommons&#39; \
</span></span><span style="display:flex;"><span>  | csvgrep -c license -i -r &#39;by-(nd|nc-nd)&#39; \
</span></span><span style="display:flex;"><span>  | sed -e &#39;s_^10_https://doi.org/10_&#39; \
</span></span><span style="display:flex;"><span>    -e &#39;s/\(am\|tdm\|unspecified\|vor\): //&#39; \
</span></span><span style="display:flex;"><span>  | tee /tmp/donald-open-dois.csv \
</span></span><span style="display:flex;"><span>  | wc -l
</span></span><span style="display:flex;"><span>4268
</span></span></code></pre></div><ul>
<li>From those I filtered for the DOIs for which I had downloaded PDFs, in the <code>filename</code> column of the Sci-Hub script and copied them to a separate directory:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> file in <span style="color:#66d9ef">$(</span>csvjoin -c doi /tmp/donald-doi-pdfs.csv /tmp/donald-open-dois.csv | csvgrep -c filename -i -r <span style="color:#e6db74">&#39;^$&#39;</span> | csvcut -c filename | sed 1d<span style="color:#66d9ef">)</span>; <span style="color:#66d9ef">do</span> cp --reflink<span style="color:#f92672">=</span>always <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span> <span style="color:#e6db74">&#34;creative-commons-licensed/</span>$file<span style="color:#e6db74">&#34;</span>; <span style="color:#66d9ef">done</span>
</span></span></code></pre></div><ul>
<li>I used BTRFS copy-on-write via reflinks to make sure I didn&rsquo;t duplicate the files :-D</li>
<li>I ran out of time and had to stop the process around 3,127 PDFs
<ul>
<li>I zipped them up and sent them to the others, along with a CSV of the DOIs, PDF filenames, and licenses</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

  
</article> 


        </div> <!-- /.blog-main -->

        <aside class="col-sm-3 ml-auto blog-sidebar">
  

        <section class="sidebar-module">
    <h4>Recent Posts</h4>
    <ol class="list-unstyled">


<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>

<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>

<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>

<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>

<li><a href="/cgspace-notes/2022-12/">December, 2022</a></li>

    </ol>
  </section>

  
  <section class="sidebar-module">
    <h4>Links</h4>
    <ol class="list-unstyled">
      
      <li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
      
      <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
      
      <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
      
    </ol>
  </section>
  
</aside>


      </div> <!-- /.row -->
    </div> <!-- /.container -->
    

    <footer class="blog-footer">
      <p dir="auto">
      
      Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
      
      </p>
      <p>
      <a href="#">Back to top</a>
      </p>
    </footer>
    

  </body>

</html>
Add notes for 2023-04-02 2023-04-02 09:16:25 +03:00			`<!DOCTYPE html>`
			`<html lang="en" >`

			`<head>`
			`<meta charset="utf-8">`
			`<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">`


			`<meta property="og:title" content="April, 2023" />`
			`<meta property="og:description" content="2023-04-02`

			`Run all system updates on CGSpace and reboot it`
			`I exported CGSpace to CSV to check for any missing Initiative collection mappings`

			`I also did a check for missing country/region mappings with csv-metadata-quality`


			`Start a harvest on AReS`
			`" />`
			`<meta property="og:type" content="article" />`
			`<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-04/" />`
			`<meta property="article:published_time" content="2023-04-02T08:19:36+03:00" />`
Add notes for 2023-04-06 2023-04-06 16:13:30 +03:00			`<meta property="article:modified_time" content="2023-04-02T09:16:25+03:00" />`
Add notes for 2023-04-02 2023-04-02 09:16:25 +03:00


			`<meta name="twitter:card" content="summary"/>`
			`<meta name="twitter:title" content="April, 2023"/>`
			`<meta name="twitter:description" content="2023-04-02`

			`Run all system updates on CGSpace and reboot it`
			`I exported CGSpace to CSV to check for any missing Initiative collection mappings`

			`I also did a check for missing country/region mappings with csv-metadata-quality`


			`Start a harvest on AReS`
			`"/>`
			`<meta name="generator" content="Hugo 0.111.3">`



			`<script type="application/ld+json">`
			`{`
			`"@context": "http://schema.org",`
			`"@type": "BlogPosting",`
			`"headline": "April, 2023",`
			`"url": "https://alanorth.github.io/cgspace-notes/2023-04/",`
Add notes for 2023-04-06 2023-04-06 16:13:30 +03:00			`"wordCount": "569",`
Add notes for 2023-04-02 2023-04-02 09:16:25 +03:00			`"datePublished": "2023-04-02T08:19:36+03:00",`
Add notes for 2023-04-06 2023-04-06 16:13:30 +03:00			`"dateModified": "2023-04-02T09:16:25+03:00",`
Add notes for 2023-04-02 2023-04-02 09:16:25 +03:00			`"author": {`
			`"@type": "Person",`
			`"name": "Alan Orth"`
			`},`
			`"keywords": "Notes"`
			`}`
			`</script>`



			`<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2023-04/">`

			`<title>April, 2023 \| CGSpace Notes</title>`


			`<!-- combined, minified CSS -->`

			`<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">`


			`<!-- minified Font Awesome for SVG icons -->`

			`<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>`

			`<!-- RSS 2.0 feed -->`




			`</head>`

			`<body>`


			`<div class="blog-masthead">`
			`<div class="container">`
			`<nav class="nav blog-nav">`
			`<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>`
			`</nav>`
			`</div>`
			`</div>`




			`<header class="blog-header">`
			`<div class="container">`
			`<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>`
			`<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>`
			`</div>`
			`</header>`




			`<div class="container">`
			`<div class="row">`
			`<div class="col-sm-8 blog-main">`




			`<article class="blog-post">`
			`<header>`
			`<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2023-04/">April, 2023</a></h2>`
			`<p class="blog-post-meta">`
			`<time datetime="2023-04-02T08:19:36+03:00">Sun Apr 02, 2023</time>`
			`in`
			`<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>`


			`</p>`
			`</header>`
			`<h2 id="2023-04-02">2023-04-02</h2>`
			`<ul>`
			`<li>Run all system updates on CGSpace and reboot it</li>`
			`<li>I exported CGSpace to CSV to check for any missing Initiative collection mappings`
			`<ul>`
			`<li>I also did a check for missing country/region mappings with csv-metadata-quality</li>`
			`</ul>`
			`</li>`
			`<li>Start a harvest on AReS</li>`
			`</ul>`
Add notes for 2023-04-06 2023-04-06 16:13:30 +03:00			`<ul>`
			`<li>I’m starting to get annoyed at my shell script for doing ImageMagick tests and looking to re-write it in something object oriented like Python`
			`<ul>`
			`<li>There doesn’t seem to be an official ImageMagick Python binding on pypi.org, perhaps I can use <a href="https://docs.wand-py.org">Wand</a>?</li>`
			`</ul>`
			`</li>`
			`<li>Testing Wand in Python:</li>`
			`</ul>`
			`<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> wand.image <span style="color:#f92672">import</span> Image`
			`</span></span><span style="display:flex;"><span>`
			`</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> Image(filename<span style="color:#f92672">=</span><span style="color:#e6db74">'data/10568-103447.pdf[0]'</span>, resolution<span style="color:#f92672">=</span><span style="color:#ae81ff">144</span>) <span style="color:#66d9ef">as</span> first_page:`
			`</span></span><span style="display:flex;"><span> print(first_page<span style="color:#f92672">.</span>height)`
			`</span></span></code></pre></div><ul>`
			`<li>I spent more time re-working my thumbnail scripts to compare the resized images and other minor changes`
			`<ul>`
			`<li>I am realizing that doing the thumbnails directly from the source improves the ssimulacra2 score by 1-3% points compared to DSpace’s method of creating a lossy supersample followed by a lossy resized thumbnail</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<h2 id="2023-04-03">2023-04-03</h2>`
			`<ul>`
			`<li>The harvest on AReS that I started yesterday never finished, and actually seems to have died…`
			`<ul>`
			`<li>Also, Fabio and Patrizio from Alliance emailed me to ask if there is something wrong with the REST API because they are having problems</li>`
			`<li>I stopped the harvest and started the plugins to get the remaining items via the sitemap…</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<h2 id="2023-04-04">2023-04-04</h2>`
			`<ul>`
			`<li>Presentation about CGSpace metadata, controlled vocabularies, and curation to Pooja’s communications and development team at UNEP`
			`<ul>`
			`<li>I uploaded the presentation to CGSpace here: <a href="https://hdl.handle.net/10568/129896">https://hdl.handle.net/10568/129896</a></li>`
			`</ul>`
			`</li>`
			`<li>Someone from the system organization contacted me to ask how to download a few thousand PDFs from a spreadsheet with DOIs and Handles</li>`
			`</ul>`
			`<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c Handle ~/Downloads/2023-04-04-Donald.csv <span style="color:#ae81ff">\`
			`</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> \| sed \`
			`</span></span><span style="display:flex;"><span> -e 1d \`
			`</span></span><span style="display:flex;"><span> -e 's_https://hdl.handle.net/__' \`
			`</span></span><span style="display:flex;"><span> -e 's_https://cgspace.cgiar.org/handle/__' \`
			`</span></span><span style="display:flex;"><span> -e 's_http://hdl.handle.net/__' \`
			`</span></span><span style="display:flex;"><span> \| sort -u > /tmp/handles.txt`
			`</span></span></code></pre></div><ul>`
			`<li>Then I used the <code>get_dspace_pdfs.py</code> script to download them</li>`
			`</ul>`
			`<h2 id="2023-04-05">2023-04-05</h2>`
			`<ul>`
			`<li>After some cleanup on Donald’s DOIs I started the <code>get_scihub_pdfs.py</code> script</li>`
			`</ul>`
			`<h2 id="2023-04-06">2023-04-06</h2>`
			`<ul>`
			`<li>I did some more work to cleanup and streamline my next generation of DSpace thumbnail testing scripts`
			`<ul>`
			`<li>I think I found a bug in ImageMagick 7.1.1.5 where CMYK to sRGB conversion fails if we use image operations like <code>-density</code> or <code>-define</code> before reading the input file</li>`
			`<li>I started <a href="https://github.com/ImageMagick/ImageMagick/discussions/6234">a discussion on the ImageMagick GitHub</a> to ask</li>`
			`</ul>`
			`</li>`
			`<li>Yesterday I started downloading the rest of the PDFs from Donald, those that had DOIs`
			`<ul>`
			`<li>As a measure of caution, I extracted the list of DOIs and used my <code>crossref_doi_lookup.py</code> script to get their licenses from Crossref:</li>`
			`</ul>`
			`</li>`
			`</ul>`
			`<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/crossref_doi_lookup.py -e xxxx@i.org -i /tmp/dois.txt -o /tmp/donald-crossref-dois.csv -d`
			`</span></span></code></pre></div><ul>`
			`<li>Then I did some CSV manipulation to extract the DOIs that were Creative Commons licensed, excluding any that were “No Derivatives”, and re-formatting the DOIs:</li>`
			`</ul>`
			`<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c doi,license /tmp/donald-crossref-dois.csv <span style="color:#ae81ff">\`
			`</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> \| csvgrep -c license -m 'creativecommons' \`
			`</span></span><span style="display:flex;"><span> \| csvgrep -c license -i -r 'by-(nd\|nc-nd)' \`
			`</span></span><span style="display:flex;"><span> \| sed -e 's_^10_https://doi.org/10_' \`
			`</span></span><span style="display:flex;"><span> -e 's/\(am\\|tdm\\|unspecified\\|vor\): //' \`
			`</span></span><span style="display:flex;"><span> \| tee /tmp/donald-open-dois.csv \`
			`</span></span><span style="display:flex;"><span> \| wc -l`
			`</span></span><span style="display:flex;"><span>4268`
			`</span></span></code></pre></div><ul>`
			`<li>From those I filtered for the DOIs for which I had downloaded PDFs, in the <code>filename</code> column of the Sci-Hub script and copied them to a separate directory:</li>`
			`</ul>`
			<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">for</span> file in <span style="color:#66d9ef">$(</span>csvjoin -c doi /tmp/donald-doi-pdfs.csv /tmp/donald-open-dois.csv \| csvgrep -c filename -i -r <span style="color:#e6db74">'^$'</span> \| csvcut -c filename \| sed 1d<span style="color:#66d9ef">)</span>; <span style="color:#66d9ef">do</span> cp --reflink<span style="color:#f92672">=</span>always <span style="color:#e6db74">"</span>$file<span style="color:#e6db74">"</span> <span style="color:#e6db74">"creative-commons-licensed/</span>$file<span style="color:#e6db74">"</span>; <span style="color:#66d9ef">done</span>
			`</span></span></code></pre></div><ul>`
			`<li>I used BTRFS copy-on-write via reflinks to make sure I didn’t duplicate the files :-D</li>`
			`<li>I ran out of time and had to stop the process around 3,127 PDFs`
			`<ul>`
			`<li>I zipped them up and sent them to the others, along with a CSV of the DOIs, PDF filenames, and licenses</li>`
			`</ul>`
			`</li>`
			`</ul>`
Add notes for 2023-04-02 2023-04-02 09:16:25 +03:00			`<!-- raw HTML omitted -->`





			`</article>`



			`</div> <!-- /.blog-main -->`

			`<aside class="col-sm-3 ml-auto blog-sidebar">`



			`<section class="sidebar-module">`
			`<h4>Recent Posts</h4>`
			`<ol class="list-unstyled">`


			`<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>`

			`<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>`

			`<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>`

			`<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>`

			`<li><a href="/cgspace-notes/2022-12/">December, 2022</a></li>`

			`</ol>`
			`</section>`




			`<section class="sidebar-module">`
			`<h4>Links</h4>`
			`<ol class="list-unstyled">`

			`<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>`

			`<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>`

			`<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>`

			`</ol>`
			`</section>`

			`</aside>`


			`</div> <!-- /.row -->`
			`</div> <!-- /.container -->`



			`<footer class="blog-footer">`
			`<p dir="auto">`

			`Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.`

			`</p>`
			`<p>`
			`<a href="#">Back to top</a>`
			`</p>`
			`</footer>`


			`</body>`

			`</html>`