From 9248744c520b2aedcffe9ace6a66e78aa2f98d51 Mon Sep 17 00:00:00 2001
From: Alan Orth <alan.orth@gmail.com>
Date: Fri, 12 Feb 2016 15:44:32 +0200
Subject: [PATCH] Add notes for 2016-02-11 and 2016-02-12

Signed-off-by: Alan Orth <alan.orth@gmail.com>
---
 content/2016-02.md          | 30 ++++++++++++++++++++++++++++++
 public/2016-02/index.html   | 34 ++++++++++++++++++++++++++++++++++
 public/index.xml            | 34 ++++++++++++++++++++++++++++++++++
 public/tags/notes/index.xml | 34 ++++++++++++++++++++++++++++++++++
 4 files changed, 132 insertions(+)
diff --git a/content/2016-02.md b/content/2016-02.md
index 01d4cfeab..801f75475 100644
--- a/content/2016-02.md
+++ b/content/2016-02.md
@@ -154,3 +154,33 @@ Swap:          255         57        198
 ```
 
 - So I'll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)
+
+## 2016-02-11
+
+- Massaging some CIAT data in OpenRefine
+- There are 1200 records that have PDFs, and will need to be imported into CGSpace
+- I created a `filename` column based on the `dc.identifier.url` column using the following transform:
+
+```
+value.split('/')[-1]
+```
+
+- Then I wrote a tool called [`generate-thumbnails.py`](https://gist.github.com/alanorth/2206f24483fe5f0454fc) to download the PDFs and generate thumbnails for them, for example:
+
+```
+$ ./generate-thumbnails.py ciat-reports.csv
+Processing 64661.pdf
+> Downloading 64661.pdf
+> Creating thumbnail for 64661.pdf
+Processing 64195.pdf
+> Downloading 64195.pdf
+> Creating thumbnail for 64195.pdf
+```
+
+## 2016-02-12
+
+- Looking at CIAT's records again, there are some problems with a dozen or so files (out of 1200)
+- A few items are using the same exact PDF
+- A few items are using HTM or DOC files
+- A few items link to PDFs on IFPRI's e-Library or Research Gate
+- A few items have no item
diff --git a/public/2016-02/index.html b/public/2016-02/index.html
index 7f96cc4fb..8b4bb57c1 100644
--- a/public/2016-02/index.html
+++ b/public/2016-02/index.html
@@ -248,6 +248,40 @@ Swap:          255         57        198
 
 <ul>
 <li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
+</ul>
+
+<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2>
+
+<ul>
+<li>Massaging some CIAT data in OpenRefine</li>
+<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
+<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
+</ul>
+
+<pre><code>value.split('/')[-1]
+</code></pre>
+
+<ul>
+<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
+</ul>
+
+<pre><code>$ ./generate-thumbnails.py ciat-reports.csv
+Processing 64661.pdf
+&gt; Downloading 64661.pdf
+&gt; Creating thumbnail for 64661.pdf
+Processing 64195.pdf
+&gt; Downloading 64195.pdf
+&gt; Creating thumbnail for 64195.pdf
+</code></pre>
+
+<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>
+
+<ul>
+<li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li>
+<li>A few items are using the same exact PDF</li>
+<li>A few items are using HTM or DOC files</li>
+<li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li>
+<li>A few items have no item</li>
 </ul>
 
   </section>
diff --git a/public/index.xml b/public/index.xml
index 1c509b748..786f04533 100644
--- a/public/index.xml
+++ b/public/index.xml
@@ -187,6 +187,40 @@ Swap:          255         57        198
 &lt;ul&gt;
 &lt;li&gt;So I&amp;rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)&lt;/li&gt;
 &lt;/ul&gt;
+
+&lt;h2 id=&#34;2016-02-11:124a59adbaa8ef13e1518d003fc03981&#34;&gt;2016-02-11&lt;/h2&gt;
+
+&lt;ul&gt;
+&lt;li&gt;Massaging some CIAT data in OpenRefine&lt;/li&gt;
+&lt;li&gt;There are 1200 records that have PDFs, and will need to be imported into CGSpace&lt;/li&gt;
+&lt;li&gt;I created a &lt;code&gt;filename&lt;/code&gt; column based on the &lt;code&gt;dc.identifier.url&lt;/code&gt; column using the following transform:&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;pre&gt;&lt;code&gt;value.split(&#39;/&#39;)[-1]
+&lt;/code&gt;&lt;/pre&gt;
+
+&lt;ul&gt;
+&lt;li&gt;Then I wrote a tool called &lt;a href=&#34;https://gist.github.com/alanorth/2206f24483fe5f0454fc&#34;&gt;&lt;code&gt;generate-thumbnails.py&lt;/code&gt;&lt;/a&gt; to download the PDFs and generate thumbnails for them, for example:&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;pre&gt;&lt;code&gt;$ ./generate-thumbnails.py ciat-reports.csv
+Processing 64661.pdf
+&amp;gt; Downloading 64661.pdf
+&amp;gt; Creating thumbnail for 64661.pdf
+Processing 64195.pdf
+&amp;gt; Downloading 64195.pdf
+&amp;gt; Creating thumbnail for 64195.pdf
+&lt;/code&gt;&lt;/pre&gt;
+
+&lt;h2 id=&#34;2016-02-12:124a59adbaa8ef13e1518d003fc03981&#34;&gt;2016-02-12&lt;/h2&gt;
+
+&lt;ul&gt;
+&lt;li&gt;Looking at CIAT&amp;rsquo;s records again, there are some problems with a dozen or so files (out of 1200)&lt;/li&gt;
+&lt;li&gt;A few items are using the same exact PDF&lt;/li&gt;
+&lt;li&gt;A few items are using HTM or DOC files&lt;/li&gt;
+&lt;li&gt;A few items link to PDFs on IFPRI&amp;rsquo;s e-Library or Research Gate&lt;/li&gt;
+&lt;li&gt;A few items have no item&lt;/li&gt;
+&lt;/ul&gt;
 </description>
     </item>
     
diff --git a/public/tags/notes/index.xml b/public/tags/notes/index.xml
index a913ad2fd..f554f8101 100644
--- a/public/tags/notes/index.xml
+++ b/public/tags/notes/index.xml
@@ -187,6 +187,40 @@ Swap:          255         57        198
 &lt;ul&gt;
 &lt;li&gt;So I&amp;rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)&lt;/li&gt;
 &lt;/ul&gt;
+
+&lt;h2 id=&#34;2016-02-11:124a59adbaa8ef13e1518d003fc03981&#34;&gt;2016-02-11&lt;/h2&gt;
+
+&lt;ul&gt;
+&lt;li&gt;Massaging some CIAT data in OpenRefine&lt;/li&gt;
+&lt;li&gt;There are 1200 records that have PDFs, and will need to be imported into CGSpace&lt;/li&gt;
+&lt;li&gt;I created a &lt;code&gt;filename&lt;/code&gt; column based on the &lt;code&gt;dc.identifier.url&lt;/code&gt; column using the following transform:&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;pre&gt;&lt;code&gt;value.split(&#39;/&#39;)[-1]
+&lt;/code&gt;&lt;/pre&gt;
+
+&lt;ul&gt;
+&lt;li&gt;Then I wrote a tool called &lt;a href=&#34;https://gist.github.com/alanorth/2206f24483fe5f0454fc&#34;&gt;&lt;code&gt;generate-thumbnails.py&lt;/code&gt;&lt;/a&gt; to download the PDFs and generate thumbnails for them, for example:&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;pre&gt;&lt;code&gt;$ ./generate-thumbnails.py ciat-reports.csv
+Processing 64661.pdf
+&amp;gt; Downloading 64661.pdf
+&amp;gt; Creating thumbnail for 64661.pdf
+Processing 64195.pdf
+&amp;gt; Downloading 64195.pdf
+&amp;gt; Creating thumbnail for 64195.pdf
+&lt;/code&gt;&lt;/pre&gt;
+
+&lt;h2 id=&#34;2016-02-12:124a59adbaa8ef13e1518d003fc03981&#34;&gt;2016-02-12&lt;/h2&gt;
+
+&lt;ul&gt;
+&lt;li&gt;Looking at CIAT&amp;rsquo;s records again, there are some problems with a dozen or so files (out of 1200)&lt;/li&gt;
+&lt;li&gt;A few items are using the same exact PDF&lt;/li&gt;
+&lt;li&gt;A few items are using HTM or DOC files&lt;/li&gt;
+&lt;li&gt;A few items link to PDFs on IFPRI&amp;rsquo;s e-Library or Research Gate&lt;/li&gt;
+&lt;li&gt;A few items have no item&lt;/li&gt;
+&lt;/ul&gt;
 </description>
     </item>