From 9248744c520b2aedcffe9ace6a66e78aa2f98d51 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Fri, 12 Feb 2016 15:44:32 +0200 Subject: [PATCH] Add notes for 2016-02-11 and 2016-02-12 Signed-off-by: Alan Orth --- content/2016-02.md | 30 ++++++++++++++++++++++++++++++ public/2016-02/index.html | 34 ++++++++++++++++++++++++++++++++++ public/index.xml | 34 ++++++++++++++++++++++++++++++++++ public/tags/notes/index.xml | 34 ++++++++++++++++++++++++++++++++++ 4 files changed, 132 insertions(+) diff --git a/content/2016-02.md b/content/2016-02.md index 01d4cfeab..801f75475 100644 --- a/content/2016-02.md +++ b/content/2016-02.md @@ -154,3 +154,33 @@ Swap: 255 57 198 ``` - So I'll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB) + +## 2016-02-11 + +- Massaging some CIAT data in OpenRefine +- There are 1200 records that have PDFs, and will need to be imported into CGSpace +- I created a `filename` column based on the `dc.identifier.url` column using the following transform: + +``` +value.split('/')[-1] +``` + +- Then I wrote a tool called [`generate-thumbnails.py`](https://gist.github.com/alanorth/2206f24483fe5f0454fc) to download the PDFs and generate thumbnails for them, for example: + +``` +$ ./generate-thumbnails.py ciat-reports.csv +Processing 64661.pdf +> Downloading 64661.pdf +> Creating thumbnail for 64661.pdf +Processing 64195.pdf +> Downloading 64195.pdf +> Creating thumbnail for 64195.pdf +``` + +## 2016-02-12 + +- Looking at CIAT's records again, there are some problems with a dozen or so files (out of 1200) +- A few items are using the same exact PDF +- A few items are using HTM or DOC files +- A few items link to PDFs on IFPRI's e-Library or Research Gate +- A few items have no item diff --git a/public/2016-02/index.html b/public/2016-02/index.html index 7f96cc4fb..8b4bb57c1 100644 --- a/public/2016-02/index.html +++ b/public/2016-02/index.html @@ -248,6 +248,40 @@ Swap: 255 57 198 + +

2016-02-11

+ + + +
value.split('/')[-1]
+
+ + + +
$ ./generate-thumbnails.py ciat-reports.csv
+Processing 64661.pdf
+> Downloading 64661.pdf
+> Creating thumbnail for 64661.pdf
+Processing 64195.pdf
+> Downloading 64195.pdf
+> Creating thumbnail for 64195.pdf
+
+ +

2016-02-12

+ + diff --git a/public/index.xml b/public/index.xml index 1c509b748..786f04533 100644 --- a/public/index.xml +++ b/public/index.xml @@ -187,6 +187,40 @@ Swap: 255 57 198 <ul> <li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li> </ul> + +<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2> + +<ul> +<li>Massaging some CIAT data in OpenRefine</li> +<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li> +<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li> +</ul> + +<pre><code>value.split('/')[-1] +</code></pre> + +<ul> +<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li> +</ul> + +<pre><code>$ ./generate-thumbnails.py ciat-reports.csv +Processing 64661.pdf +&gt; Downloading 64661.pdf +&gt; Creating thumbnail for 64661.pdf +Processing 64195.pdf +&gt; Downloading 64195.pdf +&gt; Creating thumbnail for 64195.pdf +</code></pre> + +<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2> + +<ul> +<li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li> +<li>A few items are using the same exact PDF</li> +<li>A few items are using HTM or DOC files</li> +<li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li> +<li>A few items have no item</li> +</ul> diff --git a/public/tags/notes/index.xml b/public/tags/notes/index.xml index a913ad2fd..f554f8101 100644 --- a/public/tags/notes/index.xml +++ b/public/tags/notes/index.xml @@ -187,6 +187,40 @@ Swap: 255 57 198 <ul> <li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li> </ul> + +<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2> + +<ul> +<li>Massaging some CIAT data in OpenRefine</li> +<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li> +<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li> +</ul> + +<pre><code>value.split('/')[-1] +</code></pre> + +<ul> +<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li> +</ul> + +<pre><code>$ ./generate-thumbnails.py ciat-reports.csv +Processing 64661.pdf +&gt; Downloading 64661.pdf +&gt; Creating thumbnail for 64661.pdf +Processing 64195.pdf +&gt; Downloading 64195.pdf +&gt; Creating thumbnail for 64195.pdf +</code></pre> + +<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2> + +<ul> +<li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li> +<li>A few items are using the same exact PDF</li> +<li>A few items are using HTM or DOC files</li> +<li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li> +<li>A few items have no item</li> +</ul>