Add notes for 2016-02-11 and 2016-02-12

Signed-off-by: Alan Orth <alan.orth@gmail.com>
This commit is contained in:
Alan Orth 2016-02-12 15:44:32 +02:00
parent 380ca762af
commit 9248744c52
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
4 changed files with 132 additions and 0 deletions

View File

@ -154,3 +154,33 @@ Swap: 255 57 198
```
- So I'll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)
## 2016-02-11
- Massaging some CIAT data in OpenRefine
- There are 1200 records that have PDFs, and will need to be imported into CGSpace
- I created a `filename` column based on the `dc.identifier.url` column using the following transform:
```
value.split('/')[-1]
```
- Then I wrote a tool called [`generate-thumbnails.py`](https://gist.github.com/alanorth/2206f24483fe5f0454fc) to download the PDFs and generate thumbnails for them, for example:
```
$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
> Downloading 64661.pdf
> Creating thumbnail for 64661.pdf
Processing 64195.pdf
> Downloading 64195.pdf
> Creating thumbnail for 64195.pdf
```
## 2016-02-12
- Looking at CIAT's records again, there are some problems with a dozen or so files (out of 1200)
- A few items are using the same exact PDF
- A few items are using HTM or DOC files
- A few items link to PDFs on IFPRI's e-Library or Research Gate
- A few items have no item

View File

@ -248,6 +248,40 @@ Swap: 255 57 198
<ul>
<li>So I&rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
</ul>
<h2 id="2016-02-11:124a59adbaa8ef13e1518d003fc03981">2016-02-11</h2>
<ul>
<li>Massaging some CIAT data in OpenRefine</li>
<li>There are 1200 records that have PDFs, and will need to be imported into CGSpace</li>
<li>I created a <code>filename</code> column based on the <code>dc.identifier.url</code> column using the following transform:</li>
</ul>
<pre><code>value.split('/')[-1]
</code></pre>
<ul>
<li>Then I wrote a tool called <a href="https://gist.github.com/alanorth/2206f24483fe5f0454fc"><code>generate-thumbnails.py</code></a> to download the PDFs and generate thumbnails for them, for example:</li>
</ul>
<pre><code>$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&gt; Downloading 64661.pdf
&gt; Creating thumbnail for 64661.pdf
Processing 64195.pdf
&gt; Downloading 64195.pdf
&gt; Creating thumbnail for 64195.pdf
</code></pre>
<h2 id="2016-02-12:124a59adbaa8ef13e1518d003fc03981">2016-02-12</h2>
<ul>
<li>Looking at CIAT&rsquo;s records again, there are some problems with a dozen or so files (out of 1200)</li>
<li>A few items are using the same exact PDF</li>
<li>A few items are using HTM or DOC files</li>
<li>A few items link to PDFs on IFPRI&rsquo;s e-Library or Research Gate</li>
<li>A few items have no item</li>
</ul>
</section>

View File

@ -187,6 +187,40 @@ Swap: 255 57 198
&lt;ul&gt;
&lt;li&gt;So I&amp;rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-02-11:124a59adbaa8ef13e1518d003fc03981&#34;&gt;2016-02-11&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Massaging some CIAT data in OpenRefine&lt;/li&gt;
&lt;li&gt;There are 1200 records that have PDFs, and will need to be imported into CGSpace&lt;/li&gt;
&lt;li&gt;I created a &lt;code&gt;filename&lt;/code&gt; column based on the &lt;code&gt;dc.identifier.url&lt;/code&gt; column using the following transform:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;value.split(&#39;/&#39;)[-1]
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Then I wrote a tool called &lt;a href=&#34;https://gist.github.com/alanorth/2206f24483fe5f0454fc&#34;&gt;&lt;code&gt;generate-thumbnails.py&lt;/code&gt;&lt;/a&gt; to download the PDFs and generate thumbnails for them, for example:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&amp;gt; Downloading 64661.pdf
&amp;gt; Creating thumbnail for 64661.pdf
Processing 64195.pdf
&amp;gt; Downloading 64195.pdf
&amp;gt; Creating thumbnail for 64195.pdf
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;2016-02-12:124a59adbaa8ef13e1518d003fc03981&#34;&gt;2016-02-12&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Looking at CIAT&amp;rsquo;s records again, there are some problems with a dozen or so files (out of 1200)&lt;/li&gt;
&lt;li&gt;A few items are using the same exact PDF&lt;/li&gt;
&lt;li&gt;A few items are using HTM or DOC files&lt;/li&gt;
&lt;li&gt;A few items link to PDFs on IFPRI&amp;rsquo;s e-Library or Research Gate&lt;/li&gt;
&lt;li&gt;A few items have no item&lt;/li&gt;
&lt;/ul&gt;
</description>
</item>

View File

@ -187,6 +187,40 @@ Swap: 255 57 198
&lt;ul&gt;
&lt;li&gt;So I&amp;rsquo;ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-02-11:124a59adbaa8ef13e1518d003fc03981&#34;&gt;2016-02-11&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Massaging some CIAT data in OpenRefine&lt;/li&gt;
&lt;li&gt;There are 1200 records that have PDFs, and will need to be imported into CGSpace&lt;/li&gt;
&lt;li&gt;I created a &lt;code&gt;filename&lt;/code&gt; column based on the &lt;code&gt;dc.identifier.url&lt;/code&gt; column using the following transform:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;value.split(&#39;/&#39;)[-1]
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Then I wrote a tool called &lt;a href=&#34;https://gist.github.com/alanorth/2206f24483fe5f0454fc&#34;&gt;&lt;code&gt;generate-thumbnails.py&lt;/code&gt;&lt;/a&gt; to download the PDFs and generate thumbnails for them, for example:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
&amp;gt; Downloading 64661.pdf
&amp;gt; Creating thumbnail for 64661.pdf
Processing 64195.pdf
&amp;gt; Downloading 64195.pdf
&amp;gt; Creating thumbnail for 64195.pdf
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;2016-02-12:124a59adbaa8ef13e1518d003fc03981&#34;&gt;2016-02-12&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Looking at CIAT&amp;rsquo;s records again, there are some problems with a dozen or so files (out of 1200)&lt;/li&gt;
&lt;li&gt;A few items are using the same exact PDF&lt;/li&gt;
&lt;li&gt;A few items are using HTM or DOC files&lt;/li&gt;
&lt;li&gt;A few items link to PDFs on IFPRI&amp;rsquo;s e-Library or Research Gate&lt;/li&gt;
&lt;li&gt;A few items have no item&lt;/li&gt;
&lt;/ul&gt;
</description>
</item>