Add notes for 2016-02-11 and 2016-02-12

Signed-off-by: Alan Orth <alan.orth@gmail.com>
This commit is contained in:
2016-02-12 15:44:32 +02:00
parent 380ca762af
commit 9248744c52
4 changed files with 132 additions and 0 deletions

View File

@ -154,3 +154,33 @@ Swap: 255 57 198
```
- So I'll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)
## 2016-02-11
- Massaging some CIAT data in OpenRefine
- There are 1200 records that have PDFs, and will need to be imported into CGSpace
- I created a `filename` column based on the `dc.identifier.url` column using the following transform:
```
value.split('/')[-1]
```
- Then I wrote a tool called [`generate-thumbnails.py`](https://gist.github.com/alanorth/2206f24483fe5f0454fc) to download the PDFs and generate thumbnails for them, for example:
```
$ ./generate-thumbnails.py ciat-reports.csv
Processing 64661.pdf
> Downloading 64661.pdf
> Creating thumbnail for 64661.pdf
Processing 64195.pdf
> Downloading 64195.pdf
> Creating thumbnail for 64195.pdf
```
## 2016-02-12
- Looking at CIAT's records again, there are some problems with a dozen or so files (out of 1200)
- A few items are using the same exact PDF
- A few items are using HTM or DOC files
- A few items link to PDFs on IFPRI's e-Library or Research Gate
- A few items have no item