Add notes for 2017-04-20

This commit is contained in:
Alan Orth 2017-04-20 14:06:33 +03:00
parent 96298cc5bf
commit 949cf17a98
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
5 changed files with 88 additions and 8 deletions

View File

@ -275,3 +275,40 @@ value.split('/')[-1].replace(/#.*$/,"")
- The `replace` part is because some URLs have an anchor like `#page=14` which we obviously don't want on the filename
- Also, we need to only use the PDF on the item corresponding with page 1, so we don't end up with literally hundreds of duplicate PDFs
- Alternatively, I could export each page to a standalone PDF...
## 2017-04-20
- Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful
- I re-enabled it with a hidden config key `workflow.stats.enabled = true` on DSpace Test and will evaluate adding it on CGSpace
- Looking at the CIAT data again, a bunch of items have metadata values ending in `||`, which might cause blank fields to be added at import time
- Cleaning them up with OpenRefine:
```
value.replace(/\|\|$/,"")
```
- Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle
- I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items
![Flagging and filtering duplicates in OpenRefine](/cgspace-notes/2017/04/openrefine-flagging-duplicates.png)
- Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace
- Unbelievable, there are also metadata values like:
```
COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
```
- Add a description to the file names using:
```
value + "__description:" + cells["dc.type"].value
```
- Test import of 933 records:
```
$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
$ wc -l /tmp/ciat
933 /tmp/ciat
```

View File

@ -30,7 +30,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
<meta property="article:published_time" content="2017-04-02T17:08:52&#43;02:00"/>
<meta property="article:modified_time" content="2017-04-19T15:39:19&#43;03:00"/>
<meta property="article:modified_time" content="2017-04-19T18:37:04&#43;03:00"/>
@ -79,9 +79,9 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
"@type": "BlogPosting",
"headline": "April, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-04/",
"wordCount": "1877",
"wordCount": "2085",
"datePublished": "2017-04-02T17:08:52&#43;02:00",
"dateModified": "2017-04-19T15:39:19&#43;03:00",
"dateModified": "2017-04-19T18:37:04&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -477,6 +477,49 @@ $ rails -s
<li>Alternatively, I could export each page to a standalone PDF&hellip;</li>
</ul>
<h2 id="2017-04-20">2017-04-20</h2>
<ul>
<li>Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful</li>
<li>I re-enabled it with a hidden config key <code>workflow.stats.enabled = true</code> on DSpace Test and will evaluate adding it on CGSpace</li>
<li>Looking at the CIAT data again, a bunch of items have metadata values ending in <code>||</code>, which might cause blank fields to be added at import time</li>
<li>Cleaning them up with OpenRefine:</li>
</ul>
<pre><code>value.replace(/\|\|$/,&quot;&quot;)
</code></pre>
<ul>
<li>Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle</li>
<li>I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items</li>
</ul>
<p><img src="/cgspace-notes/2017/04/openrefine-flagging-duplicates.png" alt="Flagging and filtering duplicates in OpenRefine" /></p>
<ul>
<li>Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace</li>
<li>Unbelievable, there are also metadata values like:</li>
</ul>
<pre><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
</code></pre>
<ul>
<li>Add a description to the file names using:</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
</code></pre>
<ul>
<li>Test import of 933 records:</li>
</ul>
<pre><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
$ wc -l /tmp/ciat
933 /tmp/ciat
</code></pre>

Binary file not shown.

After

Width:  |  Height:  |  Size: 182 KiB

View File

@ -3,7 +3,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-04/</loc>
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
</url>
<url>
@ -93,7 +93,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
<priority>0</priority>
</url>
@ -104,19 +104,19 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
<priority>0</priority>
</url>

Binary file not shown.

After

Width:  |  Height:  |  Size: 182 KiB