mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 00:18:21 +01:00
Add notes for 2017-04-20
This commit is contained in:
parent
96298cc5bf
commit
949cf17a98
@ -275,3 +275,40 @@ value.split('/')[-1].replace(/#.*$/,"")
|
|||||||
- The `replace` part is because some URLs have an anchor like `#page=14` which we obviously don't want on the filename
|
- The `replace` part is because some URLs have an anchor like `#page=14` which we obviously don't want on the filename
|
||||||
- Also, we need to only use the PDF on the item corresponding with page 1, so we don't end up with literally hundreds of duplicate PDFs
|
- Also, we need to only use the PDF on the item corresponding with page 1, so we don't end up with literally hundreds of duplicate PDFs
|
||||||
- Alternatively, I could export each page to a standalone PDF...
|
- Alternatively, I could export each page to a standalone PDF...
|
||||||
|
|
||||||
|
## 2017-04-20
|
||||||
|
|
||||||
|
- Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful
|
||||||
|
- I re-enabled it with a hidden config key `workflow.stats.enabled = true` on DSpace Test and will evaluate adding it on CGSpace
|
||||||
|
- Looking at the CIAT data again, a bunch of items have metadata values ending in `||`, which might cause blank fields to be added at import time
|
||||||
|
- Cleaning them up with OpenRefine:
|
||||||
|
|
||||||
|
```
|
||||||
|
value.replace(/\|\|$/,"")
|
||||||
|
```
|
||||||
|
|
||||||
|
- Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle
|
||||||
|
- I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items
|
||||||
|
|
||||||
|
![Flagging and filtering duplicates in OpenRefine](/cgspace-notes/2017/04/openrefine-flagging-duplicates.png)
|
||||||
|
|
||||||
|
- Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace
|
||||||
|
- Unbelievable, there are also metadata values like:
|
||||||
|
|
||||||
|
```
|
||||||
|
COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
|
||||||
|
```
|
||||||
|
|
||||||
|
- Add a description to the file names using:
|
||||||
|
|
||||||
|
```
|
||||||
|
value + "__description:" + cells["dc.type"].value
|
||||||
|
```
|
||||||
|
|
||||||
|
- Test import of 933 records:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
|
||||||
|
$ wc -l /tmp/ciat
|
||||||
|
933 /tmp/ciat
|
||||||
|
```
|
||||||
|
@ -30,7 +30,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
|
|||||||
|
|
||||||
|
|
||||||
<meta property="article:published_time" content="2017-04-02T17:08:52+02:00"/>
|
<meta property="article:published_time" content="2017-04-02T17:08:52+02:00"/>
|
||||||
<meta property="article:modified_time" content="2017-04-19T15:39:19+03:00"/>
|
<meta property="article:modified_time" content="2017-04-19T18:37:04+03:00"/>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -79,9 +79,9 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "April, 2017",
|
"headline": "April, 2017",
|
||||||
"url": "https://alanorth.github.io/cgspace-notes/2017-04/",
|
"url": "https://alanorth.github.io/cgspace-notes/2017-04/",
|
||||||
"wordCount": "1877",
|
"wordCount": "2085",
|
||||||
"datePublished": "2017-04-02T17:08:52+02:00",
|
"datePublished": "2017-04-02T17:08:52+02:00",
|
||||||
"dateModified": "2017-04-19T15:39:19+03:00",
|
"dateModified": "2017-04-19T18:37:04+03:00",
|
||||||
"author": {
|
"author": {
|
||||||
"@type": "Person",
|
"@type": "Person",
|
||||||
"name": "Alan Orth"
|
"name": "Alan Orth"
|
||||||
@ -477,6 +477,49 @@ $ rails -s
|
|||||||
<li>Alternatively, I could export each page to a standalone PDF…</li>
|
<li>Alternatively, I could export each page to a standalone PDF…</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
|
<h2 id="2017-04-20">2017-04-20</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful</li>
|
||||||
|
<li>I re-enabled it with a hidden config key <code>workflow.stats.enabled = true</code> on DSpace Test and will evaluate adding it on CGSpace</li>
|
||||||
|
<li>Looking at the CIAT data again, a bunch of items have metadata values ending in <code>||</code>, which might cause blank fields to be added at import time</li>
|
||||||
|
<li>Cleaning them up with OpenRefine:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>value.replace(/\|\|$/,"")
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle</li>
|
||||||
|
<li>I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<p><img src="/cgspace-notes/2017/04/openrefine-flagging-duplicates.png" alt="Flagging and filtering duplicates in OpenRefine" /></p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace</li>
|
||||||
|
<li>Unbelievable, there are also metadata values like:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Add a description to the file names using:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>value + "__description:" + cells["dc.type"].value
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Test import of 933 records:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
|
||||||
|
$ wc -l /tmp/ciat
|
||||||
|
933 /tmp/ciat
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
BIN
public/2017/04/openrefine-flagging-duplicates.png
Normal file
BIN
public/2017/04/openrefine-flagging-duplicates.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 182 KiB |
@ -3,7 +3,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2017-04/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2017-04/</loc>
|
||||||
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
|
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
@ -93,7 +93,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
|
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -104,19 +104,19 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||||
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
|
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
|
||||||
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
|
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||||
<lastmod>2017-04-19T15:39:19+03:00</lastmod>
|
<lastmod>2017-04-19T18:37:04+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
|
BIN
static/2017/04/openrefine-flagging-duplicates.png
Normal file
BIN
static/2017/04/openrefine-flagging-duplicates.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 182 KiB |
Loading…
Reference in New Issue
Block a user