Add notes for 2017-04-20

This commit is contained in:
2017-04-20 14:06:33 +03:00
parent 96298cc5bf
commit 949cf17a98
5 changed files with 88 additions and 8 deletions

View File

@ -275,3 +275,40 @@ value.split('/')[-1].replace(/#.*$/,"")
- The `replace` part is because some URLs have an anchor like `#page=14` which we obviously don't want on the filename
- Also, we need to only use the PDF on the item corresponding with page 1, so we don't end up with literally hundreds of duplicate PDFs
- Alternatively, I could export each page to a standalone PDF...
## 2017-04-20
- Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful
- I re-enabled it with a hidden config key `workflow.stats.enabled = true` on DSpace Test and will evaluate adding it on CGSpace
- Looking at the CIAT data again, a bunch of items have metadata values ending in `||`, which might cause blank fields to be added at import time
- Cleaning them up with OpenRefine:
```
value.replace(/\|\|$/,"")
```
- Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle
- I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items
![Flagging and filtering duplicates in OpenRefine](/cgspace-notes/2017/04/openrefine-flagging-duplicates.png)
- Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace
- Unbelievable, there are also metadata values like:
```
COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
```
- Add a description to the file names using:
```
value + "__description:" + cells["dc.type"].value
```
- Test import of 933 records:
```
$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
$ wc -l /tmp/ciat
933 /tmp/ciat
```