mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2017-04-20
This commit is contained in:
@ -275,3 +275,40 @@ value.split('/')[-1].replace(/#.*$/,"")
|
||||
- The `replace` part is because some URLs have an anchor like `#page=14` which we obviously don't want on the filename
|
||||
- Also, we need to only use the PDF on the item corresponding with page 1, so we don't end up with literally hundreds of duplicate PDFs
|
||||
- Alternatively, I could export each page to a standalone PDF...
|
||||
|
||||
## 2017-04-20
|
||||
|
||||
- Atmire responded about the Workflow Statistics, saying that it had been disabled because many environments needed customization to be useful
|
||||
- I re-enabled it with a hidden config key `workflow.stats.enabled = true` on DSpace Test and will evaluate adding it on CGSpace
|
||||
- Looking at the CIAT data again, a bunch of items have metadata values ending in `||`, which might cause blank fields to be added at import time
|
||||
- Cleaning them up with OpenRefine:
|
||||
|
||||
```
|
||||
value.replace(/\|\|$/,"")
|
||||
```
|
||||
|
||||
- Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle
|
||||
- I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items
|
||||
|
||||

|
||||
|
||||
- Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace
|
||||
- Unbelievable, there are also metadata values like:
|
||||
|
||||
```
|
||||
COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
|
||||
```
|
||||
|
||||
- Add a description to the file names using:
|
||||
|
||||
```
|
||||
value + "__description:" + cells["dc.type"].value
|
||||
```
|
||||
|
||||
- Test import of 933 records:
|
||||
|
||||
```
|
||||
$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
|
||||
$ wc -l /tmp/ciat
|
||||
933 /tmp/ciat
|
||||
```
|
||||
|
Reference in New Issue
Block a user