mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2018-06-27
This commit is contained in:
@ -308,4 +308,43 @@ dc.contributor.author,cg.creator.id
|
||||
- It's actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting
|
||||
- Ah, I think I just need to run `dspace oai import`
|
||||
|
||||
## 2018-06-27
|
||||
|
||||
- Vika from CIFOR sent back his annotations on the duplicates for the "CIFOR_May_9" archive import that I sent him last week
|
||||
- I'll have to figure out how to separate those we're keeping, deleting, and mapping into CIFOR's archive collection
|
||||
- First, get the 62 deletes from Vika's file and remove them from the collection:
|
||||
|
||||
```
|
||||
$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
|
||||
$ wc -l cifor-handle-to-delete.txt
|
||||
62 cifor-handle-to-delete.txt
|
||||
$ wc -l 10568-92904.csv
|
||||
2461 10568-92904.csv
|
||||
$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
|
||||
$ wc -l 10568-92904.csv
|
||||
2399 10568-92904.csv
|
||||
```
|
||||
|
||||
- This iterates over the handles for deletion and uses `sed` with an alternative pattern delimiter of '#' (which must be escaped), because the pattern itself contains a '/'
|
||||
- The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:
|
||||
|
||||
```
|
||||
$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
|
||||
$ wc -l cifor-handle-to-map.txt
|
||||
50 cifor-handle-to-map.txt
|
||||
```
|
||||
|
||||
- I can either get them from the databse, or programatically export the metadata using `dspace metadata-export -i 10568/xxxxx`...
|
||||
- Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the `id` and `collection` columns using [csvkit](https://csvkit.readthedocs.io/):
|
||||
|
||||
```
|
||||
$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
|
||||
$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
|
||||
```
|
||||
|
||||
- Then I can use Open Refine to add the "CIFOR Archive" collection to the mappings
|
||||
- Importing the 2398 items via `dspace metadata-import` ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000
|
||||
- After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch
|
||||
- I'll let Abenet take one last look and then move them to CGSpace
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user