Add notes for 2018-06-27

This commit is contained in:
2018-06-27 17:16:24 +03:00
parent 6ecb0c84d6
commit a6f5f0c6b2
4 changed files with 97 additions and 14 deletions

View File

@ -308,4 +308,43 @@ dc.contributor.author,cg.creator.id
- It's actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting
- Ah, I think I just need to run `dspace oai import`
## 2018-06-27
- Vika from CIFOR sent back his annotations on the duplicates for the "CIFOR_May_9" archive import that I sent him last week
- I'll have to figure out how to separate those we're keeping, deleting, and mapping into CIFOR's archive collection
- First, get the 62 deletes from Vika's file and remove them from the collection:
```
$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
$ wc -l cifor-handle-to-delete.txt
62 cifor-handle-to-delete.txt
$ wc -l 10568-92904.csv
2461 10568-92904.csv
$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
$ wc -l 10568-92904.csv
2399 10568-92904.csv
```
- This iterates over the handles for deletion and uses `sed` with an alternative pattern delimiter of '#' (which must be escaped), because the pattern itself contains a '/'
- The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:
```
$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
$ wc -l cifor-handle-to-map.txt
50 cifor-handle-to-map.txt
```
- I can either get them from the databse, or programatically export the metadata using `dspace metadata-export -i 10568/xxxxx`...
- Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the `id` and `collection` columns using [csvkit](https://csvkit.readthedocs.io/):
```
$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
```
- Then I can use Open Refine to add the "CIFOR Archive" collection to the mappings
- Importing the 2398 items via `dspace metadata-import` ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000
- After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch
- I'll let Abenet take one last look and then move them to CGSpace
<!-- vim: set sw=2 ts=2: -->