diff --git a/content/posts/2018-06.md b/content/posts/2018-06.md index 9b5fd46e0..a6461a705 100644 --- a/content/posts/2018-06.md +++ b/content/posts/2018-06.md @@ -308,4 +308,43 @@ dc.contributor.author,cg.creator.id - It's actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting - Ah, I think I just need to run `dspace oai import` +## 2018-06-27 + +- Vika from CIFOR sent back his annotations on the duplicates for the "CIFOR_May_9" archive import that I sent him last week +- I'll have to figure out how to separate those we're keeping, deleting, and mapping into CIFOR's archive collection +- First, get the 62 deletes from Vika's file and remove them from the collection: + +``` +$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt +$ wc -l cifor-handle-to-delete.txt +62 cifor-handle-to-delete.txt +$ wc -l 10568-92904.csv +2461 10568-92904.csv +$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt +$ wc -l 10568-92904.csv +2399 10568-92904.csv +``` + +- This iterates over the handles for deletion and uses `sed` with an alternative pattern delimiter of '#' (which must be escaped), because the pattern itself contains a '/' +- The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them: + +``` +$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt +$ wc -l cifor-handle-to-map.txt +50 cifor-handle-to-map.txt +``` + +- I can either get them from the databse, or programatically export the metadata using `dspace metadata-export -i 10568/xxxxx`... +- Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the `id` and `collection` columns using [csvkit](https://csvkit.readthedocs.io/): + +``` +$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt +$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv +``` + +- Then I can use Open Refine to add the "CIFOR Archive" collection to the mappings +- Importing the 2398 items via `dspace metadata-import` ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000 +- After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch +- I'll let Abenet take one last look and then move them to CGSpace + diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html index 754e3c0fc..6b03d3fa6 100644 --- a/docs/2018-06/index.html +++ b/docs/2018-06/index.html @@ -41,7 +41,7 @@ sys 2m7.289s - + @@ -93,9 +93,9 @@ sys 2m7.289s "@type": "BlogPosting", "headline": "June, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-06/", - "wordCount": "2420", + "wordCount": "2734", "datePublished": "2018-06-04T19:49:54-07:00", - "dateModified": "2018-06-24T17:38:07+03:00", + "dateModified": "2018-06-26T17:17:55+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -517,6 +517,50 @@ Done.
dspace oai import
$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
+$ wc -l cifor-handle-to-delete.txt
+62 cifor-handle-to-delete.txt
+$ wc -l 10568-92904.csv
+2461 10568-92904.csv
+$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
+$ wc -l 10568-92904.csv
+2399 10568-92904.csv
+
+
+sed
with an alternative pattern delimiter of ‘#’ (which must be escaped), because the pattern itself contains a ‘/’$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
+$ wc -l cifor-handle-to-map.txt
+50 cifor-handle-to-map.txt
+
+
+dspace metadata-export -i 10568/xxxxx
…id
and collection
columns using csvkit:$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
+$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
+
+
+dspace metadata-import
ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000