diff --git a/content/posts/2018-06.md b/content/posts/2018-06.md index 9b5fd46e0..a6461a705 100644 --- a/content/posts/2018-06.md +++ b/content/posts/2018-06.md @@ -308,4 +308,43 @@ dc.contributor.author,cg.creator.id - It's actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting - Ah, I think I just need to run `dspace oai import` +## 2018-06-27 + +- Vika from CIFOR sent back his annotations on the duplicates for the "CIFOR_May_9" archive import that I sent him last week +- I'll have to figure out how to separate those we're keeping, deleting, and mapping into CIFOR's archive collection +- First, get the 62 deletes from Vika's file and remove them from the collection: + +``` +$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt +$ wc -l cifor-handle-to-delete.txt +62 cifor-handle-to-delete.txt +$ wc -l 10568-92904.csv +2461 10568-92904.csv +$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt +$ wc -l 10568-92904.csv +2399 10568-92904.csv +``` + +- This iterates over the handles for deletion and uses `sed` with an alternative pattern delimiter of '#' (which must be escaped), because the pattern itself contains a '/' +- The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them: + +``` +$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt +$ wc -l cifor-handle-to-map.txt +50 cifor-handle-to-map.txt +``` + +- I can either get them from the databse, or programatically export the metadata using `dspace metadata-export -i 10568/xxxxx`... +- Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the `id` and `collection` columns using [csvkit](https://csvkit.readthedocs.io/): + +``` +$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt +$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv +``` + +- Then I can use Open Refine to add the "CIFOR Archive" collection to the mappings +- Importing the 2398 items via `dspace metadata-import` ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000 +- After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch +- I'll let Abenet take one last look and then move them to CGSpace + diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html index 754e3c0fc..6b03d3fa6 100644 --- a/docs/2018-06/index.html +++ b/docs/2018-06/index.html @@ -41,7 +41,7 @@ sys 2m7.289s - + @@ -93,9 +93,9 @@ sys 2m7.289s "@type": "BlogPosting", "headline": "June, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-06/", - "wordCount": "2420", + "wordCount": "2734", "datePublished": "2018-06-04T19:49:54-07:00", - "dateModified": "2018-06-24T17:38:07+03:00", + "dateModified": "2018-06-26T17:17:55+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -517,6 +517,50 @@ Done.
  • Ah, I think I just need to run dspace oai import
  • +

    2018-06-27

    + + + +
    $ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
    +$ wc -l cifor-handle-to-delete.txt
    +62 cifor-handle-to-delete.txt
    +$ wc -l 10568-92904.csv
    +2461 10568-92904.csv
    +$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
    +$ wc -l 10568-92904.csv
    +2399 10568-92904.csv
    +
    + + + +
    $ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
    +$ wc -l cifor-handle-to-map.txt
    +50 cifor-handle-to-map.txt
    +
    + + + +
    $ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
    +$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
    +
    + + + diff --git a/docs/robots.txt b/docs/robots.txt index f80484daa..12c3510ab 100644 --- a/docs/robots.txt +++ b/docs/robots.txt @@ -36,7 +36,7 @@ Disallow: /cgspace-notes/2015-12/ Disallow: /cgspace-notes/2015-11/ Disallow: /cgspace-notes/ Disallow: /cgspace-notes/categories/ -Disallow: /cgspace-notes/tags/notes/ Disallow: /cgspace-notes/categories/notes/ +Disallow: /cgspace-notes/tags/notes/ Disallow: /cgspace-notes/posts/ Disallow: /cgspace-notes/tags/ diff --git a/docs/sitemap.xml b/docs/sitemap.xml index a6d0a29f6..ca097b361 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-06/ - 2018-06-24T17:38:07+03:00 + 2018-06-26T17:17:55+03:00 @@ -169,7 +169,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-06-24T17:38:07+03:00 + 2018-06-26T17:17:55+03:00 0 @@ -178,27 +178,27 @@ 0 - - https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-06-24T17:38:07+03:00 - 0 - - https://alanorth.github.io/cgspace-notes/categories/notes/ 2018-03-09T22:10:33+02:00 0 + + https://alanorth.github.io/cgspace-notes/tags/notes/ + 2018-06-26T17:17:55+03:00 + 0 + + https://alanorth.github.io/cgspace-notes/posts/ - 2018-06-24T17:38:07+03:00 + 2018-06-26T17:17:55+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-06-24T17:38:07+03:00 + 2018-06-26T17:17:55+03:00 0