mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 06:35:03 +01:00
Add notes for 2018-06-27
This commit is contained in:
parent
6ecb0c84d6
commit
a6f5f0c6b2
@ -308,4 +308,43 @@ dc.contributor.author,cg.creator.id
|
||||
- It's actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting
|
||||
- Ah, I think I just need to run `dspace oai import`
|
||||
|
||||
## 2018-06-27
|
||||
|
||||
- Vika from CIFOR sent back his annotations on the duplicates for the "CIFOR_May_9" archive import that I sent him last week
|
||||
- I'll have to figure out how to separate those we're keeping, deleting, and mapping into CIFOR's archive collection
|
||||
- First, get the 62 deletes from Vika's file and remove them from the collection:
|
||||
|
||||
```
|
||||
$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
|
||||
$ wc -l cifor-handle-to-delete.txt
|
||||
62 cifor-handle-to-delete.txt
|
||||
$ wc -l 10568-92904.csv
|
||||
2461 10568-92904.csv
|
||||
$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
|
||||
$ wc -l 10568-92904.csv
|
||||
2399 10568-92904.csv
|
||||
```
|
||||
|
||||
- This iterates over the handles for deletion and uses `sed` with an alternative pattern delimiter of '#' (which must be escaped), because the pattern itself contains a '/'
|
||||
- The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:
|
||||
|
||||
```
|
||||
$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
|
||||
$ wc -l cifor-handle-to-map.txt
|
||||
50 cifor-handle-to-map.txt
|
||||
```
|
||||
|
||||
- I can either get them from the databse, or programatically export the metadata using `dspace metadata-export -i 10568/xxxxx`...
|
||||
- Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the `id` and `collection` columns using [csvkit](https://csvkit.readthedocs.io/):
|
||||
|
||||
```
|
||||
$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
|
||||
$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
|
||||
```
|
||||
|
||||
- Then I can use Open Refine to add the "CIFOR Archive" collection to the mappings
|
||||
- Importing the 2398 items via `dspace metadata-import` ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000
|
||||
- After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch
|
||||
- I'll let Abenet take one last look and then move them to CGSpace
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -41,7 +41,7 @@ sys 2m7.289s
|
||||
|
||||
<meta property="article:published_time" content="2018-06-04T19:49:54-07:00"/>
|
||||
|
||||
<meta property="article:modified_time" content="2018-06-24T17:38:07+03:00"/>
|
||||
<meta property="article:modified_time" content="2018-06-26T17:17:55+03:00"/>
|
||||
|
||||
|
||||
|
||||
@ -93,9 +93,9 @@ sys 2m7.289s
|
||||
"@type": "BlogPosting",
|
||||
"headline": "June, 2018",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2018-06/",
|
||||
"wordCount": "2420",
|
||||
"wordCount": "2734",
|
||||
"datePublished": "2018-06-04T19:49:54-07:00",
|
||||
"dateModified": "2018-06-24T17:38:07+03:00",
|
||||
"dateModified": "2018-06-26T17:17:55+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -517,6 +517,50 @@ Done.
|
||||
<li>Ah, I think I just need to run <code>dspace oai import</code></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-06-27">2018-06-27</h2>
|
||||
|
||||
<ul>
|
||||
<li>Vika from CIFOR sent back his annotations on the duplicates for the “CIFOR_May_9” archive import that I sent him last week</li>
|
||||
<li>I’ll have to figure out how to separate those we’re keeping, deleting, and mapping into CIFOR’s archive collection</li>
|
||||
<li>First, get the 62 deletes from Vika’s file and remove them from the collection:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
|
||||
$ wc -l cifor-handle-to-delete.txt
|
||||
62 cifor-handle-to-delete.txt
|
||||
$ wc -l 10568-92904.csv
|
||||
2461 10568-92904.csv
|
||||
$ while read line; do sed -i "\#$line#d" 10568-92904.csv; done < cifor-handle-to-delete.txt
|
||||
$ wc -l 10568-92904.csv
|
||||
2399 10568-92904.csv
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>This iterates over the handles for deletion and uses <code>sed</code> with an alternative pattern delimiter of ‘#’ (which must be escaped), because the pattern itself contains a ‘/’</li>
|
||||
<li>The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
|
||||
$ wc -l cifor-handle-to-map.txt
|
||||
50 cifor-handle-to-map.txt
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>I can either get them from the databse, or programatically export the metadata using <code>dspace metadata-export -i 10568/xxxxx</code>…</li>
|
||||
<li>Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the <code>id</code> and <code>collection</code> columns using <a href="https://csvkit.readthedocs.io/">csvkit</a>:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
|
||||
$ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Then I can use Open Refine to add the “CIFOR Archive” collection to the mappings</li>
|
||||
<li>Importing the 2398 items via <code>dspace metadata-import</code> ends up with a Java garbage collection error, so I think I need to do it in batches of 1,000</li>
|
||||
<li>After deleting the 62 duplicates, mapping the 50 items from elsewhere in CGSpace, and uploading 2,398 unique items, there are a total of 2,448 items added in this batch</li>
|
||||
<li>I’ll let Abenet take one last look and then move them to CGSpace</li>
|
||||
</ul>
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
||||
|
||||
|
@ -36,7 +36,7 @@ Disallow: /cgspace-notes/2015-12/
|
||||
Disallow: /cgspace-notes/2015-11/
|
||||
Disallow: /cgspace-notes/
|
||||
Disallow: /cgspace-notes/categories/
|
||||
Disallow: /cgspace-notes/tags/notes/
|
||||
Disallow: /cgspace-notes/categories/notes/
|
||||
Disallow: /cgspace-notes/tags/notes/
|
||||
Disallow: /cgspace-notes/posts/
|
||||
Disallow: /cgspace-notes/tags/
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2018-06/</loc>
|
||||
<lastmod>2018-06-24T17:38:07+03:00</lastmod>
|
||||
<lastmod>2018-06-26T17:17:55+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
@ -169,7 +169,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2018-06-24T17:38:07+03:00</lastmod>
|
||||
<lastmod>2018-06-26T17:17:55+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -178,27 +178,27 @@
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2018-06-24T17:38:07+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2018-03-09T22:10:33+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2018-06-26T17:17:55+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2018-06-24T17:38:07+03:00</lastmod>
|
||||
<lastmod>2018-06-26T17:17:55+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2018-06-24T17:38:07+03:00</lastmod>
|
||||
<lastmod>2018-06-26T17:17:55+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user