mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 00:18:21 +01:00
Update notes for 2019-09-20
This commit is contained in:
parent
063d4e1725
commit
77bc2f3b6b
@ -236,6 +236,7 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut
|
||||
|
||||
## 2019-09-20
|
||||
|
||||
- Deploy a fresh snapshot of CGSpace's PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations
|
||||
- Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace
|
||||
- They want to do some enrichment of the metadata to add countries and regions
|
||||
- Also, they noticed that some items have a blank ISSN in the citation like "ISSN:"
|
||||
@ -248,4 +249,46 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut
|
||||
$ perl-rename -n 's/_{2,3}/_/g' *.pdf
|
||||
```
|
||||
|
||||
- I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
|
||||
- There are a *few dozen* that have completely fucked up names due to some encoding error
|
||||
- To make matters worse, when I tried to download them, some of the links in the "URL" column that Francesco included are wrong, so I had to go to the permalink and get a link that worked
|
||||
- After downloading everything I had to use Ubuntu's version of rename to get rid of all the double and triple underscores:
|
||||
|
||||
```
|
||||
$ rename -v 's/___/_/g' *.pdf
|
||||
$ rename -v 's/__/_/g' *.pdf
|
||||
```
|
||||
|
||||
- I'm still waiting to hear what Carol and Francesca want to do with the `1195.pdf.LCK` file (for now I've removed it from the CSV, but for future reference it has the number 630 in its permalink)
|
||||
- I wrote two fairly long GREL expressions to clean up the institutional author names in the `dc.contributor.author` and `dc.identifier.citation` fields using OpenRefine
|
||||
- The first targets acronyms in parentheses like "International Livestock Research Institute (ILRI)":
|
||||
|
||||
```
|
||||
value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
|
||||
```
|
||||
- The second targets cities and countries after names like "International Livestock Research Intstitute, Kenya":
|
||||
|
||||
```
|
||||
replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
|
||||
```
|
||||
|
||||
- I imported the 1,427 Bioversity records with bitstreams to a new collection called [2019-09-20 Bioversity Migration Test](https://dspacetest.cgiar.org/handle/10568/103688) on DSpace Test (after splitting them in two batches of about 700 each):
|
||||
|
||||
```
|
||||
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
|
||||
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
|
||||
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
|
||||
```
|
||||
|
||||
- After that I exported the collection again and started doing some quality checks and cleanups:
|
||||
- Change all DOIs to use https://doi.org format
|
||||
- Change all bioversityinternational.org links to use https://
|
||||
- Fix ten authors with invalid names like "Orth,." by checking the correct name in the citation
|
||||
- Fix several invalid ISBNs, but there are several more that contain incorrect ISBNs in their PDFs!
|
||||
- Fix some citations that were using "ISSN" instead of ISBN
|
||||
- The next steps are:
|
||||
- Check for duplicates
|
||||
- Continue with institutional author normalization
|
||||
- Ask which collection to map items with type Brochure, Journal Item, and Thesis?
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -40,7 +40,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-09/" />
|
||||
<meta property="article:published_time" content="2019-09-01T10:17:51+03:00" />
|
||||
<meta property="article:modified_time" content="2019-09-20T12:55:11+03:00" />
|
||||
<meta property="article:modified_time" content="2019-09-20T13:25:59+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="September, 2019"/>
|
||||
@ -85,9 +85,9 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
|
||||
"@type": "BlogPosting",
|
||||
"headline": "September, 2019",
|
||||
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-09\/",
|
||||
"wordCount": "1778",
|
||||
"wordCount": "2166",
|
||||
"datePublished": "2019-09-01T10:17:51\x2b03:00",
|
||||
"dateModified": "2019-09-20T12:55:11\x2b03:00",
|
||||
"dateModified": "2019-09-20T13:25:59\x2b03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -438,6 +438,8 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
|
||||
<h2 id="2019-09-20">2019-09-20</h2>
|
||||
|
||||
<ul>
|
||||
<li>Deploy a fresh snapshot of CGSpace’s PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations</li>
|
||||
|
||||
<li><p>Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace</p>
|
||||
|
||||
<ul>
|
||||
@ -452,6 +454,60 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
|
||||
<pre><code>$ perl-rename -n 's/_{2,3}/_/g' *.pdf
|
||||
</code></pre></li>
|
||||
</ul></li>
|
||||
|
||||
<li><p>I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)</p>
|
||||
|
||||
<ul>
|
||||
<li>There are a <em>few dozen</em> that have completely fucked up names due to some encoding error</li>
|
||||
<li>To make matters worse, when I tried to download them, some of the links in the “URL” column that Francesco included are wrong, so I had to go to the permalink and get a link that worked</li>
|
||||
|
||||
<li><p>After downloading everything I had to use Ubuntu’s version of rename to get rid of all the double and triple underscores:</p>
|
||||
|
||||
<pre><code>$ rename -v 's/___/_/g' *.pdf
|
||||
$ rename -v 's/__/_/g' *.pdf
|
||||
</code></pre></li>
|
||||
</ul></li>
|
||||
|
||||
<li><p>I’m still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I’ve removed it from the CSV, but for future reference it has the number 630 in its permalink)</p></li>
|
||||
|
||||
<li><p>I wrote two fairly long GREL expressions to clean up the institutional author names in the <code>dc.contributor.author</code> and <code>dc.identifier.citation</code> fields using OpenRefine</p>
|
||||
|
||||
<ul>
|
||||
<li><p>The first targets acronyms in parentheses like “International Livestock Research Institute (ILRI)”:</p>
|
||||
|
||||
<pre><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
|
||||
</code></pre></li>
|
||||
|
||||
<li><p>The second targets cities and countries after names like “International Livestock Research Intstitute, Kenya”:</p>
|
||||
|
||||
<pre><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
|
||||
</code></pre></li>
|
||||
</ul></li>
|
||||
|
||||
<li><p>I imported the 1,427 Bioversity records with bitstreams to a new collection called <a href="https://dspacetest.cgiar.org/handle/10568/103688">2019-09-20 Bioversity Migration Test</a> on DSpace Test (after splitting them in two batches of about 700 each):</p>
|
||||
|
||||
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
|
||||
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
|
||||
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
|
||||
</code></pre></li>
|
||||
|
||||
<li><p>After that I exported the collection again and started doing some quality checks and cleanups:</p>
|
||||
|
||||
<ul>
|
||||
<li>Change all DOIs to use <a href="https://doi.org">https://doi.org</a> format</li>
|
||||
<li>Change all bioversityinternational.org links to use https://</li>
|
||||
<li>Fix ten authors with invalid names like “Orth,.” by checking the correct name in the citation</li>
|
||||
<li>Fix several invalid ISBNs, but there are several more that contain incorrect ISBNs in their PDFs!</li>
|
||||
<li>Fix some citations that were using “ISSN” instead of ISBN</li>
|
||||
</ul></li>
|
||||
|
||||
<li><p>The next steps are:</p>
|
||||
|
||||
<ul>
|
||||
<li>Check for duplicates</li>
|
||||
<li>Continue with institutional author normalization</li>
|
||||
<li>Ask which collection to map items with type Brochure, Journal Item, and Thesis?</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -4,27 +4,27 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2019-09-20T12:55:11+03:00</lastmod>
|
||||
<lastmod>2019-09-20T13:25:59+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2019-09-20T12:55:11+03:00</lastmod>
|
||||
<lastmod>2019-09-20T13:25:59+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2019-09-20T12:55:11+03:00</lastmod>
|
||||
<lastmod>2019-09-20T13:25:59+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2019-09/</loc>
|
||||
<lastmod>2019-09-20T12:55:11+03:00</lastmod>
|
||||
<lastmod>2019-09-20T13:25:59+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2019-09-20T12:55:11+03:00</lastmod>
|
||||
<lastmod>2019-09-20T13:25:59+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
|
Loading…
Reference in New Issue
Block a user