Update notes for 2019-09-20

This commit is contained in:
Alan Orth 2019-09-21 02:25:19 +03:00
parent 063d4e1725
commit 77bc2f3b6b
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
3 changed files with 107 additions and 8 deletions

View File

@ -236,6 +236,7 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut
## 2019-09-20 ## 2019-09-20
- Deploy a fresh snapshot of CGSpace's PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations
- Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace - Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace
- They want to do some enrichment of the metadata to add countries and regions - They want to do some enrichment of the metadata to add countries and regions
- Also, they noticed that some items have a blank ISSN in the citation like "ISSN:" - Also, they noticed that some items have a blank ISSN in the citation like "ISSN:"
@ -248,4 +249,46 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut
$ perl-rename -n 's/_{2,3}/_/g' *.pdf $ perl-rename -n 's/_{2,3}/_/g' *.pdf
``` ```
- I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
- There are a *few dozen* that have completely fucked up names due to some encoding error
- To make matters worse, when I tried to download them, some of the links in the "URL" column that Francesco included are wrong, so I had to go to the permalink and get a link that worked
- After downloading everything I had to use Ubuntu's version of rename to get rid of all the double and triple underscores:
```
$ rename -v 's/___/_/g' *.pdf
$ rename -v 's/__/_/g' *.pdf
```
- I'm still waiting to hear what Carol and Francesca want to do with the `1195.pdf.LCK` file (for now I've removed it from the CSV, but for future reference it has the number 630 in its permalink)
- I wrote two fairly long GREL expressions to clean up the institutional author names in the `dc.contributor.author` and `dc.identifier.citation` fields using OpenRefine
- The first targets acronyms in parentheses like "International Livestock Research Institute (ILRI)":
```
value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
```
- The second targets cities and countries after names like "International Livestock Research Intstitute, Kenya":
```
replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
```
- I imported the 1,427 Bioversity records with bitstreams to a new collection called [2019-09-20 Bioversity Migration Test](https://dspacetest.cgiar.org/handle/10568/103688) on DSpace Test (after splitting them in two batches of about 700 each):
```
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
```
- After that I exported the collection again and started doing some quality checks and cleanups:
- Change all DOIs to use https://doi.org format
- Change all bioversityinternational.org links to use https://
- Fix ten authors with invalid names like "Orth,." by checking the correct name in the citation
- Fix several invalid ISBNs, but there are several more that contain incorrect ISBNs in their PDFs!
- Fix some citations that were using "ISSN" instead of ISBN
- The next steps are:
- Check for duplicates
- Continue with institutional author normalization
- Ask which collection to map items with type Brochure, Journal Item, and Thesis?
<!-- vim: set sw=2 ts=2: --> <!-- vim: set sw=2 ts=2: -->

View File

@ -40,7 +40,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
<meta property="og:type" content="article" /> <meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-09/" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-09/" />
<meta property="article:published_time" content="2019-09-01T10:17:51+03:00" /> <meta property="article:published_time" content="2019-09-01T10:17:51+03:00" />
<meta property="article:modified_time" content="2019-09-20T12:55:11+03:00" /> <meta property="article:modified_time" content="2019-09-20T13:25:59+03:00" />
<meta name="twitter:card" content="summary"/> <meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2019"/> <meta name="twitter:title" content="September, 2019"/>
@ -85,9 +85,9 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
"@type": "BlogPosting", "@type": "BlogPosting",
"headline": "September, 2019", "headline": "September, 2019",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-09\/", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-09\/",
"wordCount": "1778", "wordCount": "2166",
"datePublished": "2019-09-01T10:17:51\x2b03:00", "datePublished": "2019-09-01T10:17:51\x2b03:00",
"dateModified": "2019-09-20T12:55:11\x2b03:00", "dateModified": "2019-09-20T13:25:59\x2b03:00",
"author": { "author": {
"@type": "Person", "@type": "Person",
"name": "Alan Orth" "name": "Alan Orth"
@ -438,6 +438,8 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<h2 id="2019-09-20">2019-09-20</h2> <h2 id="2019-09-20">2019-09-20</h2>
<ul> <ul>
<li>Deploy a fresh snapshot of CGSpace&rsquo;s PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations</li>
<li><p>Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace</p> <li><p>Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace</p>
<ul> <ul>
@ -452,6 +454,60 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
<pre><code>$ perl-rename -n 's/_{2,3}/_/g' *.pdf <pre><code>$ perl-rename -n 's/_{2,3}/_/g' *.pdf
</code></pre></li> </code></pre></li>
</ul></li> </ul></li>
<li><p>I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)</p>
<ul>
<li>There are a <em>few dozen</em> that have completely fucked up names due to some encoding error</li>
<li>To make matters worse, when I tried to download them, some of the links in the &ldquo;URL&rdquo; column that Francesco included are wrong, so I had to go to the permalink and get a link that worked</li>
<li><p>After downloading everything I had to use Ubuntu&rsquo;s version of rename to get rid of all the double and triple underscores:</p>
<pre><code>$ rename -v 's/___/_/g' *.pdf
$ rename -v 's/__/_/g' *.pdf
</code></pre></li>
</ul></li>
<li><p>I&rsquo;m still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I&rsquo;ve removed it from the CSV, but for future reference it has the number 630 in its permalink)</p></li>
<li><p>I wrote two fairly long GREL expressions to clean up the institutional author names in the <code>dc.contributor.author</code> and <code>dc.identifier.citation</code> fields using OpenRefine</p>
<ul>
<li><p>The first targets acronyms in parentheses like &ldquo;International Livestock Research Institute (ILRI)&rdquo;:</p>
<pre><code>value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,&quot;&quot;)
</code></pre></li>
<li><p>The second targets cities and countries after names like &ldquo;International Livestock Research Intstitute, Kenya&rdquo;:</p>
<pre><code>replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,&quot;&quot;)
</code></pre></li>
</ul></li>
<li><p>I imported the 1,427 Bioversity records with bitstreams to a new collection called <a href="https://dspacetest.cgiar.org/handle/10568/103688">2019-09-20 Bioversity Migration Test</a> on DSpace Test (after splitting them in two batches of about 700 each):</p>
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
</code></pre></li>
<li><p>After that I exported the collection again and started doing some quality checks and cleanups:</p>
<ul>
<li>Change all DOIs to use <a href="https://doi.org">https://doi.org</a> format</li>
<li>Change all bioversityinternational.org links to use https://</li>
<li>Fix ten authors with invalid names like &ldquo;Orth,.&rdquo; by checking the correct name in the citation</li>
<li>Fix several invalid ISBNs, but there are several more that contain incorrect ISBNs in their PDFs!</li>
<li>Fix some citations that were using &ldquo;ISSN&rdquo; instead of ISBN</li>
</ul></li>
<li><p>The next steps are:</p>
<ul>
<li>Check for duplicates</li>
<li>Continue with institutional author normalization</li>
<li>Ask which collection to map items with type Brochure, Journal Item, and Thesis?</li>
</ul></li>
</ul> </ul>
<!-- vim: set sw=2 ts=2: --> <!-- vim: set sw=2 ts=2: -->

View File

@ -4,27 +4,27 @@
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/</loc> <loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2019-09-20T12:55:11+03:00</lastmod> <lastmod>2019-09-20T13:25:59+03:00</lastmod>
</url> </url>
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc> <loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2019-09-20T12:55:11+03:00</lastmod> <lastmod>2019-09-20T13:25:59+03:00</lastmod>
</url> </url>
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc> <loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2019-09-20T12:55:11+03:00</lastmod> <lastmod>2019-09-20T13:25:59+03:00</lastmod>
</url> </url>
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/2019-09/</loc> <loc>https://alanorth.github.io/cgspace-notes/2019-09/</loc>
<lastmod>2019-09-20T12:55:11+03:00</lastmod> <lastmod>2019-09-20T13:25:59+03:00</lastmod>
</url> </url>
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc> <loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2019-09-20T12:55:11+03:00</lastmod> <lastmod>2019-09-20T13:25:59+03:00</lastmod>
</url> </url>
<url> <url>