diff --git a/content/posts/2019-09.md b/content/posts/2019-09.md index f73f26360..6bc4ca2eb 100644 --- a/content/posts/2019-09.md +++ b/content/posts/2019-09.md @@ -236,6 +236,7 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut ## 2019-09-20 +- Deploy a fresh snapshot of CGSpace's PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations - Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace - They want to do some enrichment of the metadata to add countries and regions - Also, they noticed that some items have a blank ISSN in the citation like "ISSN:" @@ -248,4 +249,46 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut $ perl-rename -n 's/_{2,3}/_/g' *.pdf ``` +- I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month) + - There are a *few dozen* that have completely fucked up names due to some encoding error + - To make matters worse, when I tried to download them, some of the links in the "URL" column that Francesco included are wrong, so I had to go to the permalink and get a link that worked + - After downloading everything I had to use Ubuntu's version of rename to get rid of all the double and triple underscores: + +``` +$ rename -v 's/___/_/g' *.pdf +$ rename -v 's/__/_/g' *.pdf +``` + +- I'm still waiting to hear what Carol and Francesca want to do with the `1195.pdf.LCK` file (for now I've removed it from the CSV, but for future reference it has the number 630 in its permalink) +- I wrote two fairly long GREL expressions to clean up the institutional author names in the `dc.contributor.author` and `dc.identifier.citation` fields using OpenRefine + - The first targets acronyms in parentheses like "International Livestock Research Institute (ILRI)": + +``` +value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"") +``` + - The second targets cities and countries after names like "International Livestock Research Intstitute, Kenya": + +``` +replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"") +``` + +- I imported the 1,427 Bioversity records with bitstreams to a new collection called [2019-09-20 Bioversity Migration Test](https://dspacetest.cgiar.org/handle/10568/103688) on DSpace Test (after splitting them in two batches of about 700 each): + +``` +$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m' +$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1 +$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2 +``` + +- After that I exported the collection again and started doing some quality checks and cleanups: + - Change all DOIs to use https://doi.org format + - Change all bioversityinternational.org links to use https:// + - Fix ten authors with invalid names like "Orth,." by checking the correct name in the citation + - Fix several invalid ISBNs, but there are several more that contain incorrect ISBNs in their PDFs! + - Fix some citations that were using "ISSN" instead of ISBN +- The next steps are: + - Check for duplicates + - Continue with institutional author normalization + - Ask which collection to map items with type Brochure, Journal Item, and Thesis? + diff --git a/docs/2019-09/index.html b/docs/2019-09/index.html index c9a220154..916e1a1b4 100644 --- a/docs/2019-09/index.html +++ b/docs/2019-09/index.html @@ -40,7 +40,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: - + @@ -85,9 +85,9 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: "@type": "BlogPosting", "headline": "September, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-09\/", - "wordCount": "1778", + "wordCount": "2166", "datePublished": "2019-09-01T10:17:51\x2b03:00", - "dateModified": "2019-09-20T12:55:11\x2b03:00", + "dateModified": "2019-09-20T13:25:59\x2b03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -438,6 +438,8 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace
$ perl-rename -n 's/_{2,3}/_/g' *.pdf
I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
+ +After downloading everything I had to use Ubuntu’s version of rename to get rid of all the double and triple underscores:
+ +$ rename -v 's/___/_/g' *.pdf
+$ rename -v 's/__/_/g' *.pdf
+
I’m still waiting to hear what Carol and Francesca want to do with the 1195.pdf.LCK
file (for now I’ve removed it from the CSV, but for future reference it has the number 630 in its permalink)
I wrote two fairly long GREL expressions to clean up the institutional author names in the dc.contributor.author
and dc.identifier.citation
fields using OpenRefine
The first targets acronyms in parentheses like “International Livestock Research Institute (ILRI)”:
+ +value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
+
The second targets cities and countries after names like “International Livestock Research Intstitute, Kenya”:
+ +replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
+
I imported the 1,427 Bioversity records with bitstreams to a new collection called 2019-09-20 Bioversity Migration Test on DSpace Test (after splitting them in two batches of about 700 each):
+ +$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
+$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
+$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
+
After that I exported the collection again and started doing some quality checks and cleanups:
+ +The next steps are:
+ +