From 77bc2f3b6bf1ca6b3290b38a6eda1540f7c76d27 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Sat, 21 Sep 2019 02:25:19 +0300 Subject: [PATCH] Update notes for 2019-09-20 --- content/posts/2019-09.md | 43 ++++++++++++++++++++++++++++ docs/2019-09/index.html | 62 ++++++++++++++++++++++++++++++++++++++-- docs/sitemap.xml | 10 +++---- 3 files changed, 107 insertions(+), 8 deletions(-) diff --git a/content/posts/2019-09.md b/content/posts/2019-09.md index f73f26360..6bc4ca2eb 100644 --- a/content/posts/2019-09.md +++ b/content/posts/2019-09.md @@ -236,6 +236,7 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut ## 2019-09-20 +- Deploy a fresh snapshot of CGSpace's PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations - Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace - They want to do some enrichment of the metadata to add countries and regions - Also, they noticed that some items have a blank ISSN in the citation like "ISSN:" @@ -248,4 +249,46 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut $ perl-rename -n 's/_{2,3}/_/g' *.pdf ``` +- I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month) + - There are a *few dozen* that have completely fucked up names due to some encoding error + - To make matters worse, when I tried to download them, some of the links in the "URL" column that Francesco included are wrong, so I had to go to the permalink and get a link that worked + - After downloading everything I had to use Ubuntu's version of rename to get rid of all the double and triple underscores: + +``` +$ rename -v 's/___/_/g' *.pdf +$ rename -v 's/__/_/g' *.pdf +``` + +- I'm still waiting to hear what Carol and Francesca want to do with the `1195.pdf.LCK` file (for now I've removed it from the CSV, but for future reference it has the number 630 in its permalink) +- I wrote two fairly long GREL expressions to clean up the institutional author names in the `dc.contributor.author` and `dc.identifier.citation` fields using OpenRefine + - The first targets acronyms in parentheses like "International Livestock Research Institute (ILRI)": + +``` +value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"") +``` + - The second targets cities and countries after names like "International Livestock Research Intstitute, Kenya": + +``` +replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"") +``` + +- I imported the 1,427 Bioversity records with bitstreams to a new collection called [2019-09-20 Bioversity Migration Test](https://dspacetest.cgiar.org/handle/10568/103688) on DSpace Test (after splitting them in two batches of about 700 each): + +``` +$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m' +$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1 +$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2 +``` + +- After that I exported the collection again and started doing some quality checks and cleanups: + - Change all DOIs to use https://doi.org format + - Change all bioversityinternational.org links to use https:// + - Fix ten authors with invalid names like "Orth,." by checking the correct name in the citation + - Fix several invalid ISBNs, but there are several more that contain incorrect ISBNs in their PDFs! + - Fix some citations that were using "ISSN" instead of ISBN +- The next steps are: + - Check for duplicates + - Continue with institutional author normalization + - Ask which collection to map items with type Brochure, Journal Item, and Thesis? + diff --git a/docs/2019-09/index.html b/docs/2019-09/index.html index c9a220154..916e1a1b4 100644 --- a/docs/2019-09/index.html +++ b/docs/2019-09/index.html @@ -40,7 +40,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: - + @@ -85,9 +85,9 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: "@type": "BlogPosting", "headline": "September, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-09\/", - "wordCount": "1778", + "wordCount": "2166", "datePublished": "2019-09-01T10:17:51\x2b03:00", - "dateModified": "2019-09-20T12:55:11\x2b03:00", + "dateModified": "2019-09-20T13:25:59\x2b03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -438,6 +438,8 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s

2019-09-20

diff --git a/docs/sitemap.xml b/docs/sitemap.xml index c24179f62..b2883d2f6 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/ - 2019-09-20T12:55:11+03:00 + 2019-09-20T13:25:59+03:00 https://alanorth.github.io/cgspace-notes/tags/notes/ - 2019-09-20T12:55:11+03:00 + 2019-09-20T13:25:59+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2019-09-20T12:55:11+03:00 + 2019-09-20T13:25:59+03:00 https://alanorth.github.io/cgspace-notes/2019-09/ - 2019-09-20T12:55:11+03:00 + 2019-09-20T13:25:59+03:00 https://alanorth.github.io/cgspace-notes/tags/ - 2019-09-20T12:55:11+03:00 + 2019-09-20T13:25:59+03:00