mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes for 2019-09-20
This commit is contained in:
@ -236,6 +236,7 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut
|
||||
|
||||
## 2019-09-20
|
||||
|
||||
- Deploy a fresh snapshot of CGSpace's PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations
|
||||
- Skype with Carol and Francesca to discuss the Bioveristy migration to CGSpace
|
||||
- They want to do some enrichment of the metadata to add countries and regions
|
||||
- Also, they noticed that some items have a blank ISSN in the citation like "ISSN:"
|
||||
@ -248,4 +249,46 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contribut
|
||||
$ perl-rename -n 's/_{2,3}/_/g' *.pdf
|
||||
```
|
||||
|
||||
- I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
|
||||
- There are a *few dozen* that have completely fucked up names due to some encoding error
|
||||
- To make matters worse, when I tried to download them, some of the links in the "URL" column that Francesco included are wrong, so I had to go to the permalink and get a link that worked
|
||||
- After downloading everything I had to use Ubuntu's version of rename to get rid of all the double and triple underscores:
|
||||
|
||||
```
|
||||
$ rename -v 's/___/_/g' *.pdf
|
||||
$ rename -v 's/__/_/g' *.pdf
|
||||
```
|
||||
|
||||
- I'm still waiting to hear what Carol and Francesca want to do with the `1195.pdf.LCK` file (for now I've removed it from the CSV, but for future reference it has the number 630 in its permalink)
|
||||
- I wrote two fairly long GREL expressions to clean up the institutional author names in the `dc.contributor.author` and `dc.identifier.citation` fields using OpenRefine
|
||||
- The first targets acronyms in parentheses like "International Livestock Research Institute (ILRI)":
|
||||
|
||||
```
|
||||
value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
|
||||
```
|
||||
- The second targets cities and countries after names like "International Livestock Research Intstitute, Kenya":
|
||||
|
||||
```
|
||||
replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
|
||||
```
|
||||
|
||||
- I imported the 1,427 Bioversity records with bitstreams to a new collection called [2019-09-20 Bioversity Migration Test](https://dspacetest.cgiar.org/handle/10568/103688) on DSpace Test (after splitting them in two batches of about 700 each):
|
||||
|
||||
```
|
||||
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
|
||||
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
|
||||
$ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
|
||||
```
|
||||
|
||||
- After that I exported the collection again and started doing some quality checks and cleanups:
|
||||
- Change all DOIs to use https://doi.org format
|
||||
- Change all bioversityinternational.org links to use https://
|
||||
- Fix ten authors with invalid names like "Orth,." by checking the correct name in the citation
|
||||
- Fix several invalid ISBNs, but there are several more that contain incorrect ISBNs in their PDFs!
|
||||
- Fix some citations that were using "ISSN" instead of ISBN
|
||||
- The next steps are:
|
||||
- Check for duplicates
|
||||
- Continue with institutional author normalization
|
||||
- Ask which collection to map items with type Brochure, Journal Item, and Thesis?
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user