From f37fb890929e8bdbd934e795bdcfd1209ac8515c Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Tue, 4 Sep 2018 17:08:34 +0300 Subject: [PATCH] Update notes for 2018-09-04 --- content/posts/2018-09.md | 10 +++++++++- docs/2018-05/index.html | 8 ++++---- docs/2018-09/index.html | 16 ++++++++++++---- docs/sitemap.xml | 12 ++++++------ 4 files changed, 31 insertions(+), 15 deletions(-) diff --git a/content/posts/2018-09.md b/content/posts/2018-09.md index 9e81e2581..aa7d1ec86 100644 --- a/content/posts/2018-09.md +++ b/content/posts/2018-09.md @@ -54,7 +54,15 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana - I'm looking over the latest round of IITA records from Sisay: [Mercy1806_August_29](https://dspacetest.cgiar.org/handle/10568/104230) - All fields are split with multiple columns like `cg.authorship.types` and `cg.authorship.types[]` - This makes it super annoying to do the checks and cleanup, so I will merge them (also time consuming) - - Five issue dates had values like `2013-5` so I corrected them to be `2013-05` + - Five items had `dc.date.issued` values like `2013-5` so I corrected them to be `2013-05` - Several metadata fields had values with newlines in them (even in some titles!), which I fixed by trimming the consecutive whitespaces in Open Refine + - Many (196!) items from before 2011 are indicated as having a CRP, but CRPs didn't exist then so this is impossible + - I got all items that were from 2011 and onwards using a custom facet with this GREL on the `dc.date.issued` column: `isNotNull(value.match(/201[1-8].*/))` and then blanking their CRPs + - Some affiliations with only one separator (|) for multiple values + - I replaced smart quotes like `’` with plain ones + - Some inconsitencies in `cg.subject.iita` like COWPEA and COWPEAS, and YAM and YAMS, etc, as well as some spelling mistakes like IMPACT ASSESSMENTN + - Some values in the `dc.identifier.isbn` are actually ISSNs so I moved them to the `dc.identifier.issn` column + - I found one invalid ISSN using a custom text facet with the regex from the [ISSN page on Wikipedia](https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format): `isNotBlank(value.match(/^\d{4}-\d{3}[\dxX]$/))` + - One invalid value for `dc.type` diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index 8438c60d4..71376ee77 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -22,7 +22,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked " /> - + Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file: -
$ lein run /tmp/crps.csv id
+
$ lein run /tmp/crps.csv name id
 
    diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index cf86cefaa..d52e532df 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -18,7 +18,7 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I " /> - +
  • All fields are split with multiple columns like cg.authorship.types and cg.authorship.types[]
  • This makes it super annoying to do the checks and cleanup, so I will merge them (also time consuming)
  • -
  • Five issue dates had values like 2013-5 so I corrected them to be 2013-05
  • +
  • Five items had dc.date.issued values like 2013-5 so I corrected them to be 2013-05
  • Several metadata fields had values with newlines in them (even in some titles!), which I fixed by trimming the consecutive whitespaces in Open Refine
  • +
  • Many (196!) items from before 2011 are indicated as having a CRP, but CRPs didn’t exist then so this is impossible
  • +
  • I got all items that were from 2011 and onwards using a custom facet with this GREL on the dc.date.issued column: isNotNull(value.match(/201[1-8].*/)) and then blanking their CRPs
  • +
  • Some affiliations with only one separator (|) for multiple values
  • +
  • I replaced smart quotes like with plain ones
  • +
  • Some inconsitencies in cg.subject.iita like COWPEA and COWPEAS, and YAM and YAMS, etc, as well as some spelling mistakes like IMPACT ASSESSMENTN
  • +
  • Some values in the dc.identifier.isbn are actually ISSNs so I moved them to the dc.identifier.issn column
  • +
  • I found one invalid ISSN using a custom text facet with the regex from the ISSN page on Wikipedia: isNotBlank(value.match(/^\d{4}-\d{3}[\dxX]$/))
  • +
  • One invalid value for dc.type
diff --git a/docs/sitemap.xml b/docs/sitemap.xml index b71ab02d5..496215727 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-09/ - 2018-09-03T16:47:24+03:00 + 2018-09-04T13:25:13+03:00 @@ -24,7 +24,7 @@ https://alanorth.github.io/cgspace-notes/2018-05/ - 2018-05-31T15:53:12-07:00 + 2018-09-04T16:15:26+03:00 @@ -184,7 +184,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-09-03T16:47:24+03:00 + 2018-09-04T13:25:13+03:00 0 @@ -195,7 +195,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-09-03T16:47:24+03:00 + 2018-09-04T13:25:13+03:00 0 @@ -207,13 +207,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2018-09-03T16:47:24+03:00 + 2018-09-04T13:25:13+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-09-03T16:47:24+03:00 + 2018-09-04T13:25:13+03:00 0