cgspace-notes/content/posts/2019-10.md

1.9 KiB
Raw Blame History

title date author tags
October, 2019 2019-10-01T13:20:51+03:00 Alan Orth
Notes

2019-10-01

  • Udana from IWMI asked me for a CSV export of their community on CGSpace
    • I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data
    • I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script's "unneccesary Unicode" fix:
$ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
  • Then I replace them in vim with :% s/\%u00a0/ /g because I can't figure out the correct sed syntax to do it directly from the pipe above
  • I uploaded those to CGSpace and then re-exported the metadata
  • Now that I think about it, I shouldn't be removing non-breaking spaces (U+00A0), I should be replacing them with normal spaces!
  • I modified the script so it replaces the non-breaking spaces instead of removing them
  • Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):
$ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u

2019-10-03

  • Upload the 117 IITA records that we had been working on last month (aka 20196th.xls aka Sept 6) to CGSpace