README.md: Add note about missing regions
continuous-integration/drone/push Build is passing Details

This commit is contained in:
Alan Orth 2022-07-28 16:58:01 +03:00
parent da87531779
commit 032a1db392
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
1 changed files with 5 additions and 1 deletions

View File

@ -28,6 +28,7 @@ If you use the DSpace CSV metadata quality checker please cite:
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
- Check for "mojibake" characters (and attempt to fix with `--unsafe-fixes`)
- Check for countries with missing regions (and attempt to fix with `--unsafe-fixes`)
- Remove duplicate metadata values
- Check for duplicate items, using the title, type, and date issued as an indicator
@ -70,7 +71,7 @@ While it is *theoretically* possible for a single `|` character to be used legit
This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
## Unsafe Fixes
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will remove newlines, perform Unicode normalization, and attempt to fix "mojibake" characters.
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will remove newlines, perform Unicode normalization, attempt to fix "mojibake" characters, and add missing UN M.49 regions.
### Newlines
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
@ -91,6 +92,9 @@ Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-
Pay special attention to the output of the script as well as the resulting file to make sure no new issues have been introduced. The ideal way to solve these issues is to avoid it in the first place. See [this guide about opening CSVs in UTF-8 format in Excel](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0).
### Countries With Missing Regions
When an input file has both country and region columns we can check to see if the ISO 3166 country names have matching UN M.49 regions and add them when they are missing.
## AGROVOC Validation
You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields: