From a04dbc50dbda79024642e9b92eeda895536d9aeb Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Fri, 19 Mar 2021 11:48:27 +0200 Subject: [PATCH] Add notes about checking and fixing mojibake --- CHANGELOG.md | 4 ++++ README.md | 9 +++++++++ 2 files changed, 13 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index a77a7ac..d374f1a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## Unreleased +### Added +- Ability to check for, and fix, "mojibake" characters using [ftfy](https://github.com/LuminosoInsight/python-ftfy) + ## [0.4.7] - 2021-03-17 ### Changed - Fixing invalid multi-value separators like `|` and `|||` is no longer class- diff --git a/README.md b/README.md index 84ecfc5..c88603d 100644 --- a/README.md +++ b/README.md @@ -20,6 +20,7 @@ If you use the DSpace CSV metadata quality checker please cite: - Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes` - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt" +- Check for "mojibake" characters (and attempt to fix with `--unsafe-fixes`) - Remove duplicate metadata values - Check for duplicate items, using the title, type, and date issued as an indicator @@ -75,6 +76,14 @@ This is considered "unsafe" because some systems give special importance to vert Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html). +### Encoding Issues aka "Mojibake" +[Mojibake](https://en.wikipedia.org/wiki/Mojibake) is a phenomenon that occurs when text is decoded using an unintended character encoding. This usually presents itself in the form of strange, garbled characters in the text. Enabling "unsafe" fixes will attempt to correct these, for example: + +- CIAT Publicaçao → CIAT Publicaçao +- CIAT Publicación → CIAT Publicación + +Pay special attention to the output of the script as well as the resulting file to make sure no new issues have been introduced. The ideal way to solve these issues is to avoid it in the first place. See [this guide about opening CSVs in UTF-8 format in Excel](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0). + ## AGROVOC Validation You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields: