mirror of
https://github.com/ilri/csv-metadata-quality.git
synced 2024-11-22 13:55:03 +01:00
Add notes about checking and fixing mojibake
This commit is contained in:
parent
28335ed159
commit
a04dbc50db
@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## Unreleased
|
||||
### Added
|
||||
- Ability to check for, and fix, "mojibake" characters using [ftfy](https://github.com/LuminosoInsight/python-ftfy)
|
||||
|
||||
## [0.4.7] - 2021-03-17
|
||||
### Changed
|
||||
- Fixing invalid multi-value separators like `|` and `|||` is no longer class-
|
||||
|
@ -20,6 +20,7 @@ If you use the DSpace CSV metadata quality checker please cite:
|
||||
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
|
||||
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
|
||||
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
|
||||
- Check for "mojibake" characters (and attempt to fix with `--unsafe-fixes`)
|
||||
- Remove duplicate metadata values
|
||||
- Check for duplicate items, using the title, type, and date issued as an indicator
|
||||
|
||||
@ -75,6 +76,14 @@ This is considered "unsafe" because some systems give special importance to vert
|
||||
|
||||
Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html).
|
||||
|
||||
### Encoding Issues aka "Mojibake"
|
||||
[Mojibake](https://en.wikipedia.org/wiki/Mojibake) is a phenomenon that occurs when text is decoded using an unintended character encoding. This usually presents itself in the form of strange, garbled characters in the text. Enabling "unsafe" fixes will attempt to correct these, for example:
|
||||
|
||||
- CIAT Publicaçao → CIAT Publicaçao
|
||||
- CIAT Publicación → CIAT Publicación
|
||||
|
||||
Pay special attention to the output of the script as well as the resulting file to make sure no new issues have been introduced. The ideal way to solve these issues is to avoid it in the first place. See [this guide about opening CSVs in UTF-8 format in Excel](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0).
|
||||
|
||||
## AGROVOC Validation
|
||||
You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields:
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user