mirror of
https://github.com/ilri/csv-metadata-quality.git
synced 2025-05-09 22:56:01 +02:00
Compare commits
21 Commits
v0.4.7
...
c8f5539d21
Author | SHA1 | Date | |
---|---|---|---|
c8f5539d21
|
|||
382d0d6aed
|
|||
b8f4be9ebb
|
|||
4e2eab68b0
|
|||
55165cb4ce
|
|||
93d3eabfba
|
|||
a8fe623f4c
|
|||
dbc0437d59
|
|||
96ce1daa90
|
|||
3adb52d7c0
|
|||
f958d1879f
|
|||
bd8943f36a
|
|||
28f9026286
|
|||
cfe09f7126
|
|||
8eddb76aab
|
|||
a04dbc50db
|
|||
28335ed159
|
|||
773a0a2695
|
|||
39a4b1a487
|
|||
898bb412c3
|
|||
e92ec5d371
|
@ -4,6 +4,13 @@ All notable changes to this project will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## Unreleased
|
||||
### Added
|
||||
- Ability to check for, and fix, "mojibake" characters using [ftfy](https://github.com/LuminosoInsight/python-ftfy)
|
||||
|
||||
### Updated
|
||||
- Python dependencies
|
||||
|
||||
## [0.4.7] - 2021-03-17
|
||||
### Changed
|
||||
- Fixing invalid multi-value separators like `|` and `|||` is no longer class-
|
||||
|
16
README.md
16
README.md
@ -20,7 +20,9 @@ If you use the DSpace CSV metadata quality checker please cite:
|
||||
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
|
||||
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
|
||||
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
|
||||
- Check for "mojibake" characters (and attempt to fix with `--unsafe-fixes`)
|
||||
- Remove duplicate metadata values
|
||||
- Check for duplicate items, using the title, type, and date issued as an indicator
|
||||
|
||||
## Installation
|
||||
The easiest way to install CSV Metadata Quality is with [poetry](https://python-poetry.org):
|
||||
@ -61,7 +63,7 @@ While it is *theoretically* possible for a single `|` character to be used legit
|
||||
This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
|
||||
|
||||
## Unsafe Fixes
|
||||
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will remove newlines and perform Unicode normalization.
|
||||
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will remove newlines, perform Unicode normalization, and attempt to fix "mojibake" characters.
|
||||
|
||||
### Newlines
|
||||
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
|
||||
@ -74,6 +76,14 @@ This is considered "unsafe" because some systems give special importance to vert
|
||||
|
||||
Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html).
|
||||
|
||||
### Encoding Issues aka "Mojibake"
|
||||
[Mojibake](https://en.wikipedia.org/wiki/Mojibake) is a phenomenon that occurs when text is decoded using an unintended character encoding. This usually presents itself in the form of strange, garbled characters in the text. Enabling "unsafe" fixes will attempt to correct these, for example:
|
||||
|
||||
- CIAT Publicaçao → CIAT Publicaçao
|
||||
- CIAT Publicación → CIAT Publicación
|
||||
|
||||
Pay special attention to the output of the script as well as the resulting file to make sure no new issues have been introduced. The ideal way to solve these issues is to avoid it in the first place. See [this guide about opening CSVs in UTF-8 format in Excel](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0).
|
||||
|
||||
## AGROVOC Validation
|
||||
You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields:
|
||||
|
||||
@ -116,10 +126,6 @@ This currently uses the [Python langid](https://github.com/saffsd/langid.py) lib
|
||||
- Warn if item is Open Access, but missing a license
|
||||
- Warn if item has an ISSN but no journal title
|
||||
- Update journal titles from ISSN
|
||||
- Check for duplicates
|
||||
- If I check titles only, then I might miss if one is a Report and another is a Presentation
|
||||
- I could just check each item against each other item, but that sounds slow...
|
||||
- Perhaps I could check for the number of unique values in a few rows, like title and doi, and see if it is the same as the total number of items
|
||||
|
||||
## License
|
||||
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
|
||||
|
@ -1,3 +1,5 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
from sys import argv
|
||||
|
||||
from csv_metadata_quality import app
|
||||
|
@ -1,3 +1,5 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import signal
|
||||
@ -107,6 +109,13 @@ def run(argv):
|
||||
# Check: suspicious characters
|
||||
df[column].apply(check.suspicious_characters, field_name=column)
|
||||
|
||||
# Check: mojibake
|
||||
df[column].apply(check.mojibake, field_name=column)
|
||||
|
||||
# Fix: mojibake
|
||||
if args.unsafe_fixes:
|
||||
df[column] = df[column].apply(fix.mojibake, field_name=column)
|
||||
|
||||
# Fix: invalid and unnecessary multi-value separators
|
||||
df[column] = df[column].apply(fix.separators, field_name=column)
|
||||
# Run whitespace fix again after fixing invalid separators
|
||||
@ -155,13 +164,16 @@ def run(argv):
|
||||
|
||||
# Check: duplicate items
|
||||
# We extract just the title, type, and date issued columns to analyze
|
||||
duplicates_df = df.filter(
|
||||
regex=r"dcterms\.title|dc\.title|dcterms\.type|dc\.type|dcterms\.issued|dc\.date\.issued"
|
||||
)
|
||||
check.duplicate_items(duplicates_df)
|
||||
try:
|
||||
duplicates_df = df.filter(
|
||||
regex=r"dcterms\.title|dc\.title|dcterms\.type|dc\.type|dcterms\.issued|dc\.date\.issued"
|
||||
)
|
||||
check.duplicate_items(duplicates_df)
|
||||
|
||||
# Delete the temporary duplicates DataFrame
|
||||
del duplicates_df
|
||||
# Delete the temporary duplicates DataFrame
|
||||
del duplicates_df
|
||||
except IndexError:
|
||||
pass
|
||||
|
||||
##
|
||||
# Perform some checks on rows so we can consider items as a whole rather
|
||||
|
@ -1,3 +1,5 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
import os
|
||||
import re
|
||||
from datetime import datetime, timedelta
|
||||
@ -11,6 +13,8 @@ from pycountry import languages
|
||||
from stdnum import isbn as stdnum_isbn
|
||||
from stdnum import issn as stdnum_issn
|
||||
|
||||
from csv_metadata_quality.util import is_mojibake
|
||||
|
||||
|
||||
def issn(field):
|
||||
"""Check if an ISSN is valid.
|
||||
@ -174,13 +178,9 @@ def language(field):
|
||||
if len(value) == 2:
|
||||
if not languages.get(alpha_2=value):
|
||||
print(f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}")
|
||||
|
||||
pass
|
||||
elif len(value) == 3:
|
||||
if not languages.get(alpha_3=value):
|
||||
print(f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}")
|
||||
|
||||
pass
|
||||
else:
|
||||
print(f"{Fore.RED}Invalid language: {Fore.RESET}{value}")
|
||||
|
||||
@ -216,7 +216,7 @@ def agrovoc(field, field_name):
|
||||
)
|
||||
|
||||
# prune old cache entries
|
||||
requests_cache.core.remove_expired_responses()
|
||||
requests_cache.remove_expired_responses()
|
||||
|
||||
# Try to split multi-value field on "||" separator
|
||||
for value in field.split("||"):
|
||||
@ -301,8 +301,6 @@ def spdx_license_identifier(field):
|
||||
if value not in spdx_license_list.LICENSES:
|
||||
print(f"{Fore.YELLOW}Non-SPDX license identifier: {Fore.RESET}{value}")
|
||||
|
||||
pass
|
||||
|
||||
return
|
||||
|
||||
|
||||
@ -345,3 +343,22 @@ def duplicate_items(df):
|
||||
)
|
||||
else:
|
||||
items.append(item_title_type_date)
|
||||
|
||||
|
||||
def mojibake(field, field_name):
|
||||
"""Check for mojibake (text that was encoded in one encoding and decoded in
|
||||
in another, perhaps multiple times). See util.py.
|
||||
|
||||
Prints the string if it contains suspected mojibake.
|
||||
"""
|
||||
|
||||
# Skip fields with missing values
|
||||
if pd.isna(field):
|
||||
return
|
||||
|
||||
if is_mojibake(field):
|
||||
print(
|
||||
f"{Fore.YELLOW}Possible encoding issue ({field_name}): {Fore.RESET}{field}"
|
||||
)
|
||||
|
||||
return
|
||||
|
@ -1,3 +1,5 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
import re
|
||||
|
||||
import langid
|
||||
|
@ -1,10 +1,13 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
import re
|
||||
from unicodedata import normalize
|
||||
|
||||
import pandas as pd
|
||||
from colorama import Fore
|
||||
from ftfy import fix_text
|
||||
|
||||
from csv_metadata_quality.util import is_nfc
|
||||
from csv_metadata_quality.util import is_mojibake, is_nfc
|
||||
|
||||
|
||||
def whitespace(field, field_name):
|
||||
@ -253,3 +256,22 @@ def normalize_unicode(field, field_name):
|
||||
field = normalize("NFC", field)
|
||||
|
||||
return field
|
||||
|
||||
|
||||
def mojibake(field, field_name):
|
||||
"""Attempts to fix mojibake (text that was encoded in one encoding and deco-
|
||||
ded in another, perhaps multiple times). See util.py.
|
||||
|
||||
Return fixed string.
|
||||
"""
|
||||
|
||||
# Skip fields with missing values
|
||||
if pd.isna(field):
|
||||
return field
|
||||
|
||||
if is_mojibake(field):
|
||||
print(f"{Fore.GREEN}Fixing encoding issue ({field_name}): {Fore.RESET}{field}")
|
||||
|
||||
return fix_text(field)
|
||||
else:
|
||||
return field
|
||||
|
@ -1,3 +1,8 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
from ftfy.badness import sequence_weirdness
|
||||
|
||||
|
||||
def is_nfc(field):
|
||||
"""Utility function to check whether a string is using normalized Unicode.
|
||||
Python's built-in unicodedata library has the is_normalized() function, but
|
||||
@ -12,3 +17,35 @@ def is_nfc(field):
|
||||
from unicodedata import normalize
|
||||
|
||||
return field == normalize("NFC", field)
|
||||
|
||||
|
||||
def is_mojibake(field):
|
||||
"""Determines whether a string contains mojibake.
|
||||
|
||||
We commonly deal with CSV files that were *encoded* in UTF-8, but decoded
|
||||
as something else like CP-1252 (Windows Latin). This manifests in the form
|
||||
of "mojibake", for example:
|
||||
|
||||
- CIAT Publicaçao
|
||||
- CIAT Publicación
|
||||
|
||||
This uses the excellent "fixes text for you" (ftfy) library to determine
|
||||
whether a string contains characters that have been encoded in one encoding
|
||||
and decoded in another.
|
||||
|
||||
Inspired by this code snippet from Martijn Pieters on StackOverflow:
|
||||
https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
|
||||
|
||||
Return boolean.
|
||||
"""
|
||||
if not sequence_weirdness(field):
|
||||
# Nothing weird, should be okay
|
||||
return False
|
||||
try:
|
||||
field.encode("sloppy-windows-1252")
|
||||
except UnicodeEncodeError:
|
||||
# Not CP-1252 encodable, probably fine
|
||||
return False
|
||||
else:
|
||||
# Encodable as CP-1252, Mojibake alert level high
|
||||
return True
|
||||
|
@ -1 +1,3 @@
|
||||
VERSION = "0.4.7"
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
VERSION = "0.4.8-dev"
|
||||
|
@ -32,3 +32,4 @@ Unnecessary multi-value separator,2021-01-03,0378-5955||,,,,,,,
|
||||
Invalid SPDX license identifier,2021-03-11,,,,,,,CC-BY,
|
||||
Duplicate Title,2021-03-17,,,,,,,,Report
|
||||
Duplicate Title,2021-03-17,,,,,,,,Report
|
||||
Mojibake,2021-03-18,,,,CIAT Publicaçao,,,,Report
|
||||
|
|
734
poetry.lock
generated
734
poetry.lock
generated
File diff suppressed because it is too large
Load Diff
@ -1,6 +1,6 @@
|
||||
[tool.poetry]
|
||||
name = "csv-metadata-quality"
|
||||
version = "0.4.7"
|
||||
version = "0.4.8-dev"
|
||||
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem."
|
||||
authors = ["Alan Orth <alan.orth@gmail.com>"]
|
||||
license="GPL-3.0-only"
|
||||
@ -16,18 +16,19 @@ pandas = "^1.0.4"
|
||||
python-stdnum = "^1.13"
|
||||
xlrd = "^1.2.0"
|
||||
requests = "^2.23.0"
|
||||
requests-cache = "^0.5.2"
|
||||
requests-cache = "~0.6.4"
|
||||
pycountry = "^19.8.18"
|
||||
langid = "^1.1.6"
|
||||
colorama = "^0.4.4"
|
||||
spdx-license-list = "^0.5.2"
|
||||
ftfy = "^5.9"
|
||||
|
||||
[tool.poetry.dev-dependencies]
|
||||
pytest = "^6.1.1"
|
||||
ipython = { version = "^7.18.1", python = "^3.7" }
|
||||
flake8 = "^3.8.4"
|
||||
pytest-clarity = "^0.3.0-alpha.0"
|
||||
black = "20.8b1"
|
||||
pytest-clarity = "^1.0.1"
|
||||
black = "^21.6b0"
|
||||
isort = "^5.5.4"
|
||||
csvkit = "^1.0.5"
|
||||
|
||||
|
@ -2,74 +2,80 @@ agate-dbf==0.2.2
|
||||
agate-excel==0.2.3
|
||||
agate-sql==0.5.6
|
||||
agate==1.6.2
|
||||
appdirs==1.4.4; python_version >= "3.6"
|
||||
appdirs==1.4.4; python_full_version >= "3.6.2"
|
||||
appnope==0.1.2; python_version >= "3.7" and python_version < "4.0" and sys_platform == "darwin"
|
||||
atomicwrites==1.4.0; python_version >= "3.6" and python_full_version < "3.0.0" and sys_platform == "win32" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") or sys_platform == "win32" and python_version >= "3.6" and python_full_version >= "3.4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6")
|
||||
attrs==20.3.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
babel==2.9.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
|
||||
attrs==21.2.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
babel==2.9.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
|
||||
backcall==0.2.0; python_version >= "3.7" and python_version < "4.0"
|
||||
black==20.8b1; python_version >= "3.6"
|
||||
certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
click==7.1.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
black==21.6b0; python_full_version >= "3.6.2"
|
||||
certifi==2021.5.30; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
chardet==4.0.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
click==8.0.1; python_version >= "3.6" and python_full_version >= "3.6.2"
|
||||
colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
|
||||
commonmark==0.9.1; python_version >= "3.6" and python_version < "4.0" and (python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0")
|
||||
csvkit==1.0.5
|
||||
dbfread==2.0.7
|
||||
decorator==4.4.2; python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "4.0" or python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.2.0"
|
||||
et-xmlfile==1.0.1; python_version >= "3.6"
|
||||
flake8==3.9.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
|
||||
greenlet==1.0.0; python_version >= "3" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3"
|
||||
idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
importlib-metadata==3.7.3; python_version < "3.8" and python_version >= "3.6" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.5.0" and python_version < "3.8" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.6" and python_version < "3.8") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.6.0" and python_version < "3.8" and python_version >= "3.6")
|
||||
decorator==5.0.9; python_version >= "3.7" and python_version < "4.0"
|
||||
et-xmlfile==1.1.0; python_version >= "3.6"
|
||||
flake8==3.9.2; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
|
||||
ftfy==5.9; python_version >= "3.5"
|
||||
greenlet==1.1.0; python_version >= "3" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3"
|
||||
idna==2.10; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
importlib-metadata==4.6.1; python_version < "3.8" and python_version >= "3.6" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.5.0" and python_version < "3.8" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.6" and python_version < "3.8") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") and python_full_version >= "3.6.2" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.6.0" and python_version < "3.8" and python_version >= "3.6")
|
||||
iniconfig==1.1.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
ipython-genutils==0.2.0; python_version >= "3.7" and python_version < "4.0"
|
||||
ipython==7.21.0; python_version >= "3.7" and python_version < "4.0"
|
||||
ipython==7.25.0; python_version >= "3.7" and python_version < "4.0"
|
||||
isodate==0.6.0
|
||||
isort==5.7.0; python_version >= "3.6" and python_version < "4.0"
|
||||
isort==5.9.1; python_full_version >= "3.6.1" and python_version < "4.0"
|
||||
itsdangerous==2.0.1; python_version >= "3.6"
|
||||
jedi==0.18.0; python_version >= "3.7" and python_version < "4.0"
|
||||
langid==1.1.6
|
||||
leather==0.3.3
|
||||
matplotlib-inline==0.1.2; python_version >= "3.7" and python_version < "4.0"
|
||||
mccabe==0.6.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
mypy-extensions==0.4.3; python_version >= "3.6"
|
||||
numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
|
||||
mypy-extensions==0.4.3; python_full_version >= "3.6.2"
|
||||
numpy==1.21.0; python_version >= "3.7" and python_full_version >= "3.7.1"
|
||||
openpyxl==3.0.7; python_version >= "3.6"
|
||||
packaging==20.9; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
pandas==1.2.3; python_full_version >= "3.7.1"
|
||||
packaging==21.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
pandas==1.3.0; python_full_version >= "3.7.1"
|
||||
parsedatetime==2.6
|
||||
parso==0.8.1; python_version >= "3.7" and python_version < "4.0"
|
||||
pathspec==0.8.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
parso==0.8.2; python_version >= "3.7" and python_version < "4.0"
|
||||
pathspec==0.8.1; python_full_version >= "3.6.2"
|
||||
pexpect==4.8.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
|
||||
pickleshare==0.7.5; python_version >= "3.7" and python_version < "4.0"
|
||||
pluggy==0.13.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
prompt-toolkit==3.0.17; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
|
||||
pprintpp==0.4.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
|
||||
prompt-toolkit==3.0.19; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
|
||||
ptyprocess==0.7.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
|
||||
py==1.10.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
pycodestyle==2.7.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
pycountry==19.8.18
|
||||
pyflakes==2.3.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
pygments==2.8.1; python_version >= "3.7" and python_version < "4.0"
|
||||
pyicu==2.6
|
||||
pyparsing==2.4.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
pytest-clarity==0.3.0a0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
|
||||
pytest==6.2.2; python_version >= "3.6"
|
||||
pyflakes==2.3.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
pygments==2.9.0; python_version >= "3.7" and python_version < "4.0" and (python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0")
|
||||
pyicu==2.7.4
|
||||
pyparsing==2.4.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.6"
|
||||
pytest-clarity==1.0.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
|
||||
pytest==6.2.4; python_version >= "3.6"
|
||||
python-dateutil==2.8.1; python_full_version >= "3.7.1"
|
||||
python-slugify==4.0.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
python-slugify==5.0.2; python_version >= "3.6"
|
||||
python-stdnum==1.16
|
||||
pytimeparse==1.1.8
|
||||
pytz==2021.1; python_full_version >= "3.7.1"
|
||||
regex==2020.11.13; python_version >= "3.6"
|
||||
requests-cache==0.5.2
|
||||
regex==2021.7.6; python_full_version >= "3.6.2"
|
||||
requests-cache==0.6.4; python_version >= "3.6"
|
||||
requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
|
||||
six==1.15.0; python_full_version >= "3.7.1"
|
||||
rich==10.5.0; python_version >= "3.6" and python_version < "4.0" and (python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0")
|
||||
six==1.16.0; python_full_version >= "3.7.1" and python_version >= "3.6" and (python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.3.0")
|
||||
spdx-license-list==0.5.2
|
||||
sqlalchemy==1.4.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0"
|
||||
termcolor==1.1.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
|
||||
text-unidecode==1.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
toml==0.10.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
sqlalchemy==1.4.20; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0"
|
||||
text-unidecode==1.3; python_version >= "3.6"
|
||||
toml==0.10.2; python_full_version >= "3.6.2" and python_version >= "3.6" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6")
|
||||
traitlets==5.0.5; python_version >= "3.7" and python_version < "4.0"
|
||||
typed-ast==1.4.2; python_version >= "3.6"
|
||||
typing-extensions==3.7.4.3; python_version < "3.8" and python_version >= "3.6"
|
||||
urllib3==1.26.4; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
|
||||
typed-ast==1.4.3; python_version < "3.8" and python_full_version >= "3.6.2"
|
||||
typing-extensions==3.10.0.0; python_version < "3.8" and python_full_version >= "3.6.2" and python_version >= "3.6" and (python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0")
|
||||
url-normalize==1.4.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
|
||||
urllib3==1.26.6; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4" and python_version >= "3.6"
|
||||
wcwidth==0.2.5; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
|
||||
xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
|
||||
zipp==3.4.1; python_version < "3.8" and python_version >= "3.6"
|
||||
zipp==3.5.0; python_version < "3.8" and python_version >= "3.6"
|
||||
|
@ -1,17 +1,21 @@
|
||||
certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
certifi==2021.5.30; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
chardet==4.0.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
|
||||
idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
ftfy==5.9; python_version >= "3.5"
|
||||
idna==2.10; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
|
||||
itsdangerous==2.0.1; python_version >= "3.6"
|
||||
langid==1.1.6
|
||||
numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
|
||||
pandas==1.2.3; python_full_version >= "3.7.1"
|
||||
numpy==1.21.0; python_version >= "3.7" and python_full_version >= "3.7.1"
|
||||
pandas==1.3.0; python_full_version >= "3.7.1"
|
||||
pycountry==19.8.18
|
||||
python-dateutil==2.8.1; python_full_version >= "3.7.1"
|
||||
python-stdnum==1.16
|
||||
pytz==2021.1; python_full_version >= "3.7.1"
|
||||
requests-cache==0.5.2
|
||||
requests-cache==0.6.4; python_version >= "3.6"
|
||||
requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
|
||||
six==1.15.0; python_full_version >= "3.7.1"
|
||||
six==1.16.0; python_full_version >= "3.7.1" and python_version >= "3.6"
|
||||
spdx-license-list==0.5.2
|
||||
urllib3==1.26.4; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
|
||||
url-normalize==1.4.3; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3.6"
|
||||
urllib3==1.26.6; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4" and python_version >= "3.6"
|
||||
wcwidth==0.2.5; python_version >= "3.5"
|
||||
xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
|
||||
|
2
setup.py
2
setup.py
@ -14,7 +14,7 @@ install_requires = [
|
||||
|
||||
setuptools.setup(
|
||||
name="csv-metadata-quality",
|
||||
version="0.4.7",
|
||||
version="0.4.8-dev",
|
||||
author="Alan Orth",
|
||||
author_email="aorth@mjanja.ch",
|
||||
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
|
||||
|
@ -1,3 +1,5 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
import pandas as pd
|
||||
from colorama import Fore
|
||||
|
||||
@ -339,3 +341,29 @@ def test_check_duplicate_item(capsys):
|
||||
captured.out
|
||||
== f"{Fore.YELLOW}Possible duplicate (dc.title): {Fore.RESET}{item_title}\n"
|
||||
)
|
||||
|
||||
|
||||
def test_check_no_mojibake():
|
||||
"""Test string with no mojibake."""
|
||||
|
||||
field = "CIAT Publicaçao"
|
||||
field_name = "dcterms.isPartOf"
|
||||
|
||||
result = check.mojibake(field, field_name)
|
||||
|
||||
assert result == None
|
||||
|
||||
|
||||
def test_check_mojibake(capsys):
|
||||
"""Test string with mojibake."""
|
||||
|
||||
field = "CIAT Publicaçao"
|
||||
field_name = "dcterms.isPartOf"
|
||||
|
||||
result = check.mojibake(field, field_name)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert (
|
||||
captured.out
|
||||
== f"{Fore.YELLOW}Possible encoding issue ({field_name}): {Fore.RESET}{field}\n"
|
||||
)
|
||||
|
@ -1,3 +1,5 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-only
|
||||
|
||||
import csv_metadata_quality.fix as fix
|
||||
|
||||
|
||||
@ -108,3 +110,12 @@ def test_fix_decomposed_unicode():
|
||||
field_name = "dc.contributor.author"
|
||||
|
||||
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
|
||||
|
||||
|
||||
def test_fix_mojibake():
|
||||
"""Test string with no mojibake."""
|
||||
|
||||
field = "CIAT Publicaçao"
|
||||
field_name = "dcterms.isPartOf"
|
||||
|
||||
assert fix.mojibake(field, field_name) == "CIAT Publicaçao"
|
||||
|
Reference in New Issue
Block a user