1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-10-25 10:51:14 +02:00

115 Commits

Author SHA1 Message Date
d76e72532a Move unreleased changes to v0.4.4
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-21 13:25:22 +02:00
13980d2dde CHANGELOG.md: Add note about colored output 2021-02-21 13:12:26 +02:00
9aaaa62461 Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-02-21 13:10:52 +02:00
a7fc5a246c Colorize output
Some checks failed
continuous-integration/drone/push Build is failing
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
7fb8acb866 Add colorama for colored output
Red for errors, yellow for warnings or information, and green for
fixes.
2021-02-21 13:00:31 +02:00
9f5d2c2c4f poetry.lock: Run poetry update
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-15 15:13:12 +02:00
202abf140c CHANGELOG.md: Add note about poetry
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-04 21:48:12 +02:00
0cd6d3dfe6 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-02-04 21:46:49 +02:00
a458beac55 poetry.lock: Run poetry update 2021-02-04 21:45:30 +02:00
e62ecb0a8f CHANGELOG.md: Add note about new date format 2021-02-04 21:43:44 +02:00
de92f32ab6 csv_metadata_quality/check.py: More date formats
We should also allow ISO 8601 extended in combined date and time
format. DSpace does not have a problem with dates in this format
and I have found some metadata that uses this date format.

For example: 2020-08-31T11:04:56Z

See: https://en.wikipedia.org/wiki/ISO_8601
2021-02-04 21:39:14 +02:00
dbbbc0944a README.md: Add handle to citation
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-27 10:33:37 +02:00
d17bf3033c README.md: Add citation 2021-01-27 10:32:26 +02:00
2ec52f1b73 README.md: Update description
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-26 15:43:41 +02:00
aa1abf15a7 README.md: Adjust title 2021-01-26 15:35:21 +02:00
cbf94490f2 Version 0.4.3 2021-01-26 15:22:40 +02:00
f3d0d5ef07 setup.py: Remove Python 3.6
I actually removed Python 3.6 support a few weeks ago after updating
to Pandas 1.2.0, but forgot to update this.
2021-01-26 15:22:08 +02:00
4b7b99c94c CHANGELOG.md: Add note about multi-value separators 2021-01-26 15:20:22 +02:00
df670e81b9 README.md: Use badge from my Drone CI
All checks were successful
continuous-integration/drone/push Build is passing
I'm not using SourceHut anymore.
2021-01-26 14:38:50 +02:00
ae357d8c6c Revert "Update requirements"
This reverts commit ca80340f7a.

Nope, we still need the --without-hashes because this still fails
on Python 3.7, but not 3.8 or 3.9. From looking around it seems
that nobody can agree whether poetry should handle this, pip should
handle it, or upstream projects should pin their dependencies.
2021-01-26 14:15:31 +02:00
ca80340f7a Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt

Trying to see if we no longer need --without-hashes since we don't
support Python 3.6 anymore.
2021-01-26 11:46:05 +02:00
cc1743b86d Remove .build.yml
I will just use GitHub Actions and Drone.
2021-01-26 11:41:30 +02:00
bcb9885c6b Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-26 10:36:48 +02:00
b484b75178 poetry.lock: Run poetry update 2021-01-26 10:36:04 +02:00
d3880a9dfa Remove Python 3.6 support
All checks were successful
continuous-integration/drone/push Build is passing
Pandas 1.2.0 apparently requires Python 3.7.1+.
2021-01-03 15:51:53 +02:00
7edb8b19d7 tests/test_check.py: Reformat with black 2021-01-03 15:50:21 +02:00
a6709c7f82 Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-03 15:42:00 +02:00
d489ea4609 poetry.lock: Run poetry update 2021-01-03 15:41:08 +02:00
96634cbb67 pytest.ini: Change --strict to --strict-markers
This is deprecated since pytest 6.2.0.

See: https://docs.pytest.org/en/stable/deprecations.html#the-strict-command-line-option
2021-01-03 15:40:14 +02:00
29e67a0887 Add tests for unnecessary multi-value separators 2021-01-03 15:37:18 +02:00
32cea2055f data/test.csv: Add unnecessary multi-value separator 2021-01-03 15:33:04 +02:00
0dc66c5c4e Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
c26ad83534 .github: Test CLI invocation 2020-12-14 23:47:09 +02:00
72ca9d99bf setup.py: Add Python 3.9
[SKIP CI]
2020-12-14 23:44:35 +02:00
ae33a9b793 Add .drone.yml 2020-12-14 23:42:23 +02:00
fc0367bfc8 README.md: Update note about Python version 2020-12-08 10:52:24 +02:00
e33b285034 README.md: Add GitHub Actions badge 2020-12-08 10:48:31 +02:00
349fca03b8 .github/workflows/python-app.yml: Rename
This name is displayed in the badge so it should be something more
relevant.
2020-12-08 10:46:39 +02:00
52d8904870 Remove .travis.yml
They changed their free tier and I might as well use GitHub Actions
for ILRI stuff anyways.
2020-12-08 10:41:36 +02:00
971c69e535 Create python-app.yml
Try GitHub Actions for Python 3.8 using GitHub's Python example.
2020-12-08 10:38:52 +02:00
f8cc233e25 .travis.yml: Use Amazon Graviton2 ARM environment
These are the new hotness and should have faster build times.

See: https://blog.travis-ci.com/2020-09-11-arm-on-aws
2020-12-06 10:49:03 +02:00
aa7b7a9592 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2020-11-03 07:42:45 +02:00
57b455bde7 poetry.lock: Run poetry update 2020-11-03 07:40:56 +02:00
23b95fa368 .travis.yml: Use Ubuntu 20.04 "Focal" environment 2020-10-29 00:14:54 +03:00
6985f76aa3 .travis.yml: Bump Python versions
Test Python 3.9 now that it was released, and allow tests to fail
on nightly builds.
2020-10-29 00:14:36 +03:00
98a6a19e12 Update requirements-dev.txt
Generated with poetry export:

    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:48:46 +03:00
f4914c414f Only install ipython on Python 3.7+ 2020-10-06 17:48:16 +03:00
d352fe8017 Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:21:33 +03:00
f13c360084 Update poetry package dependencies 2020-10-06 17:20:16 +03:00
7cfd4c0b59 csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
826509ddcf poetry.lock: Run poetry update
List of updated modules:

  - Updating numpy (1.19.1 -> 1.19.2)
  - Updating pygments (2.6.1 -> 2.7.1)
  - Updating pandas (1.1.1 -> 1.1.2)

All tests still pass according to pytest.
2020-09-26 12:18:23 +03:00
22b5c0f7a1 CHANGELOG.md: Add note about dependencies update 2020-09-08 15:04:40 +03:00
774e274b32 poetry.lock: Run poetry update
Update dependencies to latest version:

  - Updating attrs (19.3.0 -> 20.2.0)
  - Updating more-itertools (8.4.0 -> 8.5.0)
  - Updating openpyxl (3.0.4 -> 3.0.5)
  - Updating parso (0.7.0 -> 0.7.1)
  - Updating sqlalchemy (1.3.18 -> 1.3.19)
  - Updating urllib3 (1.25.9 -> 1.25.10)
  - Updating agate-dbf (0.2.1 -> 0.2.2)
  - Updating agate-sql (0.5.4 -> 0.5.5)
  - Updating jedi (0.17.1 -> 0.17.2)
  - Updating numpy (1.19.0 -> 1.19.1)
  - Updating prompt-toolkit (3.0.5 -> 3.0.7)
  - Updating regex (2020.6.8 -> 2020.7.14)
  - Updating traitlets (4.3.3 -> 5.0.4)
  - Updating ipython (7.16.1 -> 7.18.1)
  - Updating pandas (1.0.5 -> 1.1.1)
  - Updating python-stdnum (1.13 -> 1.14)

All tests still pass according to pytest.
2020-09-08 15:04:00 +03:00
db474a802f README.md: Use badge from travis-ci.com 2020-08-04 11:12:28 +03:00
e241f8461b CHANGELOG.md: Add notes 2020-07-06 14:10:46 +03:00
431e6331c8 csv_metadata_quality/check.py: Format with black 2020-07-06 14:10:19 +03:00
cb07d357d4 Version 0.4.2 2020-07-06 14:04:34 +03:00
65cd48a26f CHANGELOG.md: Update changes 2020-07-06 14:00:21 +03:00
0f883f640c Remove pipenv 2020-07-06 13:59:49 +03:00
f4c5c5781e README.md: Switch to poetry 2020-07-06 13:59:11 +03:00
6aa784ad8c Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-07-06 13:57:07 +03:00
7b8da94f41 poetry.lock: Update Python dependencies 2020-07-06 13:56:31 +03:00
2a1566af62 csv_metadata_quality/check.py: Parameterize AGROVOC request 2020-07-06 13:44:46 +03:00
5fcaa63bd5 csv_metadata_quality/check.py: Prune requests cache once
We only need to prune the requests cache once before using it, not
for every value we check.
2020-07-06 13:42:19 +03:00
aa9e23b46c pyproject.toml: Update license specifier
We need to use valid SPDX license identifiers.
2020-06-09 14:22:53 +03:00
73acb1661f Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-05-31 17:51:16 +03:00
2a068fddc4 .build.yml: Fix test 2020-05-31 17:44:37 +03:00
c6c2f13e88 .build.yml: Fix poetry install invocation
Poetry apparently installs dev dependencies by default.
2020-05-31 17:37:09 +03:00
56f16e37ed .build.yml: Use poetry in SourceHut CI 2020-05-31 17:35:04 +03:00
0c44b967b6 Add poetry project file and lock
I want to try to use poetry instead of pipenv because pipenv takes
forever to do dependency resolution sometimes. Also, I have had a
few issues with Python modules like black that don't have releases
other than pre-releases, and even including the project itself in
the dependencies (pip install -e . ...?). My initial experience is
that poetry handles this better.
2020-05-31 17:33:40 +03:00
8a267bb40b .travis.yml: Try to build with Python 3.8-dev
But allow failures.
2020-03-29 16:40:11 +03:00
8fda8f1ef1 Pipfile.lock: Run pipenv update
All tests still passing.
2020-03-20 16:22:04 +02:00
5e471813e8 CHANGELOG.md: Add note about python dependencies 2020-01-29 12:41:43 +02:00
79244b9ac3 Pipfile.lock: Run pipenv update 2020-01-29 12:39:12 +02:00
5e81a33482 CHANGELOG.md: Add note about field names 2020-01-16 12:37:11 +02:00
28b5996aa6 Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
40ba9bae6c README.md: Adjust heading size 2020-01-15 12:26:11 +02:00
0b2d211455 Version 0.4.1 2020-01-15 12:19:42 +02:00
7f1df0b47c Support Python 3.6 and 3.7 again 2020-01-15 12:19:17 +02:00
365ecda324 Add utility function to check normalization
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 12:17:52 +02:00
550ce7fb7e .travis.yml: Only test Python 3.8
The Unicode normalization feature requires Python 3.8 because the
unicodedata.is_normalized() function only appears there. If I find
another way to check if a string is normalized without normalizing
it first I will drop the requirements back down to Python 3.6.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:57:21 +02:00
705127fd28 Version 0.4.0 2020-01-15 11:44:56 +02:00
894e0a196d setup.py: Change Python requirements
The `unicodedata.is_normalized()` function requires Python 3.8.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:43:25 +02:00
87181bc7b8 Run black, isort, and flake8. 2020-01-15 11:41:31 +02:00
8de5d862b6 CHANGELOG.md: Add note about Unicode normalization 2020-01-15 11:40:40 +02:00
49e3543878 Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
403b253762 CHANGELOG.md: Update python library versions 2020-01-15 10:58:44 +02:00
c5fbaf407a Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2020-01-15 10:51:58 +02:00
4f81f6c83c Pipfile.lock: Run pipenv update 2020-01-15 10:51:19 +02:00
4b9d1e060f setup.py: Add Python 3.8 classifier 2019-12-14 12:56:11 +02:00
c8a71e3143 Pipfile.lock: Run pipenv update 2019-12-14 12:53:39 +02:00
7964d98ca5 Pipfile: Specify exact version of black
Black only releases pre-release versions, which causes issues with
pipenv. Instead of always running pipenv with "--pre" and potenti-
ally letting in some other pre-release versions for other depende-
ncies, I would rather specify the latest black version explicitly.

See: https://github.com/psf/black/issues/517
See: https://github.com/microsoft/vscode-python/issues/5171
2019-12-14 12:41:28 +02:00
64ffc2f1da .travis.yml: Install packages from requirements.txt too 2019-11-14 23:42:28 +02:00
7b1bc29a92 .travis.yml: Try using pip instead of pipenv
The Pipfile knows it was created with Python 3.8, yet we're running
with multiple Python versions on Travis. I'm curious if would work
better to use pip to install dependencies instead of pipenv in this
case.
2019-11-14 23:37:25 +02:00
f0110d8e74 CHANGELOG.md: Add note about requirements 2019-11-14 23:30:26 +02:00
86498deee8 Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-11-14 23:28:42 +02:00
251647a15f CHANGELOG.md: Add TravisCI changes 2019-11-14 23:24:08 +02:00
0bd28e22ec .travis.yml: Test Python 3.8 2019-11-14 23:22:37 +02:00
63fdce7d13 .travis.yml: Use Ubuntu 18.04 "Bionic" 2019-11-14 23:22:19 +02:00
f068c0e16a CHANGELOG.md: Use Python 3.8.0 for pipenv 2019-11-14 23:11:43 +02:00
79b8f62a85 Use Python 3.8 for pipenv
Python 3.8.0 entered Arch Linux core repositories now and all tests
pass with Python 3.8.0 so it's time...
2019-11-14 23:10:20 +02:00
6c1e132531 CHANGELOG.md: Add unreleased changes 2019-11-14 09:19:19 +02:00
c0f3c866bd Pipfile.lock: Run pipenv update
Updates the following dependencies:

- numpy 1.17.2→1.17.4
- pandas 0.25.1→0.25.3
- flake8 3.7.8→3.7.9
- pytest 5.1.3→5.2.2
- black 19.3b0→19.10b0
2019-11-14 09:17:31 +02:00
36d0474b95 CHANGELOG.md: Move unreleased changes to v0.3.1 2019-10-01 17:11:52 +03:00
efdc3a841a Version 0.3.1 2019-10-01 17:11:13 +03:00
fd2ba6845d CHANGELOG.md: Update unreleased notes 2019-10-01 17:10:23 +03:00
e55380b4d5 csv_metadata_quality/fix.py: Harmonize language in fix output
We should always say if we're removing or replacing something.
2019-10-01 17:09:49 +03:00
85ae16d9b7 CHANGELOG.md: Add note about non-breaking spaces 2019-10-01 16:56:37 +03:00
c42f8b4812 csv_metadata_quality/fix.py: Replace non-breaking spaces
We should be replacing non-breaking spaces (U+00A0) with normal sp-
aces instead of removing them.
2019-10-01 16:55:04 +03:00
1c75608d54 README.md: Update introduction text
We should mention that this is not DSpace specific. Rather, it is
much more realistically Dublin Core specific.
2019-09-26 14:19:13 +03:00
0b15a8ed3b README.md: Remove TODO about lack of space after comma
This was added as an automatic global fix a few weeks ago.
2019-09-26 14:16:33 +03:00
9ca266f5f0 data/test.csv: Change birthdate column to dc.date.issued
More accurately reflects actual data we will be validating.
2019-09-26 14:15:48 +03:00
0d3f948708 CHANGELOG.md: Update comment about language validation 2019-09-26 14:14:57 +03:00
c04207fcfc CHANGELOG.md: Fix header formatting 2019-09-26 14:13:50 +03:00
9d4eceddc7 .build.yml: Enable experimental CLI checks on SourceHut 2019-09-26 14:11:35 +03:00
23 changed files with 1750 additions and 793 deletions

View File

@@ -1,19 +0,0 @@
image: archlinux
packages:
- python-pipenv
sources:
- https://git.sr.ht/~alanorth/csv-metadata-quality
tasks:
- setup: |
cd csv-metadata-quality
pipenv install --dev
- pytest: |
cd csv-metadata-quality
pipenv run pytest
- testcli: |
cd csv-metadata-quality
pipenv run pip install .
pipenv run csv-metadata-quality -i data/test.csv -o /tmp/test.csv -u --agrovoc-fields dc.subject,cg.coverage.country
environment:
PIPENV_NOSPIN: 'True'
PIPENV_HIDE_EMOJIS: 'True'

49
.drone.yml Normal file
View File

@@ -0,0 +1,49 @@
---
kind: pipeline
type: docker
name: python39
steps:
- name: test
image: python:3.9-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python38
steps:
- name: test
image: python:3.8-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python37
steps:
- name: test
image: python:3.7-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
# vim: ts=2 sw=2 et

41
.github/workflows/python-app.yml vendored Normal file
View File

@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build and Test
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
- name: Test CLI
run: |
python setup.py install
csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country

View File

@@ -1,11 +0,0 @@
dist: xenial
language: python
python:
- "3.6"
- "3.7"
install:
- "pip install pipenv --upgrade-strategy=only-if-needed"
- "pipenv install --dev"
script: pytest
# vim: ts=2 sw=2 et

View File

@@ -4,14 +4,68 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.4.4] - 2021-02-21
### Added
- Accept dates formatted in ISO 8601 extended with combined date and time, for
example: 2020-08-31T11:04:56Z
- Colorized output: red for errors, yellow for warnings and information, green
for changes
### Updated
- Run `poetry update` to update project dependencies
## [0.4.3] - 2021-01-26
### Changed
- Reformat with black
- Requires Python 3.7+ for pandas 1.2.0
### Updated
- Run `poetry update`
- Expand check/fix for multi-value separators to include metadata with invalid
separators at the end, for example "Kenya||Tanzania||"
## [0.4.2] - 2020-07-06
### Changed
- Add field name to the output for more fixes and checks to help identify where
the error is
- Minor optimizations to AGROVOC subject lookup
- Use Poetry instead of Pipenv
### Updated
- Update python dependencies to latest versions
## [0.4.1] - 2020-01-15
### Changed
- Reduce minimum Python version to 3.6 by working around the `is_normalized()`
that only works in Python >= 3.8
## [0.4.0] - 2020-01-15
### Added
- Unicode normalization (enable with `--unsafe-fixes`, see README.md)
### Updated
- Update python dependencies to latest versions, including numpy 1.18.1, pandas
1.0.0rc0, flake8 3.7.9, pytest 5.3.2, and black 19.10b0
- Regenerate requirements.txt and requirements-dev.txt
### Changed
- Use Python 3.8.0 for pipenv
- Use Ubuntu 18.04 "Bionic" for TravisCI builds
- Test Python 3.8 in TravisCI builds
## [0.3.1] - 2019-10-01
## Changed
- Replace non-breaking spaces (U+00A0) with space instead of removing them
- Harmonize language of script output when fixing various issues
## [0.3.0] - 2019-09-26 ## [0.3.0] - 2019-09-26
### Updated ### Updated
- Update python dependencies to latest versions, including numpy 1.17.2, pandas - Update python dependencies to latest versions, including numpy 1.17.2, pandas
0.25.1, pytest 5.1.3, and requests-cache 0.5.2 0.25.1, pytest 5.1.3, and requests-cache 0.5.2
## Added ### Added
- csvkit to dev requirements (csvcut etc are useful during development) - csvkit to dev requirements (csvcut etc are useful during development)
- Experimental language validation using `-e` (see README.md) - Experimental language validation using the Python `langid` library (enable with `-e`, see README.md)
### Changed ### Changed
- Re-formatted code with black and isort - Re-formatted code with black and isort

29
Pipfile
View File

@@ -1,29 +0,0 @@
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true
[dev-packages]
pytest = "*"
ipython = "*"
flake8 = "*"
pytest-clarity = "*"
black = "*"
isort = "*"
csvkit = "*"
[packages]
pandas = "*"
python-stdnum = "*"
xlrd = "*"
requests = "*"
requests-cache = "*"
pycountry = "*"
csv-metadata-quality = {editable = true,path = "."}
langid = "*"
[requires]
python_version = "3.7"
[pipenv]
allow_prereleases = true

555
Pipfile.lock generated
View File

@@ -1,555 +0,0 @@
{
"_meta": {
"hash": {
"sha256": "59562d8c59eb09e23b49475d6901687edbf605f5b84e283e90cc8e2de518641f"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.7"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.org/simple",
"verify_ssl": true
}
]
},
"default": {
"certifi": {
"hashes": [
"sha256:e4f3620cfea4f83eedc95b24abd9cd56f3c4b146dd0177e83a21b4eb49e21e50",
"sha256:fd7c7c74727ddcf00e9acd26bba8da604ffec95bf1c2144e67aff7a8b50e6cef"
],
"version": "==2019.9.11"
},
"chardet": {
"hashes": [
"sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae",
"sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"
],
"version": "==3.0.4"
},
"csv-metadata-quality": {
"editable": true,
"path": "."
},
"idna": {
"hashes": [
"sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407",
"sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c"
],
"version": "==2.8"
},
"langid": {
"hashes": [
"sha256:044bcae1912dab85c33d8e98f2811b8f4ff1213e5e9a9e9510137b84da2cb293"
],
"index": "pypi",
"version": "==1.1.6"
},
"numpy": {
"hashes": [
"sha256:05dbfe72684cc14b92568de1bc1f41e5f62b00f714afc9adee42f6311738091f",
"sha256:0d82cb7271a577529d07bbb05cb58675f2deb09772175fab96dc8de025d8ac05",
"sha256:10132aa1fef99adc85a905d82e8497a580f83739837d7cbd234649f2e9b9dc58",
"sha256:12322df2e21f033a60c80319c25011194cd2a21294cc66fee0908aeae2c27832",
"sha256:16f19b3aa775dddc9814e02a46b8e6ae6a54ed8cf143962b4e53f0471dbd7b16",
"sha256:3d0b0989dd2d066db006158de7220802899a1e5c8cf622abe2d0bd158fd01c2c",
"sha256:438a3f0e7b681642898fd7993d38e2bf140a2d1eafaf3e89bb626db7f50db355",
"sha256:5fd214f482ab53f2cea57414c5fb3e58895b17df6e6f5bca5be6a0bb6aea23bb",
"sha256:73615d3edc84dd7c4aeb212fa3748fb83217e00d201875a47327f55363cef2df",
"sha256:7bd355ad7496f4ce1d235e9814ec81ee3d28308d591c067ce92e49f745ba2c2f",
"sha256:7d077f2976b8f3de08a0dcf5d72083f4af5411e8fddacd662aae27baa2601196",
"sha256:a4092682778dc48093e8bda8d26ee8360153e2047826f95a3f5eae09f0ae3abf",
"sha256:b458de8624c9f6034af492372eb2fee41a8e605f03f4732f43fc099e227858b2",
"sha256:e70fc8ff03a961f13363c2c95ef8285e0cf6a720f8271836f852cc0fa64e97c8",
"sha256:ee8e9d7cad5fe6dde50ede0d2e978d81eafeaa6233fb0b8719f60214cf226578",
"sha256:f4a4f6aba148858a5a5d546a99280f71f5ee6ec8182a7d195af1a914195b21a2"
],
"version": "==1.17.2"
},
"pandas": {
"hashes": [
"sha256:18d91a9199d1dfaa01ad645f7540370ba630bdcef09daaf9edf45b4b1bca0232",
"sha256:3f26e5da310a0c0b83ea50da1fd397de2640b02b424aa69be7e0784228f656c9",
"sha256:4182e32f4456d2c64619e97c58571fa5ca0993d1e8c2d9ca44916185e1726e15",
"sha256:426e590e2eb0e60f765271d668a30cf38b582eaae5ec9b31229c8c3c10c5bc21",
"sha256:5eb934a8f0dc358f0e0cdf314072286bbac74e4c124b64371395e94644d5d919",
"sha256:717928808043d3ea55b9bcde636d4a52d2236c246f6df464163a66ff59980ad8",
"sha256:8145f97c5ed71827a6ec98ceaef35afed1377e2d19c4078f324d209ff253ecb5",
"sha256:8744c84c914dcc59cbbb2943b32b7664df1039d99e834e1034a3372acb89ea4d",
"sha256:c1ac1d9590d0c9314ebf01591bd40d4c03d710bfc84a3889e5263c97d7891dee",
"sha256:cb2e197b7b0687becb026b84d3c242482f20cbb29a9981e43604eb67576da9f6",
"sha256:d4001b71ad2c9b84ff18b182cea22b7b6cbf624216da3ea06fb7af28d1f93165",
"sha256:d8930772adccb2882989ab1493fa74bd87d47c8ac7417f5dd3dd834ba8c24dc9",
"sha256:dfbb0173ee2399bc4ed3caf2d236e5c0092f948aafd0a15fbe4a0e77ee61a958",
"sha256:eebfbba048f4fa8ac711b22c78516e16ff8117d05a580e7eeef6b0c2be554c18",
"sha256:f1b21bc5cf3dbea53d33615d1ead892dfdae9d7052fa8898083bec88be20dcd2"
],
"index": "pypi",
"version": "==0.25.1"
},
"pycountry": {
"hashes": [
"sha256:3c57aa40adcf293d59bebaffbe60d8c39976fba78d846a018dc0c2ec9c6cb3cb"
],
"index": "pypi",
"version": "==19.8.18"
},
"python-dateutil": {
"hashes": [
"sha256:7e6584c74aeed623791615e26efd690f29817a27c73085b78e4bad02493df2fb",
"sha256:c89805f6f4d64db21ed966fda138f8a5ed7a4fdbc1a8ee329ce1b74e3c74da9e"
],
"version": "==2.8.0"
},
"python-stdnum": {
"hashes": [
"sha256:d5f0af1bee9ddd9a20b398b46ce062dbd4d41fcc9646940f2667256a44df3854",
"sha256:f445ec32bf5246c90389204cabba465f494545371c29a83fa2d30e6c872a6763"
],
"index": "pypi",
"version": "==1.11"
},
"pytz": {
"hashes": [
"sha256:26c0b32e437e54a18161324a2fca3c4b9846b74a8dccddd843113109e1116b32",
"sha256:c894d57500a4cd2d5c71114aaab77dbab5eabd9022308ce5ac9bb93a60a6f0c7"
],
"version": "==2019.2"
},
"requests": {
"hashes": [
"sha256:11e007a8a2aa0323f5a921e9e6a2d7e4e67d9877e85773fba9ba6419025cbeb4",
"sha256:9cf5292fcd0f598c671cfc1e0d7d1a7f13bb8085e9a590f48c010551dc6c4b31"
],
"index": "pypi",
"version": "==2.22.0"
},
"requests-cache": {
"hashes": [
"sha256:813023269686045f8e01e2289cc1e7e9ae5ab22ddd1e2849a9093ab3ab7270eb",
"sha256:81e13559baee64677a7d73b85498a5a8f0639e204517b5d05ff378e44a57831a"
],
"index": "pypi",
"version": "==0.5.2"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"urllib3": {
"hashes": [
"sha256:3de946ffbed6e6746608990594d08faac602528ac7015ac28d33cee6a45b7398",
"sha256:9a107b99a5393caf59c7aa3c1249c16e6879447533d0887f4336dde834c7be86"
],
"version": "==1.25.6"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
}
},
"develop": {
"agate": {
"hashes": [
"sha256:48d6f80b35611c1ba25a642cbc5b90fcbdeeb2a54711c4a8d062ee2809334d1c",
"sha256:c93aaa500b439d71e4a5cf088d0006d2ce2c76f1950960c8843114e5f361dfd3"
],
"version": "==1.6.1"
},
"agate-dbf": {
"hashes": [
"sha256:00c93c498ec9a04cc587bf63dd7340e67e2541f0df4c9a7259d7cb3dd4ce372f"
],
"version": "==0.2.1"
},
"agate-excel": {
"hashes": [
"sha256:8f255ef2c87c436b7132049e1dd86c8e08bf82d8c773aea86f3069b461a17d52"
],
"version": "==0.2.3"
},
"agate-sql": {
"hashes": [
"sha256:9277490ba8b8e7c747a9ae3671f52fe486784b48d4a14e78ca197fb0e36f281b"
],
"version": "==0.5.4"
},
"appdirs": {
"hashes": [
"sha256:9e5896d1372858f8dd3344faf4e5014d21849c756c8d5701f78f8a103b372d92",
"sha256:d8b24664561d0d34ddfaec54636d502d7cea6e29c3eaf68f3df6180863e2166e"
],
"version": "==1.4.3"
},
"atomicwrites": {
"hashes": [
"sha256:03472c30eb2c5d1ba9227e4c2ca66ab8287fbfbbda3888aa93dc2e28fc6811b4",
"sha256:75a9445bac02d8d058d5e1fe689654ba5a6556a1dfd8ce6ec55a0ed79866cfa6"
],
"version": "==1.3.0"
},
"attrs": {
"hashes": [
"sha256:69c0dbf2ed392de1cb5ec704444b08a5ef81680a61cb899dc08127123af36a79",
"sha256:f0b870f674851ecbfbbbd364d6b5cbdff9dcedbc7f3f5e18a6891057f21fe399"
],
"version": "==19.1.0"
},
"babel": {
"hashes": [
"sha256:af92e6106cb7c55286b25b38ad7695f8b4efb36a90ba483d7f7a6628c46158ab",
"sha256:e86135ae101e31e2c8ec20a4e0c5220f4eed12487d5cf3f78be7e98d3a57fc28"
],
"version": "==2.7.0"
},
"backcall": {
"hashes": [
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
],
"version": "==0.1.0"
},
"black": {
"hashes": [
"sha256:09a9dcb7c46ed496a9850b76e4e825d6049ecd38b611f1224857a79bd985a8cf",
"sha256:68950ffd4d9169716bcb8719a56c07a2f4485354fec061cdd5910aa07369731c"
],
"index": "pypi",
"version": "==19.3b0"
},
"click": {
"hashes": [
"sha256:2335065e6395b9e67ca716de5f7526736bfa6ceead690adf616d925bdc622b13",
"sha256:5b94b49521f6456670fdb30cd82a4eca9412788a93fa6dd6df72c94d5a8ff2d7"
],
"version": "==7.0"
},
"csvkit": {
"hashes": [
"sha256:1353a383531bee191820edfb88418c13dfe1cdfa9dd3dc46f431c05cd2a260a0"
],
"index": "pypi",
"version": "==1.0.4"
},
"dbfread": {
"hashes": [
"sha256:07c8a9af06ffad3f6f03e8fe91ad7d2733e31a26d2b72c4dd4cfbae07ee3b73d",
"sha256:f604def58c59694fa0160d7be5d0b8d594467278d2bb6a47d46daf7162c84cec"
],
"version": "==2.0.7"
},
"decorator": {
"hashes": [
"sha256:86156361c50488b84a3f148056ea716ca587df2f0de1d34750d35c21312725de",
"sha256:f069f3a01830ca754ba5258fde2278454a0b5b79e0d7f5c13b3b97e57d4acff6"
],
"version": "==4.4.0"
},
"entrypoints": {
"hashes": [
"sha256:589f874b313739ad35be6e0cd7efde2a4e9b6fea91edcc34e58ecbb8dbe56d19",
"sha256:c70dd71abe5a8c85e55e12c19bd91ccfeec11a6e99044204511f9ed547d48451"
],
"version": "==0.3"
},
"et-xmlfile": {
"hashes": [
"sha256:614d9722d572f6246302c4491846d2c393c199cfa4edc9af593437691683335b"
],
"version": "==1.0.1"
},
"flake8": {
"hashes": [
"sha256:19241c1cbc971b9962473e4438a2ca19749a7dd002dd1a946eaba171b4114548",
"sha256:8e9dfa3cecb2400b3738a42c54c3043e821682b9c840b0448c0503f781130696"
],
"index": "pypi",
"version": "==3.7.8"
},
"future": {
"hashes": [
"sha256:67045236dcfd6816dc439556d009594abf643e5eb48992e36beac09c2ca659b8"
],
"version": "==0.17.1"
},
"importlib-metadata": {
"hashes": [
"sha256:aa18d7378b00b40847790e7c27e11673d7fed219354109d0e7b9e5b25dc3ad26",
"sha256:d5f18a79777f3aa179c145737780282e27b508fc8fd688cb17c7a813e8bd39af"
],
"markers": "python_version < '3.8'",
"version": "==0.23"
},
"ipython": {
"hashes": [
"sha256:c4ab005921641e40a68e405e286e7a1fcc464497e14d81b6914b4fd95e5dee9b",
"sha256:dd76831f065f17bddd7eaa5c781f5ea32de5ef217592cf019e34043b56895aa1"
],
"index": "pypi",
"version": "==7.8.0"
},
"ipython-genutils": {
"hashes": [
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
],
"version": "==0.2.0"
},
"isodate": {
"hashes": [
"sha256:2e364a3d5759479cdb2d37cce6b9376ea504db2ff90252a2e5b7cc89cc9ff2d8",
"sha256:aa4d33c06640f5352aca96e4b81afd8ab3b47337cc12089822d6f322ac772c81"
],
"version": "==0.6.0"
},
"isort": {
"hashes": [
"sha256:54da7e92468955c4fceacd0c86bd0ec997b0e1ee80d97f67c35a78b719dccab1",
"sha256:6e811fcb295968434526407adb8796944f1988c5b65e8139058f2014cbe100fd"
],
"index": "pypi",
"version": "==4.3.21"
},
"jdcal": {
"hashes": [
"sha256:1abf1305fce18b4e8aa248cf8fe0c56ce2032392bc64bbd61b5dff2a19ec8bba",
"sha256:472872e096eb8df219c23f2689fc336668bdb43d194094b5cc1707e1640acfc8"
],
"version": "==1.4.1"
},
"jedi": {
"hashes": [
"sha256:786b6c3d80e2f06fd77162a07fed81b8baa22dde5d62896a790a331d6ac21a27",
"sha256:ba859c74fa3c966a22f2aeebe1b74ee27e2a462f56d3f5f7ca4a59af61bfe42e"
],
"version": "==0.15.1"
},
"leather": {
"hashes": [
"sha256:076d1603b5281488285718ce1a5ce78cf1027fe1e76adf9c548caf83c519b988",
"sha256:e0bb36a6d5f59fbf3c1a6e75e7c8bee29e67f06f5b48c0134407dde612eba5e2"
],
"version": "==0.3.3"
},
"mccabe": {
"hashes": [
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
],
"version": "==0.6.1"
},
"more-itertools": {
"hashes": [
"sha256:409cd48d4db7052af495b09dec721011634af3753ae1ef92d2b32f73a745f832",
"sha256:92b8c4b06dac4f0611c0729b2f2ede52b2e1bac1ab48f089c7ddc12e26bb60c4"
],
"version": "==7.2.0"
},
"openpyxl": {
"hashes": [
"sha256:340a1ab2069764559b9d58027a43a24db18db0e25deb80f81ecb8ca7ee5253db"
],
"version": "==3.0.0"
},
"packaging": {
"hashes": [
"sha256:28b924174df7a2fa32c1953825ff29c61e2f5e082343165438812f00d3a7fc47",
"sha256:d9551545c6d761f3def1677baf08ab2a3ca17c56879e70fecba2fc4dde4ed108"
],
"version": "==19.2"
},
"parsedatetime": {
"hashes": [
"sha256:3d817c58fb9570d1eec1dd46fa9448cd644eeed4fb612684b02dfda3a79cb84b",
"sha256:9ee3529454bf35c40a77115f5a596771e59e1aee8c53306f346c461b8e913094"
],
"version": "==2.4"
},
"parso": {
"hashes": [
"sha256:63854233e1fadb5da97f2744b6b24346d2750b85965e7e399bec1620232797dc",
"sha256:666b0ee4a7a1220f65d367617f2cd3ffddff3e205f3f16a0284df30e774c2a9c"
],
"version": "==0.5.1"
},
"pexpect": {
"hashes": [
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
],
"markers": "sys_platform != 'win32'",
"version": "==4.7.0"
},
"pickleshare": {
"hashes": [
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
],
"version": "==0.7.5"
},
"pluggy": {
"hashes": [
"sha256:0db4b7601aae1d35b4a033282da476845aa19185c1e6964b25cf324b5e4ec3e6",
"sha256:fa5fa1622fa6dd5c030e9cad086fa19ef6a0cf6d7a2d12318e10cb49d6d68f34"
],
"version": "==0.13.0"
},
"prompt-toolkit": {
"hashes": [
"sha256:11adf3389a996a6d45cc277580d0d53e8a5afd281d0c9ec71b28e6f121463780",
"sha256:2519ad1d8038fd5fc8e770362237ad0364d16a7650fb5724af6997ed5515e3c1",
"sha256:977c6583ae813a37dc1c2e1b715892461fcbdaa57f6fc62f33a528c4886c8f55"
],
"version": "==2.0.9"
},
"ptyprocess": {
"hashes": [
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
],
"version": "==0.6.0"
},
"py": {
"hashes": [
"sha256:64f65755aee5b381cea27766a3a147c3f15b9b6b9ac88676de66ba2ae36793fa",
"sha256:dc639b046a6e2cff5bbe40194ad65936d6ba360b52b3c3fe1d08a82dd50b5e53"
],
"version": "==1.8.0"
},
"pycodestyle": {
"hashes": [
"sha256:95a2219d12372f05704562a14ec30bc76b05a5b297b21a5dfe3f6fac3491ae56",
"sha256:e40a936c9a450ad81df37f549d676d127b1b66000a6c500caa2b085bc0ca976c"
],
"version": "==2.5.0"
},
"pyflakes": {
"hashes": [
"sha256:17dbeb2e3f4d772725c777fabc446d5634d1038f234e77343108ce445ea69ce0",
"sha256:d976835886f8c5b31d47970ed689944a0262b5f3afa00a5a7b4dc81e5449f8a2"
],
"version": "==2.1.1"
},
"pygments": {
"hashes": [
"sha256:71e430bc85c88a430f000ac1d9b331d2407f681d6f6aec95e8bcfbc3df5b0127",
"sha256:881c4c157e45f30af185c1ffe8d549d48ac9127433f2c380c24b84572ad66297"
],
"version": "==2.4.2"
},
"pyparsing": {
"hashes": [
"sha256:6f98a7b9397e206d78cc01df10131398f1c8b8510a2f4d97d9abd82e1aacdd80",
"sha256:d9338df12903bbf5d65a0e4e87c2161968b10d2e489652bb47001d82a9b028b4"
],
"version": "==2.4.2"
},
"pytest": {
"hashes": [
"sha256:813b99704b22c7d377bbd756ebe56c35252bb710937b46f207100e843440b3c2",
"sha256:cc6620b96bc667a0c8d4fa592a8c9c94178a1bd6cc799dbb057dfd9286d31a31"
],
"index": "pypi",
"version": "==5.1.3"
},
"pytest-clarity": {
"hashes": [
"sha256:3f40d5ae7cb21cc95e622fc4f50d9466f80ae0f91460225b8c95c07afbf93e20"
],
"index": "pypi",
"version": "==0.2.0a1"
},
"python-slugify": {
"hashes": [
"sha256:575d03256a132fc1efb4c52966c6eb11c57a13b071618f0b26076057a23f6937"
],
"version": "==3.0.4"
},
"pytimeparse": {
"hashes": [
"sha256:04b7be6cc8bd9f5647a6325444926c3ac34ee6bc7e69da4367ba282f076036bd",
"sha256:e86136477be924d7e670646a98561957e8ca7308d44841e21f5ddea757556a0a"
],
"version": "==1.1.8"
},
"pytz": {
"hashes": [
"sha256:26c0b32e437e54a18161324a2fca3c4b9846b74a8dccddd843113109e1116b32",
"sha256:c894d57500a4cd2d5c71114aaab77dbab5eabd9022308ce5ac9bb93a60a6f0c7"
],
"version": "==2019.2"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"sqlalchemy": {
"hashes": [
"sha256:2f8ff566a4d3a92246d367f2e9cd6ed3edeef670dcd6dda6dfdc9efed88bcd80"
],
"version": "==1.3.8"
},
"termcolor": {
"hashes": [
"sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b"
],
"version": "==1.1.0"
},
"text-unidecode": {
"hashes": [
"sha256:1311f10e8b895935241623731c2ba64f4c455287888b18189350b67134a822e8",
"sha256:bad6603bb14d279193107714b288be206cac565dfa49aa5b105294dd5c4aab93"
],
"version": "==1.3"
},
"toml": {
"hashes": [
"sha256:229f81c57791a41d65e399fc06bf0848bab550a9dfd5ed66df18ce5f05e73d5c",
"sha256:235682dd292d5899d361a811df37e04a8828a5b1da3115886b73cf81ebc9100e"
],
"version": "==0.10.0"
},
"traitlets": {
"hashes": [
"sha256:262089114405f22f4833be96b31e143ab906d7764a22c04c71fee0bbda4787ba",
"sha256:6ad5b30dacd5e2424c46cc94a0aeab990d98ae17d181acea2cc4272ac3409fca"
],
"version": "==4.3.3.dev0"
},
"wcwidth": {
"hashes": [
"sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e",
"sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c"
],
"version": "==0.1.7"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
},
"zipp": {
"hashes": [
"sha256:3718b1cbcd963c7d4c5511a8240812904164b7f381b647143a89d3b98f9bcd8e",
"sha256:f06903e9f1f43b12d371004b4ac7b06ab39a44adc747266928ae6debfa7b3335"
],
"version": "==0.6.0"
}
}
}

View File

@@ -1,7 +1,11 @@
# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?) # DSpace CSV Metadata Quality Checker ![GitHub Actions](https://github.com/ilri/csv-metadata-quality/workflows/Build%20and%20Test/badge.svg) [![Build Status](https://ci.mjanja.ch/api/badges/alanorth/csv-metadata-quality/status.svg)](https://ci.mjanja.ch/alanorth/csv-metadata-quality)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc. A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, unnecessary Unicode, AGROVOC terms, etc.
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested. Requires Python 3.7 or greater (3.8 recommended). CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
If you use the DSpace CSV metadata quality checker please cite:
*Orth, A. 2019. DSpace CSV metadata quality checker. Nairobi, Kenya: ILRI. https://hdl.handle.net/10568/110997.*
## Functionality ## Functionality
@@ -10,23 +14,24 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
- Experimental validation of titles and abstracts against item's Dublin Core language field - Experimental validation of titles and abstracts against item's Dublin Core language field
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option) - Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
- Fix leading, trailing, and excessive (ie, more than one) whitespace - Fix leading, trailing, and excessive (ie, more than one) whitespace
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes` - Fix invalid and unnecessary multi-value separators (`|`) using `--unsafe-fixes`
- Fix problematic newlines (line feeds) using `--unsafe-fixes` - Fix problematic newlines (line feeds) using `--unsafe-fixes`
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt" - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
- Remove duplicate metadata values - Remove duplicate metadata values
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
## Installation ## Installation
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv): The easiest way to install CSV Metadata Quality is with [poetry](https://python-poetry.org):
``` ```
$ git clone https://github.com/ilri/csv-metadata-quality.git $ git clone https://github.com/ilri/csv-metadata-quality.git
$ cd csv-metadata-quality $ cd csv-metadata-quality
$ pipenv install $ poetry install
$ pipenv shell $ poetry shell
``` ```
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment: Otherwise, if you don't have poetry, you can use a vanilla Python virtual environment:
``` ```
$ git clone https://github.com/ilri/csv-metadata-quality.git $ git clone https://github.com/ilri/csv-metadata-quality.git
@@ -55,9 +60,19 @@ You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currentl
### Invalid Multi-Value Separators ### Invalid Multi-Value Separators
This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`. This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
### Newlines ### Newlines
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A). This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
### Unicode Normalization
[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text. As the standard aims to support most of the world's languages, characters can often be represented in different ways and still be valid Unicode. This leads to interesting problems that can be confusing unless you know what's going on behind the scenes. For example, the characters `é` and `é` *look* the same, but are nottechnically they refer to different code points in the Unicode standard:
- `é` is the Unicode code point `U+00E9`
- `é` is the Unicode code points `U+0065` + `U+0301`
Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html).
## AGROVOC Validation ## AGROVOC Validation
You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields: You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields:
@@ -92,8 +107,10 @@ This currently uses the [Python langid](https://github.com/saffsd/langid.py) lib
- Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006 - Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006
- Warn if two items use the same file in `filename` column - Warn if two items use the same file in `filename` column
- Add an option to drop invalid AGROVOC subjects? - Add an option to drop invalid AGROVOC subjects?
- Add check for author names with incorrect spacing after commas, ie "Orth,Alan S."
- Add tests for application invocation, ie `tests/test_app.py`? - Add tests for application invocation, ie `tests/test_app.py`?
- Validate ISSNs or journal titles against CrossRef API?
- Better ISO 8601 date parsing (currently only supports simple dates, perhaps we need to use dateutil.parser.parseiso())
- Fix lazy date check (assumes field name has "date" but could be dcterms.issued etc!)
## License ## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html). This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

View File

@@ -4,6 +4,7 @@ import signal
import sys import sys
import pandas as pd import pandas as pd
from colorama import Fore
import csv_metadata_quality.check as check import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental import csv_metadata_quality.experimental as experimental
@@ -21,7 +22,8 @@ def parse_args(argv):
parser.add_argument( parser.add_argument(
"--experimental-checks", "--experimental-checks",
"-e", "-e",
help="Enable experimental checks like language detection", action="store_true" help="Enable experimental checks like language detection",
action="store_true",
) )
parser.add_argument( parser.add_argument(
"--input-file", "--input-file",
@@ -76,12 +78,12 @@ def run(argv):
if column == exclude and skip is False: if column == exclude and skip is False:
skip = True skip = True
if skip: if skip:
print(f"Skipping {column}") print(f"{Fore.YELLOW}Skipping {Fore.RESET}{column}")
continue continue
# Fix: whitespace # Fix: whitespace
df[column] = df[column].apply(fix.whitespace) df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: newlines # Fix: newlines
if args.unsafe_fixes: if args.unsafe_fixes:
@@ -94,23 +96,28 @@ def run(argv):
if match is not None: if match is not None:
df[column] = df[column].apply(fix.comma_space, field_name=column) df[column] = df[column].apply(fix.comma_space, field_name=column)
# Fix: perform Unicode normalization (NFC) to convert decomposed
# characters into their canonical forms.
if args.unsafe_fixes:
df[column] = df[column].apply(fix.normalize_unicode, field_name=column)
# Fix: unnecessary Unicode # Fix: unnecessary Unicode
df[column] = df[column].apply(fix.unnecessary_unicode) df[column] = df[column].apply(fix.unnecessary_unicode)
# Check: invalid multi-value separator # Check: invalid and unnecessary multi-value separators
df[column] = df[column].apply(check.separators) df[column] = df[column].apply(check.separators, field_name=column)
# Check: suspicious characters # Check: suspicious characters
df[column] = df[column].apply(check.suspicious_characters, field_name=column) df[column] = df[column].apply(check.suspicious_characters, field_name=column)
# Fix: invalid multi-value separator # Fix: invalid and unnecessary multi-value separators
if args.unsafe_fixes: if args.unsafe_fixes:
df[column] = df[column].apply(fix.separators) df[column] = df[column].apply(fix.separators, field_name=column)
# Run whitespace fix again after fixing invalid separators # Run whitespace fix again after fixing invalid separators
df[column] = df[column].apply(fix.whitespace) df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: duplicate metadata values # Fix: duplicate metadata values
df[column] = df[column].apply(fix.duplicates) df[column] = df[column].apply(fix.duplicates, field_name=column)
# Check: invalid AGROVOC subject # Check: invalid AGROVOC subject
if args.agrovoc_fields: if args.agrovoc_fields:

View File

@@ -1,4 +1,10 @@
from datetime import datetime, timedelta
import pandas as pd import pandas as pd
import requests
import requests_cache
from colorama import Fore
from pycountry import languages
def issn(field): def issn(field):
@@ -21,7 +27,7 @@ def issn(field):
for value in field.split("||"): for value in field.split("||"):
if not issn.is_valid(value): if not issn.is_valid(value):
print(f"Invalid ISSN: {value}") print(f"{Fore.RED}Invalid ISSN: {Fore.RESET}{value}")
return field return field
@@ -46,13 +52,17 @@ def isbn(field):
for value in field.split("||"): for value in field.split("||"):
if not isbn.is_valid(value): if not isbn.is_valid(value):
print(f"Invalid ISBN: {value}") print(f"{Fore.RED}Invalid ISBN: {Fore.RESET}{value}")
return field return field
def separators(field): def separators(field, field_name):
"""Check for invalid multi-value separators (ie "|" or "|||"). """Check for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator. Prints the field with the invalid multi-value separator.
""" """
@@ -65,12 +75,22 @@ def separators(field):
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split("||"): for value in field.split("||"):
# Check if the current value is blank
if value == "":
print(
f"{Fore.RED}Unnecessary multi-value separator ({field_name}): {Fore.RESET}{field}"
)
continue
# After splitting, see if there are any remaining "|" characters # After splitting, see if there are any remaining "|" characters
match = re.findall(r"^.*?\|.*$", value) match = re.findall(r"^.*?\|.*$", value)
# Check if there was a match
if match: if match:
print(f"Invalid multi-value separator: {field}") print(
f"{Fore.RED}Invalid multi-value separator ({field_name}): {Fore.RESET}{field}"
)
return field return field
@@ -85,10 +105,9 @@ def date(field, field_name):
Prints the date if invalid. Prints the date if invalid.
""" """
from datetime import datetime
if pd.isna(field): if pd.isna(field):
print(f"Missing date ({field_name}).") print(f"{Fore.RED}Missing date ({field_name}).{Fore.RESET}")
return return
@@ -97,7 +116,9 @@ def date(field, field_name):
# We don't allow multi-value date fields # We don't allow multi-value date fields
if len(multiple_dates) > 1: if len(multiple_dates) > 1:
print(f"Multiple dates not allowed ({field_name}): {field}") print(
f"{Fore.RED}Multiple dates not allowed ({field_name}): {Fore.RESET}{field}"
)
return field return field
@@ -123,7 +144,15 @@ def date(field, field_name):
return field return field
except ValueError: except ValueError:
print(f"Invalid date ({field_name}): {field}") pass
try:
# Check if date is valid YYYY-MM-DDTHH:MM:SSZ format
datetime.strptime(field, "%Y-%m-%dT%H:%M:%SZ")
return field
except ValueError:
print(f"{Fore.RED}Invalid date ({field_name}): {Fore.RESET}{field}")
return field return field
@@ -156,9 +185,7 @@ def suspicious_characters(field, field_name):
# character and spanning enough of the rest to give a preview, # character and spanning enough of the rest to give a preview,
# but not too much to cause the line to break in terminals with # but not too much to cause the line to break in terminals with
# a default of 80 characters width. # a default of 80 characters width.
suspicious_character_msg = ( suspicious_character_msg = f"{Fore.YELLOW}Suspicious character ({field_name}): {Fore.RESET}{field_subset}"
f"Suspicious character ({field_name}): {field_subset}"
)
print(f"{suspicious_character_msg:1.80}") print(f"{suspicious_character_msg:1.80}")
return field return field
@@ -170,8 +197,6 @@ def language(field):
Prints the value if it is invalid. Prints the value if it is invalid.
""" """
from pycountry import languages
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
@@ -185,16 +210,16 @@ def language(field):
# can check it against ISO 639-1 or ISO 639-3 accordingly. # can check it against ISO 639-1 or ISO 639-3 accordingly.
if len(value) == 2: if len(value) == 2:
if not languages.get(alpha_2=value): if not languages.get(alpha_2=value):
print(f"Invalid ISO 639-1 language: {value}") print(f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}")
pass pass
elif len(value) == 3: elif len(value) == 3:
if not languages.get(alpha_3=value): if not languages.get(alpha_3=value):
print(f"Invalid ISO 639-3 language: {value}") print(f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}")
pass pass
else: else:
print(f"Invalid language: {value}") print(f"{Fore.RED}Invalid language: {Fore.RESET}{value}")
return field return field
@@ -213,37 +238,30 @@ def agrovoc(field, field_name):
Prints a warning if the value is invalid. Prints a warning if the value is invalid.
""" """
from datetime import timedelta
import requests
import requests_cache
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
# enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30)
requests_cache.install_cache("agrovoc-response-cache", expire_after=expire_after)
# prune old cache entries
requests_cache.core.remove_expired_responses()
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split("||"): for value in field.split("||"):
request_url = ( request_url = "http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search"
f"http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}" request_params = {"query": value}
)
# enable transparent request cache with thirty days expiry request = requests.get(request_url, params=request_params)
expire_after = timedelta(days=30)
requests_cache.install_cache(
"agrovoc-response-cache", expire_after=expire_after
)
request = requests.get(request_url)
# prune old cache entries
requests_cache.core.remove_expired_responses()
if request.status_code == requests.codes.ok: if request.status_code == requests.codes.ok:
data = request.json() data = request.json()
# check if there are any results # check if there are any results
if len(data["results"]) == 0: if len(data["results"]) == 0:
print(f"Invalid AGROVOC ({field_name}): {value}") print(f"{Fore.RED}Invalid AGROVOC ({field_name}): {Fore.RESET}{value}")
return field return field
@@ -296,6 +314,6 @@ def filename_extension(field):
break break
if filename_extension_match is False: if filename_extension_match is False:
print(f"Filename with uncommon extension: {value}") print(f"{Fore.YELLOW}Filename with uncommon extension: {Fore.RESET}{value}")
return field return field

View File

@@ -1,4 +1,5 @@
import pandas as pd import pandas as pd
from colorama import Fore
def correct_language(row): def correct_language(row):
@@ -10,10 +11,11 @@ def correct_language(row):
language and returns the value in the language field if it does match. language and returns the value in the language field if it does match.
""" """
from pycountry import languages
import langid
import re import re
import langid
from pycountry import languages
# Initialize some variables at global scope so that we can set them in the # Initialize some variables at global scope so that we can set them in the
# loop scope below and still be able to access them afterwards. # loop scope below and still be able to access them afterwards.
language = "" language = ""
@@ -83,12 +85,12 @@ def correct_language(row):
detected_language = languages.get(alpha_2=langid_classification[0]) detected_language = languages.get(alpha_2=langid_classification[0])
if len(language) == 2 and language != detected_language.alpha_2: if len(language) == 2 and language != detected_language.alpha_2:
print( print(
f"Possibly incorrect language {language} (detected {detected_language.alpha_2}): {title}" f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_2}): {Fore.RESET}{title}"
) )
elif len(language) == 3 and language != detected_language.alpha_3: elif len(language) == 3 and language != detected_language.alpha_3:
print( print(
f"Possibly incorrect language {language} (detected {detected_language.alpha_3}): {title}" f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_3}): {Fore.RESET}{title}"
) )
else: else:

View File

@@ -1,9 +1,13 @@
import re import re
from unicodedata import normalize
import pandas as pd import pandas as pd
from colorama import Fore
from csv_metadata_quality.util import is_nfc
def whitespace(field): def whitespace(field, field_name):
"""Fix whitespace issues. """Fix whitespace issues.
Return string with leading, trailing, and consecutive whitespace trimmed. Return string with leading, trailing, and consecutive whitespace trimmed.
@@ -26,7 +30,9 @@ def whitespace(field):
match = re.findall(pattern, value) match = re.findall(pattern, value)
if match: if match:
print(f"Excessive whitespace: {value}") print(
f"{Fore.GREEN}Removing excessive whitespace ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, " ", value) value = re.sub(pattern, " ", value)
# Save cleaned value # Save cleaned value
@@ -38,8 +44,15 @@ def whitespace(field):
return new_field return new_field
def separators(field): def separators(field, field_name):
"""Fix for invalid multi-value separators (ie "|").""" """Fix for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator.
"""
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
@@ -50,12 +63,22 @@ def separators(field):
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split("||"): for value in field.split("||"):
# Check if the value is blank and skip it
if value == "":
print(
f"{Fore.GREEN}Fixing unnecessary multi-value separator ({field_name}): {Fore.RESET}{field}"
)
continue
# After splitting, see if there are any remaining "|" characters # After splitting, see if there are any remaining "|" characters
pattern = re.compile(r"\|") pattern = re.compile(r"\|")
match = re.findall(pattern, value) match = re.findall(pattern, value)
if match: if match:
print(f"Fixing invalid multi-value separator: {value}") print(
f"{Fore.RED}Fixing invalid multi-value separator ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, "||", value) value = re.sub(pattern, "||", value)
@@ -74,10 +97,10 @@ def unnecessary_unicode(field):
Removes unnecessary Unicode characters like: Removes unnecessary Unicode characters like:
- Zero-width space (U+200B) - Zero-width space (U+200B)
- Replacement character (U+FFFD) - Replacement character (U+FFFD)
- No-break space (U+00A0)
Replaces unnecessary Unicode characters like: Replaces unnecessary Unicode characters like:
- Soft hyphen (U+00AD) → hyphen - Soft hyphen (U+00AD) → hyphen
- No-break space (U+00A0) → space
Return string with characters removed or replaced. Return string with characters removed or replaced.
""" """
@@ -91,7 +114,7 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Removing unnecessary Unicode (U+200B): {field}") print(f"{Fore.GREEN}Removing unnecessary Unicode (U+200B): {Fore.RESET}{field}")
field = re.sub(pattern, "", field) field = re.sub(pattern, "", field)
# Check for replacement characters (U+FFFD) # Check for replacement characters (U+FFFD)
@@ -99,7 +122,7 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Removing unnecessary Unicode (U+FFFD): {field}") print(f"{Fore.GREEN}Removing unnecessary Unicode (U+FFFD): {Fore.RESET}{field}")
field = re.sub(pattern, "", field) field = re.sub(pattern, "", field)
# Check for no-break spaces (U+00A0) # Check for no-break spaces (U+00A0)
@@ -107,21 +130,25 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Removing unnecessary Unicode (U+00A0): {field}") print(
field = re.sub(pattern, "", field) f"{Fore.GREEN}Replacing unnecessary Unicode (U+00A0): {Fore.RESET}{field}"
)
field = re.sub(pattern, " ", field)
# Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen # Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen
pattern = re.compile(r"\u002D*?\u00AD") pattern = re.compile(r"\u002D*?\u00AD")
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Replacing unnecessary Unicode (U+00AD): {field}") print(
f"{Fore.GREEN}Replacing unnecessary Unicode (U+00AD): {Fore.RESET}{field}"
)
field = re.sub(pattern, "-", field) field = re.sub(pattern, "-", field)
return field return field
def duplicates(field): def duplicates(field, field_name):
"""Remove duplicate metadata values.""" """Remove duplicate metadata values."""
# Skip fields with missing values # Skip fields with missing values
@@ -140,7 +167,9 @@ def duplicates(field):
if value not in new_values: if value not in new_values:
new_values.append(value) new_values.append(value)
else: else:
print(f"Dropping duplicate value: {value}") print(
f"{Fore.GREEN}Removing duplicate value ({field_name}): {Fore.RESET}{value}"
)
# Create a new field consisting of all values joined with "||" # Create a new field consisting of all values joined with "||"
new_field = "||".join(new_values) new_field = "||".join(new_values)
@@ -173,7 +202,7 @@ def newlines(field):
match = re.findall(r"\n", field) match = re.findall(r"\n", field)
if match: if match:
print(f"Removing newline: {field}") print(f"{Fore.GREEN}Removing newline: {Fore.RESET}{field}")
field = field.replace("\n", "") field = field.replace("\n", "")
return field return field
@@ -197,7 +226,30 @@ def comma_space(field, field_name):
match = re.findall(r",\w", field) match = re.findall(r",\w", field)
if match: if match:
print(f"Adding space after comma ({field_name}): {field}") print(
f"{Fore.GREEN}Adding space after comma ({field_name}): {Fore.RESET}{field}"
)
field = re.sub(r",(\w)", r", \1", field) field = re.sub(r",(\w)", r", \1", field)
return field return field
def normalize_unicode(field, field_name):
"""Fix occurrences of decomposed Unicode characters by normalizing them
with NFC to their canonical forms, for example:
Ouédraogo, Mathieu → Ouédraogo, Mathieu
Return normalized string.
"""
# Skip fields with missing values
if pd.isna(field):
return
# Check if the current string is using normalized Unicode (NFC)
if not is_nfc(field):
print(f"{Fore.GREEN}Normalizing Unicode ({field_name}): {Fore.RESET}{field}")
field = normalize("NFC", field)
return field

View File

@@ -0,0 +1,14 @@
def is_nfc(field):
"""Utility function to check whether a string is using normalized Unicode.
Python's built-in unicodedata library has the is_normalized() function, but
it was only introduced in Python 3.8. By using a simple utility function we
are able to run on Python >= 3.6 again.
See: https://docs.python.org/3/library/unicodedata.html
Return boolean.
"""
from unicodedata import normalize
return field == normalize("NFC", field)

View File

@@ -1 +1 @@
VERSION = "0.3.0" VERSION = "0.4.4"

View File

@@ -1,4 +1,4 @@
dc.title,birthdate,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country,filename dc.title,dc.date.issued,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country,filename
Leading space,2019-07-29,,,,,, Leading space,2019-07-29,,,,,,
Trailing space ,2019-07-29,,,,,, Trailing space ,2019-07-29,,,,,,
Excessive space,2019-07-29,,,,,, Excessive space,2019-07-29,,,,,,
@@ -26,3 +26,6 @@ Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,,
"Missing space,after comma",2019-08-27,,,,,, "Missing space,after comma",2019-08-27,,,,,,
Incorrect ISO 639-1 language,2019-09-26,,,es,,, Incorrect ISO 639-1 language,2019-09-26,,,es,,,
Incorrect ISO 639-3 language,2019-09-26,,,spa,,, Incorrect ISO 639-3 language,2019-09-26,,,spa,,,
Composéd Unicode,2020-01-14,,,,,,
Decomposéd Unicode,2020-01-14,,,,,,
Unnecessary multi-value separator,2021-01-03,0378-5955||,,,,,
1 dc.title birthdate dc.date.issued dc.identifier.issn dc.identifier.isbn dc.language.iso dc.subject cg.coverage.country filename
2 Leading space 2019-07-29
3 Trailing space 2019-07-29
4 Excessive space 2019-07-29
26 Incorrect ISO 639-1 language 2019-09-26 es
27 Incorrect ISO 639-3 language 2019-09-26 spa
28 Composéd Unicode 2020-01-14
29 Decomposéd Unicode 2020-01-14
30 Unnecessary multi-value separator 2021-01-03 0378-5955||
31

1185
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

32
pyproject.toml Normal file
View File

@@ -0,0 +1,32 @@
[tool.poetry]
name = "csv-metadata-quality"
version = "0.4.4"
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem."
authors = ["Alan Orth <alan.orth@gmail.com>"]
license="GPL-3.0-only"
repository = "https://github.com/ilri/csv-metadata-quality"
homepage = "https://github.com/ilri/csv-metadata-quality"
[tool.poetry.dependencies]
python = "^3.8"
pandas = "^1.0.4"
python-stdnum = "^1.13"
xlrd = "^1.2.0"
requests = "^2.23.0"
requests-cache = "^0.5.2"
pycountry = "^19.8.18"
langid = "^1.1.6"
colorama = "^0.4.4"
[tool.poetry.dev-dependencies]
pytest = "^6.1.1"
ipython = { version = "^7.18.1", python = "^3.7" }
flake8 = "^3.8.4"
pytest-clarity = "^0.3.0-alpha.0"
black = "20.8b1"
isort = "^5.5.4"
csvkit = "^1.0.5"
[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

View File

@@ -1,5 +1,5 @@
[pytest] [pytest]
addopts= -rsxX -s -v --strict --capture=sys addopts= -rsxX -s -v --strict-markers --capture=sys
filterwarnings = filterwarnings =
error::UserWarning error::UserWarning
ignore:.*U.* is deprecated:DeprecationWarning ignore:.*U.* is deprecated:DeprecationWarning

View File

@@ -1,57 +1,71 @@
-i https://pypi.org/simple agate-dbf==0.2.2
agate-dbf==0.2.1
agate-excel==0.2.3 agate-excel==0.2.3
agate-sql==0.5.4 agate-sql==0.5.5
agate==1.6.1 agate==1.6.1
appdirs==1.4.3 appdirs==1.4.4; python_version >= "3.6"
atomicwrites==1.3.0 appnope==0.1.2; python_version >= "3.7" and python_version < "4.0" and sys_platform == "darwin"
attrs==19.1.0 atomicwrites==1.4.0; python_version >= "3.6" and python_full_version < "3.0.0" and sys_platform == "win32" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") or sys_platform == "win32" and python_version >= "3.6" and python_full_version >= "3.4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6")
babel==2.7.0 attrs==20.3.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
backcall==0.1.0 babel==2.9.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
black==19.3b0 backcall==0.2.0; python_version >= "3.7" and python_version < "4.0"
click==7.0 black==20.8b1; python_version >= "3.6"
csvkit==1.0.4 certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
click==7.1.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
csvkit==1.0.5
dbfread==2.0.7 dbfread==2.0.7
decorator==4.4.0 decorator==4.4.2; python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "4.0" or python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.2.0"
entrypoints==0.3 et-xmlfile==1.0.1; python_version >= "3.6"
et-xmlfile==1.0.1 flake8==3.8.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
flake8==3.7.8 idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
future==0.17.1 iniconfig==1.1.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
importlib-metadata==0.23 ; python_version < '3.8' ipython-genutils==0.2.0; python_version >= "3.7" and python_version < "4.0"
ipython-genutils==0.2.0 ipython==7.20.0; python_version >= "3.7" and python_version < "4.0"
ipython==7.8.0
isodate==0.6.0 isodate==0.6.0
isort==4.3.21 isort==5.7.0; python_version >= "3.6" and python_version < "4.0"
jdcal==1.4.1 jdcal==1.4.1; python_version >= "3.6"
jedi==0.15.1 jedi==0.18.0; python_version >= "3.7" and python_version < "4.0"
langid==1.1.6
leather==0.3.3 leather==0.3.3
mccabe==0.6.1 mccabe==0.6.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
more-itertools==7.2.0 mypy-extensions==0.4.3; python_version >= "3.6"
openpyxl==3.0.0 numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
packaging==19.2 openpyxl==3.0.6; python_version >= "3.6"
parsedatetime==2.4 packaging==20.9; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
parso==0.5.1 pandas==1.2.2; python_full_version >= "3.7.1"
pexpect==4.7.0 ; sys_platform != 'win32' parsedatetime==2.6
pickleshare==0.7.5 parso==0.8.1; python_version >= "3.7" and python_version < "4.0"
pluggy==0.13.0 pathspec==0.8.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
prompt-toolkit==2.0.9 pexpect==4.8.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
ptyprocess==0.6.0 pickleshare==0.7.5; python_version >= "3.7" and python_version < "4.0"
py==1.8.0 pluggy==0.13.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pycodestyle==2.5.0 prompt-toolkit==3.0.16; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
pyflakes==2.1.1 ptyprocess==0.7.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
pygments==2.4.2 py==1.10.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pyparsing==2.4.2 pycodestyle==2.6.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
pytest-clarity==0.2.0a1 pycountry==19.8.18
pytest==5.1.3 pyflakes==2.2.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
python-slugify==3.0.4 pygments==2.8.0; python_version >= "3.7" and python_version < "4.0"
pyparsing==2.4.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pytest-clarity==0.3.0a0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
pytest==6.2.2; python_version >= "3.6"
python-dateutil==2.8.1; python_full_version >= "3.7.1"
python-slugify==4.0.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
python-stdnum==1.16
pytimeparse==1.1.8 pytimeparse==1.1.8
pytz==2019.2 pytz==2021.1; python_full_version >= "3.7.1"
six==1.12.0 regex==2020.11.13; python_version >= "3.6"
sqlalchemy==1.3.8 requests-cache==0.5.2
termcolor==1.1.0 requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
text-unidecode==1.3 six==1.15.0; python_full_version >= "3.7.1"
toml==0.10.0 sqlalchemy==1.3.23; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
traitlets==4.3.3.dev0 termcolor==1.1.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
wcwidth==0.1.7 text-unidecode==1.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
xlrd==1.2.0 toml==0.10.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
zipp==0.6.0 traitlets==5.0.5; python_version >= "3.7" and python_version < "4.0"
typed-ast==1.4.2; python_version >= "3.6"
typing-extensions==3.7.4.3; python_version >= "3.6"
urllib3==1.26.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
wcwidth==0.2.5; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")

View File

@@ -1,17 +1,16 @@
-i https://pypi.org/simple certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
-e . chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
certifi==2019.9.11 colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
chardet==3.0.4 idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
idna==2.8
langid==1.1.6 langid==1.1.6
numpy==1.17.2 numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
pandas==0.25.1 pandas==1.2.2; python_full_version >= "3.7.1"
pycountry==19.8.18 pycountry==19.8.18
python-dateutil==2.8.0 python-dateutil==2.8.1; python_full_version >= "3.7.1"
python-stdnum==1.11 python-stdnum==1.16
pytz==2019.2 pytz==2021.1; python_full_version >= "3.7.1"
requests-cache==0.5.2 requests-cache==0.5.2
requests==2.22.0 requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
six==1.12.0 six==1.15.0; python_full_version >= "3.7.1"
urllib3==1.25.6 urllib3==1.26.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
xlrd==1.2.0 xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")

View File

@@ -4,17 +4,17 @@ with open("README.md", "r") as fh:
long_description = fh.read() long_description = fh.read()
install_requires = [ install_requires = [
'pandas', "pandas",
'python-stdnum', "python-stdnum",
'requests', "requests",
'requests-cache', "requests-cache",
'pycountry', "pycountry",
'langid' "langid",
] ]
setuptools.setup( setuptools.setup(
name="csv-metadata-quality", name="csv-metadata-quality",
version="0.3.0", version="0.4.3",
author="Alan Orth", author="Alan Orth",
author_email="aorth@mjanja.ch", author_email="aorth@mjanja.ch",
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.", description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
@@ -23,17 +23,16 @@ setuptools.setup(
long_description_content_type="text/markdown", long_description_content_type="text/markdown",
url="https://github.com/alanorth/csv-metadata-quality", url="https://github.com/alanorth/csv-metadata-quality",
classifiers=[ classifiers=[
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: OS Independent", "Operating System :: OS Independent",
"Development Status :: 4 - Beta" "Development Status :: 4 - Beta",
], ],
packages=['csv_metadata_quality'], packages=["csv_metadata_quality"],
entry_points={ entry_points={
'console_scripts': [ "console_scripts": ["csv-metadata-quality = csv_metadata_quality.__main__:main"]
'csv-metadata-quality = csv_metadata_quality.__main__:main'
]
}, },
install_requires=install_requires install_requires=install_requires,
) )

View File

@@ -1,6 +1,8 @@
import pandas as pd
from colorama import Fore
import csv_metadata_quality.check as check import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental import csv_metadata_quality.experimental as experimental
import pandas as pd
def test_check_invalid_issn(capsys): def test_check_invalid_issn(capsys):
@@ -11,7 +13,7 @@ def test_check_invalid_issn(capsys):
check.issn(value) check.issn(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISSN: {value}\n" assert captured.out == f"{Fore.RED}Invalid ISSN: {Fore.RESET}{value}\n"
def test_check_valid_issn(): def test_check_valid_issn():
@@ -32,7 +34,7 @@ def test_check_invalid_isbn(capsys):
check.isbn(value) check.isbn(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISBN: {value}\n" assert captured.out == f"{Fore.RED}Invalid ISBN: {Fore.RESET}{value}\n"
def test_check_valid_isbn(): def test_check_valid_isbn():
@@ -50,10 +52,31 @@ def test_check_invalid_separators(capsys):
value = "Alan|Orth" value = "Alan|Orth"
check.separators(value) field_name = "dc.contributor.author"
check.separators(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid multi-value separator: {value}\n" assert (
captured.out
== f"{Fore.RED}Invalid multi-value separator ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_unnecessary_separators(capsys):
"""Test checking unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
check.separators(field, field_name)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.RED}Unnecessary multi-value separator ({field_name}): {Fore.RESET}{field}\n"
)
def test_check_valid_separators(): def test_check_valid_separators():
@@ -61,7 +84,9 @@ def test_check_valid_separators():
value = "Alan||Orth" value = "Alan||Orth"
result = check.separators(value) field_name = "dc.contributor.author"
result = check.separators(value, field_name)
assert result == value assert result == value
@@ -76,7 +101,7 @@ def test_check_missing_date(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Missing date ({field_name}).\n" assert captured.out == f"{Fore.RED}Missing date ({field_name}).{Fore.RESET}\n"
def test_check_multiple_dates(capsys): def test_check_multiple_dates(capsys):
@@ -89,7 +114,10 @@ def test_check_multiple_dates(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Multiple dates not allowed ({field_name}): {value}\n" assert (
captured.out
== f"{Fore.RED}Multiple dates not allowed ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_invalid_date(capsys): def test_check_invalid_date(capsys):
@@ -102,7 +130,9 @@ def test_check_invalid_date(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid date ({field_name}): {value}\n" assert (
captured.out == f"{Fore.RED}Invalid date ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_valid_date(): def test_check_valid_date():
@@ -127,7 +157,10 @@ def test_check_suspicious_characters(capsys):
check.suspicious_characters(value, field_name) check.suspicious_characters(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Suspicious character ({field_name}): ˆt\n" assert (
captured.out
== f"{Fore.YELLOW}Suspicious character ({field_name}): {Fore.RESET}ˆt\n"
)
def test_check_valid_iso639_1_language(): def test_check_valid_iso639_1_language():
@@ -158,7 +191,9 @@ def test_check_invalid_iso639_1_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISO 639-1 language: {value}\n" assert (
captured.out == f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}\n"
)
def test_check_invalid_iso639_3_language(capsys): def test_check_invalid_iso639_3_language(capsys):
@@ -169,7 +204,9 @@ def test_check_invalid_iso639_3_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISO 639-3 language: {value}\n" assert (
captured.out == f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}\n"
)
def test_check_invalid_language(capsys): def test_check_invalid_language(capsys):
@@ -180,7 +217,7 @@ def test_check_invalid_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid language: {value}\n" assert captured.out == f"{Fore.RED}Invalid language: {Fore.RESET}{value}\n"
def test_check_invalid_agrovoc(capsys): def test_check_invalid_agrovoc(capsys):
@@ -192,7 +229,10 @@ def test_check_invalid_agrovoc(capsys):
check.agrovoc(value, field_name) check.agrovoc(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid AGROVOC ({field_name}): {value}\n" assert (
captured.out
== f"{Fore.RED}Invalid AGROVOC ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_valid_agrovoc(): def test_check_valid_agrovoc():
@@ -214,7 +254,10 @@ def test_check_uncommon_filename_extension(capsys):
check.filename_extension(value) check.filename_extension(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Filename with uncommon extension: {value}\n" assert (
captured.out
== f"{Fore.YELLOW}Filename with uncommon extension: {Fore.RESET}{value}\n"
)
def test_check_common_filename_extension(): def test_check_common_filename_extension():
@@ -242,7 +285,7 @@ def test_check_incorrect_iso_639_1_language(capsys):
captured = capsys.readouterr() captured = capsys.readouterr()
assert ( assert (
captured.out captured.out
== f"Possibly incorrect language {language} (detected en): {title}\n" == f"{Fore.YELLOW}Possibly incorrect language {language} (detected en): {Fore.RESET}{title}\n"
) )
@@ -261,7 +304,7 @@ def test_check_incorrect_iso_639_3_language(capsys):
captured = capsys.readouterr() captured = capsys.readouterr()
assert ( assert (
captured.out captured.out
== f"Possibly incorrect language {language} (detected eng): {title}\n" == f"{Fore.YELLOW}Possibly incorrect language {language} (detected eng): {Fore.RESET}{title}\n"
) )

View File

@@ -6,7 +6,9 @@ def test_fix_leading_whitespace():
value = " Alan" value = " Alan"
assert fix.whitespace(value) == "Alan" field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_trailing_whitespace(): def test_fix_trailing_whitespace():
@@ -14,7 +16,9 @@ def test_fix_trailing_whitespace():
value = "Alan " value = "Alan "
assert fix.whitespace(value) == "Alan" field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_excessive_whitespace(): def test_fix_excessive_whitespace():
@@ -22,7 +26,9 @@ def test_fix_excessive_whitespace():
value = "Alan Orth" value = "Alan Orth"
assert fix.whitespace(value) == "Alan Orth" field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan Orth"
def test_fix_invalid_separators(): def test_fix_invalid_separators():
@@ -30,7 +36,19 @@ def test_fix_invalid_separators():
value = "Alan|Orth" value = "Alan|Orth"
assert fix.separators(value) == "Alan||Orth" field_name = "dc.contributor.author"
assert fix.separators(value, field_name) == "Alan||Orth"
def test_fix_unnecessary_separators():
"""Test fixing unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
assert fix.separators(field, field_name) == "Alan||Orth"
def test_fix_unnecessary_unicode(): def test_fix_unnecessary_unicode():
@@ -46,7 +64,9 @@ def test_fix_duplicates():
value = "Kenya||Kenya" value = "Kenya||Kenya"
assert fix.duplicates(value) == "Kenya" field_name = "dc.contributor.author"
assert fix.duplicates(value, field_name) == "Kenya"
def test_fix_newlines(): def test_fix_newlines():
@@ -66,3 +86,25 @@ def test_fix_comma_space():
field_name = "dc.contributor.author" field_name = "dc.contributor.author"
assert fix.comma_space(value, field_name) == "Orth, Alan S." assert fix.comma_space(value, field_name) == "Orth, Alan S."
def test_fix_normalized_unicode():
"""Test fixing a string that is already in its normalized (NFC) Unicode form."""
# string using the normalized canonical form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
def test_fix_decomposed_unicode():
"""Test fixing a string that contains Unicode string."""
# string using the decomposed form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"