1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2026-01-08 05:13:31 +01:00

83 Commits

Author SHA1 Message Date
ca80340f7a Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt

Trying to see if we no longer need --without-hashes since we don't
support Python 3.6 anymore.
2021-01-26 11:46:05 +02:00
cc1743b86d Remove .build.yml
I will just use GitHub Actions and Drone.
2021-01-26 11:41:30 +02:00
bcb9885c6b Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-26 10:36:48 +02:00
b484b75178 poetry.lock: Run poetry update 2021-01-26 10:36:04 +02:00
d3880a9dfa Remove Python 3.6 support
All checks were successful
continuous-integration/drone/push Build is passing
Pandas 1.2.0 apparently requires Python 3.7.1+.
2021-01-03 15:51:53 +02:00
7edb8b19d7 tests/test_check.py: Reformat with black 2021-01-03 15:50:21 +02:00
a6709c7f82 Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-03 15:42:00 +02:00
d489ea4609 poetry.lock: Run poetry update 2021-01-03 15:41:08 +02:00
96634cbb67 pytest.ini: Change --strict to --strict-markers
This is deprecated since pytest 6.2.0.

See: https://docs.pytest.org/en/stable/deprecations.html#the-strict-command-line-option
2021-01-03 15:40:14 +02:00
29e67a0887 Add tests for unnecessary multi-value separators 2021-01-03 15:37:18 +02:00
32cea2055f data/test.csv: Add unnecessary multi-value separator 2021-01-03 15:33:04 +02:00
0dc66c5c4e Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
c26ad83534 .github: Test CLI invocation 2020-12-14 23:47:09 +02:00
72ca9d99bf setup.py: Add Python 3.9
[SKIP CI]
2020-12-14 23:44:35 +02:00
ae33a9b793 Add .drone.yml 2020-12-14 23:42:23 +02:00
fc0367bfc8 README.md: Update note about Python version 2020-12-08 10:52:24 +02:00
e33b285034 README.md: Add GitHub Actions badge 2020-12-08 10:48:31 +02:00
349fca03b8 .github/workflows/python-app.yml: Rename
This name is displayed in the badge so it should be something more
relevant.
2020-12-08 10:46:39 +02:00
52d8904870 Remove .travis.yml
They changed their free tier and I might as well use GitHub Actions
for ILRI stuff anyways.
2020-12-08 10:41:36 +02:00
971c69e535 Create python-app.yml
Try GitHub Actions for Python 3.8 using GitHub's Python example.
2020-12-08 10:38:52 +02:00
f8cc233e25 .travis.yml: Use Amazon Graviton2 ARM environment
These are the new hotness and should have faster build times.

See: https://blog.travis-ci.com/2020-09-11-arm-on-aws
2020-12-06 10:49:03 +02:00
aa7b7a9592 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2020-11-03 07:42:45 +02:00
57b455bde7 poetry.lock: Run poetry update 2020-11-03 07:40:56 +02:00
23b95fa368 .travis.yml: Use Ubuntu 20.04 "Focal" environment 2020-10-29 00:14:54 +03:00
6985f76aa3 .travis.yml: Bump Python versions
Test Python 3.9 now that it was released, and allow tests to fail
on nightly builds.
2020-10-29 00:14:36 +03:00
98a6a19e12 Update requirements-dev.txt
Generated with poetry export:

    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:48:46 +03:00
f4914c414f Only install ipython on Python 3.7+ 2020-10-06 17:48:16 +03:00
d352fe8017 Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:21:33 +03:00
f13c360084 Update poetry package dependencies 2020-10-06 17:20:16 +03:00
7cfd4c0b59 csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
826509ddcf poetry.lock: Run poetry update
List of updated modules:

  - Updating numpy (1.19.1 -> 1.19.2)
  - Updating pygments (2.6.1 -> 2.7.1)
  - Updating pandas (1.1.1 -> 1.1.2)

All tests still pass according to pytest.
2020-09-26 12:18:23 +03:00
22b5c0f7a1 CHANGELOG.md: Add note about dependencies update 2020-09-08 15:04:40 +03:00
774e274b32 poetry.lock: Run poetry update
Update dependencies to latest version:

  - Updating attrs (19.3.0 -> 20.2.0)
  - Updating more-itertools (8.4.0 -> 8.5.0)
  - Updating openpyxl (3.0.4 -> 3.0.5)
  - Updating parso (0.7.0 -> 0.7.1)
  - Updating sqlalchemy (1.3.18 -> 1.3.19)
  - Updating urllib3 (1.25.9 -> 1.25.10)
  - Updating agate-dbf (0.2.1 -> 0.2.2)
  - Updating agate-sql (0.5.4 -> 0.5.5)
  - Updating jedi (0.17.1 -> 0.17.2)
  - Updating numpy (1.19.0 -> 1.19.1)
  - Updating prompt-toolkit (3.0.5 -> 3.0.7)
  - Updating regex (2020.6.8 -> 2020.7.14)
  - Updating traitlets (4.3.3 -> 5.0.4)
  - Updating ipython (7.16.1 -> 7.18.1)
  - Updating pandas (1.0.5 -> 1.1.1)
  - Updating python-stdnum (1.13 -> 1.14)

All tests still pass according to pytest.
2020-09-08 15:04:00 +03:00
db474a802f README.md: Use badge from travis-ci.com 2020-08-04 11:12:28 +03:00
e241f8461b CHANGELOG.md: Add notes 2020-07-06 14:10:46 +03:00
431e6331c8 csv_metadata_quality/check.py: Format with black 2020-07-06 14:10:19 +03:00
cb07d357d4 Version 0.4.2 2020-07-06 14:04:34 +03:00
65cd48a26f CHANGELOG.md: Update changes 2020-07-06 14:00:21 +03:00
0f883f640c Remove pipenv 2020-07-06 13:59:49 +03:00
f4c5c5781e README.md: Switch to poetry 2020-07-06 13:59:11 +03:00
6aa784ad8c Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-07-06 13:57:07 +03:00
7b8da94f41 poetry.lock: Update Python dependencies 2020-07-06 13:56:31 +03:00
2a1566af62 csv_metadata_quality/check.py: Parameterize AGROVOC request 2020-07-06 13:44:46 +03:00
5fcaa63bd5 csv_metadata_quality/check.py: Prune requests cache once
We only need to prune the requests cache once before using it, not
for every value we check.
2020-07-06 13:42:19 +03:00
aa9e23b46c pyproject.toml: Update license specifier
We need to use valid SPDX license identifiers.
2020-06-09 14:22:53 +03:00
73acb1661f Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-05-31 17:51:16 +03:00
2a068fddc4 .build.yml: Fix test 2020-05-31 17:44:37 +03:00
c6c2f13e88 .build.yml: Fix poetry install invocation
Poetry apparently installs dev dependencies by default.
2020-05-31 17:37:09 +03:00
56f16e37ed .build.yml: Use poetry in SourceHut CI 2020-05-31 17:35:04 +03:00
0c44b967b6 Add poetry project file and lock
I want to try to use poetry instead of pipenv because pipenv takes
forever to do dependency resolution sometimes. Also, I have had a
few issues with Python modules like black that don't have releases
other than pre-releases, and even including the project itself in
the dependencies (pip install -e . ...?). My initial experience is
that poetry handles this better.
2020-05-31 17:33:40 +03:00
8a267bb40b .travis.yml: Try to build with Python 3.8-dev
But allow failures.
2020-03-29 16:40:11 +03:00
8fda8f1ef1 Pipfile.lock: Run pipenv update
All tests still passing.
2020-03-20 16:22:04 +02:00
5e471813e8 CHANGELOG.md: Add note about python dependencies 2020-01-29 12:41:43 +02:00
79244b9ac3 Pipfile.lock: Run pipenv update 2020-01-29 12:39:12 +02:00
5e81a33482 CHANGELOG.md: Add note about field names 2020-01-16 12:37:11 +02:00
28b5996aa6 Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
40ba9bae6c README.md: Adjust heading size 2020-01-15 12:26:11 +02:00
0b2d211455 Version 0.4.1 2020-01-15 12:19:42 +02:00
7f1df0b47c Support Python 3.6 and 3.7 again 2020-01-15 12:19:17 +02:00
365ecda324 Add utility function to check normalization
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 12:17:52 +02:00
550ce7fb7e .travis.yml: Only test Python 3.8
The Unicode normalization feature requires Python 3.8 because the
unicodedata.is_normalized() function only appears there. If I find
another way to check if a string is normalized without normalizing
it first I will drop the requirements back down to Python 3.6.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:57:21 +02:00
705127fd28 Version 0.4.0 2020-01-15 11:44:56 +02:00
894e0a196d setup.py: Change Python requirements
The `unicodedata.is_normalized()` function requires Python 3.8.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:43:25 +02:00
87181bc7b8 Run black, isort, and flake8. 2020-01-15 11:41:31 +02:00
8de5d862b6 CHANGELOG.md: Add note about Unicode normalization 2020-01-15 11:40:40 +02:00
49e3543878 Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
403b253762 CHANGELOG.md: Update python library versions 2020-01-15 10:58:44 +02:00
c5fbaf407a Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2020-01-15 10:51:58 +02:00
4f81f6c83c Pipfile.lock: Run pipenv update 2020-01-15 10:51:19 +02:00
4b9d1e060f setup.py: Add Python 3.8 classifier 2019-12-14 12:56:11 +02:00
c8a71e3143 Pipfile.lock: Run pipenv update 2019-12-14 12:53:39 +02:00
7964d98ca5 Pipfile: Specify exact version of black
Black only releases pre-release versions, which causes issues with
pipenv. Instead of always running pipenv with "--pre" and potenti-
ally letting in some other pre-release versions for other depende-
ncies, I would rather specify the latest black version explicitly.

See: https://github.com/psf/black/issues/517
See: https://github.com/microsoft/vscode-python/issues/5171
2019-12-14 12:41:28 +02:00
64ffc2f1da .travis.yml: Install packages from requirements.txt too 2019-11-14 23:42:28 +02:00
7b1bc29a92 .travis.yml: Try using pip instead of pipenv
The Pipfile knows it was created with Python 3.8, yet we're running
with multiple Python versions on Travis. I'm curious if would work
better to use pip to install dependencies instead of pipenv in this
case.
2019-11-14 23:37:25 +02:00
f0110d8e74 CHANGELOG.md: Add note about requirements 2019-11-14 23:30:26 +02:00
86498deee8 Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-11-14 23:28:42 +02:00
251647a15f CHANGELOG.md: Add TravisCI changes 2019-11-14 23:24:08 +02:00
0bd28e22ec .travis.yml: Test Python 3.8 2019-11-14 23:22:37 +02:00
63fdce7d13 .travis.yml: Use Ubuntu 18.04 "Bionic" 2019-11-14 23:22:19 +02:00
f068c0e16a CHANGELOG.md: Use Python 3.8.0 for pipenv 2019-11-14 23:11:43 +02:00
79b8f62a85 Use Python 3.8 for pipenv
Python 3.8.0 entered Arch Linux core repositories now and all tests
pass with Python 3.8.0 so it's time...
2019-11-14 23:10:20 +02:00
6c1e132531 CHANGELOG.md: Add unreleased changes 2019-11-14 09:19:19 +02:00
c0f3c866bd Pipfile.lock: Run pipenv update
Updates the following dependencies:

- numpy 1.17.2→1.17.4
- pandas 0.25.1→0.25.3
- flake8 3.7.8→3.7.9
- pytest 5.1.3→5.2.2
- black 19.3b0→19.10b0
2019-11-14 09:17:31 +02:00
22 changed files with 2022 additions and 757 deletions

View File

@@ -1,19 +0,0 @@
image: archlinux
packages:
- python-pipenv
sources:
- https://git.sr.ht/~alanorth/csv-metadata-quality
tasks:
- setup: |
cd csv-metadata-quality
pipenv install --dev
- pytest: |
cd csv-metadata-quality
pipenv run pytest
- testcli: |
cd csv-metadata-quality
pipenv run pip install .
pipenv run csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
environment:
PIPENV_NOSPIN: 'True'
PIPENV_HIDE_EMOJIS: 'True'

49
.drone.yml Normal file
View File

@@ -0,0 +1,49 @@
---
kind: pipeline
type: docker
name: python39
steps:
- name: test
image: python:3.9-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python38
steps:
- name: test
image: python:3.8-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python37
steps:
- name: test
image: python:3.7-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
# vim: ts=2 sw=2 et

41
.github/workflows/python-app.yml vendored Normal file
View File

@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build and Test
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
- name: Test CLI
run: |
python setup.py install
csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country

View File

@@ -1,11 +0,0 @@
dist: xenial
language: python
python:
- "3.6"
- "3.7"
install:
- "pip install pipenv --upgrade-strategy=only-if-needed"
- "pipenv install --dev"
script: pytest
# vim: ts=2 sw=2 et

View File

@@ -4,6 +4,43 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Unreleased
### Changed
- Reformat with black
- Requires Python 3.7+ for pandas 1.2.0
### Updated
- Run `poetry update`
## [0.4.2] - 2020-07-06
### Changed
- Add field name to the output for more fixes and checks to help identify where
the error is
- Minor optimizations to AGROVOC subject lookup
- Use Poetry instead of Pipenv
### Updated
- Update python dependencies to latest versions
## [0.4.1] - 2020-01-15
### Changed
- Reduce minimum Python version to 3.6 by working around the `is_normalized()`
that only works in Python >= 3.8
## [0.4.0] - 2020-01-15
### Added
- Unicode normalization (enable with `--unsafe-fixes`, see README.md)
### Updated
- Update python dependencies to latest versions, including numpy 1.18.1, pandas
1.0.0rc0, flake8 3.7.9, pytest 5.3.2, and black 19.10b0
- Regenerate requirements.txt and requirements-dev.txt
### Changed
- Use Python 3.8.0 for pipenv
- Use Ubuntu 18.04 "Bionic" for TravisCI builds
- Test Python 3.8 in TravisCI builds
## [0.3.1] - 2019-10-01
## Changed
- Replace non-breaking spaces (U+00A0) with space instead of removing them

29
Pipfile
View File

@@ -1,29 +0,0 @@
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true
[dev-packages]
pytest = "*"
ipython = "*"
flake8 = "*"
pytest-clarity = "*"
black = "*"
isort = "*"
csvkit = "*"
[packages]
pandas = "*"
python-stdnum = "*"
xlrd = "*"
requests = "*"
requests-cache = "*"
pycountry = "*"
csv-metadata-quality = {editable = true,path = "."}
langid = "*"
[requires]
python_version = "3.7"
[pipenv]
allow_prereleases = true

555
Pipfile.lock generated
View File

@@ -1,555 +0,0 @@
{
"_meta": {
"hash": {
"sha256": "59562d8c59eb09e23b49475d6901687edbf605f5b84e283e90cc8e2de518641f"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.7"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.org/simple",
"verify_ssl": true
}
]
},
"default": {
"certifi": {
"hashes": [
"sha256:e4f3620cfea4f83eedc95b24abd9cd56f3c4b146dd0177e83a21b4eb49e21e50",
"sha256:fd7c7c74727ddcf00e9acd26bba8da604ffec95bf1c2144e67aff7a8b50e6cef"
],
"version": "==2019.9.11"
},
"chardet": {
"hashes": [
"sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae",
"sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"
],
"version": "==3.0.4"
},
"csv-metadata-quality": {
"editable": true,
"path": "."
},
"idna": {
"hashes": [
"sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407",
"sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c"
],
"version": "==2.8"
},
"langid": {
"hashes": [
"sha256:044bcae1912dab85c33d8e98f2811b8f4ff1213e5e9a9e9510137b84da2cb293"
],
"index": "pypi",
"version": "==1.1.6"
},
"numpy": {
"hashes": [
"sha256:05dbfe72684cc14b92568de1bc1f41e5f62b00f714afc9adee42f6311738091f",
"sha256:0d82cb7271a577529d07bbb05cb58675f2deb09772175fab96dc8de025d8ac05",
"sha256:10132aa1fef99adc85a905d82e8497a580f83739837d7cbd234649f2e9b9dc58",
"sha256:12322df2e21f033a60c80319c25011194cd2a21294cc66fee0908aeae2c27832",
"sha256:16f19b3aa775dddc9814e02a46b8e6ae6a54ed8cf143962b4e53f0471dbd7b16",
"sha256:3d0b0989dd2d066db006158de7220802899a1e5c8cf622abe2d0bd158fd01c2c",
"sha256:438a3f0e7b681642898fd7993d38e2bf140a2d1eafaf3e89bb626db7f50db355",
"sha256:5fd214f482ab53f2cea57414c5fb3e58895b17df6e6f5bca5be6a0bb6aea23bb",
"sha256:73615d3edc84dd7c4aeb212fa3748fb83217e00d201875a47327f55363cef2df",
"sha256:7bd355ad7496f4ce1d235e9814ec81ee3d28308d591c067ce92e49f745ba2c2f",
"sha256:7d077f2976b8f3de08a0dcf5d72083f4af5411e8fddacd662aae27baa2601196",
"sha256:a4092682778dc48093e8bda8d26ee8360153e2047826f95a3f5eae09f0ae3abf",
"sha256:b458de8624c9f6034af492372eb2fee41a8e605f03f4732f43fc099e227858b2",
"sha256:e70fc8ff03a961f13363c2c95ef8285e0cf6a720f8271836f852cc0fa64e97c8",
"sha256:ee8e9d7cad5fe6dde50ede0d2e978d81eafeaa6233fb0b8719f60214cf226578",
"sha256:f4a4f6aba148858a5a5d546a99280f71f5ee6ec8182a7d195af1a914195b21a2"
],
"version": "==1.17.2"
},
"pandas": {
"hashes": [
"sha256:18d91a9199d1dfaa01ad645f7540370ba630bdcef09daaf9edf45b4b1bca0232",
"sha256:3f26e5da310a0c0b83ea50da1fd397de2640b02b424aa69be7e0784228f656c9",
"sha256:4182e32f4456d2c64619e97c58571fa5ca0993d1e8c2d9ca44916185e1726e15",
"sha256:426e590e2eb0e60f765271d668a30cf38b582eaae5ec9b31229c8c3c10c5bc21",
"sha256:5eb934a8f0dc358f0e0cdf314072286bbac74e4c124b64371395e94644d5d919",
"sha256:717928808043d3ea55b9bcde636d4a52d2236c246f6df464163a66ff59980ad8",
"sha256:8145f97c5ed71827a6ec98ceaef35afed1377e2d19c4078f324d209ff253ecb5",
"sha256:8744c84c914dcc59cbbb2943b32b7664df1039d99e834e1034a3372acb89ea4d",
"sha256:c1ac1d9590d0c9314ebf01591bd40d4c03d710bfc84a3889e5263c97d7891dee",
"sha256:cb2e197b7b0687becb026b84d3c242482f20cbb29a9981e43604eb67576da9f6",
"sha256:d4001b71ad2c9b84ff18b182cea22b7b6cbf624216da3ea06fb7af28d1f93165",
"sha256:d8930772adccb2882989ab1493fa74bd87d47c8ac7417f5dd3dd834ba8c24dc9",
"sha256:dfbb0173ee2399bc4ed3caf2d236e5c0092f948aafd0a15fbe4a0e77ee61a958",
"sha256:eebfbba048f4fa8ac711b22c78516e16ff8117d05a580e7eeef6b0c2be554c18",
"sha256:f1b21bc5cf3dbea53d33615d1ead892dfdae9d7052fa8898083bec88be20dcd2"
],
"index": "pypi",
"version": "==0.25.1"
},
"pycountry": {
"hashes": [
"sha256:3c57aa40adcf293d59bebaffbe60d8c39976fba78d846a018dc0c2ec9c6cb3cb"
],
"index": "pypi",
"version": "==19.8.18"
},
"python-dateutil": {
"hashes": [
"sha256:7e6584c74aeed623791615e26efd690f29817a27c73085b78e4bad02493df2fb",
"sha256:c89805f6f4d64db21ed966fda138f8a5ed7a4fdbc1a8ee329ce1b74e3c74da9e"
],
"version": "==2.8.0"
},
"python-stdnum": {
"hashes": [
"sha256:d5f0af1bee9ddd9a20b398b46ce062dbd4d41fcc9646940f2667256a44df3854",
"sha256:f445ec32bf5246c90389204cabba465f494545371c29a83fa2d30e6c872a6763"
],
"index": "pypi",
"version": "==1.11"
},
"pytz": {
"hashes": [
"sha256:26c0b32e437e54a18161324a2fca3c4b9846b74a8dccddd843113109e1116b32",
"sha256:c894d57500a4cd2d5c71114aaab77dbab5eabd9022308ce5ac9bb93a60a6f0c7"
],
"version": "==2019.2"
},
"requests": {
"hashes": [
"sha256:11e007a8a2aa0323f5a921e9e6a2d7e4e67d9877e85773fba9ba6419025cbeb4",
"sha256:9cf5292fcd0f598c671cfc1e0d7d1a7f13bb8085e9a590f48c010551dc6c4b31"
],
"index": "pypi",
"version": "==2.22.0"
},
"requests-cache": {
"hashes": [
"sha256:813023269686045f8e01e2289cc1e7e9ae5ab22ddd1e2849a9093ab3ab7270eb",
"sha256:81e13559baee64677a7d73b85498a5a8f0639e204517b5d05ff378e44a57831a"
],
"index": "pypi",
"version": "==0.5.2"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"urllib3": {
"hashes": [
"sha256:3de946ffbed6e6746608990594d08faac602528ac7015ac28d33cee6a45b7398",
"sha256:9a107b99a5393caf59c7aa3c1249c16e6879447533d0887f4336dde834c7be86"
],
"version": "==1.25.6"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
}
},
"develop": {
"agate": {
"hashes": [
"sha256:48d6f80b35611c1ba25a642cbc5b90fcbdeeb2a54711c4a8d062ee2809334d1c",
"sha256:c93aaa500b439d71e4a5cf088d0006d2ce2c76f1950960c8843114e5f361dfd3"
],
"version": "==1.6.1"
},
"agate-dbf": {
"hashes": [
"sha256:00c93c498ec9a04cc587bf63dd7340e67e2541f0df4c9a7259d7cb3dd4ce372f"
],
"version": "==0.2.1"
},
"agate-excel": {
"hashes": [
"sha256:8f255ef2c87c436b7132049e1dd86c8e08bf82d8c773aea86f3069b461a17d52"
],
"version": "==0.2.3"
},
"agate-sql": {
"hashes": [
"sha256:9277490ba8b8e7c747a9ae3671f52fe486784b48d4a14e78ca197fb0e36f281b"
],
"version": "==0.5.4"
},
"appdirs": {
"hashes": [
"sha256:9e5896d1372858f8dd3344faf4e5014d21849c756c8d5701f78f8a103b372d92",
"sha256:d8b24664561d0d34ddfaec54636d502d7cea6e29c3eaf68f3df6180863e2166e"
],
"version": "==1.4.3"
},
"atomicwrites": {
"hashes": [
"sha256:03472c30eb2c5d1ba9227e4c2ca66ab8287fbfbbda3888aa93dc2e28fc6811b4",
"sha256:75a9445bac02d8d058d5e1fe689654ba5a6556a1dfd8ce6ec55a0ed79866cfa6"
],
"version": "==1.3.0"
},
"attrs": {
"hashes": [
"sha256:69c0dbf2ed392de1cb5ec704444b08a5ef81680a61cb899dc08127123af36a79",
"sha256:f0b870f674851ecbfbbbd364d6b5cbdff9dcedbc7f3f5e18a6891057f21fe399"
],
"version": "==19.1.0"
},
"babel": {
"hashes": [
"sha256:af92e6106cb7c55286b25b38ad7695f8b4efb36a90ba483d7f7a6628c46158ab",
"sha256:e86135ae101e31e2c8ec20a4e0c5220f4eed12487d5cf3f78be7e98d3a57fc28"
],
"version": "==2.7.0"
},
"backcall": {
"hashes": [
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
],
"version": "==0.1.0"
},
"black": {
"hashes": [
"sha256:09a9dcb7c46ed496a9850b76e4e825d6049ecd38b611f1224857a79bd985a8cf",
"sha256:68950ffd4d9169716bcb8719a56c07a2f4485354fec061cdd5910aa07369731c"
],
"index": "pypi",
"version": "==19.3b0"
},
"click": {
"hashes": [
"sha256:2335065e6395b9e67ca716de5f7526736bfa6ceead690adf616d925bdc622b13",
"sha256:5b94b49521f6456670fdb30cd82a4eca9412788a93fa6dd6df72c94d5a8ff2d7"
],
"version": "==7.0"
},
"csvkit": {
"hashes": [
"sha256:1353a383531bee191820edfb88418c13dfe1cdfa9dd3dc46f431c05cd2a260a0"
],
"index": "pypi",
"version": "==1.0.4"
},
"dbfread": {
"hashes": [
"sha256:07c8a9af06ffad3f6f03e8fe91ad7d2733e31a26d2b72c4dd4cfbae07ee3b73d",
"sha256:f604def58c59694fa0160d7be5d0b8d594467278d2bb6a47d46daf7162c84cec"
],
"version": "==2.0.7"
},
"decorator": {
"hashes": [
"sha256:86156361c50488b84a3f148056ea716ca587df2f0de1d34750d35c21312725de",
"sha256:f069f3a01830ca754ba5258fde2278454a0b5b79e0d7f5c13b3b97e57d4acff6"
],
"version": "==4.4.0"
},
"entrypoints": {
"hashes": [
"sha256:589f874b313739ad35be6e0cd7efde2a4e9b6fea91edcc34e58ecbb8dbe56d19",
"sha256:c70dd71abe5a8c85e55e12c19bd91ccfeec11a6e99044204511f9ed547d48451"
],
"version": "==0.3"
},
"et-xmlfile": {
"hashes": [
"sha256:614d9722d572f6246302c4491846d2c393c199cfa4edc9af593437691683335b"
],
"version": "==1.0.1"
},
"flake8": {
"hashes": [
"sha256:19241c1cbc971b9962473e4438a2ca19749a7dd002dd1a946eaba171b4114548",
"sha256:8e9dfa3cecb2400b3738a42c54c3043e821682b9c840b0448c0503f781130696"
],
"index": "pypi",
"version": "==3.7.8"
},
"future": {
"hashes": [
"sha256:67045236dcfd6816dc439556d009594abf643e5eb48992e36beac09c2ca659b8"
],
"version": "==0.17.1"
},
"importlib-metadata": {
"hashes": [
"sha256:aa18d7378b00b40847790e7c27e11673d7fed219354109d0e7b9e5b25dc3ad26",
"sha256:d5f18a79777f3aa179c145737780282e27b508fc8fd688cb17c7a813e8bd39af"
],
"markers": "python_version < '3.8'",
"version": "==0.23"
},
"ipython": {
"hashes": [
"sha256:c4ab005921641e40a68e405e286e7a1fcc464497e14d81b6914b4fd95e5dee9b",
"sha256:dd76831f065f17bddd7eaa5c781f5ea32de5ef217592cf019e34043b56895aa1"
],
"index": "pypi",
"version": "==7.8.0"
},
"ipython-genutils": {
"hashes": [
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
],
"version": "==0.2.0"
},
"isodate": {
"hashes": [
"sha256:2e364a3d5759479cdb2d37cce6b9376ea504db2ff90252a2e5b7cc89cc9ff2d8",
"sha256:aa4d33c06640f5352aca96e4b81afd8ab3b47337cc12089822d6f322ac772c81"
],
"version": "==0.6.0"
},
"isort": {
"hashes": [
"sha256:54da7e92468955c4fceacd0c86bd0ec997b0e1ee80d97f67c35a78b719dccab1",
"sha256:6e811fcb295968434526407adb8796944f1988c5b65e8139058f2014cbe100fd"
],
"index": "pypi",
"version": "==4.3.21"
},
"jdcal": {
"hashes": [
"sha256:1abf1305fce18b4e8aa248cf8fe0c56ce2032392bc64bbd61b5dff2a19ec8bba",
"sha256:472872e096eb8df219c23f2689fc336668bdb43d194094b5cc1707e1640acfc8"
],
"version": "==1.4.1"
},
"jedi": {
"hashes": [
"sha256:786b6c3d80e2f06fd77162a07fed81b8baa22dde5d62896a790a331d6ac21a27",
"sha256:ba859c74fa3c966a22f2aeebe1b74ee27e2a462f56d3f5f7ca4a59af61bfe42e"
],
"version": "==0.15.1"
},
"leather": {
"hashes": [
"sha256:076d1603b5281488285718ce1a5ce78cf1027fe1e76adf9c548caf83c519b988",
"sha256:e0bb36a6d5f59fbf3c1a6e75e7c8bee29e67f06f5b48c0134407dde612eba5e2"
],
"version": "==0.3.3"
},
"mccabe": {
"hashes": [
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
],
"version": "==0.6.1"
},
"more-itertools": {
"hashes": [
"sha256:409cd48d4db7052af495b09dec721011634af3753ae1ef92d2b32f73a745f832",
"sha256:92b8c4b06dac4f0611c0729b2f2ede52b2e1bac1ab48f089c7ddc12e26bb60c4"
],
"version": "==7.2.0"
},
"openpyxl": {
"hashes": [
"sha256:340a1ab2069764559b9d58027a43a24db18db0e25deb80f81ecb8ca7ee5253db"
],
"version": "==3.0.0"
},
"packaging": {
"hashes": [
"sha256:28b924174df7a2fa32c1953825ff29c61e2f5e082343165438812f00d3a7fc47",
"sha256:d9551545c6d761f3def1677baf08ab2a3ca17c56879e70fecba2fc4dde4ed108"
],
"version": "==19.2"
},
"parsedatetime": {
"hashes": [
"sha256:3d817c58fb9570d1eec1dd46fa9448cd644eeed4fb612684b02dfda3a79cb84b",
"sha256:9ee3529454bf35c40a77115f5a596771e59e1aee8c53306f346c461b8e913094"
],
"version": "==2.4"
},
"parso": {
"hashes": [
"sha256:63854233e1fadb5da97f2744b6b24346d2750b85965e7e399bec1620232797dc",
"sha256:666b0ee4a7a1220f65d367617f2cd3ffddff3e205f3f16a0284df30e774c2a9c"
],
"version": "==0.5.1"
},
"pexpect": {
"hashes": [
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
],
"markers": "sys_platform != 'win32'",
"version": "==4.7.0"
},
"pickleshare": {
"hashes": [
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
],
"version": "==0.7.5"
},
"pluggy": {
"hashes": [
"sha256:0db4b7601aae1d35b4a033282da476845aa19185c1e6964b25cf324b5e4ec3e6",
"sha256:fa5fa1622fa6dd5c030e9cad086fa19ef6a0cf6d7a2d12318e10cb49d6d68f34"
],
"version": "==0.13.0"
},
"prompt-toolkit": {
"hashes": [
"sha256:11adf3389a996a6d45cc277580d0d53e8a5afd281d0c9ec71b28e6f121463780",
"sha256:2519ad1d8038fd5fc8e770362237ad0364d16a7650fb5724af6997ed5515e3c1",
"sha256:977c6583ae813a37dc1c2e1b715892461fcbdaa57f6fc62f33a528c4886c8f55"
],
"version": "==2.0.9"
},
"ptyprocess": {
"hashes": [
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
],
"version": "==0.6.0"
},
"py": {
"hashes": [
"sha256:64f65755aee5b381cea27766a3a147c3f15b9b6b9ac88676de66ba2ae36793fa",
"sha256:dc639b046a6e2cff5bbe40194ad65936d6ba360b52b3c3fe1d08a82dd50b5e53"
],
"version": "==1.8.0"
},
"pycodestyle": {
"hashes": [
"sha256:95a2219d12372f05704562a14ec30bc76b05a5b297b21a5dfe3f6fac3491ae56",
"sha256:e40a936c9a450ad81df37f549d676d127b1b66000a6c500caa2b085bc0ca976c"
],
"version": "==2.5.0"
},
"pyflakes": {
"hashes": [
"sha256:17dbeb2e3f4d772725c777fabc446d5634d1038f234e77343108ce445ea69ce0",
"sha256:d976835886f8c5b31d47970ed689944a0262b5f3afa00a5a7b4dc81e5449f8a2"
],
"version": "==2.1.1"
},
"pygments": {
"hashes": [
"sha256:71e430bc85c88a430f000ac1d9b331d2407f681d6f6aec95e8bcfbc3df5b0127",
"sha256:881c4c157e45f30af185c1ffe8d549d48ac9127433f2c380c24b84572ad66297"
],
"version": "==2.4.2"
},
"pyparsing": {
"hashes": [
"sha256:6f98a7b9397e206d78cc01df10131398f1c8b8510a2f4d97d9abd82e1aacdd80",
"sha256:d9338df12903bbf5d65a0e4e87c2161968b10d2e489652bb47001d82a9b028b4"
],
"version": "==2.4.2"
},
"pytest": {
"hashes": [
"sha256:813b99704b22c7d377bbd756ebe56c35252bb710937b46f207100e843440b3c2",
"sha256:cc6620b96bc667a0c8d4fa592a8c9c94178a1bd6cc799dbb057dfd9286d31a31"
],
"index": "pypi",
"version": "==5.1.3"
},
"pytest-clarity": {
"hashes": [
"sha256:3f40d5ae7cb21cc95e622fc4f50d9466f80ae0f91460225b8c95c07afbf93e20"
],
"index": "pypi",
"version": "==0.2.0a1"
},
"python-slugify": {
"hashes": [
"sha256:575d03256a132fc1efb4c52966c6eb11c57a13b071618f0b26076057a23f6937"
],
"version": "==3.0.4"
},
"pytimeparse": {
"hashes": [
"sha256:04b7be6cc8bd9f5647a6325444926c3ac34ee6bc7e69da4367ba282f076036bd",
"sha256:e86136477be924d7e670646a98561957e8ca7308d44841e21f5ddea757556a0a"
],
"version": "==1.1.8"
},
"pytz": {
"hashes": [
"sha256:26c0b32e437e54a18161324a2fca3c4b9846b74a8dccddd843113109e1116b32",
"sha256:c894d57500a4cd2d5c71114aaab77dbab5eabd9022308ce5ac9bb93a60a6f0c7"
],
"version": "==2019.2"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"sqlalchemy": {
"hashes": [
"sha256:2f8ff566a4d3a92246d367f2e9cd6ed3edeef670dcd6dda6dfdc9efed88bcd80"
],
"version": "==1.3.8"
},
"termcolor": {
"hashes": [
"sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b"
],
"version": "==1.1.0"
},
"text-unidecode": {
"hashes": [
"sha256:1311f10e8b895935241623731c2ba64f4c455287888b18189350b67134a822e8",
"sha256:bad6603bb14d279193107714b288be206cac565dfa49aa5b105294dd5c4aab93"
],
"version": "==1.3"
},
"toml": {
"hashes": [
"sha256:229f81c57791a41d65e399fc06bf0848bab550a9dfd5ed66df18ce5f05e73d5c",
"sha256:235682dd292d5899d361a811df37e04a8828a5b1da3115886b73cf81ebc9100e"
],
"version": "==0.10.0"
},
"traitlets": {
"hashes": [
"sha256:262089114405f22f4833be96b31e143ab906d7764a22c04c71fee0bbda4787ba",
"sha256:6ad5b30dacd5e2424c46cc94a0aeab990d98ae17d181acea2cc4272ac3409fca"
],
"version": "==4.3.3.dev0"
},
"wcwidth": {
"hashes": [
"sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e",
"sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c"
],
"version": "==0.1.7"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
},
"zipp": {
"hashes": [
"sha256:3718b1cbcd963c7d4c5511a8240812904164b7f381b647143a89d3b98f9bcd8e",
"sha256:f06903e9f1f43b12d371004b4ac7b06ab39a44adc747266928ae6debfa7b3335"
],
"version": "==0.6.0"
}
}
}

View File

@@ -1,7 +1,7 @@
# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
# CSV Metadata Quality ![GitHub Actions](https://github.com/ilri/csv-metadata-quality/workflows/Build%20and%20Test/badge.svg) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
Requires Python 3.7 or greater (3.8 recommended). CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
## Functionality
@@ -10,23 +10,24 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
- Experimental validation of titles and abstracts against item's Dublin Core language field
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
- Fix leading, trailing, and excessive (ie, more than one) whitespace
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes`
- Fix invalid and unnecessary multi-value separators (`|`) using `--unsafe-fixes`
- Fix problematic newlines (line feeds) using `--unsafe-fixes`
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
- Remove duplicate metadata values
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
## Installation
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):
The easiest way to install CSV Metadata Quality is with [poetry](https://python-poetry.org):
```
$ git clone https://github.com/ilri/csv-metadata-quality.git
$ cd csv-metadata-quality
$ pipenv install
$ pipenv shell
$ poetry install
$ poetry shell
```
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:
Otherwise, if you don't have poetry, you can use a vanilla Python virtual environment:
```
$ git clone https://github.com/ilri/csv-metadata-quality.git
@@ -55,9 +56,19 @@ You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currentl
### Invalid Multi-Value Separators
This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
### Newlines
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
### Unicode Normalization
[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text. As the standard aims to support most of the world's languages, characters can often be represented in different ways and still be valid Unicode. This leads to interesting problems that can be confusing unless you know what's going on behind the scenes. For example, the characters `é` and `é` *look* the same, but are nottechnically they refer to different code points in the Unicode standard:
- `é` is the Unicode code point `U+00E9`
- `é` is the Unicode code points `U+0065` + `U+0301`
Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html).
## AGROVOC Validation
You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields:
@@ -93,6 +104,9 @@ This currently uses the [Python langid](https://github.com/saffsd/langid.py) lib
- Warn if two items use the same file in `filename` column
- Add an option to drop invalid AGROVOC subjects?
- Add tests for application invocation, ie `tests/test_app.py`?
- Validate ISSNs or journal titles against CrossRef API?
- Better ISO 8601 date parsing (currently only supports simple dates, perhaps we need to use dateutil.parser.parseiso())
- Fix lazy date check (assumes field name has "date" but could be dcterms.issued etc!)
## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

View File

@@ -21,7 +21,8 @@ def parse_args(argv):
parser.add_argument(
"--experimental-checks",
"-e",
help="Enable experimental checks like language detection", action="store_true"
help="Enable experimental checks like language detection",
action="store_true",
)
parser.add_argument(
"--input-file",
@@ -81,7 +82,7 @@ def run(argv):
continue
# Fix: whitespace
df[column] = df[column].apply(fix.whitespace)
df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: newlines
if args.unsafe_fixes:
@@ -94,23 +95,28 @@ def run(argv):
if match is not None:
df[column] = df[column].apply(fix.comma_space, field_name=column)
# Fix: perform Unicode normalization (NFC) to convert decomposed
# characters into their canonical forms.
if args.unsafe_fixes:
df[column] = df[column].apply(fix.normalize_unicode, field_name=column)
# Fix: unnecessary Unicode
df[column] = df[column].apply(fix.unnecessary_unicode)
# Check: invalid multi-value separator
df[column] = df[column].apply(check.separators)
# Check: invalid and unnecessary multi-value separators
df[column] = df[column].apply(check.separators, field_name=column)
# Check: suspicious characters
df[column] = df[column].apply(check.suspicious_characters, field_name=column)
# Fix: invalid multi-value separator
# Fix: invalid and unnecessary multi-value separators
if args.unsafe_fixes:
df[column] = df[column].apply(fix.separators)
df[column] = df[column].apply(fix.separators, field_name=column)
# Run whitespace fix again after fixing invalid separators
df[column] = df[column].apply(fix.whitespace)
df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: duplicate metadata values
df[column] = df[column].apply(fix.duplicates)
df[column] = df[column].apply(fix.duplicates, field_name=column)
# Check: invalid AGROVOC subject
if args.agrovoc_fields:

View File

@@ -1,4 +1,9 @@
from datetime import datetime, timedelta
import pandas as pd
import requests
import requests_cache
from pycountry import languages
def issn(field):
@@ -51,8 +56,12 @@ def isbn(field):
return field
def separators(field):
"""Check for invalid multi-value separators (ie "|" or "|||").
def separators(field, field_name):
"""Check for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator.
"""
@@ -65,12 +74,18 @@ def separators(field):
# Try to split multi-value field on "||" separator
for value in field.split("||"):
# Check if the current value is blank
if value == "":
print(f"Unnecessary multi-value separator ({field_name}): {field}")
continue
# After splitting, see if there are any remaining "|" characters
match = re.findall(r"^.*?\|.*$", value)
# Check if there was a match
if match:
print(f"Invalid multi-value separator: {field}")
print(f"Invalid multi-value separator ({field_name}): {field}")
return field
@@ -85,7 +100,6 @@ def date(field, field_name):
Prints the date if invalid.
"""
from datetime import datetime
if pd.isna(field):
print(f"Missing date ({field_name}).")
@@ -170,8 +184,6 @@ def language(field):
Prints the value if it is invalid.
"""
from pycountry import languages
# Skip fields with missing values
if pd.isna(field):
return
@@ -213,30 +225,23 @@ def agrovoc(field, field_name):
Prints a warning if the value is invalid.
"""
from datetime import timedelta
import requests
import requests_cache
# Skip fields with missing values
if pd.isna(field):
return
# enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30)
requests_cache.install_cache("agrovoc-response-cache", expire_after=expire_after)
# prune old cache entries
requests_cache.core.remove_expired_responses()
# Try to split multi-value field on "||" separator
for value in field.split("||"):
request_url = (
f"http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}"
)
request_url = "http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search"
request_params = {"query": value}
# enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30)
requests_cache.install_cache(
"agrovoc-response-cache", expire_after=expire_after
)
request = requests.get(request_url)
# prune old cache entries
requests_cache.core.remove_expired_responses()
request = requests.get(request_url, params=request_params)
if request.status_code == requests.codes.ok:
data = request.json()

View File

@@ -1,9 +1,12 @@
import re
from unicodedata import normalize
import pandas as pd
from csv_metadata_quality.util import is_nfc
def whitespace(field):
def whitespace(field, field_name):
"""Fix whitespace issues.
Return string with leading, trailing, and consecutive whitespace trimmed.
@@ -26,7 +29,7 @@ def whitespace(field):
match = re.findall(pattern, value)
if match:
print(f"Removing excessive whitespace: {value}")
print(f"Removing excessive whitespace ({field_name}): {value}")
value = re.sub(pattern, " ", value)
# Save cleaned value
@@ -38,8 +41,15 @@ def whitespace(field):
return new_field
def separators(field):
"""Fix for invalid multi-value separators (ie "|")."""
def separators(field, field_name):
"""Fix for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator.
"""
# Skip fields with missing values
if pd.isna(field):
@@ -50,12 +60,18 @@ def separators(field):
# Try to split multi-value field on "||" separator
for value in field.split("||"):
# Check if the value is blank and skip it
if value == "":
print(f"Fixing unnecessary multi-value separator ({field_name}): {field}")
continue
# After splitting, see if there are any remaining "|" characters
pattern = re.compile(r"\|")
match = re.findall(pattern, value)
if match:
print(f"Fixing invalid multi-value separator: {value}")
print(f"Fixing invalid multi-value separator ({field_name}): {value}")
value = re.sub(pattern, "||", value)
@@ -121,7 +137,7 @@ def unnecessary_unicode(field):
return field
def duplicates(field):
def duplicates(field, field_name):
"""Remove duplicate metadata values."""
# Skip fields with missing values
@@ -140,7 +156,7 @@ def duplicates(field):
if value not in new_values:
new_values.append(value)
else:
print(f"Removing duplicate value: {value}")
print(f"Removing duplicate value ({field_name}): {value}")
# Create a new field consisting of all values joined with "||"
new_field = "||".join(new_values)
@@ -201,3 +217,24 @@ def comma_space(field, field_name):
field = re.sub(r",(\w)", r", \1", field)
return field
def normalize_unicode(field, field_name):
"""Fix occurrences of decomposed Unicode characters by normalizing them
with NFC to their canonical forms, for example:
Ouédraogo, Mathieu → Ouédraogo, Mathieu
Return normalized string.
"""
# Skip fields with missing values
if pd.isna(field):
return
# Check if the current string is using normalized Unicode (NFC)
if not is_nfc(field):
print(f"Normalizing Unicode ({field_name}): {field}")
field = normalize("NFC", field)
return field

View File

@@ -0,0 +1,14 @@
def is_nfc(field):
"""Utility function to check whether a string is using normalized Unicode.
Python's built-in unicodedata library has the is_normalized() function, but
it was only introduced in Python 3.8. By using a simple utility function we
are able to run on Python >= 3.6 again.
See: https://docs.python.org/3/library/unicodedata.html
Return boolean.
"""
from unicodedata import normalize
return field == normalize("NFC", field)

View File

@@ -1 +1 @@
VERSION = "0.3.1"
VERSION = "0.4.2"

View File

@@ -26,3 +26,6 @@ Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,,
"Missing space,after comma",2019-08-27,,,,,,
Incorrect ISO 639-1 language,2019-09-26,,,es,,,
Incorrect ISO 639-3 language,2019-09-26,,,spa,,,
Composéd Unicode,2020-01-14,,,,,,
Decomposéd Unicode,2020-01-14,,,,,,
Unnecessary multi-value separator,2021-01-03,0378-5955||,,,,,
1 dc.title dc.date.issued dc.identifier.issn dc.identifier.isbn dc.language.iso dc.subject cg.coverage.country filename
26 Incorrect ISO 639-1 language 2019-09-26 es
27 Incorrect ISO 639-3 language 2019-09-26 spa
28 Composéd Unicode 2020-01-14
29 Decomposéd Unicode 2020-01-14
30 Unnecessary multi-value separator 2021-01-03 0378-5955||
31

1211
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

31
pyproject.toml Normal file
View File

@@ -0,0 +1,31 @@
[tool.poetry]
name = "csv-metadata-quality"
version = "0.4.2"
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem."
authors = ["Alan Orth <alan.orth@gmail.com>"]
license="GPL-3.0-only"
repository = "https://github.com/ilri/csv-metadata-quality"
homepage = "https://github.com/ilri/csv-metadata-quality"
[tool.poetry.dependencies]
python = "^3.8"
pandas = "^1.0.4"
python-stdnum = "^1.13"
xlrd = "^1.2.0"
requests = "^2.23.0"
requests-cache = "^0.5.2"
pycountry = "^19.8.18"
langid = "^1.1.6"
[tool.poetry.dev-dependencies]
pytest = "^6.1.1"
ipython = { version = "^7.18.1", python = "^3.7" }
flake8 = "^3.8.4"
pytest-clarity = "^0.3.0-alpha.0"
black = "20.8b1"
isort = "^5.5.4"
csvkit = "^1.0.5"
[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

View File

@@ -1,5 +1,5 @@
[pytest]
addopts= -rsxX -s -v --strict --capture=sys
addopts= -rsxX -s -v --strict-markers --capture=sys
filterwarnings =
error::UserWarning
ignore:.*U.* is deprecated:DeprecationWarning

View File

@@ -1,57 +1,353 @@
-i https://pypi.org/simple
agate-dbf==0.2.1
agate-excel==0.2.3
agate-sql==0.5.4
agate==1.6.1
appdirs==1.4.3
atomicwrites==1.3.0
attrs==19.1.0
babel==2.7.0
backcall==0.1.0
black==19.3b0
click==7.0
csvkit==1.0.4
dbfread==2.0.7
decorator==4.4.0
entrypoints==0.3
et-xmlfile==1.0.1
flake8==3.7.8
future==0.17.1
importlib-metadata==0.23 ; python_version < '3.8'
ipython-genutils==0.2.0
ipython==7.8.0
isodate==0.6.0
isort==4.3.21
jdcal==1.4.1
jedi==0.15.1
leather==0.3.3
mccabe==0.6.1
more-itertools==7.2.0
openpyxl==3.0.0
packaging==19.2
parsedatetime==2.4
parso==0.5.1
pexpect==4.7.0 ; sys_platform != 'win32'
pickleshare==0.7.5
pluggy==0.13.0
prompt-toolkit==2.0.9
ptyprocess==0.6.0
py==1.8.0
pycodestyle==2.5.0
pyflakes==2.1.1
pygments==2.4.2
pyparsing==2.4.2
pytest-clarity==0.2.0a1
pytest==5.1.3
python-slugify==3.0.4
pytimeparse==1.1.8
pytz==2019.2
six==1.12.0
sqlalchemy==1.3.8
termcolor==1.1.0
text-unidecode==1.3
toml==0.10.0
traitlets==4.3.3.dev0
wcwidth==0.1.7
xlrd==1.2.0
zipp==0.6.0
agate==1.6.1 \
--hash=sha256:48d6f80b35611c1ba25a642cbc5b90fcbdeeb2a54711c4a8d062ee2809334d1c \
--hash=sha256:c93aaa500b439d71e4a5cf088d0006d2ce2c76f1950960c8843114e5f361dfd3
agate-dbf==0.2.2 \
--hash=sha256:589682b78c5c03f2dc8511e6e3edb659fb7336cd118e248896bb0b44c2f1917b
agate-excel==0.2.3 \
--hash=sha256:8f255ef2c87c436b7132049e1dd86c8e08bf82d8c773aea86f3069b461a17d52
agate-sql==0.5.5 \
--hash=sha256:50a39754babef6cd0d1b1e75763324a49593394fe46ab1ea9546791b5e6b69a7
appdirs==1.4.4 \
--hash=sha256:a841dacd6b99318a741b166adb07e19ee71a274450e68237b4650ca1055ab128 \
--hash=sha256:7d5d0167b2b1ba821647616af46a749d1c653740dd0d2415100fe26e27afdf41
appnope==0.1.2; python_version >= "3.7" and python_version < "4.0" and sys_platform == "darwin" \
--hash=sha256:93aa393e9d6c54c5cd570ccadd8edad61ea0c4b9ea7a01409020c9aa019eb442 \
--hash=sha256:dd83cd4b5b460958838f6eb3000c660b1f9caf2a5b1de4264e941512f603258a
atomicwrites==1.4.0; sys_platform == "win32" \
--hash=sha256:6d1784dea7c0c8d4a5172b6c620f40b6e4cbfdf96d783691f2e1302a7b88e197 \
--hash=sha256:ae70396ad1a434f9c7046fd2dd196fc04b12f9e91ffb859164193be8b6168a7a
attrs==20.3.0 \
--hash=sha256:31b2eced602aa8423c2aea9c76a724617ed67cf9513173fd3a4f03e3a929c7e6 \
--hash=sha256:832aa3cde19744e49938b91fea06d69ecb9e649c93ba974535d08ad92164f700
babel==2.9.0 \
--hash=sha256:9d35c22fcc79893c3ecc85ac4a56cde1ecf3f19c540bba0922308a6c06ca6fa5 \
--hash=sha256:da031ab54472314f210b0adcff1588ee5d1d1d0ba4dbd07b94dba82bde791e05
backcall==0.2.0; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:fbbce6a29f263178a1f7915c1940bde0ec2b2a967566fe1c65c1dfb7422bd255 \
--hash=sha256:5cbdbf27be5e7cfadb448baf0aa95508f91f2bbc6c6437cd9cd06e2a4c215e1e
black==20.8b1 \
--hash=sha256:1c02557aa099101b9d21496f8a914e9ed2222ef70336404eeeac8edba836fbea
certifi==2020.12.5 \
--hash=sha256:719a74fb9e33b9bd44cc7f3a8d94bc35e4049deebe19ba7d8e108280cfd59830 \
--hash=sha256:1a4995114262bffbc2413b159f2a1a480c969de6e6eb13ee966d470af86af59c
chardet==4.0.0 \
--hash=sha256:f864054d66fd9118f2e67044ac8981a54775ec5b67aed0441892edb553d21da5 \
--hash=sha256:0d6f53a15db4120f2b08c94f11e7d93d2c911ee118b6b30a04ec3ee8310179fa
click==7.1.2 \
--hash=sha256:dacca89f4bfadd5de3d7489b7c8a566eee0d3676333fbb50030263894c38c0dc \
--hash=sha256:d2b5255c7c6349bc1bd1e59e08cd12acbbd63ce649f2588755783aa94dfb6b1a
colorama==0.4.4; python_version >= "3.7" and python_version < "4.0" and sys_platform == "win32" or sys_platform == "win32" \
--hash=sha256:9f47eda37229f68eee03b24b9748937c7dc3868f906e8ba69fbcbdd3bc5dc3e2 \
--hash=sha256:5941b2b48a20143d2267e95b1c2a7603ce057ee39fd88e7329b0c292aa16869b
csvkit==1.0.5 \
--hash=sha256:7bd390f4d300e45dc9ed67a32af762a916bae7d9a85087a10fd4f64ce65fd5b9
dbfread==2.0.7 \
--hash=sha256:f604def58c59694fa0160d7be5d0b8d594467278d2bb6a47d46daf7162c84cec \
--hash=sha256:07c8a9af06ffad3f6f03e8fe91ad7d2733e31a26d2b72c4dd4cfbae07ee3b73d
decorator==4.4.2; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:41fa54c2a0cc4ba648be4fd43cff00aedf5b9465c9bf18d64325bc225f08f760 \
--hash=sha256:e3a62f0520172440ca0dcc823749319382e377f37f140a0b99ef45fecb84bfe7
et-xmlfile==1.0.1 \
--hash=sha256:614d9722d572f6246302c4491846d2c393c199cfa4edc9af593437691683335b
flake8==3.8.4 \
--hash=sha256:749dbbd6bfd0cf1318af27bf97a14e28e5ff548ef8e5b1566ccfb25a11e7c839 \
--hash=sha256:aadae8761ec651813c24be05c6f7b4680857ef6afaae4651a4eccaef97ce6c3b
idna==2.10 \
--hash=sha256:b97d804b1e9b523befed77c48dacec60e6dcb0b5391d57af6a65a312a90648c0 \
--hash=sha256:b307872f855b18632ce0c21c5e45be78c0ea7ae4c15c828c20788b26921eb3f6
iniconfig==1.1.1 \
--hash=sha256:011e24c64b7f47f6ebd835bb12a743f2fbe9a26d4cecaa7f53bc4f35ee9da8b3 \
--hash=sha256:bc3af051d7d14b2ee5ef9969666def0cd1a000e121eaea580d4a313df4b37f32
ipython==7.19.0; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:c987e8178ced651532b3b1ff9965925bfd445c279239697052561a9ab806d28f \
--hash=sha256:cbb2ef3d5961d44e6a963b9817d4ea4e1fa2eb589c371a470fed14d8d40cbd6a
ipython-genutils==0.2.0; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8 \
--hash=sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8
isodate==0.6.0 \
--hash=sha256:aa4d33c06640f5352aca96e4b81afd8ab3b47337cc12089822d6f322ac772c81 \
--hash=sha256:2e364a3d5759479cdb2d37cce6b9376ea504db2ff90252a2e5b7cc89cc9ff2d8
isort==5.7.0 \
--hash=sha256:fff4f0c04e1825522ce6949973e83110a6e907750cd92d128b0d14aaaadbffdc \
--hash=sha256:c729845434366216d320e936b8ad6f9d681aab72dc7cbc2d51bedc3582f3ad1e
jdcal==1.4.1 \
--hash=sha256:1abf1305fce18b4e8aa248cf8fe0c56ce2032392bc64bbd61b5dff2a19ec8bba \
--hash=sha256:472872e096eb8df219c23f2689fc336668bdb43d194094b5cc1707e1640acfc8
jedi==0.18.0; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:18456d83f65f400ab0c2d3319e48520420ef43b23a086fdc05dff34132f0fb93 \
--hash=sha256:92550a404bad8afed881a137ec9a461fed49eca661414be45059329614ed0707
langid==1.1.6 \
--hash=sha256:044bcae1912dab85c33d8e98f2811b8f4ff1213e5e9a9e9510137b84da2cb293
leather==0.3.3 \
--hash=sha256:e0bb36a6d5f59fbf3c1a6e75e7c8bee29e67f06f5b48c0134407dde612eba5e2 \
--hash=sha256:076d1603b5281488285718ce1a5ce78cf1027fe1e76adf9c548caf83c519b988
mccabe==0.6.1 \
--hash=sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42 \
--hash=sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f
mypy-extensions==0.4.3 \
--hash=sha256:090fedd75945a69ae91ce1303b5824f428daf5a028d2f6ab8a299250a846f15d \
--hash=sha256:2d82818f5bb3e369420cb3c4060a7970edba416647068eb4c5343488a6c604a8
numpy==1.19.5 \
--hash=sha256:cc6bd4fd593cb261332568485e20a0712883cf631f6f5e8e86a52caa8b2b50ff \
--hash=sha256:aeb9ed923be74e659984e321f609b9ba54a48354bfd168d21a2b072ed1e833ea \
--hash=sha256:8b5e972b43c8fc27d56550b4120fe6257fdc15f9301914380b27f74856299fea \
--hash=sha256:43d4c81d5ffdff6bae58d66a3cd7f54a7acd9a0e7b18d97abb255defc09e3140 \
--hash=sha256:a4646724fba402aa7504cd48b4b50e783296b5e10a524c7a6da62e4a8ac9698d \
--hash=sha256:2e55195bc1c6b705bfd8ad6f288b38b11b1af32f3c8289d6c50d47f950c12e76 \
--hash=sha256:39b70c19ec771805081578cc936bbe95336798b7edf4732ed102e7a43ec5c07a \
--hash=sha256:dbd18bcf4889b720ba13a27ec2f2aac1981bd41203b3a3b27ba7a33f88ae4827 \
--hash=sha256:603aa0706be710eea8884af807b1b3bc9fb2e49b9f4da439e76000f3b3c6ff0f \
--hash=sha256:cae865b1cae1ec2663d8ea56ef6ff185bad091a5e33ebbadd98de2cfa3fa668f \
--hash=sha256:36674959eed6957e61f11c912f71e78857a8d0604171dfd9ce9ad5cbf41c511c \
--hash=sha256:06fab248a088e439402141ea04f0fffb203723148f6ee791e9c75b3e9e82f080 \
--hash=sha256:6149a185cece5ee78d1d196938b2a8f9d09f5a5ebfbba66969302a778d5ddd1d \
--hash=sha256:50a4a0ad0111cc1b71fa32dedd05fa239f7fb5a43a40663269bb5dc7877cfd28 \
--hash=sha256:d051ec1c64b85ecc69531e1137bb9751c6830772ee5c1c426dbcfe98ef5788d7 \
--hash=sha256:a12ff4c8ddfee61f90a1633a4c4afd3f7bcb32b11c52026c92a12e1325922d0d \
--hash=sha256:cf2402002d3d9f91c8b01e66fbb436a4ed01c6498fffed0e4c7566da1d40ee1e \
--hash=sha256:1ded4fce9cfaaf24e7a0ab51b7a87be9038ea1ace7f34b841fe3b6894c721d1c \
--hash=sha256:012426a41bc9ab63bb158635aecccc7610e3eff5d31d1eb43bc099debc979d94 \
--hash=sha256:759e4095edc3c1b3ac031f34d9459fa781777a93ccc633a472a5468587a190ff \
--hash=sha256:a9d17f2be3b427fbb2bce61e596cf555d6f8a56c222bd2ca148baeeb5e5c783c \
--hash=sha256:99abf4f353c3d1a0c7a5f27699482c987cf663b1eac20db59b8c7b061eabd7fc \
--hash=sha256:384ec0463d1c2671170901994aeb6dce126de0a95ccc3976c43b0038a37329c2 \
--hash=sha256:811daee36a58dc79cf3d8bdd4a490e4277d0e4b7d103a001a4e73ddb48e7e6aa \
--hash=sha256:c843b3f50d1ab7361ca4f0b3639bf691569493a56808a0b0c54a051d260b7dbd \
--hash=sha256:d6631f2e867676b13026e2846180e2c13c1e11289d67da08d71cacb2cd93d4aa \
--hash=sha256:7fb43004bce0ca31d8f13a6eb5e943fa73371381e53f7074ed21a4cb786c32f8 \
--hash=sha256:2ea52bd92ab9f768cc64a4c3ef8f4b2580a17af0a5436f6126b08efbd1838371 \
--hash=sha256:400580cbd3cff6ffa6293df2278c75aef2d58d8d93d3c5614cd67981dae68ceb \
--hash=sha256:df609c82f18c5b9f6cb97271f03315ff0dbe481a2a02e56aeb1b1a985ce38e60 \
--hash=sha256:ab83f24d5c52d60dbc8cd0528759532736b56db58adaa7b5f1f76ad551416a1e \
--hash=sha256:0eef32ca3132a48e43f6a0f5a82cb508f22ce5a3d6f67a8329c81c8e226d3f6e \
--hash=sha256:a0d53e51a6cb6f0d9082decb7a4cb6dfb33055308c4c44f53103c073f649af73 \
--hash=sha256:a76f502430dd98d7546e1ea2250a7360c065a5fdea52b2dffe8ae7180909b6f4
openpyxl==3.0.6 \
--hash=sha256:1a4b3869c2500b5c713e8e28341cdada49ecfcff1b10cd9006945f5bcefc090d \
--hash=sha256:b229112b46e158b910a5d1b270b212c42773d39cab24e8db527f775b82afc041
packaging==20.8 \
--hash=sha256:24e0da08660a87484d1602c30bb4902d74816b6985b93de36926f5bc95741858 \
--hash=sha256:78598185a7008a470d64526a8059de9aaa449238f280fc9eb6b13ba6c4109093
pandas==1.2.1 \
--hash=sha256:50e6c0a17ef7f831b5565fd0394dbf9bfd5d615ee4dd4bb60a3d8c9d2e872323 \
--hash=sha256:324e60bea729cf3b55c1bf9e88fe8b9932c26f8669d13b928e3c96b3a1453dff \
--hash=sha256:37443199f451f8badfe0add666e43cdb817c59fa36bceedafd9c543a42f236ca \
--hash=sha256:23ac77a3a222d9304cb2a7934bb7b4805ff43d513add7a42d1a22dc7df14edd2 \
--hash=sha256:496fcc29321e9a804d56d5aa5d7ec1320edfd1898eee2f451aa70171cf1d5a29 \
--hash=sha256:30e9e8bc8c5c17c03d943e8d6f778313efff59e413b8dbdd8214c2ed9aa165f6 \
--hash=sha256:055647e7f4c5e66ba92c2a7dcae6c2c57898b605a3fb007745df61cc4015937f \
--hash=sha256:9d45f58b03af1fea4b48e44aa38a819a33dccb9821ef9e1d68f529995f8a632f \
--hash=sha256:b26e2dabda73d347c7af3e6fed58483161c7b87a886a4e06d76ccfe55a044aa9 \
--hash=sha256:47ec0808a8357ab3890ce0eca39a63f79dcf941e2e7f494470fe1c9ec43f6091 \
--hash=sha256:57d5c7ac62925a8d2ab43ea442b297a56cc8452015e71e24f4aa7e4ed6be3d77 \
--hash=sha256:d7cca42dba13bfee369e2944ae31f6549a55831cba3117e17636955176004088 \
--hash=sha256:cfd237865d878da9b65cfee883da5e0067f5e2ff839e459466fb90565a77bda3 \
--hash=sha256:050ed2c9d825ef36738e018454e6d055c63d947c1d52010fbadd7584f09df5db \
--hash=sha256:fe7de6fed43e7d086e3d947651ec89e55ddf00102f9dd5758763d56d182f0564 \
--hash=sha256:2de012a36cc507debd9c3351b4d757f828d5a784a5fc4e6766eafc2b56e4b0f5 \
--hash=sha256:5527c5475d955c0bc9689c56865aaa2a7b13c504d6c44f0aadbf57b565af5ebd
parsedatetime==2.6 \
--hash=sha256:cb96edd7016872f58479e35879294258c71437195760746faffedb692aef000b \
--hash=sha256:4cb368fbb18a0b7231f4d76119165451c8d2e35951455dfee97c62a87b04d455
parso==0.8.1; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:15b00182f472319383252c18d5913b69269590616c947747bc50bf4ac768f410 \
--hash=sha256:8519430ad07087d4c997fda3a7918f7cfa27cb58972a8c89c2a0295a1c940e9e
pathspec==0.8.1 \
--hash=sha256:aa0cb481c4041bf52ffa7b0d8fa6cd3e88a2ca4879c533c9153882ee2556790d \
--hash=sha256:86379d6b86d75816baba717e64b1a3a3469deb93bb76d613c9ce79edc5cb68fd
pexpect==4.8.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32" \
--hash=sha256:0b48a55dcb3c05f3329815901ea4fc1537514d6ba867a152b581d69ae3710937 \
--hash=sha256:fc65a43959d153d0114afe13997d439c22823a27cefceb5ff35c2178c6784c0c
pickleshare==0.7.5; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56 \
--hash=sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca
pluggy==0.13.1 \
--hash=sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d \
--hash=sha256:15b2acde666561e1298d71b523007ed7364de07029219b604cf808bfa1c765b0
prompt-toolkit==3.0.14; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:c96b30925025a7635471dc083ffb6af0cc67482a00611bd81aeaeeeb7e5a5e12 \
--hash=sha256:7e966747c18ececaec785699626b771c1ba8344c8d31759a1915d6b12fad6525
ptyprocess==0.7.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32" \
--hash=sha256:4b41f3967fce3af57cc7e94b888626c18bf37a083e3651ca8feeb66d492fef35 \
--hash=sha256:5c5d0a3b48ceee0b48485e0c26037c0acd7d29765ca3fbb5cb3831d347423220
py==1.10.0 \
--hash=sha256:3b80836aa6d1feeaa108e046da6423ab8f6ceda6468545ae8d02d9d58d18818a \
--hash=sha256:21b81bda15b66ef5e1a777a21c4dcd9c20ad3efd0b3f817e7a809035269e1bd3
pycodestyle==2.6.0 \
--hash=sha256:2295e7b2f6b5bd100585ebcb1f616591b652db8a741695b3d8f5d28bdc934367 \
--hash=sha256:c58a7d2815e0e8d7972bf1803331fb0152f867bd89adf8a01dfd55085434192e
pycountry==19.8.18 \
--hash=sha256:3c57aa40adcf293d59bebaffbe60d8c39976fba78d846a018dc0c2ec9c6cb3cb
pyflakes==2.2.0 \
--hash=sha256:0d94e0e05a19e57a99444b6ddcf9a6eb2e5c68d3ca1e98e90707af8152c90a92 \
--hash=sha256:35b2d75ee967ea93b55750aa9edbbf72813e06a66ba54438df2cfac9e3c27fc8
pygments==2.7.4; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:bc9591213a8f0e0ca1a5e68a479b4887fdc3e75d0774e5c71c31920c427de435 \
--hash=sha256:df49d09b498e83c1a73128295860250b0b7edd4c723a32e9bc0d295c7c2ec337
pyparsing==2.4.7 \
--hash=sha256:ef9d7589ef3c200abe66653d3f1ab1033c3c419ae9b9bdb1240a85b024efc88b \
--hash=sha256:c203ec8783bf771a155b207279b9bccb8dea02d8f0c9e5f8ead507bc3246ecc1
pytest==6.2.2 \
--hash=sha256:b574b57423e818210672e07ca1fa90aaf194a4f63f3ab909a2c67ebb22913839 \
--hash=sha256:9d1edf9e7d0b84d72ea3dbcdfd22b35fb543a5e8f2a60092dd578936bf63d7f9
pytest-clarity==0.3.0a0 \
--hash=sha256:5cc99e3d9b7969dfe17e5f6072d45a917c59d363b679686d3c958a1ded2e4dcf
python-dateutil==2.8.1 \
--hash=sha256:73ebfe9dbf22e832286dafa60473e4cd239f8592f699aa5adaf10050e6e1823c \
--hash=sha256:75bb3f31ea686f1197762692a9ee6a7550b59fc6ca3a1f4b5d7e32fb98e2da2a
python-slugify==4.0.1 \
--hash=sha256:69a517766e00c1268e5bbfc0d010a0a8508de0b18d30ad5a1ff357f8ae724270
python-stdnum==1.15 \
--hash=sha256:ff0e4d2b8731711cf2a73d22964139208088b4bd1a3e3b0c4f78d6e823dd8923 \
--hash=sha256:46c05dff2d5ff89b12cc63ab54cca1e4788e4e3087c163a1b7434a53fa926f28
pytimeparse==1.1.8 \
--hash=sha256:04b7be6cc8bd9f5647a6325444926c3ac34ee6bc7e69da4367ba282f076036bd \
--hash=sha256:e86136477be924d7e670646a98561957e8ca7308d44841e21f5ddea757556a0a
pytz==2020.5 \
--hash=sha256:16962c5fb8db4a8f63a26646d8886e9d769b6c511543557bc84e9569fb9a9cb4 \
--hash=sha256:180befebb1927b16f6b57101720075a984c019ac16b1b7575673bea42c6c3da5
regex==2020.11.13 \
--hash=sha256:8b882a78c320478b12ff024e81dc7d43c1462aa4a3341c754ee65d857a521f85 \
--hash=sha256:a63f1a07932c9686d2d416fb295ec2c01ab246e89b4d58e5fa468089cab44b70 \
--hash=sha256:6e4b08c6f8daca7d8f07c8d24e4331ae7953333dbd09c648ed6ebd24db5a10ee \
--hash=sha256:bba349276b126947b014e50ab3316c027cac1495992f10e5682dc677b3dfa0c5 \
--hash=sha256:56e01daca75eae420bce184edd8bb341c8eebb19dd3bce7266332258f9fb9dd7 \
--hash=sha256:6a8ce43923c518c24a2579fda49f093f1397dad5d18346211e46f134fc624e31 \
--hash=sha256:1ab79fcb02b930de09c76d024d279686ec5d532eb814fd0ed1e0051eb8bd2daa \
--hash=sha256:9801c4c1d9ae6a70aeb2128e5b4b68c45d4f0af0d1535500884d644fa9b768c6 \
--hash=sha256:49cae022fa13f09be91b2c880e58e14b6da5d10639ed45ca69b85faf039f7a4e \
--hash=sha256:749078d1eb89484db5f34b4012092ad14b327944ee7f1c4f74d6279a6e4d1884 \
--hash=sha256:b2f4007bff007c96a173e24dcda236e5e83bde4358a557f9ccf5e014439eae4b \
--hash=sha256:38c8fd190db64f513fe4e1baa59fed086ae71fa45083b6936b52d34df8f86a88 \
--hash=sha256:5862975b45d451b6db51c2e654990c1820523a5b07100fc6903e9c86575202a0 \
--hash=sha256:262c6825b309e6485ec2493ffc7e62a13cf13fb2a8b6d212f72bd53ad34118f1 \
--hash=sha256:bafb01b4688833e099d79e7efd23f99172f501a15c44f21ea2118681473fdba0 \
--hash=sha256:e32f5f3d1b1c663af7f9c4c1e72e6ffe9a78c03a31e149259f531e0fed826512 \
--hash=sha256:3bddc701bdd1efa0d5264d2649588cbfda549b2899dc8d50417e47a82e1387ba \
--hash=sha256:02951b7dacb123d8ea6da44fe45ddd084aa6777d4b2454fa0da61d569c6fa538 \
--hash=sha256:0d08e71e70c0237883d0bef12cad5145b84c3705e9c6a588b2a9c7080e5af2a4 \
--hash=sha256:1fa7ee9c2a0e30405e21031d07d7ba8617bc590d391adfc2b7f1e8b99f46f444 \
--hash=sha256:baf378ba6151f6e272824b86a774326f692bc2ef4cc5ce8d5bc76e38c813a55f \
--hash=sha256:e3faaf10a0d1e8e23a9b51d1900b72e1635c2d5b0e1bea1c18022486a8e2e52d \
--hash=sha256:2a11a3e90bd9901d70a5b31d7dd85114755a581a5da3fc996abfefa48aee78af \
--hash=sha256:d1ebb090a426db66dd80df8ca85adc4abfcbad8a7c2e9a5ec7513ede522e0a8f \
--hash=sha256:b2b1a5ddae3677d89b686e5c625fc5547c6e492bd755b520de5332773a8af06b \
--hash=sha256:2c99e97d388cd0a8d30f7c514d67887d8021541b875baf09791a3baad48bb4f8 \
--hash=sha256:c084582d4215593f2f1d28b65d2a2f3aceff8342aa85afd7be23a9cad74a0de5 \
--hash=sha256:a3d748383762e56337c39ab35c6ed4deb88df5326f97a38946ddd19028ecce6b \
--hash=sha256:7913bd25f4ab274ba37bc97ad0e21c31004224ccb02765ad984eef43e04acc6c \
--hash=sha256:6c54ce4b5d61a7129bad5c5dc279e222afd00e721bf92f9ef09e4fae28755683 \
--hash=sha256:1862a9d9194fae76a7aaf0150d5f2a8ec1da89e8b55890b1786b8f88a0f619dc \
--hash=sha256:4902e6aa086cbb224241adbc2f06235927d5cdacffb2425c73e6570e8d862364 \
--hash=sha256:7a25fcbeae08f96a754b45bdc050e1fb94b95cab046bf56b016c25e9ab127b3e \
--hash=sha256:d2d8ce12b7c12c87e41123997ebaf1a5767a5be3ec545f64675388970f415e2e \
--hash=sha256:f7d29a6fc4760300f86ae329e3b6ca28ea9c20823df123a2ea8693e967b29917 \
--hash=sha256:717881211f46de3ab130b58ec0908267961fadc06e44f974466d1887f865bd5b \
--hash=sha256:3128e30d83f2e70b0bed9b2a34e92707d0877e460b402faca908c6667092ada9 \
--hash=sha256:8f6a2229e8ad946e36815f2a03386bb8353d4bde368fdf8ca5f0cb97264d3b5c \
--hash=sha256:f8f295db00ef5f8bae530fc39af0b40486ca6068733fb860b42115052206466f \
--hash=sha256:a15f64ae3a027b64496a71ab1f722355e570c3fac5ba2801cafce846bf5af01d \
--hash=sha256:83d6b356e116ca119db8e7c6fc2983289d87b27b3fac238cfe5dca529d884562
requests==2.25.1 \
--hash=sha256:c210084e36a42ae6b9219e00e48287def368a26d03a048ddad7bfee44f75871e \
--hash=sha256:27973dd4a904a4f13b263a19c866c13b92a39ed1c964655f025f3f8d3d75b804
requests-cache==0.5.2 \
--hash=sha256:813023269686045f8e01e2289cc1e7e9ae5ab22ddd1e2849a9093ab3ab7270eb \
--hash=sha256:81e13559baee64677a7d73b85498a5a8f0639e204517b5d05ff378e44a57831a
six==1.15.0 \
--hash=sha256:8b74bedcbbbaca38ff6d7491d76f2b06b3592611af620f8426e82dddb04a5ced \
--hash=sha256:30639c035cdb23534cd4aa2dd52c3bf48f06e5f4a941509c8bafd8ce11080259
sqlalchemy==1.3.22 \
--hash=sha256:61628715931f4962e0cdb2a7c87ff39eea320d2aa96bd471a3c293d146f90394 \
--hash=sha256:81d8d099a49f83111cce55ec03cc87eef45eec0d90f9842b4fc674f860b857b0 \
--hash=sha256:d055ff750fcab69ca4e57b656d9c6ad33682e9b8d564f2fbe667ab95c63591b0 \
--hash=sha256:9bf572e4f5aa23f88dd902f10bb103cb5979022a38eec684bfa6d61851173fec \
--hash=sha256:7d4b8de6bb0bc736161cb0bbd95366b11b3eb24dd6b814a143d8375e75af9990 \
--hash=sha256:4a84c7c7658dd22a33dab2e2aa2d17c18cb004a42388246f2e87cb4085ef2811 \
--hash=sha256:f1e88b30da8163215eab643962ae9d9252e47b4ea53404f2c4f10f24e70ddc62 \
--hash=sha256:f115150cc4361dd46153302a640c7fa1804ac207f9cc356228248e351a8b4676 \
--hash=sha256:6aaa13ee40c4552d5f3a59f543f0db6e31712cc4009ec7385407be4627259d41 \
--hash=sha256:3ab5b44a07b8c562c6dcb7433c6a6c6e03266d19d64f87b3333eda34e3b9936b \
--hash=sha256:426ece890153ccc52cc5151a1a0ed540a5a7825414139bb4c95a868d8da54a52 \
--hash=sha256:bd4b1af45fd322dcd1fb2a9195b4f93f570d1a5902a842e3e6051385fac88f9c \
--hash=sha256:62285607a5264d1f91590abd874d6a498e229d5840669bd7d9f654cfaa599bd0 \
--hash=sha256:314f5042c0b047438e19401d5f29757a511cfc2f0c40d28047ca0e4c95eabb5b \
--hash=sha256:62fb881ba51dbacba9af9b779211cf9acff3442d4f2993142015b22b3cd1f92a \
--hash=sha256:bde677047305fe76c7ee3e4492b545e0018918e44141cc154fe39e124e433991 \
--hash=sha256:0c6406a78a714a540d980a680b86654feadb81c8d0eecb59f3d6c554a4c69f19 \
--hash=sha256:95bde07d19c146d608bccb9b16e144ec8f139bcfe7fd72331858698a71c9b4f5 \
--hash=sha256:888d5b4b5aeed0d3449de93ea80173653e939e916cc95fe8527079e50235c1d2 \
--hash=sha256:d53f59744b01f1440a1b0973ed2c3a7de204135c593299ee997828aad5191693 \
--hash=sha256:70121f0ae48b25ef3e56e477b88cd0b0af0e1f3a53b5554071aa6a93ef378a03 \
--hash=sha256:54da615e5b92c339e339fe8536cce99fe823b6ed505d4ea344852aefa1c205fb \
--hash=sha256:68428818cf80c60dc04aa0f38da20ad39b28aba4d4d199f949e7d6e04444ea86 \
--hash=sha256:17610d573e698bf395afbbff946544fbce7c5f4ee77b5bcb1f821b36345fae7a \
--hash=sha256:216ba5b4299c95ed179b58f298bda885a476b16288ab7243e89f29f6aeced7e0 \
--hash=sha256:0c72b90988be749e04eff0342dcc98c18a14461eb4b2ad59d611b57b31120f90 \
--hash=sha256:491fe48adc07d13e020a8b07ef82eefc227003a046809c121bea81d3dbf1832d \
--hash=sha256:f8191fef303025879e6c3548ecd8a95aafc0728c764ab72ec51a0bdf0c91a341 \
--hash=sha256:108580808803c7732f34798eb4a329d45b04c562ed83ee90f09f6a184a42b766 \
--hash=sha256:bab5a1e15b9466a25c96cda19139f3beb3e669794373b9ce28c4cf158c6e841d \
--hash=sha256:318b5b727e00662e5fc4b4cd2bf58a5116d7c1b4dd56ffaa7d68f43458a8d1ed \
--hash=sha256:1418f5e71d6081aa1095a1d6b567a562d2761996710bdce9b6e6ba20a03d0864 \
--hash=sha256:5a7f224cdb7233182cec2a45d4c633951268d6a9bcedac37abbf79dd07012aea \
--hash=sha256:715b34578cc740b743361f7c3e5f584b04b0f1344f45afc4e87fbac4802eb0a0 \
--hash=sha256:2ff132a379838b1abf83c065be54cef32b47c987aedd06b82fc76476c85225eb \
--hash=sha256:c389d7cc2b821853fb018c85457da3e7941db64f4387720a329bc7ff06a27963 \
--hash=sha256:04f995fcbf54e46cddeb4f75ce9dfc17075d6ae04ac23b2bacb44b3bc6f6bf11 \
--hash=sha256:758fc8c4d6c0336e617f9f6919f9daea3ab6bb9b07005eda9a1a682e24a6cacc
termcolor==1.1.0 \
--hash=sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b
text-unidecode==1.3 \
--hash=sha256:bad6603bb14d279193107714b288be206cac565dfa49aa5b105294dd5c4aab93 \
--hash=sha256:1311f10e8b895935241623731c2ba64f4c455287888b18189350b67134a822e8
toml==0.10.2 \
--hash=sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b \
--hash=sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f
traitlets==5.0.5; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:69ff3f9d5351f31a7ad80443c2674b7099df13cc41fc5fa6e2f6d3b0330b0426 \
--hash=sha256:178f4ce988f69189f7e523337a3e11d91c786ded9360174a3d9ca83e79bc5396
typed-ast==1.4.2 \
--hash=sha256:7703620125e4fb79b64aa52427ec192822e9f45d37d4b6625ab37ef403e1df70 \
--hash=sha256:c9aadc4924d4b5799112837b226160428524a9a45f830e0d0f184b19e4090487 \
--hash=sha256:9ec45db0c766f196ae629e509f059ff05fc3148f9ffd28f3cfe75d4afb485412 \
--hash=sha256:85f95aa97a35bdb2f2f7d10ec5bbdac0aeb9dafdaf88e17492da0504de2e6400 \
--hash=sha256:9044ef2df88d7f33692ae3f18d3be63dec69c4fb1b5a4a9ac950f9b4ba571606 \
--hash=sha256:c1c876fd795b36126f773db9cbb393f19808edd2637e00fd6caba0e25f2c7b64 \
--hash=sha256:5dcfc2e264bd8a1db8b11a892bd1647154ce03eeba94b461effe68790d8b8e07 \
--hash=sha256:8db0e856712f79c45956da0c9a40ca4246abc3485ae0d7ecc86a20f5e4c09abc \
--hash=sha256:d003156bb6a59cda9050e983441b7fa2487f7800d76bdc065566b7d728b4581a \
--hash=sha256:4c790331247081ea7c632a76d5b2a265e6d325ecd3179d06e9cf8d46d90dd151 \
--hash=sha256:d175297e9533d8d37437abc14e8a83cbc68af93cc9c1c59c2c292ec59a0697a3 \
--hash=sha256:cf54cfa843f297991b7388c281cb3855d911137223c6b6d2dd82a47ae5125a41 \
--hash=sha256:b4fcdcfa302538f70929eb7b392f536a237cbe2ed9cba88e3bf5027b39f5f77f \
--hash=sha256:987f15737aba2ab5f3928c617ccf1ce412e2e321c77ab16ca5a293e7bbffd581 \
--hash=sha256:37f48d46d733d57cc70fd5f30572d11ab8ed92da6e6b28e024e4a3edfb456e37 \
--hash=sha256:36d829b31ab67d6fcb30e185ec996e1f72b892255a745d3a82138c97d21ed1cd \
--hash=sha256:8368f83e93c7156ccd40e49a783a6a6850ca25b556c0fa0240ed0f659d2fe496 \
--hash=sha256:963c80b583b0661918718b095e02303d8078950b26cc00b5e5ea9ababe0de1fc \
--hash=sha256:e683e409e5c45d5c9082dc1daf13f6374300806240719f95dc783d1fc942af10 \
--hash=sha256:84aa6223d71012c68d577c83f4e7db50d11d6b1399a9c779046d75e24bed74ea \
--hash=sha256:a38878a223bdd37c9709d07cd357bb79f4c760b29210e14ad0fb395294583787 \
--hash=sha256:a2c927c49f2029291fbabd673d51a2180038f8cd5a5b2f290f78c4516be48be2 \
--hash=sha256:c0c74e5579af4b977c8b932f40a5464764b2f86681327410aa028a22d2f54937 \
--hash=sha256:07d49388d5bf7e863f7fa2f124b1b1d89d8aa0e2f7812faff0a5658c01c59aa1 \
--hash=sha256:240296b27397e4e37874abb1df2a608a92df85cf3e2a04d0d4d61055c8305ba6 \
--hash=sha256:d746a437cdbca200622385305aedd9aef68e8a645e385cc483bdc5e488f07166 \
--hash=sha256:14bf1522cdee369e8f5581238edac09150c765ec1cb33615855889cf33dcb92d \
--hash=sha256:cc7b98bf58167b7f2db91a4327da24fb93368838eb84a44c472283778fc2446b \
--hash=sha256:7147e2a76c75f0f64c4319886e7639e490fee87c9d25cb1d4faef1d8cf83a440 \
--hash=sha256:9fc0b3cb5d1720e7141d103cf4819aea239f7d136acf9ee4a69b047b7986175a
typing-extensions==3.7.4.3 \
--hash=sha256:dafc7639cde7f1b6e1acc0f457842a83e722ccca8eef5270af2d74792619a89f \
--hash=sha256:7cb407020f00f7bfc3cb3e7881628838e69d8f3fcab2f64742a5e76b2f841918 \
--hash=sha256:99d4073b617d30288f569d3f13d2bd7548c3a7e4c8de87db09a9d29bb3a4a60c
urllib3==1.26.2 \
--hash=sha256:d8ff90d979214d7b4f8ce956e80f4028fc6860e4431f731ea4a8c08f23f99473 \
--hash=sha256:19188f96923873c92ccb987120ec4acaa12f0461fa9ce5d3d0772bc965a39e08
wcwidth==0.2.5; python_version >= "3.7" and python_version < "4.0" \
--hash=sha256:beb4802a9cebb9144e99086eff703a642a13d6a0052920003a230f3294bbe784 \
--hash=sha256:c4d647b99872929fdb7bdcaa4fbe7f01413ed3d98077df798530e5b04f116c83
xlrd==1.2.0 \
--hash=sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde \
--hash=sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2

View File

@@ -1,17 +1,90 @@
-i https://pypi.org/simple
-e .
certifi==2019.9.11
chardet==3.0.4
idna==2.8
langid==1.1.6
numpy==1.17.2
pandas==0.25.1
pycountry==19.8.18
python-dateutil==2.8.0
python-stdnum==1.11
pytz==2019.2
requests-cache==0.5.2
requests==2.22.0
six==1.12.0
urllib3==1.25.6
xlrd==1.2.0
certifi==2020.12.5 \
--hash=sha256:719a74fb9e33b9bd44cc7f3a8d94bc35e4049deebe19ba7d8e108280cfd59830 \
--hash=sha256:1a4995114262bffbc2413b159f2a1a480c969de6e6eb13ee966d470af86af59c
chardet==4.0.0 \
--hash=sha256:f864054d66fd9118f2e67044ac8981a54775ec5b67aed0441892edb553d21da5 \
--hash=sha256:0d6f53a15db4120f2b08c94f11e7d93d2c911ee118b6b30a04ec3ee8310179fa
idna==2.10 \
--hash=sha256:b97d804b1e9b523befed77c48dacec60e6dcb0b5391d57af6a65a312a90648c0 \
--hash=sha256:b307872f855b18632ce0c21c5e45be78c0ea7ae4c15c828c20788b26921eb3f6
langid==1.1.6 \
--hash=sha256:044bcae1912dab85c33d8e98f2811b8f4ff1213e5e9a9e9510137b84da2cb293
numpy==1.19.5 \
--hash=sha256:cc6bd4fd593cb261332568485e20a0712883cf631f6f5e8e86a52caa8b2b50ff \
--hash=sha256:aeb9ed923be74e659984e321f609b9ba54a48354bfd168d21a2b072ed1e833ea \
--hash=sha256:8b5e972b43c8fc27d56550b4120fe6257fdc15f9301914380b27f74856299fea \
--hash=sha256:43d4c81d5ffdff6bae58d66a3cd7f54a7acd9a0e7b18d97abb255defc09e3140 \
--hash=sha256:a4646724fba402aa7504cd48b4b50e783296b5e10a524c7a6da62e4a8ac9698d \
--hash=sha256:2e55195bc1c6b705bfd8ad6f288b38b11b1af32f3c8289d6c50d47f950c12e76 \
--hash=sha256:39b70c19ec771805081578cc936bbe95336798b7edf4732ed102e7a43ec5c07a \
--hash=sha256:dbd18bcf4889b720ba13a27ec2f2aac1981bd41203b3a3b27ba7a33f88ae4827 \
--hash=sha256:603aa0706be710eea8884af807b1b3bc9fb2e49b9f4da439e76000f3b3c6ff0f \
--hash=sha256:cae865b1cae1ec2663d8ea56ef6ff185bad091a5e33ebbadd98de2cfa3fa668f \
--hash=sha256:36674959eed6957e61f11c912f71e78857a8d0604171dfd9ce9ad5cbf41c511c \
--hash=sha256:06fab248a088e439402141ea04f0fffb203723148f6ee791e9c75b3e9e82f080 \
--hash=sha256:6149a185cece5ee78d1d196938b2a8f9d09f5a5ebfbba66969302a778d5ddd1d \
--hash=sha256:50a4a0ad0111cc1b71fa32dedd05fa239f7fb5a43a40663269bb5dc7877cfd28 \
--hash=sha256:d051ec1c64b85ecc69531e1137bb9751c6830772ee5c1c426dbcfe98ef5788d7 \
--hash=sha256:a12ff4c8ddfee61f90a1633a4c4afd3f7bcb32b11c52026c92a12e1325922d0d \
--hash=sha256:cf2402002d3d9f91c8b01e66fbb436a4ed01c6498fffed0e4c7566da1d40ee1e \
--hash=sha256:1ded4fce9cfaaf24e7a0ab51b7a87be9038ea1ace7f34b841fe3b6894c721d1c \
--hash=sha256:012426a41bc9ab63bb158635aecccc7610e3eff5d31d1eb43bc099debc979d94 \
--hash=sha256:759e4095edc3c1b3ac031f34d9459fa781777a93ccc633a472a5468587a190ff \
--hash=sha256:a9d17f2be3b427fbb2bce61e596cf555d6f8a56c222bd2ca148baeeb5e5c783c \
--hash=sha256:99abf4f353c3d1a0c7a5f27699482c987cf663b1eac20db59b8c7b061eabd7fc \
--hash=sha256:384ec0463d1c2671170901994aeb6dce126de0a95ccc3976c43b0038a37329c2 \
--hash=sha256:811daee36a58dc79cf3d8bdd4a490e4277d0e4b7d103a001a4e73ddb48e7e6aa \
--hash=sha256:c843b3f50d1ab7361ca4f0b3639bf691569493a56808a0b0c54a051d260b7dbd \
--hash=sha256:d6631f2e867676b13026e2846180e2c13c1e11289d67da08d71cacb2cd93d4aa \
--hash=sha256:7fb43004bce0ca31d8f13a6eb5e943fa73371381e53f7074ed21a4cb786c32f8 \
--hash=sha256:2ea52bd92ab9f768cc64a4c3ef8f4b2580a17af0a5436f6126b08efbd1838371 \
--hash=sha256:400580cbd3cff6ffa6293df2278c75aef2d58d8d93d3c5614cd67981dae68ceb \
--hash=sha256:df609c82f18c5b9f6cb97271f03315ff0dbe481a2a02e56aeb1b1a985ce38e60 \
--hash=sha256:ab83f24d5c52d60dbc8cd0528759532736b56db58adaa7b5f1f76ad551416a1e \
--hash=sha256:0eef32ca3132a48e43f6a0f5a82cb508f22ce5a3d6f67a8329c81c8e226d3f6e \
--hash=sha256:a0d53e51a6cb6f0d9082decb7a4cb6dfb33055308c4c44f53103c073f649af73 \
--hash=sha256:a76f502430dd98d7546e1ea2250a7360c065a5fdea52b2dffe8ae7180909b6f4
pandas==1.2.1 \
--hash=sha256:50e6c0a17ef7f831b5565fd0394dbf9bfd5d615ee4dd4bb60a3d8c9d2e872323 \
--hash=sha256:324e60bea729cf3b55c1bf9e88fe8b9932c26f8669d13b928e3c96b3a1453dff \
--hash=sha256:37443199f451f8badfe0add666e43cdb817c59fa36bceedafd9c543a42f236ca \
--hash=sha256:23ac77a3a222d9304cb2a7934bb7b4805ff43d513add7a42d1a22dc7df14edd2 \
--hash=sha256:496fcc29321e9a804d56d5aa5d7ec1320edfd1898eee2f451aa70171cf1d5a29 \
--hash=sha256:30e9e8bc8c5c17c03d943e8d6f778313efff59e413b8dbdd8214c2ed9aa165f6 \
--hash=sha256:055647e7f4c5e66ba92c2a7dcae6c2c57898b605a3fb007745df61cc4015937f \
--hash=sha256:9d45f58b03af1fea4b48e44aa38a819a33dccb9821ef9e1d68f529995f8a632f \
--hash=sha256:b26e2dabda73d347c7af3e6fed58483161c7b87a886a4e06d76ccfe55a044aa9 \
--hash=sha256:47ec0808a8357ab3890ce0eca39a63f79dcf941e2e7f494470fe1c9ec43f6091 \
--hash=sha256:57d5c7ac62925a8d2ab43ea442b297a56cc8452015e71e24f4aa7e4ed6be3d77 \
--hash=sha256:d7cca42dba13bfee369e2944ae31f6549a55831cba3117e17636955176004088 \
--hash=sha256:cfd237865d878da9b65cfee883da5e0067f5e2ff839e459466fb90565a77bda3 \
--hash=sha256:050ed2c9d825ef36738e018454e6d055c63d947c1d52010fbadd7584f09df5db \
--hash=sha256:fe7de6fed43e7d086e3d947651ec89e55ddf00102f9dd5758763d56d182f0564 \
--hash=sha256:2de012a36cc507debd9c3351b4d757f828d5a784a5fc4e6766eafc2b56e4b0f5 \
--hash=sha256:5527c5475d955c0bc9689c56865aaa2a7b13c504d6c44f0aadbf57b565af5ebd
pycountry==19.8.18 \
--hash=sha256:3c57aa40adcf293d59bebaffbe60d8c39976fba78d846a018dc0c2ec9c6cb3cb
python-dateutil==2.8.1 \
--hash=sha256:73ebfe9dbf22e832286dafa60473e4cd239f8592f699aa5adaf10050e6e1823c \
--hash=sha256:75bb3f31ea686f1197762692a9ee6a7550b59fc6ca3a1f4b5d7e32fb98e2da2a
python-stdnum==1.15 \
--hash=sha256:ff0e4d2b8731711cf2a73d22964139208088b4bd1a3e3b0c4f78d6e823dd8923 \
--hash=sha256:46c05dff2d5ff89b12cc63ab54cca1e4788e4e3087c163a1b7434a53fa926f28
pytz==2020.5 \
--hash=sha256:16962c5fb8db4a8f63a26646d8886e9d769b6c511543557bc84e9569fb9a9cb4 \
--hash=sha256:180befebb1927b16f6b57101720075a984c019ac16b1b7575673bea42c6c3da5
requests==2.25.1 \
--hash=sha256:c210084e36a42ae6b9219e00e48287def368a26d03a048ddad7bfee44f75871e \
--hash=sha256:27973dd4a904a4f13b263a19c866c13b92a39ed1c964655f025f3f8d3d75b804
requests-cache==0.5.2 \
--hash=sha256:813023269686045f8e01e2289cc1e7e9ae5ab22ddd1e2849a9093ab3ab7270eb \
--hash=sha256:81e13559baee64677a7d73b85498a5a8f0639e204517b5d05ff378e44a57831a
six==1.15.0 \
--hash=sha256:8b74bedcbbbaca38ff6d7491d76f2b06b3592611af620f8426e82dddb04a5ced \
--hash=sha256:30639c035cdb23534cd4aa2dd52c3bf48f06e5f4a941509c8bafd8ce11080259
urllib3==1.26.2 \
--hash=sha256:d8ff90d979214d7b4f8ce956e80f4028fc6860e4431f731ea4a8c08f23f99473 \
--hash=sha256:19188f96923873c92ccb987120ec4acaa12f0461fa9ce5d3d0772bc965a39e08
xlrd==1.2.0 \
--hash=sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde \
--hash=sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2

View File

@@ -4,17 +4,17 @@ with open("README.md", "r") as fh:
long_description = fh.read()
install_requires = [
'pandas',
'python-stdnum',
'requests',
'requests-cache',
'pycountry',
'langid'
"pandas",
"python-stdnum",
"requests",
"requests-cache",
"pycountry",
"langid",
]
setuptools.setup(
name="csv-metadata-quality",
version="0.3.1",
version="0.4.2",
author="Alan Orth",
author_email="aorth@mjanja.ch",
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
@@ -25,15 +25,15 @@ setuptools.setup(
classifiers=[
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: OS Independent",
"Development Status :: 4 - Beta"
"Development Status :: 4 - Beta",
],
packages=['csv_metadata_quality'],
packages=["csv_metadata_quality"],
entry_points={
'console_scripts': [
'csv-metadata-quality = csv_metadata_quality.__main__:main'
]
"console_scripts": ["csv-metadata-quality = csv_metadata_quality.__main__:main"]
},
install_requires=install_requires
install_requires=install_requires,
)

View File

@@ -1,6 +1,7 @@
import pandas as pd
import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental
import pandas as pd
def test_check_invalid_issn(capsys):
@@ -50,10 +51,27 @@ def test_check_invalid_separators(capsys):
value = "Alan|Orth"
check.separators(value)
field_name = "dc.contributor.author"
check.separators(value, field_name)
captured = capsys.readouterr()
assert captured.out == f"Invalid multi-value separator: {value}\n"
assert captured.out == f"Invalid multi-value separator ({field_name}): {value}\n"
def test_check_unnecessary_separators(capsys):
"""Test checking unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
check.separators(field, field_name)
captured = capsys.readouterr()
assert (
captured.out == f"Unnecessary multi-value separator ({field_name}): {field}\n"
)
def test_check_valid_separators():
@@ -61,7 +79,9 @@ def test_check_valid_separators():
value = "Alan||Orth"
result = check.separators(value)
field_name = "dc.contributor.author"
result = check.separators(value, field_name)
assert result == value

View File

@@ -6,7 +6,9 @@ def test_fix_leading_whitespace():
value = " Alan"
assert fix.whitespace(value) == "Alan"
field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_trailing_whitespace():
@@ -14,7 +16,9 @@ def test_fix_trailing_whitespace():
value = "Alan "
assert fix.whitespace(value) == "Alan"
field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_excessive_whitespace():
@@ -22,7 +26,9 @@ def test_fix_excessive_whitespace():
value = "Alan Orth"
assert fix.whitespace(value) == "Alan Orth"
field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan Orth"
def test_fix_invalid_separators():
@@ -30,7 +36,19 @@ def test_fix_invalid_separators():
value = "Alan|Orth"
assert fix.separators(value) == "Alan||Orth"
field_name = "dc.contributor.author"
assert fix.separators(value, field_name) == "Alan||Orth"
def test_fix_unnecessary_separators():
"""Test fixing unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
assert fix.separators(field, field_name) == "Alan||Orth"
def test_fix_unnecessary_unicode():
@@ -46,7 +64,9 @@ def test_fix_duplicates():
value = "Kenya||Kenya"
assert fix.duplicates(value) == "Kenya"
field_name = "dc.contributor.author"
assert fix.duplicates(value, field_name) == "Kenya"
def test_fix_newlines():
@@ -66,3 +86,25 @@ def test_fix_comma_space():
field_name = "dc.contributor.author"
assert fix.comma_space(value, field_name) == "Orth, Alan S."
def test_fix_normalized_unicode():
"""Test fixing a string that is already in its normalized (NFC) Unicode form."""
# string using the normalized canonical form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
def test_fix_decomposed_unicode():
"""Test fixing a string that contains Unicode string."""
# string using the decomposed form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"