1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-05-09 22:56:01 +02:00

66 Commits

Author SHA1 Message Date
cbf94490f2 Version 0.4.3 2021-01-26 15:22:40 +02:00
f3d0d5ef07 setup.py: Remove Python 3.6
I actually removed Python 3.6 support a few weeks ago after updating
to Pandas 1.2.0, but forgot to update this.
2021-01-26 15:22:08 +02:00
4b7b99c94c CHANGELOG.md: Add note about multi-value separators 2021-01-26 15:20:22 +02:00
df670e81b9 README.md: Use badge from my Drone CI
All checks were successful
continuous-integration/drone/push Build is passing
I'm not using SourceHut anymore.
2021-01-26 14:38:50 +02:00
ae357d8c6c Revert "Update requirements"
This reverts commit ca80340f7a717a9c8cd1780ba12e5bd366c9900d.

Nope, we still need the --without-hashes because this still fails
on Python 3.7, but not 3.8 or 3.9. From looking around it seems
that nobody can agree whether poetry should handle this, pip should
handle it, or upstream projects should pin their dependencies.
2021-01-26 14:15:31 +02:00
ca80340f7a Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt

Trying to see if we no longer need --without-hashes since we don't
support Python 3.6 anymore.
2021-01-26 11:46:05 +02:00
cc1743b86d Remove .build.yml
I will just use GitHub Actions and Drone.
2021-01-26 11:41:30 +02:00
bcb9885c6b Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-26 10:36:48 +02:00
b484b75178 poetry.lock: Run poetry update 2021-01-26 10:36:04 +02:00
d3880a9dfa Remove Python 3.6 support
All checks were successful
continuous-integration/drone/push Build is passing
Pandas 1.2.0 apparently requires Python 3.7.1+.
2021-01-03 15:51:53 +02:00
7edb8b19d7 tests/test_check.py: Reformat with black 2021-01-03 15:50:21 +02:00
a6709c7f82 Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-03 15:42:00 +02:00
d489ea4609 poetry.lock: Run poetry update 2021-01-03 15:41:08 +02:00
96634cbb67 pytest.ini: Change --strict to --strict-markers
This is deprecated since pytest 6.2.0.

See: https://docs.pytest.org/en/stable/deprecations.html#the-strict-command-line-option
2021-01-03 15:40:14 +02:00
29e67a0887 Add tests for unnecessary multi-value separators 2021-01-03 15:37:18 +02:00
32cea2055f data/test.csv: Add unnecessary multi-value separator 2021-01-03 15:33:04 +02:00
0dc66c5c4e Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
c26ad83534 .github: Test CLI invocation 2020-12-14 23:47:09 +02:00
72ca9d99bf setup.py: Add Python 3.9
[SKIP CI]
2020-12-14 23:44:35 +02:00
ae33a9b793 Add .drone.yml 2020-12-14 23:42:23 +02:00
fc0367bfc8 README.md: Update note about Python version 2020-12-08 10:52:24 +02:00
e33b285034 README.md: Add GitHub Actions badge 2020-12-08 10:48:31 +02:00
349fca03b8 .github/workflows/python-app.yml: Rename
This name is displayed in the badge so it should be something more
relevant.
2020-12-08 10:46:39 +02:00
52d8904870 Remove .travis.yml
They changed their free tier and I might as well use GitHub Actions
for ILRI stuff anyways.
2020-12-08 10:41:36 +02:00
971c69e535 Create python-app.yml
Try GitHub Actions for Python 3.8 using GitHub's Python example.
2020-12-08 10:38:52 +02:00
f8cc233e25 .travis.yml: Use Amazon Graviton2 ARM environment
These are the new hotness and should have faster build times.

See: https://blog.travis-ci.com/2020-09-11-arm-on-aws
2020-12-06 10:49:03 +02:00
aa7b7a9592 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2020-11-03 07:42:45 +02:00
57b455bde7 poetry.lock: Run poetry update 2020-11-03 07:40:56 +02:00
23b95fa368 .travis.yml: Use Ubuntu 20.04 "Focal" environment 2020-10-29 00:14:54 +03:00
6985f76aa3 .travis.yml: Bump Python versions
Test Python 3.9 now that it was released, and allow tests to fail
on nightly builds.
2020-10-29 00:14:36 +03:00
98a6a19e12 Update requirements-dev.txt
Generated with poetry export:

    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:48:46 +03:00
f4914c414f Only install ipython on Python 3.7+ 2020-10-06 17:48:16 +03:00
d352fe8017 Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:21:33 +03:00
f13c360084 Update poetry package dependencies 2020-10-06 17:20:16 +03:00
7cfd4c0b59 csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
826509ddcf poetry.lock: Run poetry update
List of updated modules:

  - Updating numpy (1.19.1 -> 1.19.2)
  - Updating pygments (2.6.1 -> 2.7.1)
  - Updating pandas (1.1.1 -> 1.1.2)

All tests still pass according to pytest.
2020-09-26 12:18:23 +03:00
22b5c0f7a1 CHANGELOG.md: Add note about dependencies update 2020-09-08 15:04:40 +03:00
774e274b32 poetry.lock: Run poetry update
Update dependencies to latest version:

  - Updating attrs (19.3.0 -> 20.2.0)
  - Updating more-itertools (8.4.0 -> 8.5.0)
  - Updating openpyxl (3.0.4 -> 3.0.5)
  - Updating parso (0.7.0 -> 0.7.1)
  - Updating sqlalchemy (1.3.18 -> 1.3.19)
  - Updating urllib3 (1.25.9 -> 1.25.10)
  - Updating agate-dbf (0.2.1 -> 0.2.2)
  - Updating agate-sql (0.5.4 -> 0.5.5)
  - Updating jedi (0.17.1 -> 0.17.2)
  - Updating numpy (1.19.0 -> 1.19.1)
  - Updating prompt-toolkit (3.0.5 -> 3.0.7)
  - Updating regex (2020.6.8 -> 2020.7.14)
  - Updating traitlets (4.3.3 -> 5.0.4)
  - Updating ipython (7.16.1 -> 7.18.1)
  - Updating pandas (1.0.5 -> 1.1.1)
  - Updating python-stdnum (1.13 -> 1.14)

All tests still pass according to pytest.
2020-09-08 15:04:00 +03:00
db474a802f README.md: Use badge from travis-ci.com 2020-08-04 11:12:28 +03:00
e241f8461b CHANGELOG.md: Add notes 2020-07-06 14:10:46 +03:00
431e6331c8 csv_metadata_quality/check.py: Format with black 2020-07-06 14:10:19 +03:00
cb07d357d4 Version 0.4.2 2020-07-06 14:04:34 +03:00
65cd48a26f CHANGELOG.md: Update changes 2020-07-06 14:00:21 +03:00
0f883f640c Remove pipenv 2020-07-06 13:59:49 +03:00
f4c5c5781e README.md: Switch to poetry 2020-07-06 13:59:11 +03:00
6aa784ad8c Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-07-06 13:57:07 +03:00
7b8da94f41 poetry.lock: Update Python dependencies 2020-07-06 13:56:31 +03:00
2a1566af62 csv_metadata_quality/check.py: Parameterize AGROVOC request 2020-07-06 13:44:46 +03:00
5fcaa63bd5 csv_metadata_quality/check.py: Prune requests cache once
We only need to prune the requests cache once before using it, not
for every value we check.
2020-07-06 13:42:19 +03:00
aa9e23b46c pyproject.toml: Update license specifier
We need to use valid SPDX license identifiers.
2020-06-09 14:22:53 +03:00
73acb1661f Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-05-31 17:51:16 +03:00
2a068fddc4 .build.yml: Fix test 2020-05-31 17:44:37 +03:00
c6c2f13e88 .build.yml: Fix poetry install invocation
Poetry apparently installs dev dependencies by default.
2020-05-31 17:37:09 +03:00
56f16e37ed .build.yml: Use poetry in SourceHut CI 2020-05-31 17:35:04 +03:00
0c44b967b6 Add poetry project file and lock
I want to try to use poetry instead of pipenv because pipenv takes
forever to do dependency resolution sometimes. Also, I have had a
few issues with Python modules like black that don't have releases
other than pre-releases, and even including the project itself in
the dependencies (pip install -e . ...?). My initial experience is
that poetry handles this better.
2020-05-31 17:33:40 +03:00
8a267bb40b .travis.yml: Try to build with Python 3.8-dev
But allow failures.
2020-03-29 16:40:11 +03:00
8fda8f1ef1 Pipfile.lock: Run pipenv update
All tests still passing.
2020-03-20 16:22:04 +02:00
5e471813e8 CHANGELOG.md: Add note about python dependencies 2020-01-29 12:41:43 +02:00
79244b9ac3 Pipfile.lock: Run pipenv update 2020-01-29 12:39:12 +02:00
5e81a33482 CHANGELOG.md: Add note about field names 2020-01-16 12:37:11 +02:00
28b5996aa6 Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
40ba9bae6c README.md: Adjust heading size 2020-01-15 12:26:11 +02:00
0b2d211455 Version 0.4.1 2020-01-15 12:19:42 +02:00
7f1df0b47c Support Python 3.6 and 3.7 again 2020-01-15 12:19:17 +02:00
365ecda324 Add utility function to check normalization
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 12:17:52 +02:00
550ce7fb7e .travis.yml: Only test Python 3.8
The Unicode normalization feature requires Python 3.8 because the
unicodedata.is_normalized() function only appears there. If I find
another way to check if a string is normalized without normalizing
it first I will drop the requirements back down to Python 3.6.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:57:21 +02:00
22 changed files with 1565 additions and 765 deletions

View File

@ -1,19 +0,0 @@
image: archlinux
packages:
- python-pipenv
sources:
- https://git.sr.ht/~alanorth/csv-metadata-quality
tasks:
- setup: |
cd csv-metadata-quality
pipenv install --dev
- pytest: |
cd csv-metadata-quality
pipenv run pytest
- testcli: |
cd csv-metadata-quality
pipenv run pip install .
pipenv run csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
environment:
PIPENV_NOSPIN: 'True'
PIPENV_HIDE_EMOJIS: 'True'

49
.drone.yml Normal file
View File

@ -0,0 +1,49 @@
---
kind: pipeline
type: docker
name: python39
steps:
- name: test
image: python:3.9-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python38
steps:
- name: test
image: python:3.8-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python37
steps:
- name: test
image: python:3.7-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
# vim: ts=2 sw=2 et

41
.github/workflows/python-app.yml vendored Normal file
View File

@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build and Test
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
- name: Test CLI
run: |
python setup.py install
csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country

View File

@ -1,12 +0,0 @@
dist: bionic
language: python
python:
- "3.6"
- "3.7"
- "3.8"
install:
- "pip install -r requirements.txt"
- "pip install -r requirements-dev.txt"
script: pytest
# vim: ts=2 sw=2 et

View File

@ -4,6 +4,31 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.4.3] - 2021-01-26
### Changed
- Reformat with black
- Requires Python 3.7+ for pandas 1.2.0
### Updated
- Run `poetry update`
- Expand check/fix for multi-value separators to include metadata with invalid
separators at the end, for example "Kenya||Tanzania||"
## [0.4.2] - 2020-07-06
### Changed
- Add field name to the output for more fixes and checks to help identify where
the error is
- Minor optimizations to AGROVOC subject lookup
- Use Poetry instead of Pipenv
### Updated
- Update python dependencies to latest versions
## [0.4.1] - 2020-01-15
### Changed
- Reduce minimum Python version to 3.6 by working around the `is_normalized()`
that only works in Python >= 3.8
## [0.4.0] - 2020-01-15
### Added
- Unicode normalization (enable with `--unsafe-fixes`, see README.md)

29
Pipfile
View File

@ -1,29 +0,0 @@
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true
[dev-packages]
pytest = "*"
ipython = "*"
flake8 = "*"
pytest-clarity = "*"
black = "==19.10b0"
isort = "*"
csvkit = "*"
[packages]
pandas = "*"
python-stdnum = "*"
xlrd = "*"
requests = "*"
requests-cache = "*"
pycountry = "*"
csv-metadata-quality = {editable = true,path = "."}
langid = "*"
[requires]
python_version = "3.8"
[pipenv]
allow_prereleases = true

589
Pipfile.lock generated
View File

@ -1,589 +0,0 @@
{
"_meta": {
"hash": {
"sha256": "bc933a2deb26ed095c46d6ccddf0f305f84157bdd95548c6b6a4356537951890"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.8"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.org/simple",
"verify_ssl": true
}
]
},
"default": {
"certifi": {
"hashes": [
"sha256:017c25db2a153ce562900032d5bc68e9f191e44e9a0f762f373977de9df1fbb3",
"sha256:25b64c7da4cd7479594d035c08c2d809eb4aab3a26e5a990ea98cc450c320f1f"
],
"version": "==2019.11.28"
},
"chardet": {
"hashes": [
"sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae",
"sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"
],
"version": "==3.0.4"
},
"csv-metadata-quality": {
"editable": true,
"path": "."
},
"idna": {
"hashes": [
"sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407",
"sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c"
],
"version": "==2.8"
},
"langid": {
"hashes": [
"sha256:044bcae1912dab85c33d8e98f2811b8f4ff1213e5e9a9e9510137b84da2cb293"
],
"index": "pypi",
"version": "==1.1.6"
},
"numpy": {
"hashes": [
"sha256:1786a08236f2c92ae0e70423c45e1e62788ed33028f94ca99c4df03f5be6b3c6",
"sha256:17aa7a81fe7599a10f2b7d95856dc5cf84a4eefa45bc96123cbbc3ebc568994e",
"sha256:20b26aaa5b3da029942cdcce719b363dbe58696ad182aff0e5dcb1687ec946dc",
"sha256:2d75908ab3ced4223ccba595b48e538afa5ecc37405923d1fea6906d7c3a50bc",
"sha256:39d2c685af15d3ce682c99ce5925cc66efc824652e10990d2462dfe9b8918c6a",
"sha256:56bc8ded6fcd9adea90f65377438f9fea8c05fcf7c5ba766bef258d0da1554aa",
"sha256:590355aeade1a2eaba17617c19edccb7db8d78760175256e3cf94590a1a964f3",
"sha256:70a840a26f4e61defa7bdf811d7498a284ced303dfbc35acb7be12a39b2aa121",
"sha256:77c3bfe65d8560487052ad55c6998a04b654c2fbc36d546aef2b2e511e760971",
"sha256:9537eecf179f566fd1c160a2e912ca0b8e02d773af0a7a1120ad4f7507cd0d26",
"sha256:9acdf933c1fd263c513a2df3dceecea6f3ff4419d80bf238510976bf9bcb26cd",
"sha256:ae0975f42ab1f28364dcda3dde3cf6c1ddab3e1d4b2909da0cb0191fa9ca0480",
"sha256:b3af02ecc999c8003e538e60c89a2b37646b39b688d4e44d7373e11c2debabec",
"sha256:b6ff59cee96b454516e47e7721098e6ceebef435e3e21ac2d6c3b8b02628eb77",
"sha256:b765ed3930b92812aa698a455847141869ef755a87e099fddd4ccf9d81fffb57",
"sha256:c98c5ffd7d41611407a1103ae11c8b634ad6a43606eca3e2a5a269e5d6e8eb07",
"sha256:cf7eb6b1025d3e169989416b1adcd676624c2dbed9e3bcb7137f51bfc8cc2572",
"sha256:d92350c22b150c1cae7ebb0ee8b5670cc84848f6359cf6b5d8f86617098a9b73",
"sha256:e422c3152921cece8b6a2fb6b0b4d73b6579bd20ae075e7d15143e711f3ca2ca",
"sha256:e840f552a509e3380b0f0ec977e8124d0dc34dc0e68289ca28f4d7c1d0d79474",
"sha256:f3d0a94ad151870978fb93538e95411c83899c9dc63e6fb65542f769568ecfa5"
],
"version": "==1.18.1"
},
"pandas": {
"hashes": [
"sha256:0f52d8a2358de840eca388f50bcab137d9d2f161f55c9c32e888387ac2e4505b",
"sha256:111d77cac6c0e2d8bb76bdad75b3a416729f5f31f705276becbf8035b26ac5e0",
"sha256:223f97e52a4d82cf918da5dcbdc92c69ab00686e2b6adeb3012326ace3dc1aee",
"sha256:3b09cae3d39e71187fcc6817c3f60a8c9bad5f503e6aa8d72e4cbb2e1cd7a585",
"sha256:4a37ab58d7c3017d71650a7d9b44d056005c1d0d9be931d8af9c8b2ca2c8a8b8",
"sha256:57628cd142f09165bca3ce0b2f82f14568ae14a6c2c125a29d167c9b9df6f76e",
"sha256:5c42b463d25780d5d5addc79b1cfb1b8d8db44d4184186da8e2a25f2c794ad43",
"sha256:656443bf914f5e9307fcc694d5f400d19e616d7aafa4faf57711e0449093272f",
"sha256:7e5dc9137b9fc2e3ccd00df092fa3af6e01430dcba747f5f063b33ea1ed0999c",
"sha256:8305fb7b2817e3da6071f0032b6ca1402cbe303094ab5594f552d7052782b8de",
"sha256:a98b46eec0e245fd3dc0d11012109f41aa37c96066aa642d65f4a4c332d193c1",
"sha256:b254f0c4308ff0c8c896a9de980642a55b716dff4d1fc8a730657e6d4711e35d",
"sha256:cce070caeb357ef89267482c7dd1a9adaa57444be5663ea294675ab0cdb5f033",
"sha256:f4e74a38cc48453bceda51c0d13122c38f0a49dd4c737f8091b8cdc88f47eb8c"
],
"index": "pypi",
"version": "==1.0.0rc0"
},
"pycountry": {
"hashes": [
"sha256:3c57aa40adcf293d59bebaffbe60d8c39976fba78d846a018dc0c2ec9c6cb3cb"
],
"index": "pypi",
"version": "==19.8.18"
},
"python-dateutil": {
"hashes": [
"sha256:73ebfe9dbf22e832286dafa60473e4cd239f8592f699aa5adaf10050e6e1823c",
"sha256:75bb3f31ea686f1197762692a9ee6a7550b59fc6ca3a1f4b5d7e32fb98e2da2a"
],
"version": "==2.8.1"
},
"python-stdnum": {
"hashes": [
"sha256:4c1347c414d7bdffb454924998f62c04d907a5c01faff0e35df659b0b52acba5",
"sha256:bb58877dafc2e590dbfddc63fa04876ab2005c3f35c8356a2dd01f62a9bdc4d6"
],
"index": "pypi",
"version": "==1.12"
},
"pytz": {
"hashes": [
"sha256:1c557d7d0e871de1f5ccd5833f60fb2550652da6be2693c1e02300743d21500d",
"sha256:b02c06db6cf09c12dd25137e563b31700d3b80fcc4ad23abb7a315f2789819be"
],
"version": "==2019.3"
},
"requests": {
"hashes": [
"sha256:11e007a8a2aa0323f5a921e9e6a2d7e4e67d9877e85773fba9ba6419025cbeb4",
"sha256:9cf5292fcd0f598c671cfc1e0d7d1a7f13bb8085e9a590f48c010551dc6c4b31"
],
"index": "pypi",
"version": "==2.22.0"
},
"requests-cache": {
"hashes": [
"sha256:813023269686045f8e01e2289cc1e7e9ae5ab22ddd1e2849a9093ab3ab7270eb",
"sha256:81e13559baee64677a7d73b85498a5a8f0639e204517b5d05ff378e44a57831a"
],
"index": "pypi",
"version": "==0.5.2"
},
"six": {
"hashes": [
"sha256:1f1b7d42e254082a9db6279deae68afb421ceba6158efa6131de7b3003ee93fd",
"sha256:30f610279e8b2578cab6db20741130331735c781b56053c59c4076da27f06b66"
],
"version": "==1.13.0"
},
"urllib3": {
"hashes": [
"sha256:a8a318824cc77d1fd4b2bec2ded92646630d7fe8619497b142c84a9e6f5a7293",
"sha256:f3c5fd51747d450d4dcf6f923c81f78f811aab8205fda64b0aba34a4e48b0745"
],
"version": "==1.25.7"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
}
},
"develop": {
"agate": {
"hashes": [
"sha256:48d6f80b35611c1ba25a642cbc5b90fcbdeeb2a54711c4a8d062ee2809334d1c",
"sha256:c93aaa500b439d71e4a5cf088d0006d2ce2c76f1950960c8843114e5f361dfd3"
],
"version": "==1.6.1"
},
"agate-dbf": {
"hashes": [
"sha256:00c93c498ec9a04cc587bf63dd7340e67e2541f0df4c9a7259d7cb3dd4ce372f"
],
"version": "==0.2.1"
},
"agate-excel": {
"hashes": [
"sha256:8f255ef2c87c436b7132049e1dd86c8e08bf82d8c773aea86f3069b461a17d52"
],
"version": "==0.2.3"
},
"agate-sql": {
"hashes": [
"sha256:9277490ba8b8e7c747a9ae3671f52fe486784b48d4a14e78ca197fb0e36f281b"
],
"version": "==0.5.4"
},
"appdirs": {
"hashes": [
"sha256:9e5896d1372858f8dd3344faf4e5014d21849c756c8d5701f78f8a103b372d92",
"sha256:d8b24664561d0d34ddfaec54636d502d7cea6e29c3eaf68f3df6180863e2166e"
],
"version": "==1.4.3"
},
"attrs": {
"hashes": [
"sha256:08a96c641c3a74e44eb59afb61a24f2cb9f4d7188748e76ba4bb5edfa3cb7d1c",
"sha256:f7b7ce16570fe9965acd6d30101a28f62fb4a7f9e926b3bbc9b61f8b04247e72"
],
"version": "==19.3.0"
},
"babel": {
"hashes": [
"sha256:1aac2ae2d0d8ea368fa90906567f5c08463d98ade155c0c4bfedd6a0f7160e38",
"sha256:d670ea0b10f8b723672d3a6abeb87b565b244da220d76b4dba1b66269ec152d4"
],
"version": "==2.8.0"
},
"backcall": {
"hashes": [
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
],
"version": "==0.1.0"
},
"black": {
"hashes": [
"sha256:1b30e59be925fafc1ee4565e5e08abef6b03fe455102883820fe5ee2e4734e0b",
"sha256:c2edb73a08e9e0e6f65a0e6af18b059b8b1cdd5bef997d7a0b181df93dc81539"
],
"index": "pypi",
"version": "==19.10b0"
},
"click": {
"hashes": [
"sha256:2335065e6395b9e67ca716de5f7526736bfa6ceead690adf616d925bdc622b13",
"sha256:5b94b49521f6456670fdb30cd82a4eca9412788a93fa6dd6df72c94d5a8ff2d7"
],
"version": "==7.0"
},
"csvkit": {
"hashes": [
"sha256:1353a383531bee191820edfb88418c13dfe1cdfa9dd3dc46f431c05cd2a260a0"
],
"index": "pypi",
"version": "==1.0.4"
},
"dbfread": {
"hashes": [
"sha256:07c8a9af06ffad3f6f03e8fe91ad7d2733e31a26d2b72c4dd4cfbae07ee3b73d",
"sha256:f604def58c59694fa0160d7be5d0b8d594467278d2bb6a47d46daf7162c84cec"
],
"version": "==2.0.7"
},
"decorator": {
"hashes": [
"sha256:54c38050039232e1db4ad7375cfce6748d7b41c29e95a081c8a6d2c30364a2ce",
"sha256:5d19b92a3c8f7f101c8dd86afd86b0f061a8ce4540ab8cd401fa2542756bce6d"
],
"version": "==4.4.1"
},
"entrypoints": {
"hashes": [
"sha256:589f874b313739ad35be6e0cd7efde2a4e9b6fea91edcc34e58ecbb8dbe56d19",
"sha256:c70dd71abe5a8c85e55e12c19bd91ccfeec11a6e99044204511f9ed547d48451"
],
"version": "==0.3"
},
"et-xmlfile": {
"hashes": [
"sha256:614d9722d572f6246302c4491846d2c393c199cfa4edc9af593437691683335b"
],
"version": "==1.0.1"
},
"flake8": {
"hashes": [
"sha256:45681a117ecc81e870cbf1262835ae4af5e7a8b08e40b944a8a6e6b895914cfb",
"sha256:49356e766643ad15072a789a20915d3c91dc89fd313ccd71802303fd67e4deca"
],
"index": "pypi",
"version": "==3.7.9"
},
"ipython": {
"hashes": [
"sha256:0f4bcf18293fb666df8511feec0403bdb7e061a5842ea6e88a3177b0ceb34ead",
"sha256:387686dd7fc9caf29d2fddcf3116c4b07a11d9025701d220c589a430b0171d8a"
],
"index": "pypi",
"version": "==7.11.1"
},
"ipython-genutils": {
"hashes": [
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
],
"version": "==0.2.0"
},
"isodate": {
"hashes": [
"sha256:2e364a3d5759479cdb2d37cce6b9376ea504db2ff90252a2e5b7cc89cc9ff2d8",
"sha256:aa4d33c06640f5352aca96e4b81afd8ab3b47337cc12089822d6f322ac772c81"
],
"version": "==0.6.0"
},
"isort": {
"hashes": [
"sha256:54da7e92468955c4fceacd0c86bd0ec997b0e1ee80d97f67c35a78b719dccab1",
"sha256:6e811fcb295968434526407adb8796944f1988c5b65e8139058f2014cbe100fd"
],
"index": "pypi",
"version": "==4.3.21"
},
"jdcal": {
"hashes": [
"sha256:1abf1305fce18b4e8aa248cf8fe0c56ce2032392bc64bbd61b5dff2a19ec8bba",
"sha256:472872e096eb8df219c23f2689fc336668bdb43d194094b5cc1707e1640acfc8"
],
"version": "==1.4.1"
},
"jedi": {
"hashes": [
"sha256:1349c1e8c107095a55386628bb3b2a79422f3a2cab8381e34ce19909e0cf5064",
"sha256:e909527104a903606dd63bea6e8e888833f0ef087057829b89a18364a856f807"
],
"version": "==0.15.2"
},
"leather": {
"hashes": [
"sha256:076d1603b5281488285718ce1a5ce78cf1027fe1e76adf9c548caf83c519b988",
"sha256:e0bb36a6d5f59fbf3c1a6e75e7c8bee29e67f06f5b48c0134407dde612eba5e2"
],
"version": "==0.3.3"
},
"mccabe": {
"hashes": [
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
],
"version": "==0.6.1"
},
"more-itertools": {
"hashes": [
"sha256:1a2a32c72400d365000412fe08eb4a24ebee89997c18d3d147544f70f5403b39",
"sha256:c468adec578380b6281a114cb8a5db34eb1116277da92d7c46f904f0b52d3288"
],
"version": "==8.1.0"
},
"openpyxl": {
"hashes": [
"sha256:547a9fc6aafcf44abe358b89ed4438d077e9d92e4f182c87e2dc294186dc4b64"
],
"version": "==3.0.3"
},
"packaging": {
"hashes": [
"sha256:aec3fdbb8bc9e4bb65f0634b9f551ced63983a529d6a8931817d52fdd0816ddb",
"sha256:fe1d8331dfa7cc0a883b49d75fc76380b2ab2734b220fbb87d774e4fd4b851f8"
],
"version": "==20.0"
},
"parsedatetime": {
"hashes": [
"sha256:3b835fc54e472c17ef447be37458b400e3fefdf14bb1ffdedb5d2c853acf4ba1",
"sha256:d2e9ddb1e463de871d32088a3f3cea3dc8282b1b2800e081bd0ef86900451667"
],
"version": "==2.5"
},
"parso": {
"hashes": [
"sha256:55cf25df1a35fd88b878715874d2c4dc1ad3f0eebd1e0266a67e1f55efccfbe1",
"sha256:5c1f7791de6bd5dbbeac8db0ef5594b36799de198b3f7f7014643b0c5536b9d3"
],
"version": "==0.5.2"
},
"pathspec": {
"hashes": [
"sha256:163b0632d4e31cef212976cf57b43d9fd6b0bac6e67c26015d611a647d5e7424",
"sha256:562aa70af2e0d434367d9790ad37aed893de47f1693e4201fd1d3dca15d19b96"
],
"version": "==0.7.0"
},
"pexpect": {
"hashes": [
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
],
"markers": "sys_platform != 'win32'",
"version": "==4.7.0"
},
"pickleshare": {
"hashes": [
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
],
"version": "==0.7.5"
},
"pluggy": {
"hashes": [
"sha256:15b2acde666561e1298d71b523007ed7364de07029219b604cf808bfa1c765b0",
"sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d"
],
"version": "==0.13.1"
},
"prompt-toolkit": {
"hashes": [
"sha256:0278d2f51b5ceba6ea8da39f76d15684e84c996b325475f6e5720edc584326a7",
"sha256:63daee79aa8366c8f1c637f1a4876b890da5fc92a19ebd2f7080ebacb901e990"
],
"version": "==3.0.2"
},
"ptyprocess": {
"hashes": [
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
],
"version": "==0.6.0"
},
"py": {
"hashes": [
"sha256:5e27081401262157467ad6e7f851b7aa402c5852dbcb3dae06768434de5752aa",
"sha256:c20fdd83a5dbc0af9efd622bee9a5564e278f6380fffcacc43ba6f43db2813b0"
],
"version": "==1.8.1"
},
"pycodestyle": {
"hashes": [
"sha256:95a2219d12372f05704562a14ec30bc76b05a5b297b21a5dfe3f6fac3491ae56",
"sha256:e40a936c9a450ad81df37f549d676d127b1b66000a6c500caa2b085bc0ca976c"
],
"version": "==2.5.0"
},
"pyflakes": {
"hashes": [
"sha256:17dbeb2e3f4d772725c777fabc446d5634d1038f234e77343108ce445ea69ce0",
"sha256:d976835886f8c5b31d47970ed689944a0262b5f3afa00a5a7b4dc81e5449f8a2"
],
"version": "==2.1.1"
},
"pygments": {
"hashes": [
"sha256:2a3fe295e54a20164a9df49c75fa58526d3be48e14aceba6d6b1e8ac0bfd6f1b",
"sha256:98c8aa5a9f778fcd1026a17361ddaf7330d1b7c62ae97c3bb0ae73e0b9b6b0fe"
],
"version": "==2.5.2"
},
"pyparsing": {
"hashes": [
"sha256:4c830582a84fb022400b85429791bc551f1f4871c33f23e44f353119e92f969f",
"sha256:c342dccb5250c08d45fd6f8b4a559613ca603b57498511740e65cd11a2e7dcec"
],
"version": "==2.4.6"
},
"pytest": {
"hashes": [
"sha256:6b571215b5a790f9b41f19f3531c53a45cf6bb8ef2988bc1ff9afb38270b25fa",
"sha256:e41d489ff43948babd0fad7ad5e49b8735d5d55e26628a58673c39ff61d95de4"
],
"index": "pypi",
"version": "==5.3.2"
},
"pytest-clarity": {
"hashes": [
"sha256:3f40d5ae7cb21cc95e622fc4f50d9466f80ae0f91460225b8c95c07afbf93e20"
],
"index": "pypi",
"version": "==0.2.0a1"
},
"python-slugify": {
"hashes": [
"sha256:a8fc3433821140e8f409a9831d13ae5deccd0b033d4744d94b31fea141bdd84c"
],
"version": "==4.0.0"
},
"pytimeparse": {
"hashes": [
"sha256:04b7be6cc8bd9f5647a6325444926c3ac34ee6bc7e69da4367ba282f076036bd",
"sha256:e86136477be924d7e670646a98561957e8ca7308d44841e21f5ddea757556a0a"
],
"version": "==1.1.8"
},
"pytz": {
"hashes": [
"sha256:1c557d7d0e871de1f5ccd5833f60fb2550652da6be2693c1e02300743d21500d",
"sha256:b02c06db6cf09c12dd25137e563b31700d3b80fcc4ad23abb7a315f2789819be"
],
"version": "==2019.3"
},
"regex": {
"hashes": [
"sha256:07b39bf943d3d2fe63d46281d8504f8df0ff3fe4c57e13d1656737950e53e525",
"sha256:0932941cdfb3afcbc26cc3bcf7c3f3d73d5a9b9c56955d432dbf8bbc147d4c5b",
"sha256:0e182d2f097ea8549a249040922fa2b92ae28be4be4895933e369a525ba36576",
"sha256:10671601ee06cf4dc1bc0b4805309040bb34c9af423c12c379c83d7895622bb5",
"sha256:23e2c2c0ff50f44877f64780b815b8fd2e003cda9ce817a7fd00dea5600c84a0",
"sha256:26ff99c980f53b3191d8931b199b29d6787c059f2e029b2b0c694343b1708c35",
"sha256:27429b8d74ba683484a06b260b7bb00f312e7c757792628ea251afdbf1434003",
"sha256:3e77409b678b21a056415da3a56abfd7c3ad03da71f3051bbcdb68cf44d3c34d",
"sha256:4e8f02d3d72ca94efc8396f8036c0d3bcc812aefc28ec70f35bb888c74a25161",
"sha256:4eae742636aec40cf7ab98171ab9400393360b97e8f9da67b1867a9ee0889b26",
"sha256:6a6ae17bf8f2d82d1e8858a47757ce389b880083c4ff2498dba17c56e6c103b9",
"sha256:6a6ba91b94427cd49cd27764679024b14a96874e0dc638ae6bdd4b1a3ce97be1",
"sha256:7bcd322935377abcc79bfe5b63c44abd0b29387f267791d566bbb566edfdd146",
"sha256:98b8ed7bb2155e2cbb8b76f627b2fd12cf4b22ab6e14873e8641f266e0fb6d8f",
"sha256:bd25bb7980917e4e70ccccd7e3b5740614f1c408a642c245019cff9d7d1b6149",
"sha256:d0f424328f9822b0323b3b6f2e4b9c90960b24743d220763c7f07071e0778351",
"sha256:d58e4606da2a41659c84baeb3cfa2e4c87a74cec89a1e7c56bee4b956f9d7461",
"sha256:e3cd21cc2840ca67de0bbe4071f79f031c81418deb544ceda93ad75ca1ee9f7b",
"sha256:e6c02171d62ed6972ca8631f6f34fa3281d51db8b326ee397b9c83093a6b7242",
"sha256:e7c7661f7276507bce416eaae22040fd91ca471b5b33c13f8ff21137ed6f248c",
"sha256:ecc6de77df3ef68fee966bb8cb4e067e84d4d1f397d0ef6fce46913663540d77"
],
"version": "==2020.1.8"
},
"six": {
"hashes": [
"sha256:1f1b7d42e254082a9db6279deae68afb421ceba6158efa6131de7b3003ee93fd",
"sha256:30f610279e8b2578cab6db20741130331735c781b56053c59c4076da27f06b66"
],
"version": "==1.13.0"
},
"sqlalchemy": {
"hashes": [
"sha256:bfb8f464a5000b567ac1d350b9090cf081180ec1ab4aa87e7bca12dab25320ec"
],
"version": "==1.3.12"
},
"termcolor": {
"hashes": [
"sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b"
],
"version": "==1.1.0"
},
"text-unidecode": {
"hashes": [
"sha256:1311f10e8b895935241623731c2ba64f4c455287888b18189350b67134a822e8",
"sha256:bad6603bb14d279193107714b288be206cac565dfa49aa5b105294dd5c4aab93"
],
"version": "==1.3"
},
"toml": {
"hashes": [
"sha256:229f81c57791a41d65e399fc06bf0848bab550a9dfd5ed66df18ce5f05e73d5c",
"sha256:235682dd292d5899d361a811df37e04a8828a5b1da3115886b73cf81ebc9100e"
],
"version": "==0.10.0"
},
"traitlets": {
"hashes": [
"sha256:70b4c6a1d9019d7b4f6846832288f86998aa3b9207c6821f3578a6a6a467fe44",
"sha256:d023ee369ddd2763310e4c3eae1ff649689440d4ae59d7485eb4cfbbe3e359f7"
],
"version": "==4.3.3"
},
"typed-ast": {
"hashes": [
"sha256:0666aa36131496aed8f7be0410ff974562ab7eeac11ef351def9ea6fa28f6355",
"sha256:0c2c07682d61a629b68433afb159376e24e5b2fd4641d35424e462169c0a7919",
"sha256:249862707802d40f7f29f6e1aad8d84b5aa9e44552d2cc17384b209f091276aa",
"sha256:24995c843eb0ad11a4527b026b4dde3da70e1f2d8806c99b7b4a7cf491612652",
"sha256:269151951236b0f9a6f04015a9004084a5ab0d5f19b57de779f908621e7d8b75",
"sha256:4083861b0aa07990b619bd7ddc365eb7fa4b817e99cf5f8d9cf21a42780f6e01",
"sha256:498b0f36cc7054c1fead3d7fc59d2150f4d5c6c56ba7fb150c013fbc683a8d2d",
"sha256:4e3e5da80ccbebfff202a67bf900d081906c358ccc3d5e3c8aea42fdfdfd51c1",
"sha256:6daac9731f172c2a22ade6ed0c00197ee7cc1221aa84cfdf9c31defeb059a907",
"sha256:715ff2f2df46121071622063fc7543d9b1fd19ebfc4f5c8895af64a77a8c852c",
"sha256:73d785a950fc82dd2a25897d525d003f6378d1cb23ab305578394694202a58c3",
"sha256:8c8aaad94455178e3187ab22c8b01a3837f8ee50e09cf31f1ba129eb293ec30b",
"sha256:8ce678dbaf790dbdb3eba24056d5364fb45944f33553dd5869b7580cdbb83614",
"sha256:aaee9905aee35ba5905cfb3c62f3e83b3bec7b39413f0a7f19be4e547ea01ebb",
"sha256:bcd3b13b56ea479b3650b82cabd6b5343a625b0ced5429e4ccad28a8973f301b",
"sha256:c9e348e02e4d2b4a8b2eedb48210430658df6951fa484e59de33ff773fbd4b41",
"sha256:d205b1b46085271b4e15f670058ce182bd1199e56b317bf2ec004b6a44f911f6",
"sha256:d43943ef777f9a1c42bf4e552ba23ac77a6351de620aa9acf64ad54933ad4d34",
"sha256:d5d33e9e7af3b34a40dc05f498939f0ebf187f07c385fd58d591c533ad8562fe",
"sha256:fc0fea399acb12edbf8a628ba8d2312f583bdbdb3335635db062fa98cf71fca4",
"sha256:fe460b922ec15dd205595c9b5b99e2f056fd98ae8f9f56b888e7a17dc2b757e7"
],
"version": "==1.4.1"
},
"wcwidth": {
"hashes": [
"sha256:8fd29383f539be45b20bd4df0dc29c20ba48654a41e661925e612311e9f3c603",
"sha256:f28b3e8a6483e5d49e7f8949ac1a78314e740333ae305b4ba5defd3e74fb37a8"
],
"version": "==0.1.8"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
}
}
}

View File

@ -1,7 +1,7 @@
# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
# CSV Metadata Quality ![GitHub Actions](https://github.com/ilri/csv-metadata-quality/workflows/Build%20and%20Test/badge.svg) [![Build Status](https://ci.mjanja.ch/api/badges/alanorth/csv-metadata-quality/status.svg)](https://ci.mjanja.ch/alanorth/csv-metadata-quality)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.
Requires Python 3.8 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
Requires Python 3.7 or greater (3.8 recommended). CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
## Functionality
@ -10,7 +10,7 @@ Requires Python 3.8 or greater. CSV and Excel support comes from the [Pandas](ht
- Experimental validation of titles and abstracts against item's Dublin Core language field
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
- Fix leading, trailing, and excessive (ie, more than one) whitespace
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes`
- Fix invalid and unnecessary multi-value separators (`|`) using `--unsafe-fixes`
- Fix problematic newlines (line feeds) using `--unsafe-fixes`
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
@ -18,16 +18,16 @@ Requires Python 3.8 or greater. CSV and Excel support comes from the [Pandas](ht
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
## Installation
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):
The easiest way to install CSV Metadata Quality is with [poetry](https://python-poetry.org):
```
$ git clone https://github.com/ilri/csv-metadata-quality.git
$ cd csv-metadata-quality
$ pipenv install
$ pipenv shell
$ poetry install
$ poetry shell
```
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:
Otherwise, if you don't have poetry, you can use a vanilla Python virtual environment:
```
$ git clone https://github.com/ilri/csv-metadata-quality.git
@ -56,10 +56,12 @@ You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currentl
### Invalid Multi-Value Separators
This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
### Newlines
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
## Unicode Normalization
### Unicode Normalization
[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text. As the standard aims to support most of the world's languages, characters can often be represented in different ways and still be valid Unicode. This leads to interesting problems that can be confusing unless you know what's going on behind the scenes. For example, the characters `é` and `é` *look* the same, but are nottechnically they refer to different code points in the Unicode standard:
- `é` is the Unicode code point `U+00E9`
@ -102,6 +104,9 @@ This currently uses the [Python langid](https://github.com/saffsd/langid.py) lib
- Warn if two items use the same file in `filename` column
- Add an option to drop invalid AGROVOC subjects?
- Add tests for application invocation, ie `tests/test_app.py`?
- Validate ISSNs or journal titles against CrossRef API?
- Better ISO 8601 date parsing (currently only supports simple dates, perhaps we need to use dateutil.parser.parseiso())
- Fix lazy date check (assumes field name has "date" but could be dcterms.issued etc!)
## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

View File

@ -82,7 +82,7 @@ def run(argv):
continue
# Fix: whitespace
df[column] = df[column].apply(fix.whitespace)
df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: newlines
if args.unsafe_fixes:
@ -103,20 +103,20 @@ def run(argv):
# Fix: unnecessary Unicode
df[column] = df[column].apply(fix.unnecessary_unicode)
# Check: invalid multi-value separator
df[column] = df[column].apply(check.separators)
# Check: invalid and unnecessary multi-value separators
df[column] = df[column].apply(check.separators, field_name=column)
# Check: suspicious characters
df[column] = df[column].apply(check.suspicious_characters, field_name=column)
# Fix: invalid multi-value separator
# Fix: invalid and unnecessary multi-value separators
if args.unsafe_fixes:
df[column] = df[column].apply(fix.separators)
df[column] = df[column].apply(fix.separators, field_name=column)
# Run whitespace fix again after fixing invalid separators
df[column] = df[column].apply(fix.whitespace)
df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: duplicate metadata values
df[column] = df[column].apply(fix.duplicates)
df[column] = df[column].apply(fix.duplicates, field_name=column)
# Check: invalid AGROVOC subject
if args.agrovoc_fields:

View File

@ -1,4 +1,9 @@
from datetime import datetime, timedelta
import pandas as pd
import requests
import requests_cache
from pycountry import languages
def issn(field):
@ -51,8 +56,12 @@ def isbn(field):
return field
def separators(field):
"""Check for invalid multi-value separators (ie "|" or "|||").
def separators(field, field_name):
"""Check for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator.
"""
@ -65,12 +74,18 @@ def separators(field):
# Try to split multi-value field on "||" separator
for value in field.split("||"):
# Check if the current value is blank
if value == "":
print(f"Unnecessary multi-value separator ({field_name}): {field}")
continue
# After splitting, see if there are any remaining "|" characters
match = re.findall(r"^.*?\|.*$", value)
# Check if there was a match
if match:
print(f"Invalid multi-value separator: {field}")
print(f"Invalid multi-value separator ({field_name}): {field}")
return field
@ -85,7 +100,6 @@ def date(field, field_name):
Prints the date if invalid.
"""
from datetime import datetime
if pd.isna(field):
print(f"Missing date ({field_name}).")
@ -170,8 +184,6 @@ def language(field):
Prints the value if it is invalid.
"""
from pycountry import languages
# Skip fields with missing values
if pd.isna(field):
return
@ -213,30 +225,23 @@ def agrovoc(field, field_name):
Prints a warning if the value is invalid.
"""
from datetime import timedelta
import requests
import requests_cache
# Skip fields with missing values
if pd.isna(field):
return
# enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30)
requests_cache.install_cache("agrovoc-response-cache", expire_after=expire_after)
# prune old cache entries
requests_cache.core.remove_expired_responses()
# Try to split multi-value field on "||" separator
for value in field.split("||"):
request_url = (
f"http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}"
)
request_url = "http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search"
request_params = {"query": value}
# enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30)
requests_cache.install_cache(
"agrovoc-response-cache", expire_after=expire_after
)
request = requests.get(request_url)
# prune old cache entries
requests_cache.core.remove_expired_responses()
request = requests.get(request_url, params=request_params)
if request.status_code == requests.codes.ok:
data = request.json()

View File

@ -1,9 +1,12 @@
import re
from unicodedata import normalize
import pandas as pd
from csv_metadata_quality.util import is_nfc
def whitespace(field):
def whitespace(field, field_name):
"""Fix whitespace issues.
Return string with leading, trailing, and consecutive whitespace trimmed.
@ -26,7 +29,7 @@ def whitespace(field):
match = re.findall(pattern, value)
if match:
print(f"Removing excessive whitespace: {value}")
print(f"Removing excessive whitespace ({field_name}): {value}")
value = re.sub(pattern, " ", value)
# Save cleaned value
@ -38,8 +41,15 @@ def whitespace(field):
return new_field
def separators(field):
"""Fix for invalid multi-value separators (ie "|")."""
def separators(field, field_name):
"""Fix for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator.
"""
# Skip fields with missing values
if pd.isna(field):
@ -50,12 +60,18 @@ def separators(field):
# Try to split multi-value field on "||" separator
for value in field.split("||"):
# Check if the value is blank and skip it
if value == "":
print(f"Fixing unnecessary multi-value separator ({field_name}): {field}")
continue
# After splitting, see if there are any remaining "|" characters
pattern = re.compile(r"\|")
match = re.findall(pattern, value)
if match:
print(f"Fixing invalid multi-value separator: {value}")
print(f"Fixing invalid multi-value separator ({field_name}): {value}")
value = re.sub(pattern, "||", value)
@ -121,7 +137,7 @@ def unnecessary_unicode(field):
return field
def duplicates(field):
def duplicates(field, field_name):
"""Remove duplicate metadata values."""
# Skip fields with missing values
@ -140,7 +156,7 @@ def duplicates(field):
if value not in new_values:
new_values.append(value)
else:
print(f"Removing duplicate value: {value}")
print(f"Removing duplicate value ({field_name}): {value}")
# Create a new field consisting of all values joined with "||"
new_field = "||".join(new_values)
@ -212,15 +228,12 @@ def normalize_unicode(field, field_name):
Return normalized string.
"""
from unicodedata import is_normalized
from unicodedata import normalize
# Skip fields with missing values
if pd.isna(field):
return
# Check if the current string is using normalized Unicode (NFC)
if not is_normalized("NFC", field):
if not is_nfc(field):
print(f"Normalizing Unicode ({field_name}): {field}")
field = normalize("NFC", field)

View File

@ -0,0 +1,14 @@
def is_nfc(field):
"""Utility function to check whether a string is using normalized Unicode.
Python's built-in unicodedata library has the is_normalized() function, but
it was only introduced in Python 3.8. By using a simple utility function we
are able to run on Python >= 3.6 again.
See: https://docs.python.org/3/library/unicodedata.html
Return boolean.
"""
from unicodedata import normalize
return field == normalize("NFC", field)

View File

@ -1 +1 @@
VERSION = "0.4.0"
VERSION = "0.4.3"

View File

@ -28,3 +28,4 @@ Incorrect ISO 639-1 language,2019-09-26,,,es,,,
Incorrect ISO 639-3 language,2019-09-26,,,spa,,,
Composéd Unicode,2020-01-14,,,,,,
Decomposéd Unicode,2020-01-14,,,,,,
Unnecessary multi-value separator,2021-01-03,0378-5955||,,,,,

1 dc.title dc.date.issued dc.identifier.issn dc.identifier.isbn dc.language.iso dc.subject cg.coverage.country filename
28 Composéd Unicode 2020-01-14
29 Decomposéd Unicode 2020-01-14
30 Unnecessary multi-value separator 2021-01-03 0378-5955||
31

1211
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

31
pyproject.toml Normal file
View File

@ -0,0 +1,31 @@
[tool.poetry]
name = "csv-metadata-quality"
version = "0.4.3"
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem."
authors = ["Alan Orth <alan.orth@gmail.com>"]
license="GPL-3.0-only"
repository = "https://github.com/ilri/csv-metadata-quality"
homepage = "https://github.com/ilri/csv-metadata-quality"
[tool.poetry.dependencies]
python = "^3.8"
pandas = "^1.0.4"
python-stdnum = "^1.13"
xlrd = "^1.2.0"
requests = "^2.23.0"
requests-cache = "^0.5.2"
pycountry = "^19.8.18"
langid = "^1.1.6"
[tool.poetry.dev-dependencies]
pytest = "^6.1.1"
ipython = { version = "^7.18.1", python = "^3.7" }
flake8 = "^3.8.4"
pytest-clarity = "^0.3.0-alpha.0"
black = "20.8b1"
isort = "^5.5.4"
csvkit = "^1.0.5"
[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

View File

@ -1,5 +1,5 @@
[pytest]
addopts= -rsxX -s -v --strict --capture=sys
addopts= -rsxX -s -v --strict-markers --capture=sys
filterwarnings =
error::UserWarning
ignore:.*U.* is deprecated:DeprecationWarning

View File

@ -1,56 +1,71 @@
-i https://pypi.org/simple
agate-dbf==0.2.1
agate-excel==0.2.3
agate-sql==0.5.4
agate==1.6.1
appdirs==1.4.3
attrs==19.3.0
babel==2.8.0
backcall==0.1.0
black==19.10b0
click==7.0
csvkit==1.0.4
agate-dbf==0.2.2
agate-excel==0.2.3
agate-sql==0.5.5
appdirs==1.4.4
appnope==0.1.2; python_version >= "3.7" and python_version < "4.0" and sys_platform == "darwin"
atomicwrites==1.4.0; sys_platform == "win32"
attrs==20.3.0
babel==2.9.0
backcall==0.2.0; python_version >= "3.7" and python_version < "4.0"
black==20.8b1
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
colorama==0.4.4; python_version >= "3.7" and python_version < "4.0" and sys_platform == "win32" or sys_platform == "win32"
csvkit==1.0.5
dbfread==2.0.7
decorator==4.4.1
entrypoints==0.3
decorator==4.4.2; python_version >= "3.7" and python_version < "4.0"
et-xmlfile==1.0.1
flake8==3.7.9
ipython-genutils==0.2.0
ipython==7.11.1
flake8==3.8.4
idna==2.10
iniconfig==1.1.1
ipython==7.19.0; python_version >= "3.7" and python_version < "4.0"
ipython-genutils==0.2.0; python_version >= "3.7" and python_version < "4.0"
isodate==0.6.0
isort==4.3.21
isort==5.7.0
jdcal==1.4.1
jedi==0.15.2
jedi==0.18.0; python_version >= "3.7" and python_version < "4.0"
langid==1.1.6
leather==0.3.3
mccabe==0.6.1
more-itertools==8.1.0
openpyxl==3.0.3
packaging==20.0
parsedatetime==2.5
parso==0.5.2
pathspec==0.7.0
pexpect==4.7.0 ; sys_platform != 'win32'
pickleshare==0.7.5
mypy-extensions==0.4.3
numpy==1.19.5
openpyxl==3.0.6
packaging==20.8
pandas==1.2.1
parsedatetime==2.6
parso==0.8.1; python_version >= "3.7" and python_version < "4.0"
pathspec==0.8.1
pexpect==4.8.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
pickleshare==0.7.5; python_version >= "3.7" and python_version < "4.0"
pluggy==0.13.1
prompt-toolkit==3.0.2
ptyprocess==0.6.0
py==1.8.1
pycodestyle==2.5.0
pyflakes==2.1.1
pygments==2.5.2
pyparsing==2.4.6
pytest-clarity==0.2.0a1
pytest==5.3.2
python-slugify==4.0.0
prompt-toolkit==3.0.14; python_version >= "3.7" and python_version < "4.0"
ptyprocess==0.7.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
py==1.10.0
pycodestyle==2.6.0
pycountry==19.8.18
pyflakes==2.2.0
pygments==2.7.4; python_version >= "3.7" and python_version < "4.0"
pyparsing==2.4.7
pytest==6.2.2
pytest-clarity==0.3.0a0
python-dateutil==2.8.1
python-slugify==4.0.1
python-stdnum==1.15
pytimeparse==1.1.8
pytz==2019.3
regex==2020.1.8
six==1.13.0
sqlalchemy==1.3.12
pytz==2020.5
regex==2020.11.13
requests==2.25.1
requests-cache==0.5.2
six==1.15.0
sqlalchemy==1.3.22
termcolor==1.1.0
text-unidecode==1.3
toml==0.10.0
traitlets==4.3.3
typed-ast==1.4.1
wcwidth==0.1.8
toml==0.10.2
traitlets==5.0.5; python_version >= "3.7" and python_version < "4.0"
typed-ast==1.4.2
typing-extensions==3.7.4.3
urllib3==1.26.2
wcwidth==0.2.5; python_version >= "3.7" and python_version < "4.0"
xlrd==1.2.0

View File

@ -1,17 +1,15 @@
-i https://pypi.org/simple
-e .
certifi==2019.11.28
chardet==3.0.4
idna==2.8
certifi==2020.12.5
chardet==4.0.0
idna==2.10
langid==1.1.6
numpy==1.18.1
pandas==1.0.0rc0
numpy==1.19.5
pandas==1.2.1
pycountry==19.8.18
python-dateutil==2.8.1
python-stdnum==1.12
pytz==2019.3
python-stdnum==1.15
pytz==2020.5
requests==2.25.1
requests-cache==0.5.2
requests==2.22.0
six==1.13.0
urllib3==1.25.7
six==1.15.0
urllib3==1.26.2
xlrd==1.2.0

View File

@ -14,7 +14,7 @@ install_requires = [
setuptools.setup(
name="csv-metadata-quality",
version="0.4.0",
version="0.4.3",
author="Alan Orth",
author_email="aorth@mjanja.ch",
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
@ -23,7 +23,9 @@ setuptools.setup(
long_description_content_type="text/markdown",
url="https://github.com/alanorth/csv-metadata-quality",
classifiers=[
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: OS Independent",
"Development Status :: 4 - Beta",

View File

@ -51,10 +51,27 @@ def test_check_invalid_separators(capsys):
value = "Alan|Orth"
check.separators(value)
field_name = "dc.contributor.author"
check.separators(value, field_name)
captured = capsys.readouterr()
assert captured.out == f"Invalid multi-value separator: {value}\n"
assert captured.out == f"Invalid multi-value separator ({field_name}): {value}\n"
def test_check_unnecessary_separators(capsys):
"""Test checking unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
check.separators(field, field_name)
captured = capsys.readouterr()
assert (
captured.out == f"Unnecessary multi-value separator ({field_name}): {field}\n"
)
def test_check_valid_separators():
@ -62,7 +79,9 @@ def test_check_valid_separators():
value = "Alan||Orth"
result = check.separators(value)
field_name = "dc.contributor.author"
result = check.separators(value, field_name)
assert result == value

View File

@ -6,7 +6,9 @@ def test_fix_leading_whitespace():
value = " Alan"
assert fix.whitespace(value) == "Alan"
field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_trailing_whitespace():
@ -14,7 +16,9 @@ def test_fix_trailing_whitespace():
value = "Alan "
assert fix.whitespace(value) == "Alan"
field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_excessive_whitespace():
@ -22,7 +26,9 @@ def test_fix_excessive_whitespace():
value = "Alan Orth"
assert fix.whitespace(value) == "Alan Orth"
field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan Orth"
def test_fix_invalid_separators():
@ -30,7 +36,19 @@ def test_fix_invalid_separators():
value = "Alan|Orth"
assert fix.separators(value) == "Alan||Orth"
field_name = "dc.contributor.author"
assert fix.separators(value, field_name) == "Alan||Orth"
def test_fix_unnecessary_separators():
"""Test fixing unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
assert fix.separators(field, field_name) == "Alan||Orth"
def test_fix_unnecessary_unicode():
@ -46,7 +64,9 @@ def test_fix_duplicates():
value = "Kenya||Kenya"
assert fix.duplicates(value) == "Kenya"
field_name = "dc.contributor.author"
assert fix.duplicates(value, field_name) == "Kenya"
def test_fix_newlines():