1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-10-25 19:01:13 +02:00

59 Commits

Author SHA1 Message Date
27b2d81ca8 CHANGELOG.md: Add note about dcterms.issued
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-28 15:14:39 +02:00
91ebd0f606 README.md: Update TODOs
A few of these date things have been addressed.
2021-02-28 15:13:36 +02:00
dd2cfae047 csv_metadata_quality/app.py: Match dcterms.issued for dates
We used to only check fields that had "date" in their name because
we were using DSpace's default dc.date.* fields. Now we are using
dcterms.issued so I will add that one as well.
2021-02-28 15:11:06 +02:00
d76e72532a Move unreleased changes to v0.4.4
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-21 13:25:22 +02:00
13980d2dde CHANGELOG.md: Add note about colored output 2021-02-21 13:12:26 +02:00
9aaaa62461 Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-02-21 13:10:52 +02:00
a7fc5a246c Colorize output
Some checks failed
continuous-integration/drone/push Build is failing
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
7fb8acb866 Add colorama for colored output
Red for errors, yellow for warnings or information, and green for
fixes.
2021-02-21 13:00:31 +02:00
9f5d2c2c4f poetry.lock: Run poetry update
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-15 15:13:12 +02:00
202abf140c CHANGELOG.md: Add note about poetry
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-04 21:48:12 +02:00
0cd6d3dfe6 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-02-04 21:46:49 +02:00
a458beac55 poetry.lock: Run poetry update 2021-02-04 21:45:30 +02:00
e62ecb0a8f CHANGELOG.md: Add note about new date format 2021-02-04 21:43:44 +02:00
de92f32ab6 csv_metadata_quality/check.py: More date formats
We should also allow ISO 8601 extended in combined date and time
format. DSpace does not have a problem with dates in this format
and I have found some metadata that uses this date format.

For example: 2020-08-31T11:04:56Z

See: https://en.wikipedia.org/wiki/ISO_8601
2021-02-04 21:39:14 +02:00
dbbbc0944a README.md: Add handle to citation
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-27 10:33:37 +02:00
d17bf3033c README.md: Add citation 2021-01-27 10:32:26 +02:00
2ec52f1b73 README.md: Update description
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-26 15:43:41 +02:00
aa1abf15a7 README.md: Adjust title 2021-01-26 15:35:21 +02:00
cbf94490f2 Version 0.4.3 2021-01-26 15:22:40 +02:00
f3d0d5ef07 setup.py: Remove Python 3.6
I actually removed Python 3.6 support a few weeks ago after updating
to Pandas 1.2.0, but forgot to update this.
2021-01-26 15:22:08 +02:00
4b7b99c94c CHANGELOG.md: Add note about multi-value separators 2021-01-26 15:20:22 +02:00
df670e81b9 README.md: Use badge from my Drone CI
All checks were successful
continuous-integration/drone/push Build is passing
I'm not using SourceHut anymore.
2021-01-26 14:38:50 +02:00
ae357d8c6c Revert "Update requirements"
This reverts commit ca80340f7a.

Nope, we still need the --without-hashes because this still fails
on Python 3.7, but not 3.8 or 3.9. From looking around it seems
that nobody can agree whether poetry should handle this, pip should
handle it, or upstream projects should pin their dependencies.
2021-01-26 14:15:31 +02:00
ca80340f7a Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt

Trying to see if we no longer need --without-hashes since we don't
support Python 3.6 anymore.
2021-01-26 11:46:05 +02:00
cc1743b86d Remove .build.yml
I will just use GitHub Actions and Drone.
2021-01-26 11:41:30 +02:00
bcb9885c6b Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-26 10:36:48 +02:00
b484b75178 poetry.lock: Run poetry update 2021-01-26 10:36:04 +02:00
d3880a9dfa Remove Python 3.6 support
All checks were successful
continuous-integration/drone/push Build is passing
Pandas 1.2.0 apparently requires Python 3.7.1+.
2021-01-03 15:51:53 +02:00
7edb8b19d7 tests/test_check.py: Reformat with black 2021-01-03 15:50:21 +02:00
a6709c7f82 Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-03 15:42:00 +02:00
d489ea4609 poetry.lock: Run poetry update 2021-01-03 15:41:08 +02:00
96634cbb67 pytest.ini: Change --strict to --strict-markers
This is deprecated since pytest 6.2.0.

See: https://docs.pytest.org/en/stable/deprecations.html#the-strict-command-line-option
2021-01-03 15:40:14 +02:00
29e67a0887 Add tests for unnecessary multi-value separators 2021-01-03 15:37:18 +02:00
32cea2055f data/test.csv: Add unnecessary multi-value separator 2021-01-03 15:33:04 +02:00
0dc66c5c4e Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
c26ad83534 .github: Test CLI invocation 2020-12-14 23:47:09 +02:00
72ca9d99bf setup.py: Add Python 3.9
[SKIP CI]
2020-12-14 23:44:35 +02:00
ae33a9b793 Add .drone.yml 2020-12-14 23:42:23 +02:00
fc0367bfc8 README.md: Update note about Python version 2020-12-08 10:52:24 +02:00
e33b285034 README.md: Add GitHub Actions badge 2020-12-08 10:48:31 +02:00
349fca03b8 .github/workflows/python-app.yml: Rename
This name is displayed in the badge so it should be something more
relevant.
2020-12-08 10:46:39 +02:00
52d8904870 Remove .travis.yml
They changed their free tier and I might as well use GitHub Actions
for ILRI stuff anyways.
2020-12-08 10:41:36 +02:00
971c69e535 Create python-app.yml
Try GitHub Actions for Python 3.8 using GitHub's Python example.
2020-12-08 10:38:52 +02:00
f8cc233e25 .travis.yml: Use Amazon Graviton2 ARM environment
These are the new hotness and should have faster build times.

See: https://blog.travis-ci.com/2020-09-11-arm-on-aws
2020-12-06 10:49:03 +02:00
aa7b7a9592 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2020-11-03 07:42:45 +02:00
57b455bde7 poetry.lock: Run poetry update 2020-11-03 07:40:56 +02:00
23b95fa368 .travis.yml: Use Ubuntu 20.04 "Focal" environment 2020-10-29 00:14:54 +03:00
6985f76aa3 .travis.yml: Bump Python versions
Test Python 3.9 now that it was released, and allow tests to fail
on nightly builds.
2020-10-29 00:14:36 +03:00
98a6a19e12 Update requirements-dev.txt
Generated with poetry export:

    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:48:46 +03:00
f4914c414f Only install ipython on Python 3.7+ 2020-10-06 17:48:16 +03:00
d352fe8017 Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:21:33 +03:00
f13c360084 Update poetry package dependencies 2020-10-06 17:20:16 +03:00
7cfd4c0b59 csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
826509ddcf poetry.lock: Run poetry update
List of updated modules:

  - Updating numpy (1.19.1 -> 1.19.2)
  - Updating pygments (2.6.1 -> 2.7.1)
  - Updating pandas (1.1.1 -> 1.1.2)

All tests still pass according to pytest.
2020-09-26 12:18:23 +03:00
22b5c0f7a1 CHANGELOG.md: Add note about dependencies update 2020-09-08 15:04:40 +03:00
774e274b32 poetry.lock: Run poetry update
Update dependencies to latest version:

  - Updating attrs (19.3.0 -> 20.2.0)
  - Updating more-itertools (8.4.0 -> 8.5.0)
  - Updating openpyxl (3.0.4 -> 3.0.5)
  - Updating parso (0.7.0 -> 0.7.1)
  - Updating sqlalchemy (1.3.18 -> 1.3.19)
  - Updating urllib3 (1.25.9 -> 1.25.10)
  - Updating agate-dbf (0.2.1 -> 0.2.2)
  - Updating agate-sql (0.5.4 -> 0.5.5)
  - Updating jedi (0.17.1 -> 0.17.2)
  - Updating numpy (1.19.0 -> 1.19.1)
  - Updating prompt-toolkit (3.0.5 -> 3.0.7)
  - Updating regex (2020.6.8 -> 2020.7.14)
  - Updating traitlets (4.3.3 -> 5.0.4)
  - Updating ipython (7.16.1 -> 7.18.1)
  - Updating pandas (1.0.5 -> 1.1.1)
  - Updating python-stdnum (1.13 -> 1.14)

All tests still pass according to pytest.
2020-09-08 15:04:00 +03:00
db474a802f README.md: Use badge from travis-ci.com 2020-08-04 11:12:28 +03:00
e241f8461b CHANGELOG.md: Add notes 2020-07-06 14:10:46 +03:00
431e6331c8 csv_metadata_quality/check.py: Format with black 2020-07-06 14:10:19 +03:00
20 changed files with 898 additions and 946 deletions

View File

@@ -1,15 +0,0 @@
image: archlinux
packages:
- python-poetry
sources:
- https://git.sr.ht/~alanorth/csv-metadata-quality
tasks:
- setup: |
cd csv-metadata-quality
poetry install
- pytest: |
cd csv-metadata-quality
poetry run pytest
- testcli: |
cd csv-metadata-quality
poetry run csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country

49
.drone.yml Normal file
View File

@@ -0,0 +1,49 @@
---
kind: pipeline
type: docker
name: python39
steps:
- name: test
image: python:3.9-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python38
steps:
- name: test
image: python:3.8-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python37
steps:
- name: test
image: python:3.7-slim
commands:
- id
- python -V
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
# vim: ts=2 sw=2 et

41
.github/workflows/python-app.yml vendored Normal file
View File

@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build and Test
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
- name: Test CLI
run: |
python setup.py install
csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country

View File

@@ -1,16 +0,0 @@
dist: bionic
language: python
python:
- "3.6"
- "3.7"
- "3.8"
- "3.8-dev" # 3.8 development branch
jobs:
allow_failures:
- python: "3.8-dev"
install:
- "pip install -r requirements.txt"
- "pip install -r requirements-dev.txt"
script: pytest
# vim: ts=2 sw=2 et

View File

@@ -4,6 +4,31 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Unreleased changes
### Added
- Check dates in dcterms.issued field as well, not just fields that have the
word "date" in them
## [0.4.4] - 2021-02-21
### Added
- Accept dates formatted in ISO 8601 extended with combined date and time, for
example: 2020-08-31T11:04:56Z
- Colorized output: red for errors, yellow for warnings and information, green
for changes
### Updated
- Run `poetry update` to update project dependencies
## [0.4.3] - 2021-01-26
### Changed
- Reformat with black
- Requires Python 3.7+ for pandas 1.2.0
### Updated
- Run `poetry update`
- Expand check/fix for multi-value separators to include metadata with invalid
separators at the end, for example "Kenya||Tanzania||"
## [0.4.2] - 2020-07-06 ## [0.4.2] - 2020-07-06
### Changed ### Changed
- Add field name to the output for more fixes and checks to help identify where - Add field name to the output for more fixes and checks to help identify where

View File

@@ -1,7 +1,11 @@
# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?) # DSpace CSV Metadata Quality Checker ![GitHub Actions](https://github.com/ilri/csv-metadata-quality/workflows/Build%20and%20Test/badge.svg) [![Build Status](https://ci.mjanja.ch/api/badges/alanorth/csv-metadata-quality/status.svg)](https://ci.mjanja.ch/alanorth/csv-metadata-quality)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc. A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, unnecessary Unicode, AGROVOC terms, etc.
Requires Python 3.8 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested. Requires Python 3.7 or greater (3.8 recommended). CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
If you use the DSpace CSV metadata quality checker please cite:
*Orth, A. 2019. DSpace CSV metadata quality checker. Nairobi, Kenya: ILRI. https://hdl.handle.net/10568/110997.*
## Functionality ## Functionality
@@ -10,7 +14,7 @@ Requires Python 3.8 or greater. CSV and Excel support comes from the [Pandas](ht
- Experimental validation of titles and abstracts against item's Dublin Core language field - Experimental validation of titles and abstracts against item's Dublin Core language field
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option) - Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
- Fix leading, trailing, and excessive (ie, more than one) whitespace - Fix leading, trailing, and excessive (ie, more than one) whitespace
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes` - Fix invalid and unnecessary multi-value separators (`|`) using `--unsafe-fixes`
- Fix problematic newlines (line feeds) using `--unsafe-fixes` - Fix problematic newlines (line feeds) using `--unsafe-fixes`
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt" - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
@@ -56,6 +60,8 @@ You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currentl
### Invalid Multi-Value Separators ### Invalid Multi-Value Separators
This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`. This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
### Newlines ### Newlines
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A). This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
@@ -102,6 +108,7 @@ This currently uses the [Python langid](https://github.com/saffsd/langid.py) lib
- Warn if two items use the same file in `filename` column - Warn if two items use the same file in `filename` column
- Add an option to drop invalid AGROVOC subjects? - Add an option to drop invalid AGROVOC subjects?
- Add tests for application invocation, ie `tests/test_app.py`? - Add tests for application invocation, ie `tests/test_app.py`?
- Validate ISSNs or journal titles against CrossRef API?
## License ## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html). This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

View File

@@ -4,6 +4,7 @@ import signal
import sys import sys
import pandas as pd import pandas as pd
from colorama import Fore
import csv_metadata_quality.check as check import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental import csv_metadata_quality.experimental as experimental
@@ -77,7 +78,7 @@ def run(argv):
if column == exclude and skip is False: if column == exclude and skip is False:
skip = True skip = True
if skip: if skip:
print(f"Skipping {column}") print(f"{Fore.YELLOW}Skipping {Fore.RESET}{column}")
continue continue
@@ -103,13 +104,13 @@ def run(argv):
# Fix: unnecessary Unicode # Fix: unnecessary Unicode
df[column] = df[column].apply(fix.unnecessary_unicode) df[column] = df[column].apply(fix.unnecessary_unicode)
# Check: invalid multi-value separator # Check: invalid and unnecessary multi-value separators
df[column] = df[column].apply(check.separators, field_name=column) df[column] = df[column].apply(check.separators, field_name=column)
# Check: suspicious characters # Check: suspicious characters
df[column] = df[column].apply(check.suspicious_characters, field_name=column) df[column] = df[column].apply(check.suspicious_characters, field_name=column)
# Fix: invalid multi-value separator # Fix: invalid and unnecessary multi-value separators
if args.unsafe_fixes: if args.unsafe_fixes:
df[column] = df[column].apply(fix.separators, field_name=column) df[column] = df[column].apply(fix.separators, field_name=column)
# Run whitespace fix again after fixing invalid separators # Run whitespace fix again after fixing invalid separators
@@ -141,7 +142,7 @@ def run(argv):
df[column] = df[column].apply(check.isbn) df[column] = df[column].apply(check.isbn)
# Check: invalid date # Check: invalid date
match = re.match(r"^.*?date.*$", column) match = re.match(r"^.*?(date|dcterms\.issued).*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.date, field_name=column) df[column] = df[column].apply(check.date, field_name=column)

View File

@@ -1,4 +1,10 @@
from datetime import datetime, timedelta
import pandas as pd import pandas as pd
import requests
import requests_cache
from colorama import Fore
from pycountry import languages
def issn(field): def issn(field):
@@ -21,7 +27,7 @@ def issn(field):
for value in field.split("||"): for value in field.split("||"):
if not issn.is_valid(value): if not issn.is_valid(value):
print(f"Invalid ISSN: {value}") print(f"{Fore.RED}Invalid ISSN: {Fore.RESET}{value}")
return field return field
@@ -46,13 +52,17 @@ def isbn(field):
for value in field.split("||"): for value in field.split("||"):
if not isbn.is_valid(value): if not isbn.is_valid(value):
print(f"Invalid ISBN: {value}") print(f"{Fore.RED}Invalid ISBN: {Fore.RESET}{value}")
return field return field
def separators(field, field_name): def separators(field, field_name):
"""Check for invalid multi-value separators (ie "|" or "|||"). """Check for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator. Prints the field with the invalid multi-value separator.
""" """
@@ -65,12 +75,22 @@ def separators(field, field_name):
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split("||"): for value in field.split("||"):
# Check if the current value is blank
if value == "":
print(
f"{Fore.RED}Unnecessary multi-value separator ({field_name}): {Fore.RESET}{field}"
)
continue
# After splitting, see if there are any remaining "|" characters # After splitting, see if there are any remaining "|" characters
match = re.findall(r"^.*?\|.*$", value) match = re.findall(r"^.*?\|.*$", value)
# Check if there was a match
if match: if match:
print(f"Invalid multi-value separator ({field_name}): {field}") print(
f"{Fore.RED}Invalid multi-value separator ({field_name}): {Fore.RESET}{field}"
)
return field return field
@@ -85,10 +105,9 @@ def date(field, field_name):
Prints the date if invalid. Prints the date if invalid.
""" """
from datetime import datetime
if pd.isna(field): if pd.isna(field):
print(f"Missing date ({field_name}).") print(f"{Fore.RED}Missing date ({field_name}).{Fore.RESET}")
return return
@@ -97,7 +116,9 @@ def date(field, field_name):
# We don't allow multi-value date fields # We don't allow multi-value date fields
if len(multiple_dates) > 1: if len(multiple_dates) > 1:
print(f"Multiple dates not allowed ({field_name}): {field}") print(
f"{Fore.RED}Multiple dates not allowed ({field_name}): {Fore.RESET}{field}"
)
return field return field
@@ -123,7 +144,15 @@ def date(field, field_name):
return field return field
except ValueError: except ValueError:
print(f"Invalid date ({field_name}): {field}") pass
try:
# Check if date is valid YYYY-MM-DDTHH:MM:SSZ format
datetime.strptime(field, "%Y-%m-%dT%H:%M:%SZ")
return field
except ValueError:
print(f"{Fore.RED}Invalid date ({field_name}): {Fore.RESET}{field}")
return field return field
@@ -156,9 +185,7 @@ def suspicious_characters(field, field_name):
# character and spanning enough of the rest to give a preview, # character and spanning enough of the rest to give a preview,
# but not too much to cause the line to break in terminals with # but not too much to cause the line to break in terminals with
# a default of 80 characters width. # a default of 80 characters width.
suspicious_character_msg = ( suspicious_character_msg = f"{Fore.YELLOW}Suspicious character ({field_name}): {Fore.RESET}{field_subset}"
f"Suspicious character ({field_name}): {field_subset}"
)
print(f"{suspicious_character_msg:1.80}") print(f"{suspicious_character_msg:1.80}")
return field return field
@@ -170,8 +197,6 @@ def language(field):
Prints the value if it is invalid. Prints the value if it is invalid.
""" """
from pycountry import languages
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
@@ -185,16 +210,16 @@ def language(field):
# can check it against ISO 639-1 or ISO 639-3 accordingly. # can check it against ISO 639-1 or ISO 639-3 accordingly.
if len(value) == 2: if len(value) == 2:
if not languages.get(alpha_2=value): if not languages.get(alpha_2=value):
print(f"Invalid ISO 639-1 language: {value}") print(f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}")
pass pass
elif len(value) == 3: elif len(value) == 3:
if not languages.get(alpha_3=value): if not languages.get(alpha_3=value):
print(f"Invalid ISO 639-3 language: {value}") print(f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}")
pass pass
else: else:
print(f"Invalid language: {value}") print(f"{Fore.RED}Invalid language: {Fore.RESET}{value}")
return field return field
@@ -213,19 +238,13 @@ def agrovoc(field, field_name):
Prints a warning if the value is invalid. Prints a warning if the value is invalid.
""" """
from datetime import timedelta
import requests
import requests_cache
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
# enable transparent request cache with thirty days expiry # enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30) expire_after = timedelta(days=30)
requests_cache.install_cache( requests_cache.install_cache("agrovoc-response-cache", expire_after=expire_after)
"agrovoc-response-cache", expire_after=expire_after
)
# prune old cache entries # prune old cache entries
requests_cache.core.remove_expired_responses() requests_cache.core.remove_expired_responses()
@@ -242,7 +261,7 @@ def agrovoc(field, field_name):
# check if there are any results # check if there are any results
if len(data["results"]) == 0: if len(data["results"]) == 0:
print(f"Invalid AGROVOC ({field_name}): {value}") print(f"{Fore.RED}Invalid AGROVOC ({field_name}): {Fore.RESET}{value}")
return field return field
@@ -295,6 +314,6 @@ def filename_extension(field):
break break
if filename_extension_match is False: if filename_extension_match is False:
print(f"Filename with uncommon extension: {value}") print(f"{Fore.YELLOW}Filename with uncommon extension: {Fore.RESET}{value}")
return field return field

View File

@@ -1,4 +1,5 @@
import pandas as pd import pandas as pd
from colorama import Fore
def correct_language(row): def correct_language(row):
@@ -10,10 +11,11 @@ def correct_language(row):
language and returns the value in the language field if it does match. language and returns the value in the language field if it does match.
""" """
from pycountry import languages
import langid
import re import re
import langid
from pycountry import languages
# Initialize some variables at global scope so that we can set them in the # Initialize some variables at global scope so that we can set them in the
# loop scope below and still be able to access them afterwards. # loop scope below and still be able to access them afterwards.
language = "" language = ""
@@ -83,12 +85,12 @@ def correct_language(row):
detected_language = languages.get(alpha_2=langid_classification[0]) detected_language = languages.get(alpha_2=langid_classification[0])
if len(language) == 2 and language != detected_language.alpha_2: if len(language) == 2 and language != detected_language.alpha_2:
print( print(
f"Possibly incorrect language {language} (detected {detected_language.alpha_2}): {title}" f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_2}): {Fore.RESET}{title}"
) )
elif len(language) == 3 and language != detected_language.alpha_3: elif len(language) == 3 and language != detected_language.alpha_3:
print( print(
f"Possibly incorrect language {language} (detected {detected_language.alpha_3}): {title}" f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_3}): {Fore.RESET}{title}"
) )
else: else:

View File

@@ -1,6 +1,10 @@
import re import re
from unicodedata import normalize
import pandas as pd import pandas as pd
from colorama import Fore
from csv_metadata_quality.util import is_nfc
def whitespace(field, field_name): def whitespace(field, field_name):
@@ -26,7 +30,9 @@ def whitespace(field, field_name):
match = re.findall(pattern, value) match = re.findall(pattern, value)
if match: if match:
print(f"Removing excessive whitespace ({field_name}): {value}") print(
f"{Fore.GREEN}Removing excessive whitespace ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, " ", value) value = re.sub(pattern, " ", value)
# Save cleaned value # Save cleaned value
@@ -39,7 +45,14 @@ def whitespace(field, field_name):
def separators(field, field_name): def separators(field, field_name):
"""Fix for invalid multi-value separators (ie "|").""" """Fix for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator.
"""
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
@@ -50,12 +63,22 @@ def separators(field, field_name):
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split("||"): for value in field.split("||"):
# Check if the value is blank and skip it
if value == "":
print(
f"{Fore.GREEN}Fixing unnecessary multi-value separator ({field_name}): {Fore.RESET}{field}"
)
continue
# After splitting, see if there are any remaining "|" characters # After splitting, see if there are any remaining "|" characters
pattern = re.compile(r"\|") pattern = re.compile(r"\|")
match = re.findall(pattern, value) match = re.findall(pattern, value)
if match: if match:
print(f"Fixing invalid multi-value separator ({field_name}): {value}") print(
f"{Fore.RED}Fixing invalid multi-value separator ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, "||", value) value = re.sub(pattern, "||", value)
@@ -91,7 +114,7 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Removing unnecessary Unicode (U+200B): {field}") print(f"{Fore.GREEN}Removing unnecessary Unicode (U+200B): {Fore.RESET}{field}")
field = re.sub(pattern, "", field) field = re.sub(pattern, "", field)
# Check for replacement characters (U+FFFD) # Check for replacement characters (U+FFFD)
@@ -99,7 +122,7 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Removing unnecessary Unicode (U+FFFD): {field}") print(f"{Fore.GREEN}Removing unnecessary Unicode (U+FFFD): {Fore.RESET}{field}")
field = re.sub(pattern, "", field) field = re.sub(pattern, "", field)
# Check for no-break spaces (U+00A0) # Check for no-break spaces (U+00A0)
@@ -107,7 +130,9 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Replacing unnecessary Unicode (U+00A0): {field}") print(
f"{Fore.GREEN}Replacing unnecessary Unicode (U+00A0): {Fore.RESET}{field}"
)
field = re.sub(pattern, " ", field) field = re.sub(pattern, " ", field)
# Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen # Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen
@@ -115,7 +140,9 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Replacing unnecessary Unicode (U+00AD): {field}") print(
f"{Fore.GREEN}Replacing unnecessary Unicode (U+00AD): {Fore.RESET}{field}"
)
field = re.sub(pattern, "-", field) field = re.sub(pattern, "-", field)
return field return field
@@ -140,7 +167,9 @@ def duplicates(field, field_name):
if value not in new_values: if value not in new_values:
new_values.append(value) new_values.append(value)
else: else:
print(f"Removing duplicate value ({field_name}): {value}") print(
f"{Fore.GREEN}Removing duplicate value ({field_name}): {Fore.RESET}{value}"
)
# Create a new field consisting of all values joined with "||" # Create a new field consisting of all values joined with "||"
new_field = "||".join(new_values) new_field = "||".join(new_values)
@@ -173,7 +202,7 @@ def newlines(field):
match = re.findall(r"\n", field) match = re.findall(r"\n", field)
if match: if match:
print(f"Removing newline: {field}") print(f"{Fore.GREEN}Removing newline: {Fore.RESET}{field}")
field = field.replace("\n", "") field = field.replace("\n", "")
return field return field
@@ -197,7 +226,9 @@ def comma_space(field, field_name):
match = re.findall(r",\w", field) match = re.findall(r",\w", field)
if match: if match:
print(f"Adding space after comma ({field_name}): {field}") print(
f"{Fore.GREEN}Adding space after comma ({field_name}): {Fore.RESET}{field}"
)
field = re.sub(r",(\w)", r", \1", field) field = re.sub(r",(\w)", r", \1", field)
return field return field
@@ -212,16 +243,13 @@ def normalize_unicode(field, field_name):
Return normalized string. Return normalized string.
""" """
from csv_metadata_quality.util import is_nfc
from unicodedata import normalize
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
# Check if the current string is using normalized Unicode (NFC) # Check if the current string is using normalized Unicode (NFC)
if not is_nfc(field): if not is_nfc(field):
print(f"Normalizing Unicode ({field_name}): {field}") print(f"{Fore.GREEN}Normalizing Unicode ({field_name}): {Fore.RESET}{field}")
field = normalize("NFC", field) field = normalize("NFC", field)
return field return field

View File

@@ -1 +1 @@
VERSION = "0.4.2" VERSION = "0.4.4"

View File

@@ -28,3 +28,4 @@ Incorrect ISO 639-1 language,2019-09-26,,,es,,,
Incorrect ISO 639-3 language,2019-09-26,,,spa,,, Incorrect ISO 639-3 language,2019-09-26,,,spa,,,
Composéd Unicode,2020-01-14,,,,,, Composéd Unicode,2020-01-14,,,,,,
Decomposéd Unicode,2020-01-14,,,,,, Decomposéd Unicode,2020-01-14,,,,,,
Unnecessary multi-value separator,2021-01-03,0378-5955||,,,,,
1 dc.title dc.date.issued dc.identifier.issn dc.identifier.isbn dc.language.iso dc.subject cg.coverage.country filename
28 Composéd Unicode 2020-01-14
29 Decomposéd Unicode 2020-01-14
30 Unnecessary multi-value separator 2021-01-03 0378-5955||
31

973
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,6 @@
[tool.poetry] [tool.poetry]
name = "csv-metadata-quality" name = "csv-metadata-quality"
version = "0.4.2" version = "0.4.4"
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem." description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem."
authors = ["Alan Orth <alan.orth@gmail.com>"] authors = ["Alan Orth <alan.orth@gmail.com>"]
license="GPL-3.0-only" license="GPL-3.0-only"
@@ -16,14 +16,15 @@ requests = "^2.23.0"
requests-cache = "^0.5.2" requests-cache = "^0.5.2"
pycountry = "^19.8.18" pycountry = "^19.8.18"
langid = "^1.1.6" langid = "^1.1.6"
colorama = "^0.4.4"
[tool.poetry.dev-dependencies] [tool.poetry.dev-dependencies]
pytest = "^5.4.2" pytest = "^6.1.1"
ipython = "^7.15.0" ipython = { version = "^7.18.1", python = "^3.7" }
flake8 = "^3.8.2" flake8 = "^3.8.4"
pytest-clarity = "^0.3.0-alpha.0" pytest-clarity = "^0.3.0-alpha.0"
black = "^19.10b0" black = "20.8b1"
isort = "^4.3.21" isort = "^5.5.4"
csvkit = "^1.0.5" csvkit = "^1.0.5"
[build-system] [build-system]

View File

@@ -1,5 +1,5 @@
[pytest] [pytest]
addopts= -rsxX -s -v --strict --capture=sys addopts= -rsxX -s -v --strict-markers --capture=sys
filterwarnings = filterwarnings =
error::UserWarning error::UserWarning
ignore:.*U.* is deprecated:DeprecationWarning ignore:.*U.* is deprecated:DeprecationWarning

View File

@@ -1,300 +1,71 @@
agate==1.6.1 \ agate-dbf==0.2.2
--hash=sha256:48d6f80b35611c1ba25a642cbc5b90fcbdeeb2a54711c4a8d062ee2809334d1c \ agate-excel==0.2.3
--hash=sha256:c93aaa500b439d71e4a5cf088d0006d2ce2c76f1950960c8843114e5f361dfd3 agate-sql==0.5.5
agate-dbf==0.2.1 \ agate==1.6.1
--hash=sha256:00c93c498ec9a04cc587bf63dd7340e67e2541f0df4c9a7259d7cb3dd4ce372f \ appdirs==1.4.4; python_version >= "3.6"
--hash=sha256:f618fadb413d41468c90d72fca945681d82d9e4d1b3d89f9bda52e607b828c0b appnope==0.1.2; python_version >= "3.7" and python_version < "4.0" and sys_platform == "darwin"
agate-excel==0.2.3 \ atomicwrites==1.4.0; python_version >= "3.6" and python_full_version < "3.0.0" and sys_platform == "win32" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") or sys_platform == "win32" and python_version >= "3.6" and python_full_version >= "3.4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6")
--hash=sha256:8f255ef2c87c436b7132049e1dd86c8e08bf82d8c773aea86f3069b461a17d52 attrs==20.3.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
agate-sql==0.5.4 \ babel==2.9.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
--hash=sha256:9277490ba8b8e7c747a9ae3671f52fe486784b48d4a14e78ca197fb0e36f281b backcall==0.2.0; python_version >= "3.7" and python_version < "4.0"
appdirs==1.4.4 \ black==20.8b1; python_version >= "3.6"
--hash=sha256:a841dacd6b99318a741b166adb07e19ee71a274450e68237b4650ca1055ab128 \ certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
--hash=sha256:7d5d0167b2b1ba821647616af46a749d1c653740dd0d2415100fe26e27afdf41 chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
appnope==0.1.0; sys_platform == "darwin" \ click==7.1.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
--hash=sha256:5b26757dc6f79a3b7dc9fab95359328d5747fcb2409d331ea66d0272b90ab2a0 \ colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
--hash=sha256:8b995ffe925347a2138d7ac0fe77155e4311a0ea6d6da4f5128fe4b3cbe5ed71 csvkit==1.0.5
atomicwrites==1.4.0; sys_platform == "win32" \ dbfread==2.0.7
--hash=sha256:6d1784dea7c0c8d4a5172b6c620f40b6e4cbfdf96d783691f2e1302a7b88e197 \ decorator==4.4.2; python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "4.0" or python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.2.0"
--hash=sha256:ae70396ad1a434f9c7046fd2dd196fc04b12f9e91ffb859164193be8b6168a7a et-xmlfile==1.0.1; python_version >= "3.6"
attrs==19.3.0 \ flake8==3.8.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
--hash=sha256:08a96c641c3a74e44eb59afb61a24f2cb9f4d7188748e76ba4bb5edfa3cb7d1c \ idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
--hash=sha256:f7b7ce16570fe9965acd6d30101a28f62fb4a7f9e926b3bbc9b61f8b04247e72 iniconfig==1.1.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
babel==2.8.0 \ ipython-genutils==0.2.0; python_version >= "3.7" and python_version < "4.0"
--hash=sha256:d670ea0b10f8b723672d3a6abeb87b565b244da220d76b4dba1b66269ec152d4 \ ipython==7.20.0; python_version >= "3.7" and python_version < "4.0"
--hash=sha256:1aac2ae2d0d8ea368fa90906567f5c08463d98ade155c0c4bfedd6a0f7160e38 isodate==0.6.0
backcall==0.2.0 \ isort==5.7.0; python_version >= "3.6" and python_version < "4.0"
--hash=sha256:fbbce6a29f263178a1f7915c1940bde0ec2b2a967566fe1c65c1dfb7422bd255 \ jdcal==1.4.1; python_version >= "3.6"
--hash=sha256:5cbdbf27be5e7cfadb448baf0aa95508f91f2bbc6c6437cd9cd06e2a4c215e1e jedi==0.18.0; python_version >= "3.7" and python_version < "4.0"
black==19.10b0 \ langid==1.1.6
--hash=sha256:1b30e59be925fafc1ee4565e5e08abef6b03fe455102883820fe5ee2e4734e0b \ leather==0.3.3
--hash=sha256:c2edb73a08e9e0e6f65a0e6af18b059b8b1cdd5bef997d7a0b181df93dc81539 mccabe==0.6.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
certifi==2020.6.20 \ mypy-extensions==0.4.3; python_version >= "3.6"
--hash=sha256:8fc0819f1f30ba15bdb34cceffb9ef04d99f420f68eb75d901e9560b8749fc41 \ numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
--hash=sha256:5930595817496dd21bb8dc35dad090f1c2cd0adfaf21204bf6732ca5d8ee34d3 openpyxl==3.0.6; python_version >= "3.6"
chardet==3.0.4 \ packaging==20.9; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
--hash=sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691 \ pandas==1.2.2; python_full_version >= "3.7.1"
--hash=sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae parsedatetime==2.6
click==7.1.2 \ parso==0.8.1; python_version >= "3.7" and python_version < "4.0"
--hash=sha256:dacca89f4bfadd5de3d7489b7c8a566eee0d3676333fbb50030263894c38c0dc \ pathspec==0.8.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
--hash=sha256:d2b5255c7c6349bc1bd1e59e08cd12acbbd63ce649f2588755783aa94dfb6b1a pexpect==4.8.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
colorama==0.4.3; sys_platform == "win32" \ pickleshare==0.7.5; python_version >= "3.7" and python_version < "4.0"
--hash=sha256:7d73d2a99753107a36ac6b455ee49046802e59d9d076ef8e47b61499fa29afff \ pluggy==0.13.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
--hash=sha256:e96da0d330793e2cb9485e9ddfd918d456036c7149416295932478192f4436a1 prompt-toolkit==3.0.16; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
csvkit==1.0.5 \ ptyprocess==0.7.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
--hash=sha256:7bd390f4d300e45dc9ed67a32af762a916bae7d9a85087a10fd4f64ce65fd5b9 py==1.10.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
dbfread==2.0.7 \ pycodestyle==2.6.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
--hash=sha256:f604def58c59694fa0160d7be5d0b8d594467278d2bb6a47d46daf7162c84cec \ pycountry==19.8.18
--hash=sha256:07c8a9af06ffad3f6f03e8fe91ad7d2733e31a26d2b72c4dd4cfbae07ee3b73d pyflakes==2.2.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
decorator==4.4.2 \ pygments==2.8.0; python_version >= "3.7" and python_version < "4.0"
--hash=sha256:41fa54c2a0cc4ba648be4fd43cff00aedf5b9465c9bf18d64325bc225f08f760 \ pyparsing==2.4.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
--hash=sha256:e3a62f0520172440ca0dcc823749319382e377f37f140a0b99ef45fecb84bfe7 pytest-clarity==0.3.0a0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
et-xmlfile==1.0.1 \ pytest==6.2.2; python_version >= "3.6"
--hash=sha256:614d9722d572f6246302c4491846d2c393c199cfa4edc9af593437691683335b python-dateutil==2.8.1; python_full_version >= "3.7.1"
flake8==3.8.3 \ python-slugify==4.0.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
--hash=sha256:15e351d19611c887e482fb960eae4d44845013cc142d42896e9862f775d8cf5c \ python-stdnum==1.16
--hash=sha256:f04b9fcbac03b0a3e58c0ab3a0ecc462e023a9faf046d57794184028123aa208 pytimeparse==1.1.8
idna==2.10 \ pytz==2021.1; python_full_version >= "3.7.1"
--hash=sha256:b97d804b1e9b523befed77c48dacec60e6dcb0b5391d57af6a65a312a90648c0 \ regex==2020.11.13; python_version >= "3.6"
--hash=sha256:b307872f855b18632ce0c21c5e45be78c0ea7ae4c15c828c20788b26921eb3f6 requests-cache==0.5.2
ipython==7.16.1 \ requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
--hash=sha256:2dbcc8c27ca7d3cfe4fcdff7f45b27f9a8d3edfa70ff8024a71c7a8eb5f09d64 \ six==1.15.0; python_full_version >= "3.7.1"
--hash=sha256:9f4fcb31d3b2c533333893b9172264e4821c1ac91839500f31bd43f2c59b3ccf sqlalchemy==1.3.23; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
ipython-genutils==0.2.0 \ termcolor==1.1.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
--hash=sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8 \ text-unidecode==1.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
--hash=sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8 toml==0.10.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
isodate==0.6.0 \ traitlets==5.0.5; python_version >= "3.7" and python_version < "4.0"
--hash=sha256:aa4d33c06640f5352aca96e4b81afd8ab3b47337cc12089822d6f322ac772c81 \ typed-ast==1.4.2; python_version >= "3.6"
--hash=sha256:2e364a3d5759479cdb2d37cce6b9376ea504db2ff90252a2e5b7cc89cc9ff2d8 typing-extensions==3.7.4.3; python_version >= "3.6"
isort==4.3.21 \ urllib3==1.26.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
--hash=sha256:6e811fcb295968434526407adb8796944f1988c5b65e8139058f2014cbe100fd \ wcwidth==0.2.5; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
--hash=sha256:54da7e92468955c4fceacd0c86bd0ec997b0e1ee80d97f67c35a78b719dccab1 xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
jdcal==1.4.1 \
--hash=sha256:1abf1305fce18b4e8aa248cf8fe0c56ce2032392bc64bbd61b5dff2a19ec8bba \
--hash=sha256:472872e096eb8df219c23f2689fc336668bdb43d194094b5cc1707e1640acfc8
jedi==0.17.1 \
--hash=sha256:1ddb0ec78059e8e27ec9eb5098360b4ea0a3dd840bedf21415ea820c21b40a22 \
--hash=sha256:807d5d4f96711a2bcfdd5dfa3b1ae6d09aa53832b182090b222b5efb81f52f63
langid==1.1.6 \
--hash=sha256:044bcae1912dab85c33d8e98f2811b8f4ff1213e5e9a9e9510137b84da2cb293
leather==0.3.3 \
--hash=sha256:e0bb36a6d5f59fbf3c1a6e75e7c8bee29e67f06f5b48c0134407dde612eba5e2 \
--hash=sha256:076d1603b5281488285718ce1a5ce78cf1027fe1e76adf9c548caf83c519b988
mccabe==0.6.1 \
--hash=sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42 \
--hash=sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f
more-itertools==8.4.0 \
--hash=sha256:68c70cc7167bdf5c7c9d8f6954a7837089c6a36bf565383919bb595efb8a17e5 \
--hash=sha256:b78134b2063dd214000685165d81c154522c3ee0a1c0d4d113c80361c234c5a2
numpy==1.19.0 \
--hash=sha256:63d971bb211ad3ca37b2adecdd5365f40f3b741a455beecba70fd0dde8b2a4cb \
--hash=sha256:b6aaeadf1e4866ca0fdf7bb4eed25e521ae21a7947c59f78154b24fc7abbe1dd \
--hash=sha256:13af0184177469192d80db9bd02619f6fa8b922f9f327e077d6f2a6acb1ce1c0 \
--hash=sha256:356f96c9fbec59974a592452ab6a036cd6f180822a60b529a975c9467fcd5f23 \
--hash=sha256:fa1fe75b4a9e18b66ae7f0b122543c42debcf800aaafa0212aaff3ad273c2596 \
--hash=sha256:cbe326f6d364375a8e5a8ccb7e9cd73f4b2f6dc3b2ed205633a0db8243e2a96a \
--hash=sha256:a2e3a39f43f0ce95204beb8fe0831199542ccab1e0c6e486a0b4947256215632 \
--hash=sha256:7b852817800eb02e109ae4a9cef2beda8dd50d98b76b6cfb7b5c0099d27b52d4 \
--hash=sha256:d97a86937cf9970453c3b62abb55a6475f173347b4cde7f8dcdb48c8e1b9952d \
--hash=sha256:a86c962e211f37edd61d6e11bb4df7eddc4a519a38a856e20a6498c319efa6b0 \
--hash=sha256:d34fbb98ad0d6b563b95de852a284074514331e6b9da0a9fc894fb1cdae7a79e \
--hash=sha256:658624a11f6e1c252b2cd170d94bf28c8f9410acab9f2fd4369e11e1cd4e1aaf \
--hash=sha256:4d054f013a1983551254e2379385e359884e5af105e3efe00418977d02f634a7 \
--hash=sha256:26a45798ca2a4e168d00de75d4a524abf5907949231512f372b217ede3429e98 \
--hash=sha256:3c40c827d36c6d1c3cf413694d7dc843d50997ebffbc7c87d888a203ed6403a7 \
--hash=sha256:be62aeff8f2f054eff7725f502f6228298891fd648dc2630e03e44bf63e8cee0 \
--hash=sha256:dd53d7c4a69e766e4900f29db5872f5824a06827d594427cf1a4aa542818b796 \
--hash=sha256:30a59fb41bb6b8c465ab50d60a1b298d1cd7b85274e71f38af5a75d6c475d2d2 \
--hash=sha256:df1889701e2dfd8ba4dc9b1a010f0a60950077fb5242bb92c8b5c7f1a6f2668a \
--hash=sha256:33c623ef9ca5e19e05991f127c1be5aeb1ab5cdf30cb1c5cf3960752e58b599b \
--hash=sha256:26f509450db547e4dfa3ec739419b31edad646d21fb8d0ed0734188b35ff6b27 \
--hash=sha256:7b57f26e5e6ee2f14f960db46bd58ffdca25ca06dd997729b1b179fddd35f5a3 \
--hash=sha256:a8705c5073fe3fcc297fb8e0b31aa794e05af6a329e81b7ca4ffecab7f2b95ef \
--hash=sha256:c2edbb783c841e36ca0fa159f0ae97a88ce8137fb3a6cd82eae77349ba4b607b \
--hash=sha256:8cde829f14bd38f6da7b2954be0f2837043e8b8d7a9110ec5e318ae6bf706610 \
--hash=sha256:76766cc80d6128750075378d3bb7812cf146415bd29b588616f72c943c00d598
openpyxl==3.0.4 \
--hash=sha256:6e62f058d19b09b95d20ebfbfb04857ad08d0833190516c1660675f699c6186f \
--hash=sha256:d88dd1480668019684c66cfff3e52a5de4ed41e9df5dd52e008cbf27af0dbf87
packaging==20.4 \
--hash=sha256:998416ba6962ae7fbd6596850b80e17859a5753ba17c32284f67bfff33784181 \
--hash=sha256:4357f74f47b9c12db93624a82154e9b120fa8293699949152b22065d556079f8
pandas==1.0.5 \
--hash=sha256:faa42a78d1350b02a7d2f0dbe3c80791cf785663d6997891549d0f86dc49125e \
--hash=sha256:9c31d52f1a7dd2bb4681d9f62646c7aa554f19e8e9addc17e8b1b20011d7522d \
--hash=sha256:8778a5cc5a8437a561e3276b85367412e10ae9fff07db1eed986e427d9a674f8 \
--hash=sha256:9871ef5ee17f388f1cb35f76dc6106d40cb8165c562d573470672f4cdefa59ef \
--hash=sha256:35b670b0abcfed7cad76f2834041dcf7ae47fd9b22b63622d67cdc933d79f453 \
--hash=sha256:c9410ce8a3dee77653bc0684cfa1535a7f9c291663bd7ad79e39f5ab58f67ab3 \
--hash=sha256:02f1e8f71cd994ed7fcb9a35b6ddddeb4314822a0e09a9c5b2d278f8cb5d4096 \
--hash=sha256:b3c4f93fcb6e97d993bf87cdd917883b7dab7d20c627699f360a8fb49e9e0b91 \
--hash=sha256:5759edf0b686b6f25a5d4a447ea588983a33afc8a0081a0954184a4a87fd0dd7 \
--hash=sha256:ab8173a8efe5418bbe50e43f321994ac6673afc5c7c4839014cf6401bbdd0705 \
--hash=sha256:13f75fb18486759da3ff40f5345d9dd20e7d78f2a39c5884d013456cec9876f0 \
--hash=sha256:5a7cf6044467c1356b2b49ef69e50bf4d231e773c3ca0558807cdba56b76820b \
--hash=sha256:ae961f1f0e270f1e4e2273f6a539b2ea33248e0e3a11ffb479d757918a5e03a9 \
--hash=sha256:f69e0f7b7c09f1f612b1f8f59e2df72faa8a6b41c5a436dde5b615aaf948f107 \
--hash=sha256:4c73f373b0800eb3062ffd13d4a7a2a6d522792fa6eb204d67a4fad0a40f03dc \
--hash=sha256:69c5d920a0b2a9838e677f78f4dde506b95ea8e4d30da25859db6469ded84fa8
parsedatetime==2.6 \
--hash=sha256:cb96edd7016872f58479e35879294258c71437195760746faffedb692aef000b \
--hash=sha256:4cb368fbb18a0b7231f4d76119165451c8d2e35951455dfee97c62a87b04d455
parso==0.7.0 \
--hash=sha256:158c140fc04112dc45bca311633ae5033c2c2a7b732fa33d0955bad8152a8dd0 \
--hash=sha256:908e9fae2144a076d72ae4e25539143d40b8e3eafbaeae03c1bfe226f4cdf12c
pathspec==0.8.0 \
--hash=sha256:7d91249d21749788d07a2d0f94147accd8f845507400749ea19c1ec9054a12b0 \
--hash=sha256:da45173eb3a6f2a5a487efba21f050af2b41948be6ab52b6a1e3ff22bb8b7061
pexpect==4.8.0; sys_platform != "win32" \
--hash=sha256:0b48a55dcb3c05f3329815901ea4fc1537514d6ba867a152b581d69ae3710937 \
--hash=sha256:fc65a43959d153d0114afe13997d439c22823a27cefceb5ff35c2178c6784c0c
pickleshare==0.7.5 \
--hash=sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56 \
--hash=sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca
pluggy==0.13.1 \
--hash=sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d \
--hash=sha256:15b2acde666561e1298d71b523007ed7364de07029219b604cf808bfa1c765b0
prompt-toolkit==3.0.5 \
--hash=sha256:df7e9e63aea609b1da3a65641ceaf5bc7d05e0a04de5bd45d05dbeffbabf9e04 \
--hash=sha256:563d1a4140b63ff9dd587bda9557cffb2fe73650205ab6f4383092fb882e7dc8
ptyprocess==0.6.0; sys_platform != "win32" \
--hash=sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f \
--hash=sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0
py==1.9.0 \
--hash=sha256:366389d1db726cd2fcfc79732e75410e5fe4d31db13692115529d34069a043c2 \
--hash=sha256:9ca6883ce56b4e8da7e79ac18787889fa5206c79dcc67fb065376cd2fe03f342
pycodestyle==2.6.0 \
--hash=sha256:2295e7b2f6b5bd100585ebcb1f616591b652db8a741695b3d8f5d28bdc934367 \
--hash=sha256:c58a7d2815e0e8d7972bf1803331fb0152f867bd89adf8a01dfd55085434192e
pycountry==19.8.18 \
--hash=sha256:3c57aa40adcf293d59bebaffbe60d8c39976fba78d846a018dc0c2ec9c6cb3cb
pyflakes==2.2.0 \
--hash=sha256:0d94e0e05a19e57a99444b6ddcf9a6eb2e5c68d3ca1e98e90707af8152c90a92 \
--hash=sha256:35b2d75ee967ea93b55750aa9edbbf72813e06a66ba54438df2cfac9e3c27fc8
pygments==2.6.1 \
--hash=sha256:ff7a40b4860b727ab48fad6360eb351cc1b33cbf9b15a0f689ca5353e9463324 \
--hash=sha256:647344a061c249a3b74e230c739f434d7ea4d8b1d5f3721bc0f3558049b38f44
pyparsing==2.4.7 \
--hash=sha256:ef9d7589ef3c200abe66653d3f1ab1033c3c419ae9b9bdb1240a85b024efc88b \
--hash=sha256:c203ec8783bf771a155b207279b9bccb8dea02d8f0c9e5f8ead507bc3246ecc1
pytest==5.4.3 \
--hash=sha256:5c0db86b698e8f170ba4582a492248919255fcd4c79b1ee64ace34301fb589a1 \
--hash=sha256:7979331bfcba207414f5e1263b5a0f8f521d0f457318836a7355531ed1a4c7d8
pytest-clarity==0.3.0a0 \
--hash=sha256:5cc99e3d9b7969dfe17e5f6072d45a917c59d363b679686d3c958a1ded2e4dcf
python-dateutil==2.8.1 \
--hash=sha256:73ebfe9dbf22e832286dafa60473e4cd239f8592f699aa5adaf10050e6e1823c \
--hash=sha256:75bb3f31ea686f1197762692a9ee6a7550b59fc6ca3a1f4b5d7e32fb98e2da2a
python-slugify==4.0.1 \
--hash=sha256:69a517766e00c1268e5bbfc0d010a0a8508de0b18d30ad5a1ff357f8ae724270
python-stdnum==1.13 \
--hash=sha256:120f83d33fb8b8be1b282f20dd755a892d5facf84f54fa21f75bbd2633128160 \
--hash=sha256:3d5d4430579cba88211d3ba4855a16faff235352a25a01d6ab70024686a75823
pytimeparse==1.1.8 \
--hash=sha256:04b7be6cc8bd9f5647a6325444926c3ac34ee6bc7e69da4367ba282f076036bd \
--hash=sha256:e86136477be924d7e670646a98561957e8ca7308d44841e21f5ddea757556a0a
pytz==2020.1 \
--hash=sha256:a494d53b6d39c3c6e44c3bec237336e14305e4f29bbf800b599253057fbb79ed \
--hash=sha256:c35965d010ce31b23eeb663ed3cc8c906275d6be1a34393a1d73a41febf4a048
regex==2020.6.8 \
--hash=sha256:fbff901c54c22425a5b809b914a3bfaf4b9570eee0e5ce8186ac71eb2025191c \
--hash=sha256:112e34adf95e45158c597feea65d06a8124898bdeac975c9087fe71b572bd938 \
--hash=sha256:92d8a043a4241a710c1cf7593f5577fbb832cf6c3a00ff3fc1ff2052aff5dd89 \
--hash=sha256:bae83f2a56ab30d5353b47f9b2a33e4aac4de9401fb582b55c42b132a8ac3868 \
--hash=sha256:b2ba0f78b3ef375114856cbdaa30559914d081c416b431f2437f83ce4f8b7f2f \
--hash=sha256:95fa7726d073c87141f7bbfb04c284901f8328e2d430eeb71b8ffdd5742a5ded \
--hash=sha256:e3cdc9423808f7e1bb9c2e0bdb1c9dc37b0607b30d646ff6faf0d4e41ee8fee3 \
--hash=sha256:c78e66a922de1c95a208e4ec02e2e5cf0bb83a36ceececc10a72841e53fbf2bd \
--hash=sha256:08997a37b221a3e27d68ffb601e45abfb0093d39ee770e4257bd2f5115e8cb0a \
--hash=sha256:2f6f211633ee8d3f7706953e9d3edc7ce63a1d6aad0be5dcee1ece127eea13ae \
--hash=sha256:55b4c25cbb3b29f8d5e63aeed27b49fa0f8476b0d4e1b3171d85db891938cc3a \
--hash=sha256:89cda1a5d3e33ec9e231ece7307afc101b5217523d55ef4dc7fb2abd6de71ba3 \
--hash=sha256:690f858d9a94d903cf5cada62ce069b5d93b313d7d05456dbcd99420856562d9 \
--hash=sha256:1700419d8a18c26ff396b3b06ace315b5f2a6e780dad387e4c48717a12a22c29 \
--hash=sha256:654cb773b2792e50151f0e22be0f2b6e1c3a04c5328ff1d9d59c0398d37ef610 \
--hash=sha256:52e1b4bef02f4040b2fd547357a170fc1146e60ab310cdbdd098db86e929b387 \
--hash=sha256:cf59bbf282b627130f5ba68b7fa3abdb96372b24b66bdf72a4920e8153fc7910 \
--hash=sha256:5aaa5928b039ae440d775acea11d01e42ff26e1561c0ffcd3d805750973c6baf \
--hash=sha256:97712e0d0af05febd8ab63d2ef0ab2d0cd9deddf4476f7aa153f76feef4b2754 \
--hash=sha256:6ad8663c17db4c5ef438141f99e291c4d4edfeaacc0ce28b5bba2b0bf273d9b5 \
--hash=sha256:e9b64e609d37438f7d6e68c2546d2cb8062f3adb27e6336bc129b51be20773ac
requests==2.24.0 \
--hash=sha256:fe75cc94a9443b9246fc7049224f75604b113c36acb93f87b80ed42c44cbb898 \
--hash=sha256:b3559a131db72c33ee969480840fff4bb6dd111de7dd27c8ee1f820f4f00231b
requests-cache==0.5.2 \
--hash=sha256:813023269686045f8e01e2289cc1e7e9ae5ab22ddd1e2849a9093ab3ab7270eb \
--hash=sha256:81e13559baee64677a7d73b85498a5a8f0639e204517b5d05ff378e44a57831a
six==1.15.0 \
--hash=sha256:8b74bedcbbbaca38ff6d7491d76f2b06b3592611af620f8426e82dddb04a5ced \
--hash=sha256:30639c035cdb23534cd4aa2dd52c3bf48f06e5f4a941509c8bafd8ce11080259
sqlalchemy==1.3.18 \
--hash=sha256:f11c2437fb5f812d020932119ba02d9e2bc29a6eca01a055233a8b449e3e1e7d \
--hash=sha256:0ec575db1b54909750332c2e335c2bb11257883914a03bc5a3306a4488ecc772 \
--hash=sha256:f57be5673e12763dd400fea568608700a63ce1c6bd5bdbc3cc3a2c5fdb045274 \
--hash=sha256:8cac7bb373a5f1423e28de3fd5fc8063b9c8ffe8957dc1b1a59cb90453db6da1 \
--hash=sha256:adad60eea2c4c2a1875eb6305a0b6e61a83163f8e233586a4d6a55221ef984fe \
--hash=sha256:57aa843b783179ab72e863512e14bdcba186641daf69e4e3a5761d705dcc35b1 \
--hash=sha256:621f58cd921cd71ba6215c42954ffaa8a918eecd8c535d97befa1a8acad986dd \
--hash=sha256:fc728ece3d5c772c196fd338a99798e7efac7a04f9cb6416299a3638ee9a94cd \
--hash=sha256:736d41cfebedecc6f159fc4ac0769dc89528a989471dc1d378ba07d29a60ba1c \
--hash=sha256:427273b08efc16a85aa2b39892817e78e3ed074fcb89b2a51c4979bae7e7ba98 \
--hash=sha256:cbe1324ef52ff26ccde2cb84b8593c8bf930069dfc06c1e616f1bfd4e47f48a3 \
--hash=sha256:8fd452dc3d49b3cc54483e033de6c006c304432e6f84b74d7b2c68afa2569ae5 \
--hash=sha256:e89e0d9e106f8a9180a4ca92a6adde60c58b1b0299e1b43bd5e0312f535fbf33 \
--hash=sha256:6ac2558631a81b85e7fb7a44e5035347938b0a73f5fdc27a8566777d0792a6a4 \
--hash=sha256:87fad64529cde4f1914a5b9c383628e1a8f9e3930304c09cf22c2ae118a1280e \
--hash=sha256:e4624d7edb2576cd72bb83636cd71c8ce544d8e272f308bd80885056972ca299 \
--hash=sha256:89494df7f93b1836cae210c42864b292f9b31eeabca4810193761990dc689cce \
--hash=sha256:716754d0b5490bdcf68e1e4925edc02ac07209883314ad01a137642ddb2056f1 \
--hash=sha256:50c4ee32f0e1581828843267d8de35c3298e86ceecd5e9017dc45788be70a864 \
--hash=sha256:d98bc827a1293ae767c8f2f18be3bb5151fd37ddcd7da2a5f9581baeeb7a3fa1 \
--hash=sha256:0942a3a0df3f6131580eddd26d99071b48cfe5aaf3eab2783076fbc5a1c1882e \
--hash=sha256:16593fd748944726540cd20f7e83afec816c2ac96b082e26ae226e8f7e9688cf \
--hash=sha256:c26f95e7609b821b5f08a72dab929baa0d685406b953efd7c89423a511d5c413 \
--hash=sha256:512a85c3c8c3995cc91af3e90f38f460da5d3cade8dc3a229c8e0879037547c9 \
--hash=sha256:d05c4adae06bd0c7f696ae3ec8d993ed8ffcc4e11a76b1b35a5af8a099bd2284 \
--hash=sha256:109581ccc8915001e8037b73c29590e78ce74be49ca0a3630a23831f9e3ed6c7 \
--hash=sha256:8619b86cb68b185a778635be5b3e6018623c0761dde4df2f112896424aa27bd8 \
--hash=sha256:da2fb75f64792c1fc64c82313a00c728a7c301efe6a60b7a9fe35b16b4368ce7
termcolor==1.1.0 \
--hash=sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b
text-unidecode==1.3 \
--hash=sha256:bad6603bb14d279193107714b288be206cac565dfa49aa5b105294dd5c4aab93 \
--hash=sha256:1311f10e8b895935241623731c2ba64f4c455287888b18189350b67134a822e8
toml==0.10.1 \
--hash=sha256:bda89d5935c2eac546d648028b9901107a595863cb36bae0c73ac804a9b4ce88 \
--hash=sha256:926b612be1e5ce0634a2ca03470f95169cf16f939018233a670519cb4ac58b0f
traitlets==4.3.3 \
--hash=sha256:70b4c6a1d9019d7b4f6846832288f86998aa3b9207c6821f3578a6a6a467fe44 \
--hash=sha256:d023ee369ddd2763310e4c3eae1ff649689440d4ae59d7485eb4cfbbe3e359f7
typed-ast==1.4.1 \
--hash=sha256:73d785a950fc82dd2a25897d525d003f6378d1cb23ab305578394694202a58c3 \
--hash=sha256:aaee9905aee35ba5905cfb3c62f3e83b3bec7b39413f0a7f19be4e547ea01ebb \
--hash=sha256:0c2c07682d61a629b68433afb159376e24e5b2fd4641d35424e462169c0a7919 \
--hash=sha256:4083861b0aa07990b619bd7ddc365eb7fa4b817e99cf5f8d9cf21a42780f6e01 \
--hash=sha256:269151951236b0f9a6f04015a9004084a5ab0d5f19b57de779f908621e7d8b75 \
--hash=sha256:24995c843eb0ad11a4527b026b4dde3da70e1f2d8806c99b7b4a7cf491612652 \
--hash=sha256:fe460b922ec15dd205595c9b5b99e2f056fd98ae8f9f56b888e7a17dc2b757e7 \
--hash=sha256:4e3e5da80ccbebfff202a67bf900d081906c358ccc3d5e3c8aea42fdfdfd51c1 \
--hash=sha256:249862707802d40f7f29f6e1aad8d84b5aa9e44552d2cc17384b209f091276aa \
--hash=sha256:8ce678dbaf790dbdb3eba24056d5364fb45944f33553dd5869b7580cdbb83614 \
--hash=sha256:c9e348e02e4d2b4a8b2eedb48210430658df6951fa484e59de33ff773fbd4b41 \
--hash=sha256:bcd3b13b56ea479b3650b82cabd6b5343a625b0ced5429e4ccad28a8973f301b \
--hash=sha256:d5d33e9e7af3b34a40dc05f498939f0ebf187f07c385fd58d591c533ad8562fe \
--hash=sha256:0666aa36131496aed8f7be0410ff974562ab7eeac11ef351def9ea6fa28f6355 \
--hash=sha256:d205b1b46085271b4e15f670058ce182bd1199e56b317bf2ec004b6a44f911f6 \
--hash=sha256:6daac9731f172c2a22ade6ed0c00197ee7cc1221aa84cfdf9c31defeb059a907 \
--hash=sha256:498b0f36cc7054c1fead3d7fc59d2150f4d5c6c56ba7fb150c013fbc683a8d2d \
--hash=sha256:715ff2f2df46121071622063fc7543d9b1fd19ebfc4f5c8895af64a77a8c852c \
--hash=sha256:fc0fea399acb12edbf8a628ba8d2312f583bdbdb3335635db062fa98cf71fca4 \
--hash=sha256:d43943ef777f9a1c42bf4e552ba23ac77a6351de620aa9acf64ad54933ad4d34 \
--hash=sha256:8c8aaad94455178e3187ab22c8b01a3837f8ee50e09cf31f1ba129eb293ec30b
urllib3==1.25.9 \
--hash=sha256:88206b0eb87e6d677d424843ac5209e3fb9d0190d0ee169599165ec25e9d9115 \
--hash=sha256:3018294ebefce6572a474f0604c2021e33b3fd8006ecd11d62107a5d2a963527
wcwidth==0.2.5 \
--hash=sha256:beb4802a9cebb9144e99086eff703a642a13d6a0052920003a230f3294bbe784 \
--hash=sha256:c4d647b99872929fdb7bdcaa4fbe7f01413ed3d98077df798530e5b04f116c83
xlrd==1.2.0 \
--hash=sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde \
--hash=sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2

View File

@@ -1,81 +1,16 @@
certifi==2020.6.20 \ certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
--hash=sha256:8fc0819f1f30ba15bdb34cceffb9ef04d99f420f68eb75d901e9560b8749fc41 \ chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
--hash=sha256:5930595817496dd21bb8dc35dad090f1c2cd0adfaf21204bf6732ca5d8ee34d3 colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
chardet==3.0.4 \ idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
--hash=sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691 \ langid==1.1.6
--hash=sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
idna==2.10 \ pandas==1.2.2; python_full_version >= "3.7.1"
--hash=sha256:b97d804b1e9b523befed77c48dacec60e6dcb0b5391d57af6a65a312a90648c0 \ pycountry==19.8.18
--hash=sha256:b307872f855b18632ce0c21c5e45be78c0ea7ae4c15c828c20788b26921eb3f6 python-dateutil==2.8.1; python_full_version >= "3.7.1"
langid==1.1.6 \ python-stdnum==1.16
--hash=sha256:044bcae1912dab85c33d8e98f2811b8f4ff1213e5e9a9e9510137b84da2cb293 pytz==2021.1; python_full_version >= "3.7.1"
numpy==1.19.0 \ requests-cache==0.5.2
--hash=sha256:63d971bb211ad3ca37b2adecdd5365f40f3b741a455beecba70fd0dde8b2a4cb \ requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
--hash=sha256:b6aaeadf1e4866ca0fdf7bb4eed25e521ae21a7947c59f78154b24fc7abbe1dd \ six==1.15.0; python_full_version >= "3.7.1"
--hash=sha256:13af0184177469192d80db9bd02619f6fa8b922f9f327e077d6f2a6acb1ce1c0 \ urllib3==1.26.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
--hash=sha256:356f96c9fbec59974a592452ab6a036cd6f180822a60b529a975c9467fcd5f23 \ xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
--hash=sha256:fa1fe75b4a9e18b66ae7f0b122543c42debcf800aaafa0212aaff3ad273c2596 \
--hash=sha256:cbe326f6d364375a8e5a8ccb7e9cd73f4b2f6dc3b2ed205633a0db8243e2a96a \
--hash=sha256:a2e3a39f43f0ce95204beb8fe0831199542ccab1e0c6e486a0b4947256215632 \
--hash=sha256:7b852817800eb02e109ae4a9cef2beda8dd50d98b76b6cfb7b5c0099d27b52d4 \
--hash=sha256:d97a86937cf9970453c3b62abb55a6475f173347b4cde7f8dcdb48c8e1b9952d \
--hash=sha256:a86c962e211f37edd61d6e11bb4df7eddc4a519a38a856e20a6498c319efa6b0 \
--hash=sha256:d34fbb98ad0d6b563b95de852a284074514331e6b9da0a9fc894fb1cdae7a79e \
--hash=sha256:658624a11f6e1c252b2cd170d94bf28c8f9410acab9f2fd4369e11e1cd4e1aaf \
--hash=sha256:4d054f013a1983551254e2379385e359884e5af105e3efe00418977d02f634a7 \
--hash=sha256:26a45798ca2a4e168d00de75d4a524abf5907949231512f372b217ede3429e98 \
--hash=sha256:3c40c827d36c6d1c3cf413694d7dc843d50997ebffbc7c87d888a203ed6403a7 \
--hash=sha256:be62aeff8f2f054eff7725f502f6228298891fd648dc2630e03e44bf63e8cee0 \
--hash=sha256:dd53d7c4a69e766e4900f29db5872f5824a06827d594427cf1a4aa542818b796 \
--hash=sha256:30a59fb41bb6b8c465ab50d60a1b298d1cd7b85274e71f38af5a75d6c475d2d2 \
--hash=sha256:df1889701e2dfd8ba4dc9b1a010f0a60950077fb5242bb92c8b5c7f1a6f2668a \
--hash=sha256:33c623ef9ca5e19e05991f127c1be5aeb1ab5cdf30cb1c5cf3960752e58b599b \
--hash=sha256:26f509450db547e4dfa3ec739419b31edad646d21fb8d0ed0734188b35ff6b27 \
--hash=sha256:7b57f26e5e6ee2f14f960db46bd58ffdca25ca06dd997729b1b179fddd35f5a3 \
--hash=sha256:a8705c5073fe3fcc297fb8e0b31aa794e05af6a329e81b7ca4ffecab7f2b95ef \
--hash=sha256:c2edbb783c841e36ca0fa159f0ae97a88ce8137fb3a6cd82eae77349ba4b607b \
--hash=sha256:8cde829f14bd38f6da7b2954be0f2837043e8b8d7a9110ec5e318ae6bf706610 \
--hash=sha256:76766cc80d6128750075378d3bb7812cf146415bd29b588616f72c943c00d598
pandas==1.0.5 \
--hash=sha256:faa42a78d1350b02a7d2f0dbe3c80791cf785663d6997891549d0f86dc49125e \
--hash=sha256:9c31d52f1a7dd2bb4681d9f62646c7aa554f19e8e9addc17e8b1b20011d7522d \
--hash=sha256:8778a5cc5a8437a561e3276b85367412e10ae9fff07db1eed986e427d9a674f8 \
--hash=sha256:9871ef5ee17f388f1cb35f76dc6106d40cb8165c562d573470672f4cdefa59ef \
--hash=sha256:35b670b0abcfed7cad76f2834041dcf7ae47fd9b22b63622d67cdc933d79f453 \
--hash=sha256:c9410ce8a3dee77653bc0684cfa1535a7f9c291663bd7ad79e39f5ab58f67ab3 \
--hash=sha256:02f1e8f71cd994ed7fcb9a35b6ddddeb4314822a0e09a9c5b2d278f8cb5d4096 \
--hash=sha256:b3c4f93fcb6e97d993bf87cdd917883b7dab7d20c627699f360a8fb49e9e0b91 \
--hash=sha256:5759edf0b686b6f25a5d4a447ea588983a33afc8a0081a0954184a4a87fd0dd7 \
--hash=sha256:ab8173a8efe5418bbe50e43f321994ac6673afc5c7c4839014cf6401bbdd0705 \
--hash=sha256:13f75fb18486759da3ff40f5345d9dd20e7d78f2a39c5884d013456cec9876f0 \
--hash=sha256:5a7cf6044467c1356b2b49ef69e50bf4d231e773c3ca0558807cdba56b76820b \
--hash=sha256:ae961f1f0e270f1e4e2273f6a539b2ea33248e0e3a11ffb479d757918a5e03a9 \
--hash=sha256:f69e0f7b7c09f1f612b1f8f59e2df72faa8a6b41c5a436dde5b615aaf948f107 \
--hash=sha256:4c73f373b0800eb3062ffd13d4a7a2a6d522792fa6eb204d67a4fad0a40f03dc \
--hash=sha256:69c5d920a0b2a9838e677f78f4dde506b95ea8e4d30da25859db6469ded84fa8
pycountry==19.8.18 \
--hash=sha256:3c57aa40adcf293d59bebaffbe60d8c39976fba78d846a018dc0c2ec9c6cb3cb
python-dateutil==2.8.1 \
--hash=sha256:73ebfe9dbf22e832286dafa60473e4cd239f8592f699aa5adaf10050e6e1823c \
--hash=sha256:75bb3f31ea686f1197762692a9ee6a7550b59fc6ca3a1f4b5d7e32fb98e2da2a
python-stdnum==1.13 \
--hash=sha256:120f83d33fb8b8be1b282f20dd755a892d5facf84f54fa21f75bbd2633128160 \
--hash=sha256:3d5d4430579cba88211d3ba4855a16faff235352a25a01d6ab70024686a75823
pytz==2020.1 \
--hash=sha256:a494d53b6d39c3c6e44c3bec237336e14305e4f29bbf800b599253057fbb79ed \
--hash=sha256:c35965d010ce31b23eeb663ed3cc8c906275d6be1a34393a1d73a41febf4a048
requests==2.24.0 \
--hash=sha256:fe75cc94a9443b9246fc7049224f75604b113c36acb93f87b80ed42c44cbb898 \
--hash=sha256:b3559a131db72c33ee969480840fff4bb6dd111de7dd27c8ee1f820f4f00231b
requests-cache==0.5.2 \
--hash=sha256:813023269686045f8e01e2289cc1e7e9ae5ab22ddd1e2849a9093ab3ab7270eb \
--hash=sha256:81e13559baee64677a7d73b85498a5a8f0639e204517b5d05ff378e44a57831a
six==1.15.0 \
--hash=sha256:8b74bedcbbbaca38ff6d7491d76f2b06b3592611af620f8426e82dddb04a5ced \
--hash=sha256:30639c035cdb23534cd4aa2dd52c3bf48f06e5f4a941509c8bafd8ce11080259
urllib3==1.25.9 \
--hash=sha256:88206b0eb87e6d677d424843ac5209e3fb9d0190d0ee169599165ec25e9d9115 \
--hash=sha256:3018294ebefce6572a474f0604c2021e33b3fd8006ecd11d62107a5d2a963527
xlrd==1.2.0 \
--hash=sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde \
--hash=sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2

View File

@@ -14,7 +14,7 @@ install_requires = [
setuptools.setup( setuptools.setup(
name="csv-metadata-quality", name="csv-metadata-quality",
version="0.4.2", version="0.4.3",
author="Alan Orth", author="Alan Orth",
author_email="aorth@mjanja.ch", author_email="aorth@mjanja.ch",
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.", description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
@@ -23,9 +23,9 @@ setuptools.setup(
long_description_content_type="text/markdown", long_description_content_type="text/markdown",
url="https://github.com/alanorth/csv-metadata-quality", url="https://github.com/alanorth/csv-metadata-quality",
classifiers=[ classifiers=[
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8", "Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: OS Independent", "Operating System :: OS Independent",
"Development Status :: 4 - Beta", "Development Status :: 4 - Beta",

View File

@@ -1,4 +1,5 @@
import pandas as pd import pandas as pd
from colorama import Fore
import csv_metadata_quality.check as check import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental import csv_metadata_quality.experimental as experimental
@@ -12,7 +13,7 @@ def test_check_invalid_issn(capsys):
check.issn(value) check.issn(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISSN: {value}\n" assert captured.out == f"{Fore.RED}Invalid ISSN: {Fore.RESET}{value}\n"
def test_check_valid_issn(): def test_check_valid_issn():
@@ -33,7 +34,7 @@ def test_check_invalid_isbn(capsys):
check.isbn(value) check.isbn(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISBN: {value}\n" assert captured.out == f"{Fore.RED}Invalid ISBN: {Fore.RESET}{value}\n"
def test_check_valid_isbn(): def test_check_valid_isbn():
@@ -56,7 +57,26 @@ def test_check_invalid_separators(capsys):
check.separators(value, field_name) check.separators(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid multi-value separator ({field_name}): {value}\n" assert (
captured.out
== f"{Fore.RED}Invalid multi-value separator ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_unnecessary_separators(capsys):
"""Test checking unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
check.separators(field, field_name)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.RED}Unnecessary multi-value separator ({field_name}): {Fore.RESET}{field}\n"
)
def test_check_valid_separators(): def test_check_valid_separators():
@@ -81,7 +101,7 @@ def test_check_missing_date(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Missing date ({field_name}).\n" assert captured.out == f"{Fore.RED}Missing date ({field_name}).{Fore.RESET}\n"
def test_check_multiple_dates(capsys): def test_check_multiple_dates(capsys):
@@ -94,7 +114,10 @@ def test_check_multiple_dates(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Multiple dates not allowed ({field_name}): {value}\n" assert (
captured.out
== f"{Fore.RED}Multiple dates not allowed ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_invalid_date(capsys): def test_check_invalid_date(capsys):
@@ -107,7 +130,9 @@ def test_check_invalid_date(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid date ({field_name}): {value}\n" assert (
captured.out == f"{Fore.RED}Invalid date ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_valid_date(): def test_check_valid_date():
@@ -132,7 +157,10 @@ def test_check_suspicious_characters(capsys):
check.suspicious_characters(value, field_name) check.suspicious_characters(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Suspicious character ({field_name}): ˆt\n" assert (
captured.out
== f"{Fore.YELLOW}Suspicious character ({field_name}): {Fore.RESET}ˆt\n"
)
def test_check_valid_iso639_1_language(): def test_check_valid_iso639_1_language():
@@ -163,7 +191,9 @@ def test_check_invalid_iso639_1_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISO 639-1 language: {value}\n" assert (
captured.out == f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}\n"
)
def test_check_invalid_iso639_3_language(capsys): def test_check_invalid_iso639_3_language(capsys):
@@ -174,7 +204,9 @@ def test_check_invalid_iso639_3_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISO 639-3 language: {value}\n" assert (
captured.out == f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}\n"
)
def test_check_invalid_language(capsys): def test_check_invalid_language(capsys):
@@ -185,7 +217,7 @@ def test_check_invalid_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid language: {value}\n" assert captured.out == f"{Fore.RED}Invalid language: {Fore.RESET}{value}\n"
def test_check_invalid_agrovoc(capsys): def test_check_invalid_agrovoc(capsys):
@@ -197,7 +229,10 @@ def test_check_invalid_agrovoc(capsys):
check.agrovoc(value, field_name) check.agrovoc(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid AGROVOC ({field_name}): {value}\n" assert (
captured.out
== f"{Fore.RED}Invalid AGROVOC ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_valid_agrovoc(): def test_check_valid_agrovoc():
@@ -219,7 +254,10 @@ def test_check_uncommon_filename_extension(capsys):
check.filename_extension(value) check.filename_extension(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Filename with uncommon extension: {value}\n" assert (
captured.out
== f"{Fore.YELLOW}Filename with uncommon extension: {Fore.RESET}{value}\n"
)
def test_check_common_filename_extension(): def test_check_common_filename_extension():
@@ -247,7 +285,7 @@ def test_check_incorrect_iso_639_1_language(capsys):
captured = capsys.readouterr() captured = capsys.readouterr()
assert ( assert (
captured.out captured.out
== f"Possibly incorrect language {language} (detected en): {title}\n" == f"{Fore.YELLOW}Possibly incorrect language {language} (detected en): {Fore.RESET}{title}\n"
) )
@@ -266,7 +304,7 @@ def test_check_incorrect_iso_639_3_language(capsys):
captured = capsys.readouterr() captured = capsys.readouterr()
assert ( assert (
captured.out captured.out
== f"Possibly incorrect language {language} (detected eng): {title}\n" == f"{Fore.YELLOW}Possibly incorrect language {language} (detected eng): {Fore.RESET}{title}\n"
) )

View File

@@ -41,6 +41,16 @@ def test_fix_invalid_separators():
assert fix.separators(value, field_name) == "Alan||Orth" assert fix.separators(value, field_name) == "Alan||Orth"
def test_fix_unnecessary_separators():
"""Test fixing unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
assert fix.separators(field, field_name) == "Alan||Orth"
def test_fix_unnecessary_unicode(): def test_fix_unnecessary_unicode():
"""Test fixing unnecessary Unicode.""" """Test fixing unnecessary Unicode."""