1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-10-24 02:11:14 +02:00

175 Commits

Author SHA1 Message Date
8eddb76aab Bump version to 0.4.8-dev
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-19 11:53:56 +02:00
a04dbc50db Add notes about checking and fixing mojibake 2021-03-19 11:48:27 +02:00
28335ed159 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-19 10:29:15 +02:00
773a0a2695 poetry.lock: Run poetry update 2021-03-19 10:28:55 +02:00
39a4b1a487 Add mojibake to data/test.csv and tests 2021-03-19 10:28:33 +02:00
898bb412c3 Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
e92ec5d371 README.md: Add note about duplicate checking
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:12:03 +02:00
f816e17fe7 Version 0.4.7
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:00:34 +02:00
9061c7c79b setup.py: Remove beta tag
I think this is only used by pypi.org?
2021-03-17 10:00:09 +02:00
661d05b977 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-17 09:58:35 +02:00
652b7ea98c CHANGELOG.md: Add note about poetry dependencies 2021-03-17 09:58:02 +02:00
65da6e9b05 poetry.lock: Run pipenv update 2021-03-17 09:57:31 +02:00
a313b7527a CHANGELOG.md: Add note about duplicate items 2021-03-17 09:55:07 +02:00
51ee370697 data/test.csv: Add duplicate item 2021-03-17 09:54:14 +02:00
e8422bfa74 tests/test_check.py: Add test for duplicate items 2021-03-17 09:54:02 +02:00
9f2dc0a0f5 Add support for detecting duplicate items
This uses the title, type, and date issued as a sort of "key" when
determining if an item already exists in the data set.
2021-03-17 09:53:07 +02:00
14010896a5 csv_metadata_quality/experimental.py: Move all imports to top of file
All checks were successful
continuous-integration/drone/push Build is passing
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-16 16:13:34 +02:00
ab3af2ec62 csv_metadata_quality/check.py: Reformat with black 2021-03-16 16:12:33 +02:00
1aa2084230 CHANGELOG.md: Add note about checks 2021-03-16 16:11:24 +02:00
330a7b7b9c Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00
9a5e3fd6ef README.md: Add TODO about detecting duplicates 2021-03-16 14:03:26 +02:00
ed084da08c CHANGELOG.md: Add note about multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-14 21:04:19 +02:00
10612cf891 Remove checks for invalid multi-value separators
Now that I no longer treat the fix for these as "unsafe" I don't a
ctually need to check for them—I can just fix them when I see them.
2021-03-14 21:01:21 +02:00
3656e9f976 Update CI workflows to use DCTERMS instead of DC
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-14 15:52:51 +02:00
c9c277f8df csv_metadata_quality/app.py: Update help text
All checks were successful
continuous-integration/drone/push Build is passing
Use DCTERMS fields where possible.
2021-03-14 10:52:58 +02:00
fb35afd937 CHANGELOG.md: Add note about requests cache 2021-03-14 09:13:51 +02:00
0e9176f0a6 csv_metadata_quality/check.py: requests cache
Allow overriding the directory for the requests cache. In the case
of csv-metadata-quality-web, which currently runs on Google's App
Engine, we can only write to /tmp.
2021-03-14 09:07:35 +02:00
1008acf35e Always fix invalid multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
f00a07e2cd README.md: Reorganize unsafe functionality
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-13 11:56:52 +02:00
46098861ed poetry.lock: Run poetry update
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 22:45:32 +02:00
fa84cfa440 Bump version to 0.4.6-dev 2021-03-11 22:44:36 +02:00
6cc1401f88 pyproject.toml: Minimum Python is technically 3.7.1
All checks were successful
continuous-integration/drone/push Build is passing
See: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html
2021-03-11 13:41:58 +02:00
ad2cda8a41 README.md: Add note about SPDX license identifiers
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:21:34 +02:00
dc6920802e .github/workflows/python-app.yml: Use Python 3.9
I now use this version in my development environment. Eventually I
should add a matrix of versions to use, but I don't know the GitHub
Actions syntax well enough yet.
2021-03-11 12:17:57 +02:00
6ca449d8ed README.md: Update note about Python 3.8 to 3.8+
Currently the lower bound on Python version support is 3.7 because
of Pandas 1.2.0 requiring it, but I use 3.9 on my development box.
2021-03-11 12:16:07 +02:00
1554cfd5c9 Version 0.4.6 2021-03-11 12:14:54 +02:00
00b8faad6d CHANGELOG.md: Fix headers 2021-03-11 12:13:22 +02:00
b19d81abdd .drone.yml: We need some stuff to build pyicu now
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:07:28 +02:00
a0ea829f5c csv_metadata_quality/fix.py: Fixes should be green 2021-03-11 11:47:24 +02:00
0089efa914 tests/test_check.py: Use dcterms.subject instead of dc.subject
Trying to move some old DC fields to DCTERMS.
2021-03-11 11:45:25 +02:00
3dbe656f9f Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-11 11:11:19 +02:00
7ad821dcad CHANGELOG.md: Add note about poetry dependencies 2021-03-11 11:10:27 +02:00
cd876c4fb3 poetry.lock: Run poetry update 2021-03-11 11:10:02 +02:00
d88ea56488 csv_metadata_quality/check.py: Move all imports to top of file
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-11 10:52:20 +02:00
e0e3ca6c58 CHANGELOG.md: Add notes about DCTERMS in data/test.csv 2021-03-11 10:50:52 +02:00
abae8ca4fb data/test.csv: Move some DC fields to DCTERMS
The original Dublin Core elements set was superceded by DCTERMS in
2008 and we have started using them in our DSpace repository so I
think it's good to update them in our test data. Old DC fields are
still checked and fixed in this tool, though.

It's worth nothing that currently supported DSpace versions (4, 5,
and 6) all have hard-coded a few fields like dc.title internally so
we can't migrate those to their DCTERMS counterparts just yet.
2021-03-11 10:49:05 +02:00
d7d4d4efca CHANGELOG.md: Add note about SPDX license identifiers 2021-03-11 10:37:27 +02:00
5318953150 tests/test_check.py: Add tests for licenses 2021-03-11 10:36:26 +02:00
3b17914002 data/test.csv: Add invalid SPDX license
Now we are checking dcterms.license against the list of SPDX license
identifiers using https://pypi.org/project/spdx-license-list/.
2021-03-11 10:34:58 +02:00
6e4b0e5c1b Add validation of SPDX license identifiers
Currently this only checks the dcterms.license field and the result
will only be a warning.
2021-03-11 10:33:16 +02:00
b16fa9121f pyproject.toml: Add csv-metadata-quality as a script
All checks were successful
continuous-integration/drone/push Build is passing
For some reason I stopped having csv-metadata-quality available in
my poetry environment after install. It seems I need to add it as a
poetry tool script? I had already done this in setup.py years ago,
which works for regular python setup.py installs, but hadn't needed
to do it in poetry for a year or more that I've been using it, until
now.
2021-03-08 09:50:05 +02:00
202bda862a Bump version to 0.4.5
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-04 21:38:10 +02:00
7479310ac0 setup.py: Bump version to 0.4.4
I missed to increase this when I actually released version 0.4.4 so
I will do it in a separate commit now before I bump the version to
0.4.5.
2021-03-04 21:35:08 +02:00
98a91bc9c2 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-03-04 21:33:33 +02:00
fc5bedcc5c CHANGELOG.md: Add poetry update 2021-03-04 21:32:46 +02:00
44d12d771a poetry.lock: Run poetry update 2021-03-04 21:32:21 +02:00
4a7000e975 README.md: Add more ideas to do 2021-03-04 21:26:53 +02:00
27b2d81ca8 CHANGELOG.md: Add note about dcterms.issued
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-28 15:14:39 +02:00
91ebd0f606 README.md: Update TODOs
A few of these date things have been addressed.
2021-02-28 15:13:36 +02:00
dd2cfae047 csv_metadata_quality/app.py: Match dcterms.issued for dates
We used to only check fields that had "date" in their name because
we were using DSpace's default dc.date.* fields. Now we are using
dcterms.issued so I will add that one as well.
2021-02-28 15:11:06 +02:00
d76e72532a Move unreleased changes to v0.4.4
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-21 13:25:22 +02:00
13980d2dde CHANGELOG.md: Add note about colored output 2021-02-21 13:12:26 +02:00
9aaaa62461 Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-02-21 13:10:52 +02:00
a7fc5a246c Colorize output
Some checks failed
continuous-integration/drone/push Build is failing
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
7fb8acb866 Add colorama for colored output
Red for errors, yellow for warnings or information, and green for
fixes.
2021-02-21 13:00:31 +02:00
9f5d2c2c4f poetry.lock: Run poetry update
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-15 15:13:12 +02:00
202abf140c CHANGELOG.md: Add note about poetry
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-04 21:48:12 +02:00
0cd6d3dfe6 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-02-04 21:46:49 +02:00
a458beac55 poetry.lock: Run poetry update 2021-02-04 21:45:30 +02:00
e62ecb0a8f CHANGELOG.md: Add note about new date format 2021-02-04 21:43:44 +02:00
de92f32ab6 csv_metadata_quality/check.py: More date formats
We should also allow ISO 8601 extended in combined date and time
format. DSpace does not have a problem with dates in this format
and I have found some metadata that uses this date format.

For example: 2020-08-31T11:04:56Z

See: https://en.wikipedia.org/wiki/ISO_8601
2021-02-04 21:39:14 +02:00
dbbbc0944a README.md: Add handle to citation
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-27 10:33:37 +02:00
d17bf3033c README.md: Add citation 2021-01-27 10:32:26 +02:00
2ec52f1b73 README.md: Update description
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-26 15:43:41 +02:00
aa1abf15a7 README.md: Adjust title 2021-01-26 15:35:21 +02:00
cbf94490f2 Version 0.4.3 2021-01-26 15:22:40 +02:00
f3d0d5ef07 setup.py: Remove Python 3.6
I actually removed Python 3.6 support a few weeks ago after updating
to Pandas 1.2.0, but forgot to update this.
2021-01-26 15:22:08 +02:00
4b7b99c94c CHANGELOG.md: Add note about multi-value separators 2021-01-26 15:20:22 +02:00
df670e81b9 README.md: Use badge from my Drone CI
All checks were successful
continuous-integration/drone/push Build is passing
I'm not using SourceHut anymore.
2021-01-26 14:38:50 +02:00
ae357d8c6c Revert "Update requirements"
This reverts commit ca80340f7a.

Nope, we still need the --without-hashes because this still fails
on Python 3.7, but not 3.8 or 3.9. From looking around it seems
that nobody can agree whether poetry should handle this, pip should
handle it, or upstream projects should pin their dependencies.
2021-01-26 14:15:31 +02:00
ca80340f7a Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt

Trying to see if we no longer need --without-hashes since we don't
support Python 3.6 anymore.
2021-01-26 11:46:05 +02:00
cc1743b86d Remove .build.yml
I will just use GitHub Actions and Drone.
2021-01-26 11:41:30 +02:00
bcb9885c6b Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-26 10:36:48 +02:00
b484b75178 poetry.lock: Run poetry update 2021-01-26 10:36:04 +02:00
d3880a9dfa Remove Python 3.6 support
All checks were successful
continuous-integration/drone/push Build is passing
Pandas 1.2.0 apparently requires Python 3.7.1+.
2021-01-03 15:51:53 +02:00
7edb8b19d7 tests/test_check.py: Reformat with black 2021-01-03 15:50:21 +02:00
a6709c7f82 Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-03 15:42:00 +02:00
d489ea4609 poetry.lock: Run poetry update 2021-01-03 15:41:08 +02:00
96634cbb67 pytest.ini: Change --strict to --strict-markers
This is deprecated since pytest 6.2.0.

See: https://docs.pytest.org/en/stable/deprecations.html#the-strict-command-line-option
2021-01-03 15:40:14 +02:00
29e67a0887 Add tests for unnecessary multi-value separators 2021-01-03 15:37:18 +02:00
32cea2055f data/test.csv: Add unnecessary multi-value separator 2021-01-03 15:33:04 +02:00
0dc66c5c4e Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
c26ad83534 .github: Test CLI invocation 2020-12-14 23:47:09 +02:00
72ca9d99bf setup.py: Add Python 3.9
[SKIP CI]
2020-12-14 23:44:35 +02:00
ae33a9b793 Add .drone.yml 2020-12-14 23:42:23 +02:00
fc0367bfc8 README.md: Update note about Python version 2020-12-08 10:52:24 +02:00
e33b285034 README.md: Add GitHub Actions badge 2020-12-08 10:48:31 +02:00
349fca03b8 .github/workflows/python-app.yml: Rename
This name is displayed in the badge so it should be something more
relevant.
2020-12-08 10:46:39 +02:00
52d8904870 Remove .travis.yml
They changed their free tier and I might as well use GitHub Actions
for ILRI stuff anyways.
2020-12-08 10:41:36 +02:00
971c69e535 Create python-app.yml
Try GitHub Actions for Python 3.8 using GitHub's Python example.
2020-12-08 10:38:52 +02:00
f8cc233e25 .travis.yml: Use Amazon Graviton2 ARM environment
These are the new hotness and should have faster build times.

See: https://blog.travis-ci.com/2020-09-11-arm-on-aws
2020-12-06 10:49:03 +02:00
aa7b7a9592 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2020-11-03 07:42:45 +02:00
57b455bde7 poetry.lock: Run poetry update 2020-11-03 07:40:56 +02:00
23b95fa368 .travis.yml: Use Ubuntu 20.04 "Focal" environment 2020-10-29 00:14:54 +03:00
6985f76aa3 .travis.yml: Bump Python versions
Test Python 3.9 now that it was released, and allow tests to fail
on nightly builds.
2020-10-29 00:14:36 +03:00
98a6a19e12 Update requirements-dev.txt
Generated with poetry export:

    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:48:46 +03:00
f4914c414f Only install ipython on Python 3.7+ 2020-10-06 17:48:16 +03:00
d352fe8017 Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:21:33 +03:00
f13c360084 Update poetry package dependencies 2020-10-06 17:20:16 +03:00
7cfd4c0b59 csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
826509ddcf poetry.lock: Run poetry update
List of updated modules:

  - Updating numpy (1.19.1 -> 1.19.2)
  - Updating pygments (2.6.1 -> 2.7.1)
  - Updating pandas (1.1.1 -> 1.1.2)

All tests still pass according to pytest.
2020-09-26 12:18:23 +03:00
22b5c0f7a1 CHANGELOG.md: Add note about dependencies update 2020-09-08 15:04:40 +03:00
774e274b32 poetry.lock: Run poetry update
Update dependencies to latest version:

  - Updating attrs (19.3.0 -> 20.2.0)
  - Updating more-itertools (8.4.0 -> 8.5.0)
  - Updating openpyxl (3.0.4 -> 3.0.5)
  - Updating parso (0.7.0 -> 0.7.1)
  - Updating sqlalchemy (1.3.18 -> 1.3.19)
  - Updating urllib3 (1.25.9 -> 1.25.10)
  - Updating agate-dbf (0.2.1 -> 0.2.2)
  - Updating agate-sql (0.5.4 -> 0.5.5)
  - Updating jedi (0.17.1 -> 0.17.2)
  - Updating numpy (1.19.0 -> 1.19.1)
  - Updating prompt-toolkit (3.0.5 -> 3.0.7)
  - Updating regex (2020.6.8 -> 2020.7.14)
  - Updating traitlets (4.3.3 -> 5.0.4)
  - Updating ipython (7.16.1 -> 7.18.1)
  - Updating pandas (1.0.5 -> 1.1.1)
  - Updating python-stdnum (1.13 -> 1.14)

All tests still pass according to pytest.
2020-09-08 15:04:00 +03:00
db474a802f README.md: Use badge from travis-ci.com 2020-08-04 11:12:28 +03:00
e241f8461b CHANGELOG.md: Add notes 2020-07-06 14:10:46 +03:00
431e6331c8 csv_metadata_quality/check.py: Format with black 2020-07-06 14:10:19 +03:00
cb07d357d4 Version 0.4.2 2020-07-06 14:04:34 +03:00
65cd48a26f CHANGELOG.md: Update changes 2020-07-06 14:00:21 +03:00
0f883f640c Remove pipenv 2020-07-06 13:59:49 +03:00
f4c5c5781e README.md: Switch to poetry 2020-07-06 13:59:11 +03:00
6aa784ad8c Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-07-06 13:57:07 +03:00
7b8da94f41 poetry.lock: Update Python dependencies 2020-07-06 13:56:31 +03:00
2a1566af62 csv_metadata_quality/check.py: Parameterize AGROVOC request 2020-07-06 13:44:46 +03:00
5fcaa63bd5 csv_metadata_quality/check.py: Prune requests cache once
We only need to prune the requests cache once before using it, not
for every value we check.
2020-07-06 13:42:19 +03:00
aa9e23b46c pyproject.toml: Update license specifier
We need to use valid SPDX license identifiers.
2020-06-09 14:22:53 +03:00
73acb1661f Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-05-31 17:51:16 +03:00
2a068fddc4 .build.yml: Fix test 2020-05-31 17:44:37 +03:00
c6c2f13e88 .build.yml: Fix poetry install invocation
Poetry apparently installs dev dependencies by default.
2020-05-31 17:37:09 +03:00
56f16e37ed .build.yml: Use poetry in SourceHut CI 2020-05-31 17:35:04 +03:00
0c44b967b6 Add poetry project file and lock
I want to try to use poetry instead of pipenv because pipenv takes
forever to do dependency resolution sometimes. Also, I have had a
few issues with Python modules like black that don't have releases
other than pre-releases, and even including the project itself in
the dependencies (pip install -e . ...?). My initial experience is
that poetry handles this better.
2020-05-31 17:33:40 +03:00
8a267bb40b .travis.yml: Try to build with Python 3.8-dev
But allow failures.
2020-03-29 16:40:11 +03:00
8fda8f1ef1 Pipfile.lock: Run pipenv update
All tests still passing.
2020-03-20 16:22:04 +02:00
5e471813e8 CHANGELOG.md: Add note about python dependencies 2020-01-29 12:41:43 +02:00
79244b9ac3 Pipfile.lock: Run pipenv update 2020-01-29 12:39:12 +02:00
5e81a33482 CHANGELOG.md: Add note about field names 2020-01-16 12:37:11 +02:00
28b5996aa6 Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
40ba9bae6c README.md: Adjust heading size 2020-01-15 12:26:11 +02:00
0b2d211455 Version 0.4.1 2020-01-15 12:19:42 +02:00
7f1df0b47c Support Python 3.6 and 3.7 again 2020-01-15 12:19:17 +02:00
365ecda324 Add utility function to check normalization
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 12:17:52 +02:00
550ce7fb7e .travis.yml: Only test Python 3.8
The Unicode normalization feature requires Python 3.8 because the
unicodedata.is_normalized() function only appears there. If I find
another way to check if a string is normalized without normalizing
it first I will drop the requirements back down to Python 3.6.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:57:21 +02:00
705127fd28 Version 0.4.0 2020-01-15 11:44:56 +02:00
894e0a196d setup.py: Change Python requirements
The `unicodedata.is_normalized()` function requires Python 3.8.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:43:25 +02:00
87181bc7b8 Run black, isort, and flake8. 2020-01-15 11:41:31 +02:00
8de5d862b6 CHANGELOG.md: Add note about Unicode normalization 2020-01-15 11:40:40 +02:00
49e3543878 Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
403b253762 CHANGELOG.md: Update python library versions 2020-01-15 10:58:44 +02:00
c5fbaf407a Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2020-01-15 10:51:58 +02:00
4f81f6c83c Pipfile.lock: Run pipenv update 2020-01-15 10:51:19 +02:00
4b9d1e060f setup.py: Add Python 3.8 classifier 2019-12-14 12:56:11 +02:00
c8a71e3143 Pipfile.lock: Run pipenv update 2019-12-14 12:53:39 +02:00
7964d98ca5 Pipfile: Specify exact version of black
Black only releases pre-release versions, which causes issues with
pipenv. Instead of always running pipenv with "--pre" and potenti-
ally letting in some other pre-release versions for other depende-
ncies, I would rather specify the latest black version explicitly.

See: https://github.com/psf/black/issues/517
See: https://github.com/microsoft/vscode-python/issues/5171
2019-12-14 12:41:28 +02:00
64ffc2f1da .travis.yml: Install packages from requirements.txt too 2019-11-14 23:42:28 +02:00
7b1bc29a92 .travis.yml: Try using pip instead of pipenv
The Pipfile knows it was created with Python 3.8, yet we're running
with multiple Python versions on Travis. I'm curious if would work
better to use pip to install dependencies instead of pipenv in this
case.
2019-11-14 23:37:25 +02:00
f0110d8e74 CHANGELOG.md: Add note about requirements 2019-11-14 23:30:26 +02:00
86498deee8 Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-11-14 23:28:42 +02:00
251647a15f CHANGELOG.md: Add TravisCI changes 2019-11-14 23:24:08 +02:00
0bd28e22ec .travis.yml: Test Python 3.8 2019-11-14 23:22:37 +02:00
63fdce7d13 .travis.yml: Use Ubuntu 18.04 "Bionic" 2019-11-14 23:22:19 +02:00
f068c0e16a CHANGELOG.md: Use Python 3.8.0 for pipenv 2019-11-14 23:11:43 +02:00
79b8f62a85 Use Python 3.8 for pipenv
Python 3.8.0 entered Arch Linux core repositories now and all tests
pass with Python 3.8.0 so it's time...
2019-11-14 23:10:20 +02:00
6c1e132531 CHANGELOG.md: Add unreleased changes 2019-11-14 09:19:19 +02:00
c0f3c866bd Pipfile.lock: Run pipenv update
Updates the following dependencies:

- numpy 1.17.2→1.17.4
- pandas 0.25.1→0.25.3
- flake8 3.7.8→3.7.9
- pytest 5.1.3→5.2.2
- black 19.3b0→19.10b0
2019-11-14 09:17:31 +02:00
36d0474b95 CHANGELOG.md: Move unreleased changes to v0.3.1 2019-10-01 17:11:52 +03:00
efdc3a841a Version 0.3.1 2019-10-01 17:11:13 +03:00
fd2ba6845d CHANGELOG.md: Update unreleased notes 2019-10-01 17:10:23 +03:00
e55380b4d5 csv_metadata_quality/fix.py: Harmonize language in fix output
We should always say if we're removing or replacing something.
2019-10-01 17:09:49 +03:00
85ae16d9b7 CHANGELOG.md: Add note about non-breaking spaces 2019-10-01 16:56:37 +03:00
c42f8b4812 csv_metadata_quality/fix.py: Replace non-breaking spaces
We should be replacing non-breaking spaces (U+00A0) with normal sp-
aces instead of removing them.
2019-10-01 16:55:04 +03:00
1c75608d54 README.md: Update introduction text
We should mention that this is not DSpace specific. Rather, it is
much more realistically Dublin Core specific.
2019-09-26 14:19:13 +03:00
0b15a8ed3b README.md: Remove TODO about lack of space after comma
This was added as an automatic global fix a few weeks ago.
2019-09-26 14:16:33 +03:00
9ca266f5f0 data/test.csv: Change birthdate column to dc.date.issued
More accurately reflects actual data we will be validating.
2019-09-26 14:15:48 +03:00
0d3f948708 CHANGELOG.md: Update comment about language validation 2019-09-26 14:14:57 +03:00
c04207fcfc CHANGELOG.md: Fix header formatting 2019-09-26 14:13:50 +03:00
9d4eceddc7 .build.yml: Enable experimental CLI checks on SourceHut 2019-09-26 14:11:35 +03:00
23 changed files with 2236 additions and 909 deletions

View File

@@ -1,19 +0,0 @@
image: archlinux
packages:
- python-pipenv
sources:
- https://git.sr.ht/~alanorth/csv-metadata-quality
tasks:
- setup: |
cd csv-metadata-quality
pipenv install --dev
- pytest: |
cd csv-metadata-quality
pipenv run pytest
- testcli: |
cd csv-metadata-quality
pipenv run pip install .
pipenv run csv-metadata-quality -i data/test.csv -o /tmp/test.csv -u --agrovoc-fields dc.subject,cg.coverage.country
environment:
PIPENV_NOSPIN: 'True'
PIPENV_HIDE_EMOJIS: 'True'

52
.drone.yml Normal file
View File

@@ -0,0 +1,52 @@
---
kind: pipeline
type: docker
name: python39
steps:
- name: test
image: python:3.9-slim
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dcterms.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python38
steps:
- name: test
image: python:3.8-slim
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dcterms.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python37
steps:
- name: test
image: python:3.7-slim
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dcterms.subject,cg.coverage.country
# vim: ts=2 sw=2 et

41
.github/workflows/python-app.yml vendored Normal file
View File

@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build and Test
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.9
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
- name: Test CLI
run: |
python setup.py install
csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dcterms.subject,cg.coverage.country

View File

@@ -1,11 +0,0 @@
dist: xenial
language: python
python:
- "3.6"
- "3.7"
install:
- "pip install pipenv --upgrade-strategy=only-if-needed"
- "pipenv install --dev"
script: pytest
# vim: ts=2 sw=2 et

View File

@@ -4,14 +4,114 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Unreleased
### Added
- Ability to check for, and fix, "mojibake" characters using [ftfy](https://github.com/LuminosoInsight/python-ftfy)
## [0.4.7] - 2021-03-17
### Changed
- Fixing invalid multi-value separators like `|` and `|||` is no longer class-
ified as "unsafe" as I have yet to see a case where this was intentional
- Not user visible, but now checks only print a warning to the screen instead
of returning a value and re-writing the DataFrame, which should be faster and
use less memory
### Added
- Configurable directory for AGROVOC requests cache (to allow running the web
version from Google App Engine where we can only write to /tmp)
- Ability to check for duplicate items in the data set (uses a combination of
the title, type, and date issued to determine uniqueness)
### Removed
- Checks for invalid and unnecessary multi-value separators because now I fix
them whenever I see them, so there is no need to have checks for them
### Updated
- Run `poetry update` to update project dependencies
## [0.4.6] - 2021-03-11
### Added
- Validation of dcterms.license field against SPDX license identifiers
### Changed
- Use DCTERMS fields where possible in `data/test.csv`
### Updated
- Run `poetry update` to update project dependencies
### Fixed
- Output for all fixes should be green, because it is good
## [0.4.5] - 2021-03-04
### Added
- Check dates in dcterms.issued field as well, not just fields that have the
word "date" in them
### Updated
- Run `poetry update` to update project dependencies
## [0.4.4] - 2021-02-21
### Added
- Accept dates formatted in ISO 8601 extended with combined date and time, for
example: 2020-08-31T11:04:56Z
- Colorized output: red for errors, yellow for warnings and information, green
for changes
### Updated
- Run `poetry update` to update project dependencies
## [0.4.3] - 2021-01-26
### Changed
- Reformat with black
- Requires Python 3.7+ for pandas 1.2.0
### Updated
- Run `poetry update`
- Expand check/fix for multi-value separators to include metadata with invalid
separators at the end, for example "Kenya||Tanzania||"
## [0.4.2] - 2020-07-06
### Changed
- Add field name to the output for more fixes and checks to help identify where
the error is
- Minor optimizations to AGROVOC subject lookup
- Use Poetry instead of Pipenv
### Updated
- Update python dependencies to latest versions
## [0.4.1] - 2020-01-15
### Changed
- Reduce minimum Python version to 3.6 by working around the `is_normalized()`
that only works in Python >= 3.8
## [0.4.0] - 2020-01-15
### Added
- Unicode normalization (enable with `--unsafe-fixes`, see README.md)
### Updated
- Update python dependencies to latest versions, including numpy 1.18.1, pandas
1.0.0rc0, flake8 3.7.9, pytest 5.3.2, and black 19.10b0
- Regenerate requirements.txt and requirements-dev.txt
### Changed
- Use Python 3.8.0 for pipenv
- Use Ubuntu 18.04 "Bionic" for TravisCI builds
- Test Python 3.8 in TravisCI builds
## [0.3.1] - 2019-10-01
## Changed
- Replace non-breaking spaces (U+00A0) with space instead of removing them
- Harmonize language of script output when fixing various issues
## [0.3.0] - 2019-09-26 ## [0.3.0] - 2019-09-26
### Updated ### Updated
- Update python dependencies to latest versions, including numpy 1.17.2, pandas - Update python dependencies to latest versions, including numpy 1.17.2, pandas
0.25.1, pytest 5.1.3, and requests-cache 0.5.2 0.25.1, pytest 5.1.3, and requests-cache 0.5.2
## Added ### Added
- csvkit to dev requirements (csvcut etc are useful during development) - csvkit to dev requirements (csvcut etc are useful during development)
- Experimental language validation using `-e` (see README.md) - Experimental language validation using the Python `langid` library (enable with `-e`, see README.md)
### Changed ### Changed
- Re-formatted code with black and isort - Re-formatted code with black and isort

29
Pipfile
View File

@@ -1,29 +0,0 @@
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true
[dev-packages]
pytest = "*"
ipython = "*"
flake8 = "*"
pytest-clarity = "*"
black = "*"
isort = "*"
csvkit = "*"
[packages]
pandas = "*"
python-stdnum = "*"
xlrd = "*"
requests = "*"
requests-cache = "*"
pycountry = "*"
csv-metadata-quality = {editable = true,path = "."}
langid = "*"
[requires]
python_version = "3.7"
[pipenv]
allow_prereleases = true

555
Pipfile.lock generated
View File

@@ -1,555 +0,0 @@
{
"_meta": {
"hash": {
"sha256": "59562d8c59eb09e23b49475d6901687edbf605f5b84e283e90cc8e2de518641f"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.7"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.org/simple",
"verify_ssl": true
}
]
},
"default": {
"certifi": {
"hashes": [
"sha256:e4f3620cfea4f83eedc95b24abd9cd56f3c4b146dd0177e83a21b4eb49e21e50",
"sha256:fd7c7c74727ddcf00e9acd26bba8da604ffec95bf1c2144e67aff7a8b50e6cef"
],
"version": "==2019.9.11"
},
"chardet": {
"hashes": [
"sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae",
"sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"
],
"version": "==3.0.4"
},
"csv-metadata-quality": {
"editable": true,
"path": "."
},
"idna": {
"hashes": [
"sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407",
"sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c"
],
"version": "==2.8"
},
"langid": {
"hashes": [
"sha256:044bcae1912dab85c33d8e98f2811b8f4ff1213e5e9a9e9510137b84da2cb293"
],
"index": "pypi",
"version": "==1.1.6"
},
"numpy": {
"hashes": [
"sha256:05dbfe72684cc14b92568de1bc1f41e5f62b00f714afc9adee42f6311738091f",
"sha256:0d82cb7271a577529d07bbb05cb58675f2deb09772175fab96dc8de025d8ac05",
"sha256:10132aa1fef99adc85a905d82e8497a580f83739837d7cbd234649f2e9b9dc58",
"sha256:12322df2e21f033a60c80319c25011194cd2a21294cc66fee0908aeae2c27832",
"sha256:16f19b3aa775dddc9814e02a46b8e6ae6a54ed8cf143962b4e53f0471dbd7b16",
"sha256:3d0b0989dd2d066db006158de7220802899a1e5c8cf622abe2d0bd158fd01c2c",
"sha256:438a3f0e7b681642898fd7993d38e2bf140a2d1eafaf3e89bb626db7f50db355",
"sha256:5fd214f482ab53f2cea57414c5fb3e58895b17df6e6f5bca5be6a0bb6aea23bb",
"sha256:73615d3edc84dd7c4aeb212fa3748fb83217e00d201875a47327f55363cef2df",
"sha256:7bd355ad7496f4ce1d235e9814ec81ee3d28308d591c067ce92e49f745ba2c2f",
"sha256:7d077f2976b8f3de08a0dcf5d72083f4af5411e8fddacd662aae27baa2601196",
"sha256:a4092682778dc48093e8bda8d26ee8360153e2047826f95a3f5eae09f0ae3abf",
"sha256:b458de8624c9f6034af492372eb2fee41a8e605f03f4732f43fc099e227858b2",
"sha256:e70fc8ff03a961f13363c2c95ef8285e0cf6a720f8271836f852cc0fa64e97c8",
"sha256:ee8e9d7cad5fe6dde50ede0d2e978d81eafeaa6233fb0b8719f60214cf226578",
"sha256:f4a4f6aba148858a5a5d546a99280f71f5ee6ec8182a7d195af1a914195b21a2"
],
"version": "==1.17.2"
},
"pandas": {
"hashes": [
"sha256:18d91a9199d1dfaa01ad645f7540370ba630bdcef09daaf9edf45b4b1bca0232",
"sha256:3f26e5da310a0c0b83ea50da1fd397de2640b02b424aa69be7e0784228f656c9",
"sha256:4182e32f4456d2c64619e97c58571fa5ca0993d1e8c2d9ca44916185e1726e15",
"sha256:426e590e2eb0e60f765271d668a30cf38b582eaae5ec9b31229c8c3c10c5bc21",
"sha256:5eb934a8f0dc358f0e0cdf314072286bbac74e4c124b64371395e94644d5d919",
"sha256:717928808043d3ea55b9bcde636d4a52d2236c246f6df464163a66ff59980ad8",
"sha256:8145f97c5ed71827a6ec98ceaef35afed1377e2d19c4078f324d209ff253ecb5",
"sha256:8744c84c914dcc59cbbb2943b32b7664df1039d99e834e1034a3372acb89ea4d",
"sha256:c1ac1d9590d0c9314ebf01591bd40d4c03d710bfc84a3889e5263c97d7891dee",
"sha256:cb2e197b7b0687becb026b84d3c242482f20cbb29a9981e43604eb67576da9f6",
"sha256:d4001b71ad2c9b84ff18b182cea22b7b6cbf624216da3ea06fb7af28d1f93165",
"sha256:d8930772adccb2882989ab1493fa74bd87d47c8ac7417f5dd3dd834ba8c24dc9",
"sha256:dfbb0173ee2399bc4ed3caf2d236e5c0092f948aafd0a15fbe4a0e77ee61a958",
"sha256:eebfbba048f4fa8ac711b22c78516e16ff8117d05a580e7eeef6b0c2be554c18",
"sha256:f1b21bc5cf3dbea53d33615d1ead892dfdae9d7052fa8898083bec88be20dcd2"
],
"index": "pypi",
"version": "==0.25.1"
},
"pycountry": {
"hashes": [
"sha256:3c57aa40adcf293d59bebaffbe60d8c39976fba78d846a018dc0c2ec9c6cb3cb"
],
"index": "pypi",
"version": "==19.8.18"
},
"python-dateutil": {
"hashes": [
"sha256:7e6584c74aeed623791615e26efd690f29817a27c73085b78e4bad02493df2fb",
"sha256:c89805f6f4d64db21ed966fda138f8a5ed7a4fdbc1a8ee329ce1b74e3c74da9e"
],
"version": "==2.8.0"
},
"python-stdnum": {
"hashes": [
"sha256:d5f0af1bee9ddd9a20b398b46ce062dbd4d41fcc9646940f2667256a44df3854",
"sha256:f445ec32bf5246c90389204cabba465f494545371c29a83fa2d30e6c872a6763"
],
"index": "pypi",
"version": "==1.11"
},
"pytz": {
"hashes": [
"sha256:26c0b32e437e54a18161324a2fca3c4b9846b74a8dccddd843113109e1116b32",
"sha256:c894d57500a4cd2d5c71114aaab77dbab5eabd9022308ce5ac9bb93a60a6f0c7"
],
"version": "==2019.2"
},
"requests": {
"hashes": [
"sha256:11e007a8a2aa0323f5a921e9e6a2d7e4e67d9877e85773fba9ba6419025cbeb4",
"sha256:9cf5292fcd0f598c671cfc1e0d7d1a7f13bb8085e9a590f48c010551dc6c4b31"
],
"index": "pypi",
"version": "==2.22.0"
},
"requests-cache": {
"hashes": [
"sha256:813023269686045f8e01e2289cc1e7e9ae5ab22ddd1e2849a9093ab3ab7270eb",
"sha256:81e13559baee64677a7d73b85498a5a8f0639e204517b5d05ff378e44a57831a"
],
"index": "pypi",
"version": "==0.5.2"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"urllib3": {
"hashes": [
"sha256:3de946ffbed6e6746608990594d08faac602528ac7015ac28d33cee6a45b7398",
"sha256:9a107b99a5393caf59c7aa3c1249c16e6879447533d0887f4336dde834c7be86"
],
"version": "==1.25.6"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
}
},
"develop": {
"agate": {
"hashes": [
"sha256:48d6f80b35611c1ba25a642cbc5b90fcbdeeb2a54711c4a8d062ee2809334d1c",
"sha256:c93aaa500b439d71e4a5cf088d0006d2ce2c76f1950960c8843114e5f361dfd3"
],
"version": "==1.6.1"
},
"agate-dbf": {
"hashes": [
"sha256:00c93c498ec9a04cc587bf63dd7340e67e2541f0df4c9a7259d7cb3dd4ce372f"
],
"version": "==0.2.1"
},
"agate-excel": {
"hashes": [
"sha256:8f255ef2c87c436b7132049e1dd86c8e08bf82d8c773aea86f3069b461a17d52"
],
"version": "==0.2.3"
},
"agate-sql": {
"hashes": [
"sha256:9277490ba8b8e7c747a9ae3671f52fe486784b48d4a14e78ca197fb0e36f281b"
],
"version": "==0.5.4"
},
"appdirs": {
"hashes": [
"sha256:9e5896d1372858f8dd3344faf4e5014d21849c756c8d5701f78f8a103b372d92",
"sha256:d8b24664561d0d34ddfaec54636d502d7cea6e29c3eaf68f3df6180863e2166e"
],
"version": "==1.4.3"
},
"atomicwrites": {
"hashes": [
"sha256:03472c30eb2c5d1ba9227e4c2ca66ab8287fbfbbda3888aa93dc2e28fc6811b4",
"sha256:75a9445bac02d8d058d5e1fe689654ba5a6556a1dfd8ce6ec55a0ed79866cfa6"
],
"version": "==1.3.0"
},
"attrs": {
"hashes": [
"sha256:69c0dbf2ed392de1cb5ec704444b08a5ef81680a61cb899dc08127123af36a79",
"sha256:f0b870f674851ecbfbbbd364d6b5cbdff9dcedbc7f3f5e18a6891057f21fe399"
],
"version": "==19.1.0"
},
"babel": {
"hashes": [
"sha256:af92e6106cb7c55286b25b38ad7695f8b4efb36a90ba483d7f7a6628c46158ab",
"sha256:e86135ae101e31e2c8ec20a4e0c5220f4eed12487d5cf3f78be7e98d3a57fc28"
],
"version": "==2.7.0"
},
"backcall": {
"hashes": [
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
],
"version": "==0.1.0"
},
"black": {
"hashes": [
"sha256:09a9dcb7c46ed496a9850b76e4e825d6049ecd38b611f1224857a79bd985a8cf",
"sha256:68950ffd4d9169716bcb8719a56c07a2f4485354fec061cdd5910aa07369731c"
],
"index": "pypi",
"version": "==19.3b0"
},
"click": {
"hashes": [
"sha256:2335065e6395b9e67ca716de5f7526736bfa6ceead690adf616d925bdc622b13",
"sha256:5b94b49521f6456670fdb30cd82a4eca9412788a93fa6dd6df72c94d5a8ff2d7"
],
"version": "==7.0"
},
"csvkit": {
"hashes": [
"sha256:1353a383531bee191820edfb88418c13dfe1cdfa9dd3dc46f431c05cd2a260a0"
],
"index": "pypi",
"version": "==1.0.4"
},
"dbfread": {
"hashes": [
"sha256:07c8a9af06ffad3f6f03e8fe91ad7d2733e31a26d2b72c4dd4cfbae07ee3b73d",
"sha256:f604def58c59694fa0160d7be5d0b8d594467278d2bb6a47d46daf7162c84cec"
],
"version": "==2.0.7"
},
"decorator": {
"hashes": [
"sha256:86156361c50488b84a3f148056ea716ca587df2f0de1d34750d35c21312725de",
"sha256:f069f3a01830ca754ba5258fde2278454a0b5b79e0d7f5c13b3b97e57d4acff6"
],
"version": "==4.4.0"
},
"entrypoints": {
"hashes": [
"sha256:589f874b313739ad35be6e0cd7efde2a4e9b6fea91edcc34e58ecbb8dbe56d19",
"sha256:c70dd71abe5a8c85e55e12c19bd91ccfeec11a6e99044204511f9ed547d48451"
],
"version": "==0.3"
},
"et-xmlfile": {
"hashes": [
"sha256:614d9722d572f6246302c4491846d2c393c199cfa4edc9af593437691683335b"
],
"version": "==1.0.1"
},
"flake8": {
"hashes": [
"sha256:19241c1cbc971b9962473e4438a2ca19749a7dd002dd1a946eaba171b4114548",
"sha256:8e9dfa3cecb2400b3738a42c54c3043e821682b9c840b0448c0503f781130696"
],
"index": "pypi",
"version": "==3.7.8"
},
"future": {
"hashes": [
"sha256:67045236dcfd6816dc439556d009594abf643e5eb48992e36beac09c2ca659b8"
],
"version": "==0.17.1"
},
"importlib-metadata": {
"hashes": [
"sha256:aa18d7378b00b40847790e7c27e11673d7fed219354109d0e7b9e5b25dc3ad26",
"sha256:d5f18a79777f3aa179c145737780282e27b508fc8fd688cb17c7a813e8bd39af"
],
"markers": "python_version < '3.8'",
"version": "==0.23"
},
"ipython": {
"hashes": [
"sha256:c4ab005921641e40a68e405e286e7a1fcc464497e14d81b6914b4fd95e5dee9b",
"sha256:dd76831f065f17bddd7eaa5c781f5ea32de5ef217592cf019e34043b56895aa1"
],
"index": "pypi",
"version": "==7.8.0"
},
"ipython-genutils": {
"hashes": [
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
],
"version": "==0.2.0"
},
"isodate": {
"hashes": [
"sha256:2e364a3d5759479cdb2d37cce6b9376ea504db2ff90252a2e5b7cc89cc9ff2d8",
"sha256:aa4d33c06640f5352aca96e4b81afd8ab3b47337cc12089822d6f322ac772c81"
],
"version": "==0.6.0"
},
"isort": {
"hashes": [
"sha256:54da7e92468955c4fceacd0c86bd0ec997b0e1ee80d97f67c35a78b719dccab1",
"sha256:6e811fcb295968434526407adb8796944f1988c5b65e8139058f2014cbe100fd"
],
"index": "pypi",
"version": "==4.3.21"
},
"jdcal": {
"hashes": [
"sha256:1abf1305fce18b4e8aa248cf8fe0c56ce2032392bc64bbd61b5dff2a19ec8bba",
"sha256:472872e096eb8df219c23f2689fc336668bdb43d194094b5cc1707e1640acfc8"
],
"version": "==1.4.1"
},
"jedi": {
"hashes": [
"sha256:786b6c3d80e2f06fd77162a07fed81b8baa22dde5d62896a790a331d6ac21a27",
"sha256:ba859c74fa3c966a22f2aeebe1b74ee27e2a462f56d3f5f7ca4a59af61bfe42e"
],
"version": "==0.15.1"
},
"leather": {
"hashes": [
"sha256:076d1603b5281488285718ce1a5ce78cf1027fe1e76adf9c548caf83c519b988",
"sha256:e0bb36a6d5f59fbf3c1a6e75e7c8bee29e67f06f5b48c0134407dde612eba5e2"
],
"version": "==0.3.3"
},
"mccabe": {
"hashes": [
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
],
"version": "==0.6.1"
},
"more-itertools": {
"hashes": [
"sha256:409cd48d4db7052af495b09dec721011634af3753ae1ef92d2b32f73a745f832",
"sha256:92b8c4b06dac4f0611c0729b2f2ede52b2e1bac1ab48f089c7ddc12e26bb60c4"
],
"version": "==7.2.0"
},
"openpyxl": {
"hashes": [
"sha256:340a1ab2069764559b9d58027a43a24db18db0e25deb80f81ecb8ca7ee5253db"
],
"version": "==3.0.0"
},
"packaging": {
"hashes": [
"sha256:28b924174df7a2fa32c1953825ff29c61e2f5e082343165438812f00d3a7fc47",
"sha256:d9551545c6d761f3def1677baf08ab2a3ca17c56879e70fecba2fc4dde4ed108"
],
"version": "==19.2"
},
"parsedatetime": {
"hashes": [
"sha256:3d817c58fb9570d1eec1dd46fa9448cd644eeed4fb612684b02dfda3a79cb84b",
"sha256:9ee3529454bf35c40a77115f5a596771e59e1aee8c53306f346c461b8e913094"
],
"version": "==2.4"
},
"parso": {
"hashes": [
"sha256:63854233e1fadb5da97f2744b6b24346d2750b85965e7e399bec1620232797dc",
"sha256:666b0ee4a7a1220f65d367617f2cd3ffddff3e205f3f16a0284df30e774c2a9c"
],
"version": "==0.5.1"
},
"pexpect": {
"hashes": [
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
],
"markers": "sys_platform != 'win32'",
"version": "==4.7.0"
},
"pickleshare": {
"hashes": [
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
],
"version": "==0.7.5"
},
"pluggy": {
"hashes": [
"sha256:0db4b7601aae1d35b4a033282da476845aa19185c1e6964b25cf324b5e4ec3e6",
"sha256:fa5fa1622fa6dd5c030e9cad086fa19ef6a0cf6d7a2d12318e10cb49d6d68f34"
],
"version": "==0.13.0"
},
"prompt-toolkit": {
"hashes": [
"sha256:11adf3389a996a6d45cc277580d0d53e8a5afd281d0c9ec71b28e6f121463780",
"sha256:2519ad1d8038fd5fc8e770362237ad0364d16a7650fb5724af6997ed5515e3c1",
"sha256:977c6583ae813a37dc1c2e1b715892461fcbdaa57f6fc62f33a528c4886c8f55"
],
"version": "==2.0.9"
},
"ptyprocess": {
"hashes": [
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
],
"version": "==0.6.0"
},
"py": {
"hashes": [
"sha256:64f65755aee5b381cea27766a3a147c3f15b9b6b9ac88676de66ba2ae36793fa",
"sha256:dc639b046a6e2cff5bbe40194ad65936d6ba360b52b3c3fe1d08a82dd50b5e53"
],
"version": "==1.8.0"
},
"pycodestyle": {
"hashes": [
"sha256:95a2219d12372f05704562a14ec30bc76b05a5b297b21a5dfe3f6fac3491ae56",
"sha256:e40a936c9a450ad81df37f549d676d127b1b66000a6c500caa2b085bc0ca976c"
],
"version": "==2.5.0"
},
"pyflakes": {
"hashes": [
"sha256:17dbeb2e3f4d772725c777fabc446d5634d1038f234e77343108ce445ea69ce0",
"sha256:d976835886f8c5b31d47970ed689944a0262b5f3afa00a5a7b4dc81e5449f8a2"
],
"version": "==2.1.1"
},
"pygments": {
"hashes": [
"sha256:71e430bc85c88a430f000ac1d9b331d2407f681d6f6aec95e8bcfbc3df5b0127",
"sha256:881c4c157e45f30af185c1ffe8d549d48ac9127433f2c380c24b84572ad66297"
],
"version": "==2.4.2"
},
"pyparsing": {
"hashes": [
"sha256:6f98a7b9397e206d78cc01df10131398f1c8b8510a2f4d97d9abd82e1aacdd80",
"sha256:d9338df12903bbf5d65a0e4e87c2161968b10d2e489652bb47001d82a9b028b4"
],
"version": "==2.4.2"
},
"pytest": {
"hashes": [
"sha256:813b99704b22c7d377bbd756ebe56c35252bb710937b46f207100e843440b3c2",
"sha256:cc6620b96bc667a0c8d4fa592a8c9c94178a1bd6cc799dbb057dfd9286d31a31"
],
"index": "pypi",
"version": "==5.1.3"
},
"pytest-clarity": {
"hashes": [
"sha256:3f40d5ae7cb21cc95e622fc4f50d9466f80ae0f91460225b8c95c07afbf93e20"
],
"index": "pypi",
"version": "==0.2.0a1"
},
"python-slugify": {
"hashes": [
"sha256:575d03256a132fc1efb4c52966c6eb11c57a13b071618f0b26076057a23f6937"
],
"version": "==3.0.4"
},
"pytimeparse": {
"hashes": [
"sha256:04b7be6cc8bd9f5647a6325444926c3ac34ee6bc7e69da4367ba282f076036bd",
"sha256:e86136477be924d7e670646a98561957e8ca7308d44841e21f5ddea757556a0a"
],
"version": "==1.1.8"
},
"pytz": {
"hashes": [
"sha256:26c0b32e437e54a18161324a2fca3c4b9846b74a8dccddd843113109e1116b32",
"sha256:c894d57500a4cd2d5c71114aaab77dbab5eabd9022308ce5ac9bb93a60a6f0c7"
],
"version": "==2019.2"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"sqlalchemy": {
"hashes": [
"sha256:2f8ff566a4d3a92246d367f2e9cd6ed3edeef670dcd6dda6dfdc9efed88bcd80"
],
"version": "==1.3.8"
},
"termcolor": {
"hashes": [
"sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b"
],
"version": "==1.1.0"
},
"text-unidecode": {
"hashes": [
"sha256:1311f10e8b895935241623731c2ba64f4c455287888b18189350b67134a822e8",
"sha256:bad6603bb14d279193107714b288be206cac565dfa49aa5b105294dd5c4aab93"
],
"version": "==1.3"
},
"toml": {
"hashes": [
"sha256:229f81c57791a41d65e399fc06bf0848bab550a9dfd5ed66df18ce5f05e73d5c",
"sha256:235682dd292d5899d361a811df37e04a8828a5b1da3115886b73cf81ebc9100e"
],
"version": "==0.10.0"
},
"traitlets": {
"hashes": [
"sha256:262089114405f22f4833be96b31e143ab906d7764a22c04c71fee0bbda4787ba",
"sha256:6ad5b30dacd5e2424c46cc94a0aeab990d98ae17d181acea2cc4272ac3409fca"
],
"version": "==4.3.3.dev0"
},
"wcwidth": {
"hashes": [
"sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e",
"sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c"
],
"version": "==0.1.7"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
},
"zipp": {
"hashes": [
"sha256:3718b1cbcd963c7d4c5511a8240812904164b7f381b647143a89d3b98f9bcd8e",
"sha256:f06903e9f1f43b12d371004b4ac7b06ab39a44adc747266928ae6debfa7b3335"
],
"version": "==0.6.0"
}
}
}

View File

@@ -1,7 +1,11 @@
# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?) # DSpace CSV Metadata Quality Checker ![GitHub Actions](https://github.com/ilri/csv-metadata-quality/workflows/Build%20and%20Test/badge.svg) [![Build Status](https://ci.mjanja.ch/api/badges/alanorth/csv-metadata-quality/status.svg)](https://ci.mjanja.ch/alanorth/csv-metadata-quality)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc. A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, unnecessary Unicode, AGROVOC terms, etc.
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested. Requires Python 3.7.1 or greater (3.8+ recommended). CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
If you use the DSpace CSV metadata quality checker please cite:
*Orth, A. 2019. DSpace CSV metadata quality checker. Nairobi, Kenya: ILRI. https://hdl.handle.net/10568/110997.*
## Functionality ## Functionality
@@ -9,24 +13,28 @@ Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](ht
- Validate languages against ISO 639-1 (alpha2) and ISO 639-3 (alpha3) - Validate languages against ISO 639-1 (alpha2) and ISO 639-3 (alpha3)
- Experimental validation of titles and abstracts against item's Dublin Core language field - Experimental validation of titles and abstracts against item's Dublin Core language field
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option) - Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
- Validation of licenses against the list of [SPDX license identifiers](https://spdx.org/licenses)
- Fix leading, trailing, and excessive (ie, more than one) whitespace - Fix leading, trailing, and excessive (ie, more than one) whitespace
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes` - Fix invalid and unnecessary multi-value separators (`|`)
- Fix problematic newlines (line feeds) using `--unsafe-fixes` - Fix problematic newlines (line feeds) using `--unsafe-fixes`
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt" - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
- Check for "mojibake" characters (and attempt to fix with `--unsafe-fixes`)
- Remove duplicate metadata values - Remove duplicate metadata values
- Check for duplicate items, using the title, type, and date issued as an indicator
## Installation ## Installation
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv): The easiest way to install CSV Metadata Quality is with [poetry](https://python-poetry.org):
``` ```
$ git clone https://github.com/ilri/csv-metadata-quality.git $ git clone https://github.com/ilri/csv-metadata-quality.git
$ cd csv-metadata-quality $ cd csv-metadata-quality
$ pipenv install $ poetry install
$ pipenv shell $ poetry shell
``` ```
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment: Otherwise, if you don't have poetry, you can use a vanilla Python virtual environment:
``` ```
$ git clone https://github.com/ilri/csv-metadata-quality.git $ git clone https://github.com/ilri/csv-metadata-quality.git
@@ -49,15 +57,33 @@ To validate and clean a CSV file you must specify input and output files using t
$ csv-metadata-quality -i data/test.csv -o /tmp/test.csv $ csv-metadata-quality -i data/test.csv -o /tmp/test.csv
``` ```
## Unsafe Fixes ## Invalid Multi-Value Separators
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will attempt to fix invalid multi-value separators and remove newlines. While it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. This utility will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
### Invalid Multi-Value Separators This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
## Unsafe Fixes
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will remove newlines and perform Unicode normalization.
### Newlines ### Newlines
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A). This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
### Unicode Normalization
[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text. As the standard aims to support most of the world's languages, characters can often be represented in different ways and still be valid Unicode. This leads to interesting problems that can be confusing unless you know what's going on behind the scenes. For example, the characters `é` and `é` *look* the same, but are nottechnically they refer to different code points in the Unicode standard:
- `é` is the Unicode code point `U+00E9`
- `é` is the Unicode code points `U+0065` + `U+0301`
Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html).
### Encoding Issues aka "Mojibake"
[Mojibake](https://en.wikipedia.org/wiki/Mojibake) is a phenomenon that occurs when text is decoded using an unintended character encoding. This usually presents itself in the form of strange, garbled characters in the text. Enabling "unsafe" fixes will attempt to correct these, for example:
- CIAT PublicaçaoCIAT Publicaçao
- CIAT PublicaciónCIAT Publicación
Pay special attention to the output of the script as well as the resulting file to make sure no new issues have been introduced. The ideal way to solve these issues is to avoid it in the first place. See [this guide about opening CSVs in UTF-8 format in Excel](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0).
## AGROVOC Validation ## AGROVOC Validation
You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields: You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields:
@@ -88,12 +114,18 @@ This currently uses the [Python langid](https://github.com/saffsd/langid.py) lib
- Better logging, for example with INFO, WARN, and ERR levels - Better logging, for example with INFO, WARN, and ERR levels
- Verbose, debug, or quiet options - Verbose, debug, or quiet options
- Warn if an author is shorter than 3 characters? - Warn if an author is shorter than 3 characters?
- Validate dc.rights field against SPDX? Perhaps with an option like `-m spdx` to enable the spdx module?
- Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006 - Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006
- Warn if two items use the same file in `filename` column - Warn if two items use the same file in `filename` column
- Add an option to drop invalid AGROVOC subjects? - Add an option to drop invalid AGROVOC subjects?
- Add check for author names with incorrect spacing after commas, ie "Orth,Alan S."
- Add tests for application invocation, ie `tests/test_app.py`? - Add tests for application invocation, ie `tests/test_app.py`?
- Validate ISSNs or journal titles against CrossRef API?
- Add configurable field validation, like specify a field name and a validation file?
- Perhaps like --validate=field.name,filename
- Add some row-based item sanity checks and fixes:
- Warn if item is Open Access, but missing a filename or URL
- Warn if item is Open Access, but missing a license
- Warn if item has an ISSN but no journal title
- Update journal titles from ISSN
## License ## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html). This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

View File

@@ -4,6 +4,7 @@ import signal
import sys import sys
import pandas as pd import pandas as pd
from colorama import Fore
import csv_metadata_quality.check as check import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental import csv_metadata_quality.experimental as experimental
@@ -16,12 +17,13 @@ def parse_args(argv):
parser.add_argument( parser.add_argument(
"--agrovoc-fields", "--agrovoc-fields",
"-a", "-a",
help="Comma-separated list of fields to validate against AGROVOC, for example: dc.subject,cg.coverage.country", help="Comma-separated list of fields to validate against AGROVOC, for example: dcterms.subject,cg.coverage.country",
) )
parser.add_argument( parser.add_argument(
"--experimental-checks", "--experimental-checks",
"-e", "-e",
help="Enable experimental checks like language detection", action="store_true" help="Enable experimental checks like language detection",
action="store_true",
) )
parser.add_argument( parser.add_argument(
"--input-file", "--input-file",
@@ -46,7 +48,7 @@ def parse_args(argv):
parser.add_argument( parser.add_argument(
"--exclude-fields", "--exclude-fields",
"-x", "-x",
help="Comma-separated list of fields to skip, for example: dc.contributor.author,dc.identifier.citation", help="Comma-separated list of fields to skip, for example: dc.contributor.author,dcterms.bibliographicCitation",
) )
args = parser.parse_args() args = parser.parse_args()
@@ -76,12 +78,12 @@ def run(argv):
if column == exclude and skip is False: if column == exclude and skip is False:
skip = True skip = True
if skip: if skip:
print(f"Skipping {column}") print(f"{Fore.YELLOW}Skipping {Fore.RESET}{column}")
continue continue
# Fix: whitespace # Fix: whitespace
df[column] = df[column].apply(fix.whitespace) df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: newlines # Fix: newlines
if args.unsafe_fixes: if args.unsafe_fixes:
@@ -94,54 +96,79 @@ def run(argv):
if match is not None: if match is not None:
df[column] = df[column].apply(fix.comma_space, field_name=column) df[column] = df[column].apply(fix.comma_space, field_name=column)
# Fix: perform Unicode normalization (NFC) to convert decomposed
# characters into their canonical forms.
if args.unsafe_fixes:
df[column] = df[column].apply(fix.normalize_unicode, field_name=column)
# Fix: unnecessary Unicode # Fix: unnecessary Unicode
df[column] = df[column].apply(fix.unnecessary_unicode) df[column] = df[column].apply(fix.unnecessary_unicode)
# Check: invalid multi-value separator
df[column] = df[column].apply(check.separators)
# Check: suspicious characters # Check: suspicious characters
df[column] = df[column].apply(check.suspicious_characters, field_name=column) df[column].apply(check.suspicious_characters, field_name=column)
# Fix: invalid multi-value separator # Check: mojibake
df[column].apply(check.mojibake, field_name=column)
# Fix: mojibake
if args.unsafe_fixes: if args.unsafe_fixes:
df[column] = df[column].apply(fix.separators) df[column] = df[column].apply(fix.mojibake, field_name=column)
# Fix: invalid and unnecessary multi-value separators
df[column] = df[column].apply(fix.separators, field_name=column)
# Run whitespace fix again after fixing invalid separators # Run whitespace fix again after fixing invalid separators
df[column] = df[column].apply(fix.whitespace) df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: duplicate metadata values # Fix: duplicate metadata values
df[column] = df[column].apply(fix.duplicates) df[column] = df[column].apply(fix.duplicates, field_name=column)
# Check: invalid AGROVOC subject # Check: invalid AGROVOC subject
if args.agrovoc_fields: if args.agrovoc_fields:
# Identify fields the user wants to validate against AGROVOC # Identify fields the user wants to validate against AGROVOC
for field in args.agrovoc_fields.split(","): for field in args.agrovoc_fields.split(","):
if column == field: if column == field:
df[column] = df[column].apply(check.agrovoc, field_name=column) df[column].apply(check.agrovoc, field_name=column)
# Check: invalid language # Check: invalid language
match = re.match(r"^.*?language.*$", column) match = re.match(r"^.*?language.*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.language) df[column].apply(check.language)
# Check: invalid ISSN # Check: invalid ISSN
match = re.match(r"^.*?issn.*$", column) match = re.match(r"^.*?issn.*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.issn) df[column].apply(check.issn)
# Check: invalid ISBN # Check: invalid ISBN
match = re.match(r"^.*?isbn.*$", column) match = re.match(r"^.*?isbn.*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.isbn) df[column].apply(check.isbn)
# Check: invalid date # Check: invalid date
match = re.match(r"^.*?date.*$", column) match = re.match(r"^.*?(date|dcterms\.issued).*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.date, field_name=column) df[column].apply(check.date, field_name=column)
# Check: filename extension # Check: filename extension
if column == "filename": if column == "filename":
df[column] = df[column].apply(check.filename_extension) df[column].apply(check.filename_extension)
# Check: SPDX license identifier
match = re.match(r"dcterms\.license.*$", column)
if match is not None:
df[column].apply(check.spdx_license_identifier)
### End individual column checks ###
# Check: duplicate items
# We extract just the title, type, and date issued columns to analyze
duplicates_df = df.filter(
regex=r"dcterms\.title|dc\.title|dcterms\.type|dc\.type|dcterms\.issued|dc\.date\.issued"
)
check.duplicate_items(duplicates_df)
# Delete the temporary duplicates DataFrame
del duplicates_df
## ##
# Perform some checks on rows so we can consider items as a whole rather # Perform some checks on rows so we can consider items as a whole rather

View File

@@ -1,4 +1,17 @@
import os
import re
from datetime import datetime, timedelta
import pandas as pd import pandas as pd
import requests
import requests_cache
import spdx_license_list
from colorama import Fore
from pycountry import languages
from stdnum import isbn as stdnum_isbn
from stdnum import issn as stdnum_issn
from csv_metadata_quality.util import is_mojibake
def issn(field): def issn(field):
@@ -11,8 +24,6 @@ def issn(field):
See: https://arthurdejong.org/python-stdnum/doc/1.11/index.html#stdnum.module.is_valid See: https://arthurdejong.org/python-stdnum/doc/1.11/index.html#stdnum.module.is_valid
""" """
from stdnum import issn
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
@@ -20,10 +31,10 @@ def issn(field):
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split("||"): for value in field.split("||"):
if not issn.is_valid(value): if not stdnum_issn.is_valid(value):
print(f"Invalid ISSN: {value}") print(f"{Fore.RED}Invalid ISSN: {Fore.RESET}{value}")
return field return
def isbn(field): def isbn(field):
@@ -36,8 +47,6 @@ def isbn(field):
See: https://arthurdejong.org/python-stdnum/doc/1.11/index.html#stdnum.module.is_valid See: https://arthurdejong.org/python-stdnum/doc/1.11/index.html#stdnum.module.is_valid
""" """
from stdnum import isbn
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
@@ -45,35 +54,11 @@ def isbn(field):
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split("||"): for value in field.split("||"):
if not isbn.is_valid(value): if not stdnum_isbn.is_valid(value):
print(f"Invalid ISBN: {value}") print(f"{Fore.RED}Invalid ISBN: {Fore.RESET}{value}")
return field
def separators(field):
"""Check for invalid multi-value separators (ie "|" or "|||").
Prints the field with the invalid multi-value separator.
"""
import re
# Skip fields with missing values
if pd.isna(field):
return return
# Try to split multi-value field on "||" separator
for value in field.split("||"):
# After splitting, see if there are any remaining "|" characters
match = re.findall(r"^.*?\|.*$", value)
if match:
print(f"Invalid multi-value separator: {field}")
return field
def date(field, field_name): def date(field, field_name):
"""Check if a date is valid. """Check if a date is valid.
@@ -85,10 +70,9 @@ def date(field, field_name):
Prints the date if invalid. Prints the date if invalid.
""" """
from datetime import datetime
if pd.isna(field): if pd.isna(field):
print(f"Missing date ({field_name}).") print(f"{Fore.RED}Missing date ({field_name}).{Fore.RESET}")
return return
@@ -97,15 +81,17 @@ def date(field, field_name):
# We don't allow multi-value date fields # We don't allow multi-value date fields
if len(multiple_dates) > 1: if len(multiple_dates) > 1:
print(f"Multiple dates not allowed ({field_name}): {field}") print(
f"{Fore.RED}Multiple dates not allowed ({field_name}): {Fore.RESET}{field}"
)
return field return
try: try:
# Check if date is valid YYYY format # Check if date is valid YYYY format
datetime.strptime(field, "%Y") datetime.strptime(field, "%Y")
return field return
except ValueError: except ValueError:
pass pass
@@ -113,7 +99,7 @@ def date(field, field_name):
# Check if date is valid YYYY-MM format # Check if date is valid YYYY-MM format
datetime.strptime(field, "%Y-%m") datetime.strptime(field, "%Y-%m")
return field return
except ValueError: except ValueError:
pass pass
@@ -121,11 +107,19 @@ def date(field, field_name):
# Check if date is valid YYYY-MM-DD format # Check if date is valid YYYY-MM-DD format
datetime.strptime(field, "%Y-%m-%d") datetime.strptime(field, "%Y-%m-%d")
return field return
except ValueError: except ValueError:
print(f"Invalid date ({field_name}): {field}") pass
return field try:
# Check if date is valid YYYY-MM-DDTHH:MM:SSZ format
datetime.strptime(field, "%Y-%m-%dT%H:%M:%SZ")
return
except ValueError:
print(f"{Fore.RED}Invalid date ({field_name}): {Fore.RESET}{field}")
return
def suspicious_characters(field, field_name): def suspicious_characters(field, field_name):
@@ -156,12 +150,10 @@ def suspicious_characters(field, field_name):
# character and spanning enough of the rest to give a preview, # character and spanning enough of the rest to give a preview,
# but not too much to cause the line to break in terminals with # but not too much to cause the line to break in terminals with
# a default of 80 characters width. # a default of 80 characters width.
suspicious_character_msg = ( suspicious_character_msg = f"{Fore.YELLOW}Suspicious character ({field_name}): {Fore.RESET}{field_subset}"
f"Suspicious character ({field_name}): {field_subset}"
)
print(f"{suspicious_character_msg:1.80}") print(f"{suspicious_character_msg:1.80}")
return field return
def language(field): def language(field):
@@ -170,8 +162,6 @@ def language(field):
Prints the value if it is invalid. Prints the value if it is invalid.
""" """
from pycountry import languages
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
@@ -185,18 +175,18 @@ def language(field):
# can check it against ISO 639-1 or ISO 639-3 accordingly. # can check it against ISO 639-1 or ISO 639-3 accordingly.
if len(value) == 2: if len(value) == 2:
if not languages.get(alpha_2=value): if not languages.get(alpha_2=value):
print(f"Invalid ISO 639-1 language: {value}") print(f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}")
pass pass
elif len(value) == 3: elif len(value) == 3:
if not languages.get(alpha_3=value): if not languages.get(alpha_3=value):
print(f"Invalid ISO 639-3 language: {value}") print(f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}")
pass pass
else: else:
print(f"Invalid language: {value}") print(f"{Fore.RED}Invalid language: {Fore.RESET}{value}")
return field return
def agrovoc(field, field_name): def agrovoc(field, field_name):
@@ -213,39 +203,38 @@ def agrovoc(field, field_name):
Prints a warning if the value is invalid. Prints a warning if the value is invalid.
""" """
from datetime import timedelta
import requests
import requests_cache
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
# Try to split multi-value field on "||" separator
for value in field.split("||"):
request_url = (
f"http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}"
)
# enable transparent request cache with thirty days expiry # enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30) expire_after = timedelta(days=30)
# Allow overriding the location of the requests cache, just in case we are
# running in an environment where we can't write to the current working di-
# rectory (for example from csv-metadata-quality-web).
REQUESTS_CACHE_DIR = os.environ.get("REQUESTS_CACHE_DIR", ".")
requests_cache.install_cache( requests_cache.install_cache(
"agrovoc-response-cache", expire_after=expire_after f"{REQUESTS_CACHE_DIR}/agrovoc-response-cache", expire_after=expire_after
) )
request = requests.get(request_url)
# prune old cache entries # prune old cache entries
requests_cache.core.remove_expired_responses() requests_cache.core.remove_expired_responses()
# Try to split multi-value field on "||" separator
for value in field.split("||"):
request_url = "http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search"
request_params = {"query": value}
request = requests.get(request_url, params=request_params)
if request.status_code == requests.codes.ok: if request.status_code == requests.codes.ok:
data = request.json() data = request.json()
# check if there are any results # check if there are any results
if len(data["results"]) == 0: if len(data["results"]) == 0:
print(f"Invalid AGROVOC ({field_name}): {value}") print(f"{Fore.RED}Invalid AGROVOC ({field_name}): {Fore.RESET}{value}")
return field return
def filename_extension(field): def filename_extension(field):
@@ -259,8 +248,6 @@ def filename_extension(field):
than .pdf, .xls(x), .doc(x), ppt(x), case insensitive). than .pdf, .xls(x), .doc(x), ppt(x), case insensitive).
""" """
import re
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
@@ -296,6 +283,86 @@ def filename_extension(field):
break break
if filename_extension_match is False: if filename_extension_match is False:
print(f"Filename with uncommon extension: {value}") print(f"{Fore.YELLOW}Filename with uncommon extension: {Fore.RESET}{value}")
return field return
def spdx_license_identifier(field):
"""Check if a license is a valid SPDX identifier.
Prints the value if it is invalid.
"""
# Skip fields with missing values
if pd.isna(field):
return
# Try to split multi-value field on "||" separator
for value in field.split("||"):
if value not in spdx_license_list.LICENSES:
print(f"{Fore.YELLOW}Non-SPDX license identifier: {Fore.RESET}{value}")
pass
return
def duplicate_items(df):
"""Attempt to identify duplicate items.
First we check the total number of titles and compare it with the number of
unique titles. If there are less unique titles than total titles we expand
the search by creating a key (of sorts) for each item that includes their
title, type, and date issued, and compare it with all the others. If there
are multiple occurrences of the same title, type, date string then it's a
very good indicator that the items are duplicates.
"""
# Extract the names of the title, type, and date issued columns so we can
# reference them later. First we filter columns by likely patterns, then
# we extract the name from the first item of the resulting object, ie:
#
# Index(['dcterms.title[en_US]'], dtype='object')
#
title_column_name = df.filter(regex=r"dcterms\.title|dc\.title").columns[0]
type_column_name = df.filter(regex=r"dcterms\.title|dc\.title").columns[0]
date_column_name = df.filter(
regex=r"dcterms\.issued|dc\.date\.accessioned"
).columns[0]
items_count_total = df[title_column_name].count()
items_count_unique = df[title_column_name].nunique()
if items_count_unique < items_count_total:
# Create a list to hold our items while we check for duplicates
items = list()
for index, row in df.iterrows():
item_title_type_date = f"{row[title_column_name]}{row[type_column_name]}{row[date_column_name]}"
if item_title_type_date in items:
print(
f"{Fore.YELLOW}Possible duplicate ({title_column_name}): {Fore.RESET}{row[title_column_name]}"
)
else:
items.append(item_title_type_date)
def mojibake(field, field_name):
"""Check for mojibake (text that was encoded in one encoding and decoded in
in another, perhaps multiple times). See util.py.
Prints the string if it contains suspected mojibake.
"""
# Skip fields with missing values
if pd.isna(field):
return
if is_mojibake(field):
print(
f"{Fore.YELLOW}Possible encoding issue ({field_name}): {Fore.RESET}{field}"
)
return

View File

@@ -1,4 +1,9 @@
import re
import langid
import pandas as pd import pandas as pd
from colorama import Fore
from pycountry import languages
def correct_language(row): def correct_language(row):
@@ -10,10 +15,6 @@ def correct_language(row):
language and returns the value in the language field if it does match. language and returns the value in the language field if it does match.
""" """
from pycountry import languages
import langid
import re
# Initialize some variables at global scope so that we can set them in the # Initialize some variables at global scope so that we can set them in the
# loop scope below and still be able to access them afterwards. # loop scope below and still be able to access them afterwards.
language = "" language = ""
@@ -83,13 +84,13 @@ def correct_language(row):
detected_language = languages.get(alpha_2=langid_classification[0]) detected_language = languages.get(alpha_2=langid_classification[0])
if len(language) == 2 and language != detected_language.alpha_2: if len(language) == 2 and language != detected_language.alpha_2:
print( print(
f"Possibly incorrect language {language} (detected {detected_language.alpha_2}): {title}" f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_2}): {Fore.RESET}{title}"
) )
elif len(language) == 3 and language != detected_language.alpha_3: elif len(language) == 3 and language != detected_language.alpha_3:
print( print(
f"Possibly incorrect language {language} (detected {detected_language.alpha_3}): {title}" f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_3}): {Fore.RESET}{title}"
) )
else: else:
return language return

View File

@@ -1,9 +1,14 @@
import re import re
from unicodedata import normalize
import pandas as pd import pandas as pd
from colorama import Fore
from ftfy import fix_text
from csv_metadata_quality.util import is_mojibake, is_nfc
def whitespace(field): def whitespace(field, field_name):
"""Fix whitespace issues. """Fix whitespace issues.
Return string with leading, trailing, and consecutive whitespace trimmed. Return string with leading, trailing, and consecutive whitespace trimmed.
@@ -26,7 +31,9 @@ def whitespace(field):
match = re.findall(pattern, value) match = re.findall(pattern, value)
if match: if match:
print(f"Excessive whitespace: {value}") print(
f"{Fore.GREEN}Removing excessive whitespace ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, " ", value) value = re.sub(pattern, " ", value)
# Save cleaned value # Save cleaned value
@@ -38,8 +45,15 @@ def whitespace(field):
return new_field return new_field
def separators(field): def separators(field, field_name):
"""Fix for invalid multi-value separators (ie "|").""" """Fix for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator.
"""
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
@@ -50,12 +64,22 @@ def separators(field):
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split("||"): for value in field.split("||"):
# Check if the value is blank and skip it
if value == "":
print(
f"{Fore.GREEN}Fixing unnecessary multi-value separator ({field_name}): {Fore.RESET}{field}"
)
continue
# After splitting, see if there are any remaining "|" characters # After splitting, see if there are any remaining "|" characters
pattern = re.compile(r"\|") pattern = re.compile(r"\|")
match = re.findall(pattern, value) match = re.findall(pattern, value)
if match: if match:
print(f"Fixing invalid multi-value separator: {value}") print(
f"{Fore.GREEN}Fixing invalid multi-value separator ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, "||", value) value = re.sub(pattern, "||", value)
@@ -74,10 +98,10 @@ def unnecessary_unicode(field):
Removes unnecessary Unicode characters like: Removes unnecessary Unicode characters like:
- Zero-width space (U+200B) - Zero-width space (U+200B)
- Replacement character (U+FFFD) - Replacement character (U+FFFD)
- No-break space (U+00A0)
Replaces unnecessary Unicode characters like: Replaces unnecessary Unicode characters like:
- Soft hyphen (U+00AD) → hyphen - Soft hyphen (U+00AD) → hyphen
- No-break space (U+00A0) → space
Return string with characters removed or replaced. Return string with characters removed or replaced.
""" """
@@ -91,7 +115,7 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Removing unnecessary Unicode (U+200B): {field}") print(f"{Fore.GREEN}Removing unnecessary Unicode (U+200B): {Fore.RESET}{field}")
field = re.sub(pattern, "", field) field = re.sub(pattern, "", field)
# Check for replacement characters (U+FFFD) # Check for replacement characters (U+FFFD)
@@ -99,7 +123,7 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Removing unnecessary Unicode (U+FFFD): {field}") print(f"{Fore.GREEN}Removing unnecessary Unicode (U+FFFD): {Fore.RESET}{field}")
field = re.sub(pattern, "", field) field = re.sub(pattern, "", field)
# Check for no-break spaces (U+00A0) # Check for no-break spaces (U+00A0)
@@ -107,21 +131,25 @@ def unnecessary_unicode(field):
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Removing unnecessary Unicode (U+00A0): {field}") print(
field = re.sub(pattern, "", field) f"{Fore.GREEN}Replacing unnecessary Unicode (U+00A0): {Fore.RESET}{field}"
)
field = re.sub(pattern, " ", field)
# Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen # Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen
pattern = re.compile(r"\u002D*?\u00AD") pattern = re.compile(r"\u002D*?\u00AD")
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f"Replacing unnecessary Unicode (U+00AD): {field}") print(
f"{Fore.GREEN}Replacing unnecessary Unicode (U+00AD): {Fore.RESET}{field}"
)
field = re.sub(pattern, "-", field) field = re.sub(pattern, "-", field)
return field return field
def duplicates(field): def duplicates(field, field_name):
"""Remove duplicate metadata values.""" """Remove duplicate metadata values."""
# Skip fields with missing values # Skip fields with missing values
@@ -140,7 +168,9 @@ def duplicates(field):
if value not in new_values: if value not in new_values:
new_values.append(value) new_values.append(value)
else: else:
print(f"Dropping duplicate value: {value}") print(
f"{Fore.GREEN}Removing duplicate value ({field_name}): {Fore.RESET}{value}"
)
# Create a new field consisting of all values joined with "||" # Create a new field consisting of all values joined with "||"
new_field = "||".join(new_values) new_field = "||".join(new_values)
@@ -173,7 +203,7 @@ def newlines(field):
match = re.findall(r"\n", field) match = re.findall(r"\n", field)
if match: if match:
print(f"Removing newline: {field}") print(f"{Fore.GREEN}Removing newline: {Fore.RESET}{field}")
field = field.replace("\n", "") field = field.replace("\n", "")
return field return field
@@ -197,7 +227,49 @@ def comma_space(field, field_name):
match = re.findall(r",\w", field) match = re.findall(r",\w", field)
if match: if match:
print(f"Adding space after comma ({field_name}): {field}") print(
f"{Fore.GREEN}Adding space after comma ({field_name}): {Fore.RESET}{field}"
)
field = re.sub(r",(\w)", r", \1", field) field = re.sub(r",(\w)", r", \1", field)
return field return field
def normalize_unicode(field, field_name):
"""Fix occurrences of decomposed Unicode characters by normalizing them
with NFC to their canonical forms, for example:
Ouédraogo, Mathieu → Ouédraogo, Mathieu
Return normalized string.
"""
# Skip fields with missing values
if pd.isna(field):
return
# Check if the current string is using normalized Unicode (NFC)
if not is_nfc(field):
print(f"{Fore.GREEN}Normalizing Unicode ({field_name}): {Fore.RESET}{field}")
field = normalize("NFC", field)
return field
def mojibake(field, field_name):
"""Attempts to fix mojibake (text that was encoded in one encoding and deco-
ded in another, perhaps multiple times). See util.py.
Return fixed string.
"""
# Skip fields with missing values
if pd.isna(field):
return field
if is_mojibake(field):
print(f"{Fore.GREEN}Fixing encoding issue ({field_name}): {Fore.RESET}{field}")
return fix_text(field)
else:
return field

View File

@@ -0,0 +1,49 @@
from ftfy.badness import sequence_weirdness
def is_nfc(field):
"""Utility function to check whether a string is using normalized Unicode.
Python's built-in unicodedata library has the is_normalized() function, but
it was only introduced in Python 3.8. By using a simple utility function we
are able to run on Python >= 3.6 again.
See: https://docs.python.org/3/library/unicodedata.html
Return boolean.
"""
from unicodedata import normalize
return field == normalize("NFC", field)
def is_mojibake(field):
"""Determines whether a string contains mojibake.
We commonly deal with CSV files that were *encoded* in UTF-8, but decoded
as something else like CP-1252 (Windows Latin). This manifests in the form
of "mojibake", for example:
- CIAT Publicaçao
- CIAT Publicación
This uses the excellent "fixes text for you" (ftfy) library to determine
whether a string contains characters that have been encoded in one encoding
and decoded in another.
Inspired by this code snippet from Martijn Pieters on StackOverflow:
https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
Return boolean.
"""
if not sequence_weirdness(field):
# Nothing weird, should be okay
return False
try:
field.encode("sloppy-windows-1252")
except UnicodeEncodeError:
# Not CP-1252 encodable, probably fine
return False
else:
# Encodable as CP-1252, Mojibake alert level high
return True

View File

@@ -1 +1 @@
VERSION = "0.3.0" VERSION = "0.4.8-dev"

View File

@@ -1,28 +1,35 @@
dc.title,birthdate,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country,filename dc.title,dcterms.issued,dc.identifier.issn,dc.identifier.isbn,dcterms.language,dcterms.subject,cg.coverage.country,filename,dcterms.license,dcterms.type
Leading space,2019-07-29,,,,,, Leading space,2019-07-29,,,,,,,,
Trailing space ,2019-07-29,,,,,, Trailing space ,2019-07-29,,,,,,,,
Excessive space,2019-07-29,,,,,, Excessive space,2019-07-29,,,,,,,,
Miscellaenous ||whitespace | issues ,2019-07-29,,,,,, Miscellaenous ||whitespace | issues ,2019-07-29,,,,,,,,
Duplicate||Duplicate,2019-07-29,,,,,, Duplicate||Duplicate,2019-07-29,,,,,,,,
Invalid ISSN,2019-07-29,2321-2302,,,,, Invalid ISSN,2019-07-29,2321-2302,,,,,,,
Invalid ISBN,2019-07-29,,978-0-306-40615-6,,,, Invalid ISBN,2019-07-29,,978-0-306-40615-6,,,,,,
Multiple valid ISSNs,2019-07-29,0378-5955||0024-9319,,,,, Multiple valid ISSNs,2019-07-29,0378-5955||0024-9319,,,,,,,
Multiple valid ISBNs,2019-07-29,,99921-58-10-7||978-0-306-40615-7,,,, Multiple valid ISBNs,2019-07-29,,99921-58-10-7||978-0-306-40615-7,,,,,,
Invalid date,2019-07-260,,,,,, Invalid date,2019-07-260,,,,,,,,
Multiple dates,2019-07-26||2019-01-10,,,,,, Multiple dates,2019-07-26||2019-01-10,,,,,,,,
Invalid multi-value separator,2019-07-29,0378-5955|0024-9319,,,,, Invalid multi-value separator,2019-07-29,0378-5955|0024-9319,,,,,,,
Unnecessary Unicode,2019-07-29,,,,,, Unnecessary Unicode,2019-07-29,,,,,,,,
Suspicious character||foreˆt,2019-07-29,,,,,, Suspicious character||foreˆt,2019-07-29,,,,,,,,
Invalid ISO 639-1 (alpha 2) language,2019-07-29,,,jp,,, Invalid ISO 639-1 (alpha 2) language,2019-07-29,,,jp,,,,,
Invalid ISO 639-3 (alpha 3) language,2019-07-29,,,chi,,, Invalid ISO 639-3 (alpha 3) language,2019-07-29,,,chi,,,,,
Invalid language,2019-07-29,,,Span,,, Invalid language,2019-07-29,,,Span,,,,,
Invalid AGROVOC subject,2019-07-29,,,,FOREST,, Invalid AGROVOC subject,2019-07-29,,,,FOREST,,,,
Newline (LF),2019-07-30,,,,"TANZA Newline (LF),2019-07-30,,,,"TANZA
NIA",, NIA",,,,
Missing date,,,,,,, Missing date,,,,,,,,,
Invalid country,2019-08-01,,,,,KENYAA, Invalid country,2019-08-01,,,,,KENYAA,,,
Uncommon filename extension,2019-08-10,,,,,,file.pdf.lck Uncommon filename extension,2019-08-10,,,,,,file.pdf.lck,,
Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,, Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,,,,
"Missing space,after comma",2019-08-27,,,,,, "Missing space,after comma",2019-08-27,,,,,,,,
Incorrect ISO 639-1 language,2019-09-26,,,es,,, Incorrect ISO 639-1 language,2019-09-26,,,es,,,,,
Incorrect ISO 639-3 language,2019-09-26,,,spa,,, Incorrect ISO 639-3 language,2019-09-26,,,spa,,,,,
Composéd Unicode,2020-01-14,,,,,,,,
Decomposéd Unicode,2020-01-14,,,,,,,,
Unnecessary multi-value separator,2021-01-03,0378-5955||,,,,,,,
Invalid SPDX license identifier,2021-03-11,,,,,,,CC-BY,
Duplicate Title,2021-03-17,,,,,,,,Report
Duplicate Title,2021-03-17,,,,,,,,Report
Mojibake,2021-03-18,,,,CIAT Publicaçao,,,,Report
1 dc.title birthdate dcterms.issued dc.identifier.issn dc.identifier.isbn dc.language.iso dcterms.language dc.subject dcterms.subject cg.coverage.country filename dcterms.license dcterms.type
2 Leading space 2019-07-29
3 Trailing space 2019-07-29
4 Excessive space 2019-07-29
5 Miscellaenous ||whitespace | issues 2019-07-29
6 Duplicate||Duplicate 2019-07-29
7 Invalid ISSN 2019-07-29 2321-2302
8 Invalid ISBN 2019-07-29 978-0-306-40615-6
9 Multiple valid ISSNs 2019-07-29 0378-5955||0024-9319
10 Multiple valid ISBNs 2019-07-29 99921-58-10-7||978-0-306-40615-7
11 Invalid date 2019-07-260
12 Multiple dates 2019-07-26||2019-01-10
13 Invalid multi-value separator 2019-07-29 0378-5955|0024-9319
14 Unnecessary Unicode​ 2019-07-29
15 Suspicious character||foreˆt 2019-07-29
16 Invalid ISO 639-1 (alpha 2) language 2019-07-29 jp
17 Invalid ISO 639-3 (alpha 3) language 2019-07-29 chi
18 Invalid language 2019-07-29 Span
19 Invalid AGROVOC subject 2019-07-29 FOREST
20 Newline (LF) 2019-07-30 TANZA NIA
21 Missing date
22 Invalid country 2019-08-01 KENYAA
23 Uncommon filename extension 2019-08-10 file.pdf.lck
24 Unneccesary unicode (U+002D + U+00AD) 2019-08-10 978-­92-­9043-­823-­6
25 Missing space,after comma 2019-08-27
26 Incorrect ISO 639-1 language 2019-09-26 es
27 Incorrect ISO 639-3 language 2019-09-26 spa
28 Composéd Unicode 2020-01-14
29 Decomposéd Unicode 2020-01-14
30 Unnecessary multi-value separator 2021-01-03 0378-5955||
31 Invalid SPDX license identifier 2021-03-11 CC-BY
32 Duplicate Title 2021-03-17 Report
33 Duplicate Title 2021-03-17 Report
34 Mojibake 2021-03-18 CIAT Publicaçao Report
35

1314
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

37
pyproject.toml Normal file
View File

@@ -0,0 +1,37 @@
[tool.poetry]
name = "csv-metadata-quality"
version = "0.4.8-dev"
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem."
authors = ["Alan Orth <alan.orth@gmail.com>"]
license="GPL-3.0-only"
repository = "https://github.com/ilri/csv-metadata-quality"
homepage = "https://github.com/ilri/csv-metadata-quality"
[tool.poetry.scripts]
csv-metadata-quality = 'csv_metadata_quality.__main__:main'
[tool.poetry.dependencies]
python = "^3.7.1"
pandas = "^1.0.4"
python-stdnum = "^1.13"
xlrd = "^1.2.0"
requests = "^2.23.0"
requests-cache = "^0.5.2"
pycountry = "^19.8.18"
langid = "^1.1.6"
colorama = "^0.4.4"
spdx-license-list = "^0.5.2"
ftfy = "^5.9"
[tool.poetry.dev-dependencies]
pytest = "^6.1.1"
ipython = { version = "^7.18.1", python = "^3.7" }
flake8 = "^3.8.4"
pytest-clarity = "^0.3.0-alpha.0"
black = "20.8b1"
isort = "^5.5.4"
csvkit = "^1.0.5"
[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

View File

@@ -1,5 +1,5 @@
[pytest] [pytest]
addopts= -rsxX -s -v --strict --capture=sys addopts= -rsxX -s -v --strict-markers --capture=sys
filterwarnings = filterwarnings =
error::UserWarning error::UserWarning
ignore:.*U.* is deprecated:DeprecationWarning ignore:.*U.* is deprecated:DeprecationWarning

View File

@@ -1,57 +1,76 @@
-i https://pypi.org/simple agate-dbf==0.2.2
agate-dbf==0.2.1
agate-excel==0.2.3 agate-excel==0.2.3
agate-sql==0.5.4 agate-sql==0.5.6
agate==1.6.1 agate==1.6.2
appdirs==1.4.3 appdirs==1.4.4; python_version >= "3.6"
atomicwrites==1.3.0 appnope==0.1.2; python_version >= "3.7" and python_version < "4.0" and sys_platform == "darwin"
attrs==19.1.0 atomicwrites==1.4.0; python_version >= "3.6" and python_full_version < "3.0.0" and sys_platform == "win32" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") or sys_platform == "win32" and python_version >= "3.6" and python_full_version >= "3.4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6")
babel==2.7.0 attrs==20.3.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
backcall==0.1.0 babel==2.9.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
black==19.3b0 backcall==0.2.0; python_version >= "3.7" and python_version < "4.0"
click==7.0 black==20.8b1; python_version >= "3.6"
csvkit==1.0.4 certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
click==7.1.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
csvkit==1.0.5
dbfread==2.0.7 dbfread==2.0.7
decorator==4.4.0 decorator==4.4.2; python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "4.0" or python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.2.0"
entrypoints==0.3 et-xmlfile==1.0.1; python_version >= "3.6"
et-xmlfile==1.0.1 flake8==3.9.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
flake8==3.7.8 ftfy==5.9; python_version >= "3.5"
future==0.17.1 greenlet==1.0.0; python_version >= "3" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3"
importlib-metadata==0.23 ; python_version < '3.8' idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
ipython-genutils==0.2.0 importlib-metadata==3.7.3; python_version < "3.8" and python_version >= "3.6" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.5.0" and python_version < "3.8" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.6" and python_version < "3.8") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.6.0" and python_version < "3.8" and python_version >= "3.6")
ipython==7.8.0 iniconfig==1.1.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
ipython-genutils==0.2.0; python_version >= "3.7" and python_version < "4.0"
ipython==7.21.0; python_version >= "3.7" and python_version < "4.0"
isodate==0.6.0 isodate==0.6.0
isort==4.3.21 isort==5.7.0; python_version >= "3.6" and python_version < "4.0"
jdcal==1.4.1 jedi==0.18.0; python_version >= "3.7" and python_version < "4.0"
jedi==0.15.1 langid==1.1.6
leather==0.3.3 leather==0.3.3
mccabe==0.6.1 mccabe==0.6.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
more-itertools==7.2.0 mypy-extensions==0.4.3; python_version >= "3.6"
openpyxl==3.0.0 numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
packaging==19.2 openpyxl==3.0.7; python_version >= "3.6"
parsedatetime==2.4 packaging==20.9; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
parso==0.5.1 pandas==1.2.3; python_full_version >= "3.7.1"
pexpect==4.7.0 ; sys_platform != 'win32' parsedatetime==2.6
pickleshare==0.7.5 parso==0.8.1; python_version >= "3.7" and python_version < "4.0"
pluggy==0.13.0 pathspec==0.8.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
prompt-toolkit==2.0.9 pexpect==4.8.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
ptyprocess==0.6.0 pickleshare==0.7.5; python_version >= "3.7" and python_version < "4.0"
py==1.8.0 pluggy==0.13.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pycodestyle==2.5.0 prompt-toolkit==3.0.17; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
pyflakes==2.1.1 ptyprocess==0.7.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
pygments==2.4.2 py==1.10.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pyparsing==2.4.2 pycodestyle==2.7.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
pytest-clarity==0.2.0a1 pycountry==19.8.18
pytest==5.1.3 pyflakes==2.3.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
python-slugify==3.0.4 pygments==2.8.1; python_version >= "3.7" and python_version < "4.0"
pyicu==2.6
pyparsing==2.4.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pytest-clarity==0.3.0a0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
pytest==6.2.2; python_version >= "3.6"
python-dateutil==2.8.1; python_full_version >= "3.7.1"
python-slugify==4.0.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
python-stdnum==1.16
pytimeparse==1.1.8 pytimeparse==1.1.8
pytz==2019.2 pytz==2021.1; python_full_version >= "3.7.1"
six==1.12.0 regex==2020.11.13; python_version >= "3.6"
sqlalchemy==1.3.8 requests-cache==0.5.2
termcolor==1.1.0 requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
text-unidecode==1.3 six==1.15.0; python_full_version >= "3.7.1"
toml==0.10.0 spdx-license-list==0.5.2
traitlets==4.3.3.dev0 sqlalchemy==1.4.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0"
wcwidth==0.1.7 termcolor==1.1.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
xlrd==1.2.0 text-unidecode==1.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
zipp==0.6.0 toml==0.10.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
traitlets==5.0.5; python_version >= "3.7" and python_version < "4.0"
typed-ast==1.4.2; python_version >= "3.6"
typing-extensions==3.7.4.3; python_version < "3.8" and python_version >= "3.6"
urllib3==1.26.4; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
wcwidth==0.2.5; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
zipp==3.4.1; python_version < "3.8" and python_version >= "3.6"

View File

@@ -1,17 +1,19 @@
-i https://pypi.org/simple certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
-e . chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
certifi==2019.9.11 colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
chardet==3.0.4 ftfy==5.9; python_version >= "3.5"
idna==2.8 idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
langid==1.1.6 langid==1.1.6
numpy==1.17.2 numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
pandas==0.25.1 pandas==1.2.3; python_full_version >= "3.7.1"
pycountry==19.8.18 pycountry==19.8.18
python-dateutil==2.8.0 python-dateutil==2.8.1; python_full_version >= "3.7.1"
python-stdnum==1.11 python-stdnum==1.16
pytz==2019.2 pytz==2021.1; python_full_version >= "3.7.1"
requests-cache==0.5.2 requests-cache==0.5.2
requests==2.22.0 requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
six==1.12.0 six==1.15.0; python_full_version >= "3.7.1"
urllib3==1.25.6 spdx-license-list==0.5.2
xlrd==1.2.0 urllib3==1.26.4; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
wcwidth==0.2.5; python_version >= "3.5"
xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")

View File

@@ -4,17 +4,17 @@ with open("README.md", "r") as fh:
long_description = fh.read() long_description = fh.read()
install_requires = [ install_requires = [
'pandas', "pandas",
'python-stdnum', "python-stdnum",
'requests', "requests",
'requests-cache', "requests-cache",
'pycountry', "pycountry",
'langid' "langid",
] ]
setuptools.setup( setuptools.setup(
name="csv-metadata-quality", name="csv-metadata-quality",
version="0.3.0", version="0.4.8-dev",
author="Alan Orth", author="Alan Orth",
author_email="aorth@mjanja.ch", author_email="aorth@mjanja.ch",
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.", description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
@@ -23,17 +23,15 @@ setuptools.setup(
long_description_content_type="text/markdown", long_description_content_type="text/markdown",
url="https://github.com/alanorth/csv-metadata-quality", url="https://github.com/alanorth/csv-metadata-quality",
classifiers=[ classifiers=[
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: OS Independent", "Operating System :: OS Independent",
"Development Status :: 4 - Beta"
], ],
packages=['csv_metadata_quality'], packages=["csv_metadata_quality"],
entry_points={ entry_points={
'console_scripts': [ "console_scripts": ["csv-metadata-quality = csv_metadata_quality.__main__:main"]
'csv-metadata-quality = csv_metadata_quality.__main__:main'
]
}, },
install_requires=install_requires install_requires=install_requires,
) )

View File

@@ -1,6 +1,8 @@
import pandas as pd
from colorama import Fore
import csv_metadata_quality.check as check import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental import csv_metadata_quality.experimental as experimental
import pandas as pd
def test_check_invalid_issn(capsys): def test_check_invalid_issn(capsys):
@@ -11,7 +13,7 @@ def test_check_invalid_issn(capsys):
check.issn(value) check.issn(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISSN: {value}\n" assert captured.out == f"{Fore.RED}Invalid ISSN: {Fore.RESET}{value}\n"
def test_check_valid_issn(): def test_check_valid_issn():
@@ -21,7 +23,7 @@ def test_check_valid_issn():
result = check.issn(value) result = check.issn(value)
assert result == value assert result == None
def test_check_invalid_isbn(capsys): def test_check_invalid_isbn(capsys):
@@ -32,7 +34,7 @@ def test_check_invalid_isbn(capsys):
check.isbn(value) check.isbn(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISBN: {value}\n" assert captured.out == f"{Fore.RED}Invalid ISBN: {Fore.RESET}{value}\n"
def test_check_valid_isbn(): def test_check_valid_isbn():
@@ -42,28 +44,7 @@ def test_check_valid_isbn():
result = check.isbn(value) result = check.isbn(value)
assert result == value assert result == None
def test_check_invalid_separators(capsys):
"""Test checking invalid multi-value separators."""
value = "Alan|Orth"
check.separators(value)
captured = capsys.readouterr()
assert captured.out == f"Invalid multi-value separator: {value}\n"
def test_check_valid_separators():
"""Test checking valid multi-value separators."""
value = "Alan||Orth"
result = check.separators(value)
assert result == value
def test_check_missing_date(capsys): def test_check_missing_date(capsys):
@@ -76,7 +57,7 @@ def test_check_missing_date(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Missing date ({field_name}).\n" assert captured.out == f"{Fore.RED}Missing date ({field_name}).{Fore.RESET}\n"
def test_check_multiple_dates(capsys): def test_check_multiple_dates(capsys):
@@ -89,7 +70,10 @@ def test_check_multiple_dates(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Multiple dates not allowed ({field_name}): {value}\n" assert (
captured.out
== f"{Fore.RED}Multiple dates not allowed ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_invalid_date(capsys): def test_check_invalid_date(capsys):
@@ -102,7 +86,9 @@ def test_check_invalid_date(capsys):
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid date ({field_name}): {value}\n" assert (
captured.out == f"{Fore.RED}Invalid date ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_valid_date(): def test_check_valid_date():
@@ -114,7 +100,7 @@ def test_check_valid_date():
result = check.date(value, field_name) result = check.date(value, field_name)
assert result == value assert result == None
def test_check_suspicious_characters(capsys): def test_check_suspicious_characters(capsys):
@@ -127,7 +113,10 @@ def test_check_suspicious_characters(capsys):
check.suspicious_characters(value, field_name) check.suspicious_characters(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Suspicious character ({field_name}): ˆt\n" assert (
captured.out
== f"{Fore.YELLOW}Suspicious character ({field_name}): {Fore.RESET}ˆt\n"
)
def test_check_valid_iso639_1_language(): def test_check_valid_iso639_1_language():
@@ -137,7 +126,7 @@ def test_check_valid_iso639_1_language():
result = check.language(value) result = check.language(value)
assert result == value assert result == None
def test_check_valid_iso639_3_language(): def test_check_valid_iso639_3_language():
@@ -147,7 +136,7 @@ def test_check_valid_iso639_3_language():
result = check.language(value) result = check.language(value)
assert result == value assert result == None
def test_check_invalid_iso639_1_language(capsys): def test_check_invalid_iso639_1_language(capsys):
@@ -158,7 +147,9 @@ def test_check_invalid_iso639_1_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISO 639-1 language: {value}\n" assert (
captured.out == f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}\n"
)
def test_check_invalid_iso639_3_language(capsys): def test_check_invalid_iso639_3_language(capsys):
@@ -169,7 +160,9 @@ def test_check_invalid_iso639_3_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid ISO 639-3 language: {value}\n" assert (
captured.out == f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}\n"
)
def test_check_invalid_language(capsys): def test_check_invalid_language(capsys):
@@ -180,30 +173,33 @@ def test_check_invalid_language(capsys):
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid language: {value}\n" assert captured.out == f"{Fore.RED}Invalid language: {Fore.RESET}{value}\n"
def test_check_invalid_agrovoc(capsys): def test_check_invalid_agrovoc(capsys):
"""Test invalid AGROVOC subject.""" """Test invalid AGROVOC subject."""
value = "FOREST" value = "FOREST"
field_name = "dc.subject" field_name = "dcterms.subject"
check.agrovoc(value, field_name) check.agrovoc(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Invalid AGROVOC ({field_name}): {value}\n" assert (
captured.out
== f"{Fore.RED}Invalid AGROVOC ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_valid_agrovoc(): def test_check_valid_agrovoc():
"""Test valid AGROVOC subject.""" """Test valid AGROVOC subject."""
value = "FORESTS" value = "FORESTS"
field_name = "dc.subject" field_name = "dcterms.subject"
result = check.agrovoc(value, field_name) result = check.agrovoc(value, field_name)
assert result == value assert result == None
def test_check_uncommon_filename_extension(capsys): def test_check_uncommon_filename_extension(capsys):
@@ -214,7 +210,10 @@ def test_check_uncommon_filename_extension(capsys):
check.filename_extension(value) check.filename_extension(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f"Filename with uncommon extension: {value}\n" assert (
captured.out
== f"{Fore.YELLOW}Filename with uncommon extension: {Fore.RESET}{value}\n"
)
def test_check_common_filename_extension(): def test_check_common_filename_extension():
@@ -224,7 +223,7 @@ def test_check_common_filename_extension():
result = check.filename_extension(value) result = check.filename_extension(value)
assert result == value assert result == None
def test_check_incorrect_iso_639_1_language(capsys): def test_check_incorrect_iso_639_1_language(capsys):
@@ -242,7 +241,7 @@ def test_check_incorrect_iso_639_1_language(capsys):
captured = capsys.readouterr() captured = capsys.readouterr()
assert ( assert (
captured.out captured.out
== f"Possibly incorrect language {language} (detected en): {title}\n" == f"{Fore.YELLOW}Possibly incorrect language {language} (detected en): {Fore.RESET}{title}\n"
) )
@@ -261,7 +260,7 @@ def test_check_incorrect_iso_639_3_language(capsys):
captured = capsys.readouterr() captured = capsys.readouterr()
assert ( assert (
captured.out captured.out
== f"Possibly incorrect language {language} (detected eng): {title}\n" == f"{Fore.YELLOW}Possibly incorrect language {language} (detected eng): {Fore.RESET}{title}\n"
) )
@@ -277,7 +276,7 @@ def test_check_correct_iso_639_1_language():
result = experimental.correct_language(series) result = experimental.correct_language(series)
assert result == language assert result == None
def test_check_correct_iso_639_3_language(): def test_check_correct_iso_639_3_language():
@@ -292,4 +291,77 @@ def test_check_correct_iso_639_3_language():
result = experimental.correct_language(series) result = experimental.correct_language(series)
assert result == language assert result == None
def test_check_valid_spdx_license_identifier():
"""Test valid SPDX license identifier."""
license = "CC-BY-SA-4.0"
result = check.spdx_license_identifier(license)
assert result == None
def test_check_invalid_spdx_license_identifier(capsys):
"""Test invalid SPDX license identifier."""
license = "CC-BY-SA"
result = check.spdx_license_identifier(license)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.YELLOW}Non-SPDX license identifier: {Fore.RESET}{license}\n"
)
def test_check_duplicate_item(capsys):
"""Test item with duplicate title, type, and date."""
item_title = "Title"
item_type = "Report"
item_date = "2021-03-17"
d = {
"dc.title": [item_title, item_title],
"dcterms.type": [item_type, item_type],
"dcterms.issued": [item_date, item_date],
}
df = pd.DataFrame(data=d)
result = check.duplicate_items(df)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.YELLOW}Possible duplicate (dc.title): {Fore.RESET}{item_title}\n"
)
def test_check_no_mojibake():
"""Test string with no mojibake."""
field = "CIAT Publicaçao"
field_name = "dcterms.isPartOf"
result = check.mojibake(field, field_name)
assert result == None
def test_check_mojibake(capsys):
"""Test string with mojibake."""
field = "CIAT Publicaçao"
field_name = "dcterms.isPartOf"
result = check.mojibake(field, field_name)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.YELLOW}Possible encoding issue ({field_name}): {Fore.RESET}{field}\n"
)

View File

@@ -6,7 +6,9 @@ def test_fix_leading_whitespace():
value = " Alan" value = " Alan"
assert fix.whitespace(value) == "Alan" field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_trailing_whitespace(): def test_fix_trailing_whitespace():
@@ -14,7 +16,9 @@ def test_fix_trailing_whitespace():
value = "Alan " value = "Alan "
assert fix.whitespace(value) == "Alan" field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_excessive_whitespace(): def test_fix_excessive_whitespace():
@@ -22,7 +26,9 @@ def test_fix_excessive_whitespace():
value = "Alan Orth" value = "Alan Orth"
assert fix.whitespace(value) == "Alan Orth" field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan Orth"
def test_fix_invalid_separators(): def test_fix_invalid_separators():
@@ -30,7 +36,19 @@ def test_fix_invalid_separators():
value = "Alan|Orth" value = "Alan|Orth"
assert fix.separators(value) == "Alan||Orth" field_name = "dc.contributor.author"
assert fix.separators(value, field_name) == "Alan||Orth"
def test_fix_unnecessary_separators():
"""Test fixing unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
assert fix.separators(field, field_name) == "Alan||Orth"
def test_fix_unnecessary_unicode(): def test_fix_unnecessary_unicode():
@@ -46,7 +64,9 @@ def test_fix_duplicates():
value = "Kenya||Kenya" value = "Kenya||Kenya"
assert fix.duplicates(value) == "Kenya" field_name = "dc.contributor.author"
assert fix.duplicates(value, field_name) == "Kenya"
def test_fix_newlines(): def test_fix_newlines():
@@ -66,3 +86,34 @@ def test_fix_comma_space():
field_name = "dc.contributor.author" field_name = "dc.contributor.author"
assert fix.comma_space(value, field_name) == "Orth, Alan S." assert fix.comma_space(value, field_name) == "Orth, Alan S."
def test_fix_normalized_unicode():
"""Test fixing a string that is already in its normalized (NFC) Unicode form."""
# string using the normalized canonical form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
def test_fix_decomposed_unicode():
"""Test fixing a string that contains Unicode string."""
# string using the decomposed form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
def test_fix_mojibake():
"""Test string with no mojibake."""
field = "CIAT Publicaçao"
field_name = "dcterms.isPartOf"
assert fix.mojibake(field, field_name) == "CIAT Publicaçao"