1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-10-24 02:11:14 +02:00

206 Commits

Author SHA1 Message Date
28f9026286 README.md: Minor edit
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-19 16:26:31 +02:00
cfe09f7126 Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
8eddb76aab Bump version to 0.4.8-dev
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-19 11:53:56 +02:00
a04dbc50db Add notes about checking and fixing mojibake 2021-03-19 11:48:27 +02:00
28335ed159 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-19 10:29:15 +02:00
773a0a2695 poetry.lock: Run poetry update 2021-03-19 10:28:55 +02:00
39a4b1a487 Add mojibake to data/test.csv and tests 2021-03-19 10:28:33 +02:00
898bb412c3 Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
e92ec5d371 README.md: Add note about duplicate checking
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:12:03 +02:00
f816e17fe7 Version 0.4.7
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:00:34 +02:00
9061c7c79b setup.py: Remove beta tag
I think this is only used by pypi.org?
2021-03-17 10:00:09 +02:00
661d05b977 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-17 09:58:35 +02:00
652b7ea98c CHANGELOG.md: Add note about poetry dependencies 2021-03-17 09:58:02 +02:00
65da6e9b05 poetry.lock: Run pipenv update 2021-03-17 09:57:31 +02:00
a313b7527a CHANGELOG.md: Add note about duplicate items 2021-03-17 09:55:07 +02:00
51ee370697 data/test.csv: Add duplicate item 2021-03-17 09:54:14 +02:00
e8422bfa74 tests/test_check.py: Add test for duplicate items 2021-03-17 09:54:02 +02:00
9f2dc0a0f5 Add support for detecting duplicate items
This uses the title, type, and date issued as a sort of "key" when
determining if an item already exists in the data set.
2021-03-17 09:53:07 +02:00
14010896a5 csv_metadata_quality/experimental.py: Move all imports to top of file
All checks were successful
continuous-integration/drone/push Build is passing
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-16 16:13:34 +02:00
ab3af2ec62 csv_metadata_quality/check.py: Reformat with black 2021-03-16 16:12:33 +02:00
1aa2084230 CHANGELOG.md: Add note about checks 2021-03-16 16:11:24 +02:00
330a7b7b9c Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00
9a5e3fd6ef README.md: Add TODO about detecting duplicates 2021-03-16 14:03:26 +02:00
ed084da08c CHANGELOG.md: Add note about multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-14 21:04:19 +02:00
10612cf891 Remove checks for invalid multi-value separators
Now that I no longer treat the fix for these as "unsafe" I don't a
ctually need to check for them—I can just fix them when I see them.
2021-03-14 21:01:21 +02:00
3656e9f976 Update CI workflows to use DCTERMS instead of DC
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-14 15:52:51 +02:00
c9c277f8df csv_metadata_quality/app.py: Update help text
All checks were successful
continuous-integration/drone/push Build is passing
Use DCTERMS fields where possible.
2021-03-14 10:52:58 +02:00
fb35afd937 CHANGELOG.md: Add note about requests cache 2021-03-14 09:13:51 +02:00
0e9176f0a6 csv_metadata_quality/check.py: requests cache
Allow overriding the directory for the requests cache. In the case
of csv-metadata-quality-web, which currently runs on Google's App
Engine, we can only write to /tmp.
2021-03-14 09:07:35 +02:00
1008acf35e Always fix invalid multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
f00a07e2cd README.md: Reorganize unsafe functionality
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-13 11:56:52 +02:00
46098861ed poetry.lock: Run poetry update
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 22:45:32 +02:00
fa84cfa440 Bump version to 0.4.6-dev 2021-03-11 22:44:36 +02:00
6cc1401f88 pyproject.toml: Minimum Python is technically 3.7.1
All checks were successful
continuous-integration/drone/push Build is passing
See: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html
2021-03-11 13:41:58 +02:00
ad2cda8a41 README.md: Add note about SPDX license identifiers
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:21:34 +02:00
dc6920802e .github/workflows/python-app.yml: Use Python 3.9
I now use this version in my development environment. Eventually I
should add a matrix of versions to use, but I don't know the GitHub
Actions syntax well enough yet.
2021-03-11 12:17:57 +02:00
6ca449d8ed README.md: Update note about Python 3.8 to 3.8+
Currently the lower bound on Python version support is 3.7 because
of Pandas 1.2.0 requiring it, but I use 3.9 on my development box.
2021-03-11 12:16:07 +02:00
1554cfd5c9 Version 0.4.6 2021-03-11 12:14:54 +02:00
00b8faad6d CHANGELOG.md: Fix headers 2021-03-11 12:13:22 +02:00
b19d81abdd .drone.yml: We need some stuff to build pyicu now
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:07:28 +02:00
a0ea829f5c csv_metadata_quality/fix.py: Fixes should be green 2021-03-11 11:47:24 +02:00
0089efa914 tests/test_check.py: Use dcterms.subject instead of dc.subject
Trying to move some old DC fields to DCTERMS.
2021-03-11 11:45:25 +02:00
3dbe656f9f Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-11 11:11:19 +02:00
7ad821dcad CHANGELOG.md: Add note about poetry dependencies 2021-03-11 11:10:27 +02:00
cd876c4fb3 poetry.lock: Run poetry update 2021-03-11 11:10:02 +02:00
d88ea56488 csv_metadata_quality/check.py: Move all imports to top of file
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-11 10:52:20 +02:00
e0e3ca6c58 CHANGELOG.md: Add notes about DCTERMS in data/test.csv 2021-03-11 10:50:52 +02:00
abae8ca4fb data/test.csv: Move some DC fields to DCTERMS
The original Dublin Core elements set was superceded by DCTERMS in
2008 and we have started using them in our DSpace repository so I
think it's good to update them in our test data. Old DC fields are
still checked and fixed in this tool, though.

It's worth nothing that currently supported DSpace versions (4, 5,
and 6) all have hard-coded a few fields like dc.title internally so
we can't migrate those to their DCTERMS counterparts just yet.
2021-03-11 10:49:05 +02:00
d7d4d4efca CHANGELOG.md: Add note about SPDX license identifiers 2021-03-11 10:37:27 +02:00
5318953150 tests/test_check.py: Add tests for licenses 2021-03-11 10:36:26 +02:00
3b17914002 data/test.csv: Add invalid SPDX license
Now we are checking dcterms.license against the list of SPDX license
identifiers using https://pypi.org/project/spdx-license-list/.
2021-03-11 10:34:58 +02:00
6e4b0e5c1b Add validation of SPDX license identifiers
Currently this only checks the dcterms.license field and the result
will only be a warning.
2021-03-11 10:33:16 +02:00
b16fa9121f pyproject.toml: Add csv-metadata-quality as a script
All checks were successful
continuous-integration/drone/push Build is passing
For some reason I stopped having csv-metadata-quality available in
my poetry environment after install. It seems I need to add it as a
poetry tool script? I had already done this in setup.py years ago,
which works for regular python setup.py installs, but hadn't needed
to do it in poetry for a year or more that I've been using it, until
now.
2021-03-08 09:50:05 +02:00
202bda862a Bump version to 0.4.5
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-04 21:38:10 +02:00
7479310ac0 setup.py: Bump version to 0.4.4
I missed to increase this when I actually released version 0.4.4 so
I will do it in a separate commit now before I bump the version to
0.4.5.
2021-03-04 21:35:08 +02:00
98a91bc9c2 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-03-04 21:33:33 +02:00
fc5bedcc5c CHANGELOG.md: Add poetry update 2021-03-04 21:32:46 +02:00
44d12d771a poetry.lock: Run poetry update 2021-03-04 21:32:21 +02:00
4a7000e975 README.md: Add more ideas to do 2021-03-04 21:26:53 +02:00
27b2d81ca8 CHANGELOG.md: Add note about dcterms.issued
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-28 15:14:39 +02:00
91ebd0f606 README.md: Update TODOs
A few of these date things have been addressed.
2021-02-28 15:13:36 +02:00
dd2cfae047 csv_metadata_quality/app.py: Match dcterms.issued for dates
We used to only check fields that had "date" in their name because
we were using DSpace's default dc.date.* fields. Now we are using
dcterms.issued so I will add that one as well.
2021-02-28 15:11:06 +02:00
d76e72532a Move unreleased changes to v0.4.4
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-21 13:25:22 +02:00
13980d2dde CHANGELOG.md: Add note about colored output 2021-02-21 13:12:26 +02:00
9aaaa62461 Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-02-21 13:10:52 +02:00
a7fc5a246c Colorize output
Some checks failed
continuous-integration/drone/push Build is failing
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
7fb8acb866 Add colorama for colored output
Red for errors, yellow for warnings or information, and green for
fixes.
2021-02-21 13:00:31 +02:00
9f5d2c2c4f poetry.lock: Run poetry update
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-15 15:13:12 +02:00
202abf140c CHANGELOG.md: Add note about poetry
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-04 21:48:12 +02:00
0cd6d3dfe6 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-02-04 21:46:49 +02:00
a458beac55 poetry.lock: Run poetry update 2021-02-04 21:45:30 +02:00
e62ecb0a8f CHANGELOG.md: Add note about new date format 2021-02-04 21:43:44 +02:00
de92f32ab6 csv_metadata_quality/check.py: More date formats
We should also allow ISO 8601 extended in combined date and time
format. DSpace does not have a problem with dates in this format
and I have found some metadata that uses this date format.

For example: 2020-08-31T11:04:56Z

See: https://en.wikipedia.org/wiki/ISO_8601
2021-02-04 21:39:14 +02:00
dbbbc0944a README.md: Add handle to citation
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-27 10:33:37 +02:00
d17bf3033c README.md: Add citation 2021-01-27 10:32:26 +02:00
2ec52f1b73 README.md: Update description
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-26 15:43:41 +02:00
aa1abf15a7 README.md: Adjust title 2021-01-26 15:35:21 +02:00
cbf94490f2 Version 0.4.3 2021-01-26 15:22:40 +02:00
f3d0d5ef07 setup.py: Remove Python 3.6
I actually removed Python 3.6 support a few weeks ago after updating
to Pandas 1.2.0, but forgot to update this.
2021-01-26 15:22:08 +02:00
4b7b99c94c CHANGELOG.md: Add note about multi-value separators 2021-01-26 15:20:22 +02:00
df670e81b9 README.md: Use badge from my Drone CI
All checks were successful
continuous-integration/drone/push Build is passing
I'm not using SourceHut anymore.
2021-01-26 14:38:50 +02:00
ae357d8c6c Revert "Update requirements"
This reverts commit ca80340f7a.

Nope, we still need the --without-hashes because this still fails
on Python 3.7, but not 3.8 or 3.9. From looking around it seems
that nobody can agree whether poetry should handle this, pip should
handle it, or upstream projects should pin their dependencies.
2021-01-26 14:15:31 +02:00
ca80340f7a Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt

Trying to see if we no longer need --without-hashes since we don't
support Python 3.6 anymore.
2021-01-26 11:46:05 +02:00
cc1743b86d Remove .build.yml
I will just use GitHub Actions and Drone.
2021-01-26 11:41:30 +02:00
bcb9885c6b Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-26 10:36:48 +02:00
b484b75178 poetry.lock: Run poetry update 2021-01-26 10:36:04 +02:00
d3880a9dfa Remove Python 3.6 support
All checks were successful
continuous-integration/drone/push Build is passing
Pandas 1.2.0 apparently requires Python 3.7.1+.
2021-01-03 15:51:53 +02:00
7edb8b19d7 tests/test_check.py: Reformat with black 2021-01-03 15:50:21 +02:00
a6709c7f82 Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-01-03 15:42:00 +02:00
d489ea4609 poetry.lock: Run poetry update 2021-01-03 15:41:08 +02:00
96634cbb67 pytest.ini: Change --strict to --strict-markers
This is deprecated since pytest 6.2.0.

See: https://docs.pytest.org/en/stable/deprecations.html#the-strict-command-line-option
2021-01-03 15:40:14 +02:00
29e67a0887 Add tests for unnecessary multi-value separators 2021-01-03 15:37:18 +02:00
32cea2055f data/test.csv: Add unnecessary multi-value separator 2021-01-03 15:33:04 +02:00
0dc66c5c4e Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
c26ad83534 .github: Test CLI invocation 2020-12-14 23:47:09 +02:00
72ca9d99bf setup.py: Add Python 3.9
[SKIP CI]
2020-12-14 23:44:35 +02:00
ae33a9b793 Add .drone.yml 2020-12-14 23:42:23 +02:00
fc0367bfc8 README.md: Update note about Python version 2020-12-08 10:52:24 +02:00
e33b285034 README.md: Add GitHub Actions badge 2020-12-08 10:48:31 +02:00
349fca03b8 .github/workflows/python-app.yml: Rename
This name is displayed in the badge so it should be something more
relevant.
2020-12-08 10:46:39 +02:00
52d8904870 Remove .travis.yml
They changed their free tier and I might as well use GitHub Actions
for ILRI stuff anyways.
2020-12-08 10:41:36 +02:00
971c69e535 Create python-app.yml
Try GitHub Actions for Python 3.8 using GitHub's Python example.
2020-12-08 10:38:52 +02:00
f8cc233e25 .travis.yml: Use Amazon Graviton2 ARM environment
These are the new hotness and should have faster build times.

See: https://blog.travis-ci.com/2020-09-11-arm-on-aws
2020-12-06 10:49:03 +02:00
aa7b7a9592 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running on Python 3.6 in Travis:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2020-11-03 07:42:45 +02:00
57b455bde7 poetry.lock: Run poetry update 2020-11-03 07:40:56 +02:00
23b95fa368 .travis.yml: Use Ubuntu 20.04 "Focal" environment 2020-10-29 00:14:54 +03:00
6985f76aa3 .travis.yml: Bump Python versions
Test Python 3.9 now that it was released, and allow tests to fail
on nightly builds.
2020-10-29 00:14:36 +03:00
98a6a19e12 Update requirements-dev.txt
Generated with poetry export:

    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:48:46 +03:00
f4914c414f Only install ipython on Python 3.7+ 2020-10-06 17:48:16 +03:00
d352fe8017 Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-10-06 17:21:33 +03:00
f13c360084 Update poetry package dependencies 2020-10-06 17:20:16 +03:00
7cfd4c0b59 csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
826509ddcf poetry.lock: Run poetry update
List of updated modules:

  - Updating numpy (1.19.1 -> 1.19.2)
  - Updating pygments (2.6.1 -> 2.7.1)
  - Updating pandas (1.1.1 -> 1.1.2)

All tests still pass according to pytest.
2020-09-26 12:18:23 +03:00
22b5c0f7a1 CHANGELOG.md: Add note about dependencies update 2020-09-08 15:04:40 +03:00
774e274b32 poetry.lock: Run poetry update
Update dependencies to latest version:

  - Updating attrs (19.3.0 -> 20.2.0)
  - Updating more-itertools (8.4.0 -> 8.5.0)
  - Updating openpyxl (3.0.4 -> 3.0.5)
  - Updating parso (0.7.0 -> 0.7.1)
  - Updating sqlalchemy (1.3.18 -> 1.3.19)
  - Updating urllib3 (1.25.9 -> 1.25.10)
  - Updating agate-dbf (0.2.1 -> 0.2.2)
  - Updating agate-sql (0.5.4 -> 0.5.5)
  - Updating jedi (0.17.1 -> 0.17.2)
  - Updating numpy (1.19.0 -> 1.19.1)
  - Updating prompt-toolkit (3.0.5 -> 3.0.7)
  - Updating regex (2020.6.8 -> 2020.7.14)
  - Updating traitlets (4.3.3 -> 5.0.4)
  - Updating ipython (7.16.1 -> 7.18.1)
  - Updating pandas (1.0.5 -> 1.1.1)
  - Updating python-stdnum (1.13 -> 1.14)

All tests still pass according to pytest.
2020-09-08 15:04:00 +03:00
db474a802f README.md: Use badge from travis-ci.com 2020-08-04 11:12:28 +03:00
e241f8461b CHANGELOG.md: Add notes 2020-07-06 14:10:46 +03:00
431e6331c8 csv_metadata_quality/check.py: Format with black 2020-07-06 14:10:19 +03:00
cb07d357d4 Version 0.4.2 2020-07-06 14:04:34 +03:00
65cd48a26f CHANGELOG.md: Update changes 2020-07-06 14:00:21 +03:00
0f883f640c Remove pipenv 2020-07-06 13:59:49 +03:00
f4c5c5781e README.md: Switch to poetry 2020-07-06 13:59:11 +03:00
6aa784ad8c Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-07-06 13:57:07 +03:00
7b8da94f41 poetry.lock: Update Python dependencies 2020-07-06 13:56:31 +03:00
2a1566af62 csv_metadata_quality/check.py: Parameterize AGROVOC request 2020-07-06 13:44:46 +03:00
5fcaa63bd5 csv_metadata_quality/check.py: Prune requests cache once
We only need to prune the requests cache once before using it, not
for every value we check.
2020-07-06 13:42:19 +03:00
aa9e23b46c pyproject.toml: Update license specifier
We need to use valid SPDX license identifiers.
2020-06-09 14:22:53 +03:00
73acb1661f Update requirements
Generated with poetry export:

    $ poetry export -f requirements.txt > requirements.txt
    $ poetry export --dev -f requirements.txt > requirements-dev.txt
2020-05-31 17:51:16 +03:00
2a068fddc4 .build.yml: Fix test 2020-05-31 17:44:37 +03:00
c6c2f13e88 .build.yml: Fix poetry install invocation
Poetry apparently installs dev dependencies by default.
2020-05-31 17:37:09 +03:00
56f16e37ed .build.yml: Use poetry in SourceHut CI 2020-05-31 17:35:04 +03:00
0c44b967b6 Add poetry project file and lock
I want to try to use poetry instead of pipenv because pipenv takes
forever to do dependency resolution sometimes. Also, I have had a
few issues with Python modules like black that don't have releases
other than pre-releases, and even including the project itself in
the dependencies (pip install -e . ...?). My initial experience is
that poetry handles this better.
2020-05-31 17:33:40 +03:00
8a267bb40b .travis.yml: Try to build with Python 3.8-dev
But allow failures.
2020-03-29 16:40:11 +03:00
8fda8f1ef1 Pipfile.lock: Run pipenv update
All tests still passing.
2020-03-20 16:22:04 +02:00
5e471813e8 CHANGELOG.md: Add note about python dependencies 2020-01-29 12:41:43 +02:00
79244b9ac3 Pipfile.lock: Run pipenv update 2020-01-29 12:39:12 +02:00
5e81a33482 CHANGELOG.md: Add note about field names 2020-01-16 12:37:11 +02:00
28b5996aa6 Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
40ba9bae6c README.md: Adjust heading size 2020-01-15 12:26:11 +02:00
0b2d211455 Version 0.4.1 2020-01-15 12:19:42 +02:00
7f1df0b47c Support Python 3.6 and 3.7 again 2020-01-15 12:19:17 +02:00
365ecda324 Add utility function to check normalization
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 12:17:52 +02:00
550ce7fb7e .travis.yml: Only test Python 3.8
The Unicode normalization feature requires Python 3.8 because the
unicodedata.is_normalized() function only appears there. If I find
another way to check if a string is normalized without normalizing
it first I will drop the requirements back down to Python 3.6.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:57:21 +02:00
705127fd28 Version 0.4.0 2020-01-15 11:44:56 +02:00
894e0a196d setup.py: Change Python requirements
The `unicodedata.is_normalized()` function requires Python 3.8.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 11:43:25 +02:00
87181bc7b8 Run black, isort, and flake8. 2020-01-15 11:41:31 +02:00
8de5d862b6 CHANGELOG.md: Add note about Unicode normalization 2020-01-15 11:40:40 +02:00
49e3543878 Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
403b253762 CHANGELOG.md: Update python library versions 2020-01-15 10:58:44 +02:00
c5fbaf407a Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2020-01-15 10:51:58 +02:00
4f81f6c83c Pipfile.lock: Run pipenv update 2020-01-15 10:51:19 +02:00
4b9d1e060f setup.py: Add Python 3.8 classifier 2019-12-14 12:56:11 +02:00
c8a71e3143 Pipfile.lock: Run pipenv update 2019-12-14 12:53:39 +02:00
7964d98ca5 Pipfile: Specify exact version of black
Black only releases pre-release versions, which causes issues with
pipenv. Instead of always running pipenv with "--pre" and potenti-
ally letting in some other pre-release versions for other depende-
ncies, I would rather specify the latest black version explicitly.

See: https://github.com/psf/black/issues/517
See: https://github.com/microsoft/vscode-python/issues/5171
2019-12-14 12:41:28 +02:00
64ffc2f1da .travis.yml: Install packages from requirements.txt too 2019-11-14 23:42:28 +02:00
7b1bc29a92 .travis.yml: Try using pip instead of pipenv
The Pipfile knows it was created with Python 3.8, yet we're running
with multiple Python versions on Travis. I'm curious if would work
better to use pip to install dependencies instead of pipenv in this
case.
2019-11-14 23:37:25 +02:00
f0110d8e74 CHANGELOG.md: Add note about requirements 2019-11-14 23:30:26 +02:00
86498deee8 Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-11-14 23:28:42 +02:00
251647a15f CHANGELOG.md: Add TravisCI changes 2019-11-14 23:24:08 +02:00
0bd28e22ec .travis.yml: Test Python 3.8 2019-11-14 23:22:37 +02:00
63fdce7d13 .travis.yml: Use Ubuntu 18.04 "Bionic" 2019-11-14 23:22:19 +02:00
f068c0e16a CHANGELOG.md: Use Python 3.8.0 for pipenv 2019-11-14 23:11:43 +02:00
79b8f62a85 Use Python 3.8 for pipenv
Python 3.8.0 entered Arch Linux core repositories now and all tests
pass with Python 3.8.0 so it's time...
2019-11-14 23:10:20 +02:00
6c1e132531 CHANGELOG.md: Add unreleased changes 2019-11-14 09:19:19 +02:00
c0f3c866bd Pipfile.lock: Run pipenv update
Updates the following dependencies:

- numpy 1.17.2→1.17.4
- pandas 0.25.1→0.25.3
- flake8 3.7.8→3.7.9
- pytest 5.1.3→5.2.2
- black 19.3b0→19.10b0
2019-11-14 09:17:31 +02:00
36d0474b95 CHANGELOG.md: Move unreleased changes to v0.3.1 2019-10-01 17:11:52 +03:00
efdc3a841a Version 0.3.1 2019-10-01 17:11:13 +03:00
fd2ba6845d CHANGELOG.md: Update unreleased notes 2019-10-01 17:10:23 +03:00
e55380b4d5 csv_metadata_quality/fix.py: Harmonize language in fix output
We should always say if we're removing or replacing something.
2019-10-01 17:09:49 +03:00
85ae16d9b7 CHANGELOG.md: Add note about non-breaking spaces 2019-10-01 16:56:37 +03:00
c42f8b4812 csv_metadata_quality/fix.py: Replace non-breaking spaces
We should be replacing non-breaking spaces (U+00A0) with normal sp-
aces instead of removing them.
2019-10-01 16:55:04 +03:00
1c75608d54 README.md: Update introduction text
We should mention that this is not DSpace specific. Rather, it is
much more realistically Dublin Core specific.
2019-09-26 14:19:13 +03:00
0b15a8ed3b README.md: Remove TODO about lack of space after comma
This was added as an automatic global fix a few weeks ago.
2019-09-26 14:16:33 +03:00
9ca266f5f0 data/test.csv: Change birthdate column to dc.date.issued
More accurately reflects actual data we will be validating.
2019-09-26 14:15:48 +03:00
0d3f948708 CHANGELOG.md: Update comment about language validation 2019-09-26 14:14:57 +03:00
c04207fcfc CHANGELOG.md: Fix header formatting 2019-09-26 14:13:50 +03:00
9d4eceddc7 .build.yml: Enable experimental CLI checks on SourceHut 2019-09-26 14:11:35 +03:00
e15c98cccb Move unreleased changes to v0.3.0 2019-09-26 14:06:31 +03:00
93c4e1a993 Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-09-26 14:05:37 +03:00
9963b2bb64 Pipfile.lock: Run pipenv update 2019-09-26 14:04:50 +03:00
76291c1876 CHANGELOG.md: Add note about language validation 2019-09-26 14:03:18 +03:00
604bd5bda6 Reformat tests with black 2019-09-26 14:02:51 +03:00
e7c220039b README.md: Add note about experimental language validation 2019-09-26 13:59:50 +03:00
d7b5e378bc setup.py: Add langid 2019-09-26 13:49:32 +03:00
8435ee242d Experimental language detection using langid
Works decenty well assuming the title, abstract, and citation fields
are an accurate representation of the language as identified by the
language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3)
values seamlessly.

This includes updated pipenv environment, test data, pytest tests
for both correct and incorrect ISO 639-1 and ISO 639-3 languages,
and a new command line option "-e".
2019-09-26 13:46:32 +03:00
7ac1c6f554 README.md: Update comment about ISO 639-3
The pycountry library is actually using ISO 639-3 apparently.

See: https://pypi.org/project/pycountry/
2019-09-26 07:51:41 +03:00
86d4623fd3 More ISO 639-1 and ISO 639-3 fixes
ISO 639-1 uses two-letter codes and ISO 639-3 uses three-letter codes.
Technically there ISO 639-2/T and ISO 639-2/B, which also uses three
letter codes, but those are not supported by the pycountry library
so I won't even worry about them.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-26 07:44:39 +03:00
ddbe970342 data/test.csv: Update titles of language tests
ISO 639-1 is alpha 2 and ISO 639-3 is alpha 3.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-26 07:40:27 +03:00
31c78ca6f3 data/test.csv: Rename contributor column to title
This makes more sense as a description of each test and the titles
are obviously not authors.
2019-09-26 05:50:40 +03:00
154d05b5e2 CHANGELOG.md: Update notes 2019-09-24 18:55:05 +03:00
186f146edb Pipfile.lock: Run pipenv update
Synchronizes state with the Pipfile and brings some new deps.
2019-09-24 18:54:49 +03:00
a4cb301943 CHANGELOG.md: Add note about csvkit 2019-09-24 18:49:20 +03:00
219e37526d Pipfile: Add csvkit to dev requirements
Used to inspect CSV files during testing and development.
2019-09-24 18:48:01 +03:00
f304ca6a33 csv_metadata_quality/app.py: Use simpler column iteration
I don't know where I got the other one...
2019-09-21 17:19:39 +03:00
3d5c8bdf5d CHANGELOG.md: Add notes about updated python packages 2019-09-11 16:45:39 +03:00
480956d54d Pipfile.lock: Run pipenv update 2019-09-11 16:45:16 +03:00
d9fc09f121 Fix references to ISO 639
It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2
is the three-letter codes, aka alpha2 and alpha3.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-11 16:36:53 +03:00
b5899001b7 CHANGELOG.md: Add note about black and isort 2019-08-29 01:26:11 +03:00
c92977d1ca Update requirements-dev.txt
Generated with:

  $ pipenv lock -r -d > requirements-dev.txt
2019-08-29 01:25:14 +03:00
280a99c8a8 Sort imports with isort
See: https://sourcery.ai/blog/python-best-practices/
2019-08-29 01:15:04 +03:00
0388145b81 Add configuration for isort
See: https://sourcery.ai/blog/python-best-practices/
2019-08-29 01:14:31 +03:00
d97dcd19db Format with black 2019-08-29 01:10:39 +03:00
b375f0e895 Add black and isort to pipenv dev dependencies
These do a very opinionated automatic formatting and validation of
code.

See: https://sourcery.ai/blog/python-best-practices/
2019-08-29 01:08:38 +03:00
865c61d316 Add note about updated python dependencies 2019-08-28 21:02:21 +03:00
3b2ba57b75 Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-08-28 21:01:48 +03:00
2805c556a9 Pipfile.lock: Run pipenv update
Brings numpy 1.17.1, pandas 0.25.1, requests-cache 0.5.2, and pandas
0.25.1.
2019-08-28 20:58:35 +03:00
25 changed files with 2624 additions and 805 deletions

View File

@@ -1,19 +0,0 @@
image: archlinux
packages:
- python-pipenv
sources:
- https://git.sr.ht/~alanorth/csv-metadata-quality
tasks:
- setup: |
cd csv-metadata-quality
pipenv install --dev
- pytest: |
cd csv-metadata-quality
pipenv run pytest
- testcli: |
cd csv-metadata-quality
pipenv run pip install .
pipenv run csv-metadata-quality -i data/test.csv -o /tmp/test.csv -u --agrovoc-fields dc.subject,cg.coverage.country
environment:
PIPENV_NOSPIN: 'True'
PIPENV_HIDE_EMOJIS: 'True'

52
.drone.yml Normal file
View File

@@ -0,0 +1,52 @@
---
kind: pipeline
type: docker
name: python39
steps:
- name: test
image: python:3.9-slim
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dcterms.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python38
steps:
- name: test
image: python:3.8-slim
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dcterms.subject,cg.coverage.country
---
kind: pipeline
type: docker
name: python37
steps:
- name: test
image: python:3.7-slim
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
- csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dcterms.subject,cg.coverage.country
# vim: ts=2 sw=2 et

41
.github/workflows/python-app.yml vendored Normal file
View File

@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build and Test
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.9
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest
- name: Test CLI
run: |
python setup.py install
csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dcterms.subject,cg.coverage.country

View File

@@ -1,11 +0,0 @@
dist: xenial
language: python
python:
- "3.6"
- "3.7"
install:
- "pip install pipenv --upgrade-strategy=only-if-needed"
- "pipenv install --dev"
script: pytest
# vim: ts=2 sw=2 et

View File

@@ -4,6 +4,118 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Unreleased
### Added
- Ability to check for, and fix, "mojibake" characters using [ftfy](https://github.com/LuminosoInsight/python-ftfy)
## [0.4.7] - 2021-03-17
### Changed
- Fixing invalid multi-value separators like `|` and `|||` is no longer class-
ified as "unsafe" as I have yet to see a case where this was intentional
- Not user visible, but now checks only print a warning to the screen instead
of returning a value and re-writing the DataFrame, which should be faster and
use less memory
### Added
- Configurable directory for AGROVOC requests cache (to allow running the web
version from Google App Engine where we can only write to /tmp)
- Ability to check for duplicate items in the data set (uses a combination of
the title, type, and date issued to determine uniqueness)
### Removed
- Checks for invalid and unnecessary multi-value separators because now I fix
them whenever I see them, so there is no need to have checks for them
### Updated
- Run `poetry update` to update project dependencies
## [0.4.6] - 2021-03-11
### Added
- Validation of dcterms.license field against SPDX license identifiers
### Changed
- Use DCTERMS fields where possible in `data/test.csv`
### Updated
- Run `poetry update` to update project dependencies
### Fixed
- Output for all fixes should be green, because it is good
## [0.4.5] - 2021-03-04
### Added
- Check dates in dcterms.issued field as well, not just fields that have the
word "date" in them
### Updated
- Run `poetry update` to update project dependencies
## [0.4.4] - 2021-02-21
### Added
- Accept dates formatted in ISO 8601 extended with combined date and time, for
example: 2020-08-31T11:04:56Z
- Colorized output: red for errors, yellow for warnings and information, green
for changes
### Updated
- Run `poetry update` to update project dependencies
## [0.4.3] - 2021-01-26
### Changed
- Reformat with black
- Requires Python 3.7+ for pandas 1.2.0
### Updated
- Run `poetry update`
- Expand check/fix for multi-value separators to include metadata with invalid
separators at the end, for example "Kenya||Tanzania||"
## [0.4.2] - 2020-07-06
### Changed
- Add field name to the output for more fixes and checks to help identify where
the error is
- Minor optimizations to AGROVOC subject lookup
- Use Poetry instead of Pipenv
### Updated
- Update python dependencies to latest versions
## [0.4.1] - 2020-01-15
### Changed
- Reduce minimum Python version to 3.6 by working around the `is_normalized()`
that only works in Python >= 3.8
## [0.4.0] - 2020-01-15
### Added
- Unicode normalization (enable with `--unsafe-fixes`, see README.md)
### Updated
- Update python dependencies to latest versions, including numpy 1.18.1, pandas
1.0.0rc0, flake8 3.7.9, pytest 5.3.2, and black 19.10b0
- Regenerate requirements.txt and requirements-dev.txt
### Changed
- Use Python 3.8.0 for pipenv
- Use Ubuntu 18.04 "Bionic" for TravisCI builds
- Test Python 3.8 in TravisCI builds
## [0.3.1] - 2019-10-01
## Changed
- Replace non-breaking spaces (U+00A0) with space instead of removing them
- Harmonize language of script output when fixing various issues
## [0.3.0] - 2019-09-26
### Updated
- Update python dependencies to latest versions, including numpy 1.17.2, pandas
0.25.1, pytest 5.1.3, and requests-cache 0.5.2
### Added
- csvkit to dev requirements (csvcut etc are useful during development)
- Experimental language validation using the Python `langid` library (enable with `-e`, see README.md)
### Changed
- Re-formatted code with black and isort
## [0.2.2] - 2019-08-27 ## [0.2.2] - 2019-08-27
### Changed ### Changed
- Output of date checks to include column names (helps debugging in case there are multiple date fields) - Output of date checks to include column names (helps debugging in case there are multiple date fields)

25
Pipfile
View File

@@ -1,25 +0,0 @@
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true
[dev-packages]
pytest = "*"
ipython = "*"
flake8 = "*"
pytest-clarity = "*"
[packages]
pandas = "*"
python-stdnum = "*"
xlrd = "*"
requests = "*"
requests-cache = "*"
pycountry = "*"
csv-metadata-quality = {editable = true,path = "."}
[requires]
python_version = "3.7"
[pipenv]
allow_prereleases = true

376
Pipfile.lock generated
View File

@@ -1,376 +0,0 @@
{
"_meta": {
"hash": {
"sha256": "f8f0a9f208ec41f4d8183ecfc68356b40674b083b2f126c37468b3c9533ba5df"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.7"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.org/simple",
"verify_ssl": true
}
]
},
"default": {
"certifi": {
"hashes": [
"sha256:046832c04d4e752f37383b628bc601a7ea7211496b4638f6514d0e5b9acc4939",
"sha256:945e3ba63a0b9f577b1395204e13c3a231f9bc0223888be653286534e5873695"
],
"version": "==2019.6.16"
},
"chardet": {
"hashes": [
"sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae",
"sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"
],
"version": "==3.0.4"
},
"csv-metadata-quality": {
"editable": true,
"path": "."
},
"idna": {
"hashes": [
"sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407",
"sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c"
],
"version": "==2.8"
},
"numpy": {
"hashes": [
"sha256:03e311b0a4c9f5755da7d52161280c6a78406c7be5c5cc7facfbcebb641efb7e",
"sha256:0cdd229a53d2720d21175012ab0599665f8c9588b3b8ffa6095dd7b90f0691dd",
"sha256:312bb18e95218bedc3563f26fcc9c1c6bfaaf9d453d15942c0839acdd7e4c473",
"sha256:464b1c48baf49e8505b1bb754c47a013d2c305c5b14269b5c85ea0625b6a988a",
"sha256:5adfde7bd3ee4864536e230bcab1c673f866736698724d5d28c11a4d63672658",
"sha256:7724e9e31ee72389d522b88c0d4201f24edc34277999701ccd4a5392e7d8af61",
"sha256:8d36f7c53ae741e23f54793ffefb2912340b800476eb0a831c6eb602e204c5c4",
"sha256:910d2272403c2ea8a52d9159827dc9f7c27fb4b263749dca884e2e4a8af3b302",
"sha256:951fefe2fb73f84c620bec4e001e80a80ddaa1b84dce244ded7f1e0cbe0ed34a",
"sha256:9588c6b4157f493edeb9378788dcd02cb9e6a6aeaa518b511a1c79d06cbd8094",
"sha256:9ce8300950f2f1d29d0e49c28ebfff0d2f1e2a7444830fbb0b913c7c08f31511",
"sha256:be39cca66cc6806652da97103605c7b65ee4442c638f04ff064a7efd9a81d50a",
"sha256:c3ab2d835b95ccb59d11dfcd56eb0480daea57cdf95d686d22eff35584bc4554",
"sha256:eb0fc4a492cb896346c9e2c7a22eae3e766d407df3eb20f4ce027f23f76e4c54",
"sha256:ec0c56eae6cee6299f41e780a0280318a93db519bbb2906103c43f3e2be1206c",
"sha256:f4e4612de60a4f1c4d06c8c2857cdcb2b8b5289189a12053f37d3f41f06c60d0"
],
"version": "==1.17.0"
},
"pandas": {
"hashes": [
"sha256:074a032f99bb55d178b93bd98999c971542f19317829af08c99504febd9e9b8b",
"sha256:20f1728182b49575c2f6f681b3e2af5fac9e84abdf29488e76d569a7969b362e",
"sha256:2745ba6e16c34d13d765c3657bb64fa20a0e2daf503e6216a36ed61770066179",
"sha256:32c44e5b628c48ba17703f734d59f369d4cdcb4239ef26047d6c8a8bfda29a6b",
"sha256:3b9f7dcee6744d9dcdd53bce19b91d20b4311bf904303fa00ef58e7df398e901",
"sha256:544f2033250980fb6f069ce4a960e5f64d99b8165d01dc39afd0b244eeeef7d7",
"sha256:58f9ef68975b9f00ba96755d5702afdf039dea9acef6a0cfd8ddcde32918a79c",
"sha256:9023972a92073a495eba1380824b197ad1737550fe1c4ef8322e65fe58662888",
"sha256:914341ad2d5b1ea522798efa4016430b66107d05781dbfe7cf05eba8f37df995",
"sha256:9d151bfb0e751e2c987f931c57792871c8d7ff292bcdfcaa7233012c367940ee",
"sha256:b932b127da810fef57d427260dde1ad54542c136c44b227a1e367551bb1a684b",
"sha256:cfb862aa37f4dd5be0730731fdb8185ac935aba8b51bf3bd035658111c9ee1c9",
"sha256:de7ecb4b120e98b91e8a2a21f186571266a8d1faa31d92421e979c7ca67d8e5c",
"sha256:df7e1933a0b83920769611c5d6b9a1bf301e3fa6a544641c6678c67621fe9843"
],
"index": "pypi",
"version": "==0.25.0"
},
"pycountry": {
"hashes": [
"sha256:68e58bfd3bedeea49ba9d4b38f2bd5e042f9753628eba9a819fb03f551d89096"
],
"index": "pypi",
"version": "==19.7.15"
},
"python-dateutil": {
"hashes": [
"sha256:7e6584c74aeed623791615e26efd690f29817a27c73085b78e4bad02493df2fb",
"sha256:c89805f6f4d64db21ed966fda138f8a5ed7a4fdbc1a8ee329ce1b74e3c74da9e"
],
"version": "==2.8.0"
},
"python-stdnum": {
"hashes": [
"sha256:d5f0af1bee9ddd9a20b398b46ce062dbd4d41fcc9646940f2667256a44df3854",
"sha256:f445ec32bf5246c90389204cabba465f494545371c29a83fa2d30e6c872a6763"
],
"index": "pypi",
"version": "==1.11"
},
"pytz": {
"hashes": [
"sha256:26c0b32e437e54a18161324a2fca3c4b9846b74a8dccddd843113109e1116b32",
"sha256:c894d57500a4cd2d5c71114aaab77dbab5eabd9022308ce5ac9bb93a60a6f0c7"
],
"version": "==2019.2"
},
"requests": {
"hashes": [
"sha256:11e007a8a2aa0323f5a921e9e6a2d7e4e67d9877e85773fba9ba6419025cbeb4",
"sha256:9cf5292fcd0f598c671cfc1e0d7d1a7f13bb8085e9a590f48c010551dc6c4b31"
],
"index": "pypi",
"version": "==2.22.0"
},
"requests-cache": {
"hashes": [
"sha256:6822f788c5ee248995c4bfbd725de2002ad710182ba26a666e85b64981866060",
"sha256:73a7211870f7d67af5fd81cad2f67cfe1cd3eb4ee6a85155e07613968cc72dfc"
],
"index": "pypi",
"version": "==0.5.0"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"urllib3": {
"hashes": [
"sha256:b246607a25ac80bedac05c6f282e3cdaf3afb65420fd024ac94435cabe6e18d1",
"sha256:dbe59173209418ae49d485b87d1681aefa36252ee85884c31346debd19463232"
],
"version": "==1.25.3"
},
"xlrd": {
"hashes": [
"sha256:546eb36cee8db40c3eaa46c351e67ffee6eeb5fa2650b71bc4c758a29a1b29b2",
"sha256:e551fb498759fa3a5384a94ccd4c3c02eb7c00ea424426e212ac0c57be9dfbde"
],
"index": "pypi",
"version": "==1.2.0"
}
},
"develop": {
"atomicwrites": {
"hashes": [
"sha256:03472c30eb2c5d1ba9227e4c2ca66ab8287fbfbbda3888aa93dc2e28fc6811b4",
"sha256:75a9445bac02d8d058d5e1fe689654ba5a6556a1dfd8ce6ec55a0ed79866cfa6"
],
"version": "==1.3.0"
},
"attrs": {
"hashes": [
"sha256:69c0dbf2ed392de1cb5ec704444b08a5ef81680a61cb899dc08127123af36a79",
"sha256:f0b870f674851ecbfbbbd364d6b5cbdff9dcedbc7f3f5e18a6891057f21fe399"
],
"version": "==19.1.0"
},
"backcall": {
"hashes": [
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
],
"version": "==0.1.0"
},
"decorator": {
"hashes": [
"sha256:86156361c50488b84a3f148056ea716ca587df2f0de1d34750d35c21312725de",
"sha256:f069f3a01830ca754ba5258fde2278454a0b5b79e0d7f5c13b3b97e57d4acff6"
],
"version": "==4.4.0"
},
"entrypoints": {
"hashes": [
"sha256:589f874b313739ad35be6e0cd7efde2a4e9b6fea91edcc34e58ecbb8dbe56d19",
"sha256:c70dd71abe5a8c85e55e12c19bd91ccfeec11a6e99044204511f9ed547d48451"
],
"version": "==0.3"
},
"flake8": {
"hashes": [
"sha256:19241c1cbc971b9962473e4438a2ca19749a7dd002dd1a946eaba171b4114548",
"sha256:8e9dfa3cecb2400b3738a42c54c3043e821682b9c840b0448c0503f781130696"
],
"index": "pypi",
"version": "==3.7.8"
},
"importlib-metadata": {
"hashes": [
"sha256:23d3d873e008a513952355379d93cbcab874c58f4f034ff657c7a87422fa64e8",
"sha256:80d2de76188eabfbfcf27e6a37342c2827801e59c4cc14b0371c56fed43820e3"
],
"version": "==0.19"
},
"ipython": {
"hashes": [
"sha256:1d3a1692921e932751bc1a1f7bb96dc38671eeefdc66ed33ee4cbc57e92a410e",
"sha256:537cd0176ff6abd06ef3e23f2d0c4c2c8a4d9277b7451544c6cbf56d1c79a83d"
],
"index": "pypi",
"version": "==7.7.0"
},
"ipython-genutils": {
"hashes": [
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
],
"version": "==0.2.0"
},
"jedi": {
"hashes": [
"sha256:53c850f1a7d3cfcd306cc513e2450a54bdf5cacd7604b74e42dd1f0758eaaf36",
"sha256:e07457174ef7cb2342ff94fa56484fe41cec7ef69b0059f01d3f812379cb6f7c"
],
"version": "==0.14.1"
},
"mccabe": {
"hashes": [
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
],
"version": "==0.6.1"
},
"more-itertools": {
"hashes": [
"sha256:409cd48d4db7052af495b09dec721011634af3753ae1ef92d2b32f73a745f832",
"sha256:92b8c4b06dac4f0611c0729b2f2ede52b2e1bac1ab48f089c7ddc12e26bb60c4"
],
"version": "==7.2.0"
},
"packaging": {
"hashes": [
"sha256:a7ac867b97fdc07ee80a8058fe4435ccd274ecc3b0ed61d852d7d53055528cf9",
"sha256:c491ca87294da7cc01902edbe30a5bc6c4c28172b5138ab4e4aa1b9d7bfaeafe"
],
"version": "==19.1"
},
"parso": {
"hashes": [
"sha256:63854233e1fadb5da97f2744b6b24346d2750b85965e7e399bec1620232797dc",
"sha256:666b0ee4a7a1220f65d367617f2cd3ffddff3e205f3f16a0284df30e774c2a9c"
],
"version": "==0.5.1"
},
"pexpect": {
"hashes": [
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
],
"markers": "sys_platform != 'win32'",
"version": "==4.7.0"
},
"pickleshare": {
"hashes": [
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
],
"version": "==0.7.5"
},
"pluggy": {
"hashes": [
"sha256:0825a152ac059776623854c1543d65a4ad408eb3d33ee114dff91e57ec6ae6fc",
"sha256:b9817417e95936bf75d85d3f8767f7df6cdde751fc40aed3bb3074cbcb77757c"
],
"version": "==0.12.0"
},
"prompt-toolkit": {
"hashes": [
"sha256:11adf3389a996a6d45cc277580d0d53e8a5afd281d0c9ec71b28e6f121463780",
"sha256:2519ad1d8038fd5fc8e770362237ad0364d16a7650fb5724af6997ed5515e3c1",
"sha256:977c6583ae813a37dc1c2e1b715892461fcbdaa57f6fc62f33a528c4886c8f55"
],
"version": "==2.0.9"
},
"ptyprocess": {
"hashes": [
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
],
"version": "==0.6.0"
},
"py": {
"hashes": [
"sha256:64f65755aee5b381cea27766a3a147c3f15b9b6b9ac88676de66ba2ae36793fa",
"sha256:dc639b046a6e2cff5bbe40194ad65936d6ba360b52b3c3fe1d08a82dd50b5e53"
],
"version": "==1.8.0"
},
"pycodestyle": {
"hashes": [
"sha256:95a2219d12372f05704562a14ec30bc76b05a5b297b21a5dfe3f6fac3491ae56",
"sha256:e40a936c9a450ad81df37f549d676d127b1b66000a6c500caa2b085bc0ca976c"
],
"version": "==2.5.0"
},
"pyflakes": {
"hashes": [
"sha256:17dbeb2e3f4d772725c777fabc446d5634d1038f234e77343108ce445ea69ce0",
"sha256:d976835886f8c5b31d47970ed689944a0262b5f3afa00a5a7b4dc81e5449f8a2"
],
"version": "==2.1.1"
},
"pygments": {
"hashes": [
"sha256:71e430bc85c88a430f000ac1d9b331d2407f681d6f6aec95e8bcfbc3df5b0127",
"sha256:881c4c157e45f30af185c1ffe8d549d48ac9127433f2c380c24b84572ad66297"
],
"version": "==2.4.2"
},
"pyparsing": {
"hashes": [
"sha256:6f98a7b9397e206d78cc01df10131398f1c8b8510a2f4d97d9abd82e1aacdd80",
"sha256:d9338df12903bbf5d65a0e4e87c2161968b10d2e489652bb47001d82a9b028b4"
],
"version": "==2.4.2"
},
"pytest": {
"hashes": [
"sha256:6ef6d06de77ce2961156013e9dff62f1b2688aa04d0dc244299fe7d67e09370d",
"sha256:a736fed91c12681a7b34617c8fcefe39ea04599ca72c608751c31d89579a3f77"
],
"index": "pypi",
"version": "==5.0.1"
},
"pytest-clarity": {
"hashes": [
"sha256:3f40d5ae7cb21cc95e622fc4f50d9466f80ae0f91460225b8c95c07afbf93e20"
],
"index": "pypi",
"version": "==0.2.0a1"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"termcolor": {
"hashes": [
"sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b"
],
"version": "==1.1.0"
},
"traitlets": {
"hashes": [
"sha256:9c4bd2d267b7153df9152698efb1050a5d84982d3384a37b2c1f7723ba3e7835",
"sha256:c6cb5e6f57c5a9bdaa40fa71ce7b4af30298fbab9ece9815b5d995ab6217c7d9"
],
"version": "==4.3.2"
},
"wcwidth": {
"hashes": [
"sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e",
"sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c"
],
"version": "==0.1.7"
},
"zipp": {
"hashes": [
"sha256:4970c3758f4e89a7857a973b1e2a5d75bcdc47794442f2e2dd4fe8e0466e809a",
"sha256:8a5712cfd3bb4248015eb3b0b3c54a5f6ee3f2425963ef2a0125b8bc40aafaec"
],
"version": "==0.5.2"
}
}
}

View File

@@ -1,31 +1,40 @@
# CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?) # DSpace CSV Metadata Quality Checker ![GitHub Actions](https://github.com/ilri/csv-metadata-quality/workflows/Build%20and%20Test/badge.svg) [![Build Status](https://ci.mjanja.ch/api/badges/alanorth/csv-metadata-quality/status.svg)](https://ci.mjanja.ch/alanorth/csv-metadata-quality)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc. A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, unnecessary Unicode, AGROVOC terms, etc.
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested. Requires Python 3.7.1 or greater (3.8+ recommended). CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
If you use the DSpace CSV metadata quality checker please cite:
*Orth, A. 2019. DSpace CSV metadata quality checker. Nairobi, Kenya: ILRI. https://hdl.handle.net/10568/110997.*
## Functionality ## Functionality
- Validate dates, ISSNs, ISBNs, and multi-value separators ("||") - Validate dates, ISSNs, ISBNs, and multi-value separators ("||")
- Validate languages against ISO 639-2 and ISO 639-3 - Validate languages against ISO 639-1 (alpha2) and ISO 639-3 (alpha3)
- Experimental validation of titles and abstracts against item's Dublin Core language field
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option) - Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
- Validation of licenses against the list of [SPDX license identifiers](https://spdx.org/licenses)
- Fix leading, trailing, and excessive (ie, more than one) whitespace - Fix leading, trailing, and excessive (ie, more than one) whitespace
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes` - Fix invalid and unnecessary multi-value separators (`|`)
- Fix problematic newlines (line feeds) using `--unsafe-fixes` - Fix problematic newlines (line feeds) using `--unsafe-fixes`
- Perform [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html) on strings using `--unsafe-fixes`
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt" - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
- Check for "mojibake" characters (and attempt to fix with `--unsafe-fixes`)
- Remove duplicate metadata values - Remove duplicate metadata values
- Check for duplicate items, using the title, type, and date issued as an indicator
## Installation ## Installation
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv): The easiest way to install CSV Metadata Quality is with [poetry](https://python-poetry.org):
``` ```
$ git clone https://github.com/ilri/csv-metadata-quality.git $ git clone https://github.com/ilri/csv-metadata-quality.git
$ cd csv-metadata-quality $ cd csv-metadata-quality
$ pipenv install $ poetry install
$ pipenv shell $ poetry shell
``` ```
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment: Otherwise, if you don't have poetry, you can use a vanilla Python virtual environment:
``` ```
$ git clone https://github.com/ilri/csv-metadata-quality.git $ git clone https://github.com/ilri/csv-metadata-quality.git
@@ -48,15 +57,33 @@ To validate and clean a CSV file you must specify input and output files using t
$ csv-metadata-quality -i data/test.csv -o /tmp/test.csv $ csv-metadata-quality -i data/test.csv -o /tmp/test.csv
``` ```
## Unsafe Fixes ## Invalid Multi-Value Separators
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will attempt to fix invalid multi-value separators and remove newlines. While it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. This utility will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
### Invalid Multi-Value Separators This will also remove unnecessary trailing multi-value separators, for example `Kenya||Tanzania||`.
This is considered "unsafe" because it is *theoretically* possible for a single `|` character to be used legitimately in a metadata value, though in my experience it is always a typo. For example, if a user mistakenly writes `Kenya|Tanzania` when attempting to indicate two countries, the result will be one metadata value with the literal text `Kenya|Tanzania`. The `--unsafe-fixes` option will correct the invalid multi-value separator so that there are two metadata values, ie `Kenya||Tanzania`.
## Unsafe Fixes
You can enable several "unsafe" fixes with the `--unsafe-fixes` option. Currently this will remove newlines, perform Unicode normalization, and attempt to fix "mojibake" characters.
### Newlines ### Newlines
This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A). This is considered "unsafe" because some systems give special importance to vertical space and render it properly. DSpace does not support rendering newlines in its XMLUI and has, at times, suffered from parsing errors that cause the import process to fail if an input file had newlines. The `--unsafe-fixes` option strips Unix line feeds (U+000A).
### Unicode Normalization
[Unicode](https://en.wikipedia.org/wiki/Unicode) is a standard for encoding text. As the standard aims to support most of the world's languages, characters can often be represented in different ways and still be valid Unicode. This leads to interesting problems that can be confusing unless you know what's going on behind the scenes. For example, the characters `é` and `é` *look* the same, but are nottechnically they refer to different code points in the Unicode standard:
- `é` is the Unicode code point `U+00E9`
- `é` is the Unicode code points `U+0065` + `U+0301`
Read more about [Unicode normalization](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html).
### Encoding Issues aka "Mojibake"
[Mojibake](https://en.wikipedia.org/wiki/Mojibake) is a phenomenon that occurs when text is decoded using an unintended character encoding. This usually presents itself in the form of strange, garbled characters in the text. Enabling "unsafe" fixes will attempt to correct these, for example:
- CIAT PublicaçaoCIAT Publicaçao
- CIAT PublicaciónCIAT Publicación
Pay special attention to the output of the script as well as the resulting file to make sure no new issues have been introduced. The ideal way to solve these issues is to avoid it in the first place. See [this guide about opening CSVs in UTF-8 format in Excel](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0).
## AGROVOC Validation ## AGROVOC Validation
You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields: You can enable validation of metadata values in certain fields against the AGROVOC REST API with the `--agrovoc-fields` option. For example, in addition to agricultural subjects, many countries and regions are also present AGROVOC. Enable this validation by specifying a comma-separated list of fields:
@@ -69,18 +96,36 @@ Invalid AGROVOC (cg.coverage.country): KENYAA
*Note: Requests to the AGROVOC REST API are cached using [requests_cache](https://pypi.org/project/requests-cache/) to speed up subsequent runs with the same data and to be kind to the system's administrators.* *Note: Requests to the AGROVOC REST API are cached using [requests_cache](https://pypi.org/project/requests-cache/) to speed up subsequent runs with the same data and to be kind to the system's administrators.*
## Experimental Checks
You can enable experimental support for validating whether the value of an item's `dc.language.iso` or `dcterms.language` field matches the actual language used in its title, abstract, and citation.
```
$ csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e
...
Possibly incorrect language es (detected en): Incorrect ISO 639-1 language
Possibly incorrect language spa (detected eng): Incorrect ISO 639-3 language
```
This currently uses the [Python langid](https://github.com/saffsd/langid.py) library. In the future I would like to move to the fastText library, but there is currently an [issue with their Python bindings](https://github.com/facebookresearch/fastText/issues/909) that makes this unfeasible.
## Todo ## Todo
- Reporting / summary - Reporting / summary
- Better logging, for example with INFO, WARN, and ERR levels - Better logging, for example with INFO, WARN, and ERR levels
- Verbose, debug, or quiet options - Verbose, debug, or quiet options
- Warn if an author is shorter than 3 characters? - Warn if an author is shorter than 3 characters?
- Validate dc.rights field against SPDX? Perhaps with an option like `-m spdx` to enable the spdx module?
- Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006 - Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006
- Warn if two items use the same file in `filename` column - Warn if two items use the same file in `filename` column
- Add an option to drop invalid AGROVOC subjects? - Add an option to drop invalid AGROVOC subjects?
- Add check for author names with incorrect spacing after commas, ie "Orth,Alan S."
- Add tests for application invocation, ie `tests/test_app.py`? - Add tests for application invocation, ie `tests/test_app.py`?
- Validate ISSNs or journal titles against CrossRef API?
- Add configurable field validation, like specify a field name and a validation file?
- Perhaps like --validate=field.name,filename
- Add some row-based item sanity checks and fixes:
- Warn if item is Open Access, but missing a filename or URL
- Warn if item is Open Access, but missing a license
- Warn if item has an ISSN but no journal title
- Update journal titles from ISSN
## License ## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html). This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

View File

@@ -1,10 +1,13 @@
from csv_metadata_quality import app # SPDX-License-Identifier: GPL-3.0-only
from sys import argv from sys import argv
from csv_metadata_quality import app
def main(): def main():
app.run(argv) app.run(argv)
if __name__ == '__main__': if __name__ == "__main__":
main() main()

View File

@@ -1,21 +1,57 @@
from csv_metadata_quality.version import VERSION # SPDX-License-Identifier: GPL-3.0-only
import argparse import argparse
import csv_metadata_quality.check as check
import csv_metadata_quality.fix as fix
import pandas as pd
import re import re
import signal import signal
import sys import sys
import pandas as pd
from colorama import Fore
import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental
import csv_metadata_quality.fix as fix
from csv_metadata_quality.version import VERSION
def parse_args(argv): def parse_args(argv):
parser = argparse.ArgumentParser(description='Metadata quality checker and fixer.') parser = argparse.ArgumentParser(description="Metadata quality checker and fixer.")
parser.add_argument('--agrovoc-fields', '-a', help='Comma-separated list of fields to validate against AGROVOC, for example: dc.subject,cg.coverage.country') parser.add_argument(
parser.add_argument('--input-file', '-i', help='Path to input file. Can be UTF-8 CSV or Excel XLSX.', required=True, type=argparse.FileType('r', encoding='UTF-8')) "--agrovoc-fields",
parser.add_argument('--output-file', '-o', help='Path to output file (always CSV).', required=True, type=argparse.FileType('w', encoding='UTF-8')) "-a",
parser.add_argument('--unsafe-fixes', '-u', help='Perform unsafe fixes.', action='store_true') help="Comma-separated list of fields to validate against AGROVOC, for example: dcterms.subject,cg.coverage.country",
parser.add_argument('--version', '-V', action='version', version=f'CSV Metadata Quality v{VERSION}') )
parser.add_argument('--exclude-fields', '-x', help='Comma-separated list of fields to skip, for example: dc.contributor.author,dc.identifier.citation') parser.add_argument(
"--experimental-checks",
"-e",
help="Enable experimental checks like language detection",
action="store_true",
)
parser.add_argument(
"--input-file",
"-i",
help="Path to input file. Can be UTF-8 CSV or Excel XLSX.",
required=True,
type=argparse.FileType("r", encoding="UTF-8"),
)
parser.add_argument(
"--output-file",
"-o",
help="Path to output file (always CSV).",
required=True,
type=argparse.FileType("w", encoding="UTF-8"),
)
parser.add_argument(
"--unsafe-fixes", "-u", help="Perform unsafe fixes.", action="store_true"
)
parser.add_argument(
"--version", "-V", action="version", version=f"CSV Metadata Quality v{VERSION}"
)
parser.add_argument(
"--exclude-fields",
"-x",
help="Comma-separated list of fields to skip, for example: dc.contributor.author,dcterms.bibliographicCitation",
)
args = parser.parse_args() args = parser.parse_args()
return args return args
@@ -34,22 +70,22 @@ def run(argv):
# Read all fields as strings so dates don't get converted from 1998 to 1998.0 # Read all fields as strings so dates don't get converted from 1998 to 1998.0
df = pd.read_csv(args.input_file, dtype=str) df = pd.read_csv(args.input_file, dtype=str)
for column in df.columns.values.tolist(): for column in df.columns:
# Check if the user requested to skip any fields # Check if the user requested to skip any fields
if args.exclude_fields: if args.exclude_fields:
skip = False skip = False
# Split the list of excludes on ',' so we can test exact matches # Split the list of excludes on ',' so we can test exact matches
# rather than fuzzy matches with regexes or "if word in string" # rather than fuzzy matches with regexes or "if word in string"
for exclude in args.exclude_fields.split(','): for exclude in args.exclude_fields.split(","):
if column == exclude and skip is False: if column == exclude and skip is False:
skip = True skip = True
if skip: if skip:
print(f'Skipping {column}') print(f"{Fore.YELLOW}Skipping {Fore.RESET}{column}")
continue continue
# Fix: whitespace # Fix: whitespace
df[column] = df[column].apply(fix.whitespace) df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: newlines # Fix: newlines
if args.unsafe_fixes: if args.unsafe_fixes:
@@ -58,58 +94,101 @@ def run(argv):
# Fix: missing space after comma. Only run on author and citation # Fix: missing space after comma. Only run on author and citation
# fields for now, as this problem is mostly an issue in names. # fields for now, as this problem is mostly an issue in names.
if args.unsafe_fixes: if args.unsafe_fixes:
match = re.match(r'^.*?(author|citation).*$', column) match = re.match(r"^.*?(author|citation).*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(fix.comma_space, field_name=column) df[column] = df[column].apply(fix.comma_space, field_name=column)
# Fix: perform Unicode normalization (NFC) to convert decomposed
# characters into their canonical forms.
if args.unsafe_fixes:
df[column] = df[column].apply(fix.normalize_unicode, field_name=column)
# Fix: unnecessary Unicode # Fix: unnecessary Unicode
df[column] = df[column].apply(fix.unnecessary_unicode) df[column] = df[column].apply(fix.unnecessary_unicode)
# Check: invalid multi-value separator
df[column] = df[column].apply(check.separators)
# Check: suspicious characters # Check: suspicious characters
df[column] = df[column].apply(check.suspicious_characters, field_name=column) df[column].apply(check.suspicious_characters, field_name=column)
# Fix: invalid multi-value separator # Check: mojibake
df[column].apply(check.mojibake, field_name=column)
# Fix: mojibake
if args.unsafe_fixes: if args.unsafe_fixes:
df[column] = df[column].apply(fix.separators) df[column] = df[column].apply(fix.mojibake, field_name=column)
# Fix: invalid and unnecessary multi-value separators
df[column] = df[column].apply(fix.separators, field_name=column)
# Run whitespace fix again after fixing invalid separators # Run whitespace fix again after fixing invalid separators
df[column] = df[column].apply(fix.whitespace) df[column] = df[column].apply(fix.whitespace, field_name=column)
# Fix: duplicate metadata values # Fix: duplicate metadata values
df[column] = df[column].apply(fix.duplicates) df[column] = df[column].apply(fix.duplicates, field_name=column)
# Check: invalid AGROVOC subject # Check: invalid AGROVOC subject
if args.agrovoc_fields: if args.agrovoc_fields:
# Identify fields the user wants to validate against AGROVOC # Identify fields the user wants to validate against AGROVOC
for field in args.agrovoc_fields.split(','): for field in args.agrovoc_fields.split(","):
if column == field: if column == field:
df[column] = df[column].apply(check.agrovoc, field_name=column) df[column].apply(check.agrovoc, field_name=column)
# Check: invalid language # Check: invalid language
match = re.match(r'^.*?language.*$', column) match = re.match(r"^.*?language.*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.language) df[column].apply(check.language)
# Check: invalid ISSN # Check: invalid ISSN
match = re.match(r'^.*?issn.*$', column) match = re.match(r"^.*?issn.*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.issn) df[column].apply(check.issn)
# Check: invalid ISBN # Check: invalid ISBN
match = re.match(r'^.*?isbn.*$', column) match = re.match(r"^.*?isbn.*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.isbn) df[column].apply(check.isbn)
# Check: invalid date # Check: invalid date
match = re.match(r'^.*?date.*$', column) match = re.match(r"^.*?(date|dcterms\.issued).*$", column)
if match is not None: if match is not None:
df[column] = df[column].apply(check.date, field_name=column) df[column].apply(check.date, field_name=column)
# Check: filename extension # Check: filename extension
if column == 'filename': if column == "filename":
df[column] = df[column].apply(check.filename_extension) df[column].apply(check.filename_extension)
# Check: SPDX license identifier
match = re.match(r"dcterms\.license.*$", column)
if match is not None:
df[column].apply(check.spdx_license_identifier)
### End individual column checks ###
# Check: duplicate items
# We extract just the title, type, and date issued columns to analyze
duplicates_df = df.filter(
regex=r"dcterms\.title|dc\.title|dcterms\.type|dc\.type|dcterms\.issued|dc\.date\.issued"
)
check.duplicate_items(duplicates_df)
# Delete the temporary duplicates DataFrame
del duplicates_df
##
# Perform some checks on rows so we can consider items as a whole rather
# than simple on a field-by-field basis. This allows us to check whether
# the language used in the title and abstract matches the language indi-
# cated in the language field, for example.
#
# This is slower and apparently frowned upon in the Pandas community be-
# cause it requires iterating over rows rather than using apply over a
# column. For now it will have to do.
##
if args.experimental_checks:
# Transpose the DataFrame so we can consider each row as a column
df_transposed = df.T
for column in df_transposed.columns:
experimental.correct_language(df_transposed[column])
# Write # Write
df.to_csv(args.output_file, index=False) df.to_csv(args.output_file, index=False)

View File

@@ -1,4 +1,19 @@
# SPDX-License-Identifier: GPL-3.0-only
import os
import re
from datetime import datetime, timedelta
import pandas as pd import pandas as pd
import requests
import requests_cache
import spdx_license_list
from colorama import Fore
from pycountry import languages
from stdnum import isbn as stdnum_isbn
from stdnum import issn as stdnum_issn
from csv_metadata_quality.util import is_mojibake
def issn(field): def issn(field):
@@ -11,19 +26,17 @@ def issn(field):
See: https://arthurdejong.org/python-stdnum/doc/1.11/index.html#stdnum.module.is_valid See: https://arthurdejong.org/python-stdnum/doc/1.11/index.html#stdnum.module.is_valid
""" """
from stdnum import issn
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split('||'): for value in field.split("||"):
if not issn.is_valid(value): if not stdnum_issn.is_valid(value):
print(f'Invalid ISSN: {value}') print(f"{Fore.RED}Invalid ISSN: {Fore.RESET}{value}")
return field return
def isbn(field): def isbn(field):
@@ -36,44 +49,18 @@ def isbn(field):
See: https://arthurdejong.org/python-stdnum/doc/1.11/index.html#stdnum.module.is_valid See: https://arthurdejong.org/python-stdnum/doc/1.11/index.html#stdnum.module.is_valid
""" """
from stdnum import isbn
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split('||'): for value in field.split("||"):
if not isbn.is_valid(value): if not stdnum_isbn.is_valid(value):
print(f'Invalid ISBN: {value}') print(f"{Fore.RED}Invalid ISBN: {Fore.RESET}{value}")
return field
def separators(field):
"""Check for invalid multi-value separators (ie "|" or "|||").
Prints the field with the invalid multi-value separator.
"""
import re
# Skip fields with missing values
if pd.isna(field):
return return
# Try to split multi-value field on "||" separator
for value in field.split('||'):
# After splitting, see if there are any remaining "|" characters
match = re.findall(r'^.*?\|.*$', value)
if match:
print(f'Invalid multi-value separator: {field}')
return field
def date(field, field_name): def date(field, field_name):
"""Check if a date is valid. """Check if a date is valid.
@@ -85,47 +72,56 @@ def date(field, field_name):
Prints the date if invalid. Prints the date if invalid.
""" """
from datetime import datetime
if pd.isna(field): if pd.isna(field):
print(f'Missing date ({field_name}).') print(f"{Fore.RED}Missing date ({field_name}).{Fore.RESET}")
return return
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
multiple_dates = field.split('||') multiple_dates = field.split("||")
# We don't allow multi-value date fields # We don't allow multi-value date fields
if len(multiple_dates) > 1: if len(multiple_dates) > 1:
print(f'Multiple dates not allowed ({field_name}): {field}') print(
f"{Fore.RED}Multiple dates not allowed ({field_name}): {Fore.RESET}{field}"
)
return field return
try: try:
# Check if date is valid YYYY format # Check if date is valid YYYY format
datetime.strptime(field, '%Y') datetime.strptime(field, "%Y")
return field return
except ValueError: except ValueError:
pass pass
try: try:
# Check if date is valid YYYY-MM format # Check if date is valid YYYY-MM format
datetime.strptime(field, '%Y-%m') datetime.strptime(field, "%Y-%m")
return field return
except ValueError: except ValueError:
pass pass
try: try:
# Check if date is valid YYYY-MM-DD format # Check if date is valid YYYY-MM-DD format
datetime.strptime(field, '%Y-%m-%d') datetime.strptime(field, "%Y-%m-%d")
return field return
except ValueError: except ValueError:
print(f'Invalid date ({field_name}): {field}') pass
return field try:
# Check if date is valid YYYY-MM-DDTHH:MM:SSZ format
datetime.strptime(field, "%Y-%m-%dT%H:%M:%SZ")
return
except ValueError:
print(f"{Fore.RED}Invalid date ({field_name}): {Fore.RESET}{field}")
return
def suspicious_characters(field, field_name): def suspicious_characters(field, field_name):
@@ -140,7 +136,7 @@ def suspicious_characters(field, field_name):
return return
# List of suspicious characters, for example: ́ˆ~` # List of suspicious characters, for example: ́ˆ~`
suspicious_characters = ['\u00B4', '\u02C6', '\u007E', '\u0060'] suspicious_characters = ["\u00B4", "\u02C6", "\u007E", "\u0060"]
for character in suspicious_characters: for character in suspicious_characters:
# Find the position of the suspicious character in the string # Find the position of the suspicious character in the string
@@ -156,20 +152,18 @@ def suspicious_characters(field, field_name):
# character and spanning enough of the rest to give a preview, # character and spanning enough of the rest to give a preview,
# but not too much to cause the line to break in terminals with # but not too much to cause the line to break in terminals with
# a default of 80 characters width. # a default of 80 characters width.
suspicious_character_msg = f'Suspicious character ({field_name}): {field_subset}' suspicious_character_msg = f"{Fore.YELLOW}Suspicious character ({field_name}): {Fore.RESET}{field_subset}"
print(f'{suspicious_character_msg:1.80}') print(f"{suspicious_character_msg:1.80}")
return field return
def language(field): def language(field):
"""Check if a language is valid ISO 639-2 or ISO 639-3. """Check if a language is valid ISO 639-1 (alpha 2) or ISO 639-3 (alpha 3).
Prints the value if it is invalid. Prints the value if it is invalid.
""" """
from pycountry import languages
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
@@ -177,24 +171,24 @@ def language(field):
# need to handle "Other" values here... # need to handle "Other" values here...
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split('||'): for value in field.split("||"):
# After splitting, check if language value is 2 or 3 characters so we # After splitting, check if language value is 2 or 3 characters so we
# can check it against ISO 639-2 or ISO 639-3 accordingly. # can check it against ISO 639-1 or ISO 639-3 accordingly.
if len(value) == 2: if len(value) == 2:
if not languages.get(alpha_2=value): if not languages.get(alpha_2=value):
print(f'Invalid ISO 639-2 language: {value}') print(f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}")
pass pass
elif len(value) == 3: elif len(value) == 3:
if not languages.get(alpha_3=value): if not languages.get(alpha_3=value):
print(f'Invalid ISO 639-3 language: {value}') print(f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}")
pass pass
else: else:
print(f'Invalid language: {value}') print(f"{Fore.RED}Invalid language: {Fore.RESET}{value}")
return field return
def agrovoc(field, field_name): def agrovoc(field, field_name):
@@ -211,35 +205,38 @@ def agrovoc(field, field_name):
Prints a warning if the value is invalid. Prints a warning if the value is invalid.
""" """
from datetime import timedelta
import requests
import requests_cache
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
# Try to split multi-value field on "||" separator
for value in field.split('||'):
request_url = f'http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search?query={value}'
# enable transparent request cache with thirty days expiry # enable transparent request cache with thirty days expiry
expire_after = timedelta(days=30) expire_after = timedelta(days=30)
requests_cache.install_cache('agrovoc-response-cache', expire_after=expire_after) # Allow overriding the location of the requests cache, just in case we are
# running in an environment where we can't write to the current working di-
request = requests.get(request_url) # rectory (for example from csv-metadata-quality-web).
REQUESTS_CACHE_DIR = os.environ.get("REQUESTS_CACHE_DIR", ".")
requests_cache.install_cache(
f"{REQUESTS_CACHE_DIR}/agrovoc-response-cache", expire_after=expire_after
)
# prune old cache entries # prune old cache entries
requests_cache.core.remove_expired_responses() requests_cache.core.remove_expired_responses()
# Try to split multi-value field on "||" separator
for value in field.split("||"):
request_url = "http://agrovoc.uniroma2.it/agrovoc/rest/v1/agrovoc/search"
request_params = {"query": value}
request = requests.get(request_url, params=request_params)
if request.status_code == requests.codes.ok: if request.status_code == requests.codes.ok:
data = request.json() data = request.json()
# check if there are any results # check if there are any results
if len(data['results']) == 0: if len(data["results"]) == 0:
print(f'Invalid AGROVOC ({field_name}): {value}') print(f"{Fore.RED}Invalid AGROVOC ({field_name}): {Fore.RESET}{value}")
return field return
def filename_extension(field): def filename_extension(field):
@@ -253,17 +250,23 @@ def filename_extension(field):
than .pdf, .xls(x), .doc(x), ppt(x), case insensitive). than .pdf, .xls(x), .doc(x), ppt(x), case insensitive).
""" """
import re
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
return return
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
values = field.split('||') values = field.split("||")
# List of common filename extentions # List of common filename extentions
common_filename_extensions = ['.pdf', '.doc', '.docx', '.ppt', '.pptx', '.xls', '.xlsx'] common_filename_extensions = [
".pdf",
".doc",
".docx",
".ppt",
".pptx",
".xls",
".xlsx",
]
# Iterate over all values # Iterate over all values
for value in values: for value in values:
@@ -272,7 +275,7 @@ def filename_extension(field):
for filename_extension in common_filename_extensions: for filename_extension in common_filename_extensions:
# Check for extension at the end of the filename # Check for extension at the end of the filename
pattern = re.escape(filename_extension) + r'$' pattern = re.escape(filename_extension) + r"$"
match = re.search(pattern, value, re.IGNORECASE) match = re.search(pattern, value, re.IGNORECASE)
if match is not None: if match is not None:
@@ -282,6 +285,86 @@ def filename_extension(field):
break break
if filename_extension_match is False: if filename_extension_match is False:
print(f'Filename with uncommon extension: {value}') print(f"{Fore.YELLOW}Filename with uncommon extension: {Fore.RESET}{value}")
return field return
def spdx_license_identifier(field):
"""Check if a license is a valid SPDX identifier.
Prints the value if it is invalid.
"""
# Skip fields with missing values
if pd.isna(field):
return
# Try to split multi-value field on "||" separator
for value in field.split("||"):
if value not in spdx_license_list.LICENSES:
print(f"{Fore.YELLOW}Non-SPDX license identifier: {Fore.RESET}{value}")
pass
return
def duplicate_items(df):
"""Attempt to identify duplicate items.
First we check the total number of titles and compare it with the number of
unique titles. If there are less unique titles than total titles we expand
the search by creating a key (of sorts) for each item that includes their
title, type, and date issued, and compare it with all the others. If there
are multiple occurrences of the same title, type, date string then it's a
very good indicator that the items are duplicates.
"""
# Extract the names of the title, type, and date issued columns so we can
# reference them later. First we filter columns by likely patterns, then
# we extract the name from the first item of the resulting object, ie:
#
# Index(['dcterms.title[en_US]'], dtype='object')
#
title_column_name = df.filter(regex=r"dcterms\.title|dc\.title").columns[0]
type_column_name = df.filter(regex=r"dcterms\.title|dc\.title").columns[0]
date_column_name = df.filter(
regex=r"dcterms\.issued|dc\.date\.accessioned"
).columns[0]
items_count_total = df[title_column_name].count()
items_count_unique = df[title_column_name].nunique()
if items_count_unique < items_count_total:
# Create a list to hold our items while we check for duplicates
items = list()
for index, row in df.iterrows():
item_title_type_date = f"{row[title_column_name]}{row[type_column_name]}{row[date_column_name]}"
if item_title_type_date in items:
print(
f"{Fore.YELLOW}Possible duplicate ({title_column_name}): {Fore.RESET}{row[title_column_name]}"
)
else:
items.append(item_title_type_date)
def mojibake(field, field_name):
"""Check for mojibake (text that was encoded in one encoding and decoded in
in another, perhaps multiple times). See util.py.
Prints the string if it contains suspected mojibake.
"""
# Skip fields with missing values
if pd.isna(field):
return
if is_mojibake(field):
print(
f"{Fore.YELLOW}Possible encoding issue ({field_name}): {Fore.RESET}{field}"
)
return

View File

@@ -0,0 +1,98 @@
# SPDX-License-Identifier: GPL-3.0-only
import re
import langid
import pandas as pd
from colorama import Fore
from pycountry import languages
def correct_language(row):
"""Analyze the text used in the title, abstract, and citation fields to pre-
dict the language being used and compare it with the item's dc.language.iso
field.
Function prints an error if the language field does not match the detected
language and returns the value in the language field if it does match.
"""
# Initialize some variables at global scope so that we can set them in the
# loop scope below and still be able to access them afterwards.
language = ""
sample_strings = list()
title = None
# Iterate over the labels of the current row's values. Before we transposed
# the DataFrame these were the columns in the CSV, ie dc.title and dc.type.
for label in row.axes[0]:
# Skip fields with missing values
if pd.isna(row[label]):
continue
# Check if current row has multiple language values (separated by "||")
match = re.match(r"^.*?language.*$", label)
if match is not None:
# Skip fields with multiple language values
if "||" in row[label]:
return
language = row[label]
# Extract title if it is present
match = re.match(r"^.*?title.*$", label)
if match is not None:
title = row[label]
# Append title to sample strings
sample_strings.append(row[label])
# Extract abstract if it is present
match = re.match(r"^.*?abstract.*$", label)
if match is not None:
sample_strings.append(row[label])
# Extract citation if it is present
match = re.match(r"^.*?citation.*$", label)
if match is not None:
sample_strings.append(row[label])
# Make sure language is not blank and is valid ISO 639-1/639-3 before proceeding with language prediction
if language != "":
# Check language value like "es"
if len(language) == 2:
if not languages.get(alpha_2=language):
return
# Check language value like "spa"
elif len(language) == 3:
if not languages.get(alpha_3=language):
return
# Language value is something else like "Span", do not proceed
else:
return
# Language is blank, do not proceed
else:
return
# Concatenate all sample strings into one string
sample_text = " ".join(sample_strings)
# Restrict the langid detection space to reduce false positives
langid.set_languages(
["ar", "de", "en", "es", "fr", "hi", "it", "ja", "ko", "pt", "ru", "vi", "zh"]
)
langid_classification = langid.classify(sample_text)
# langid returns an ISO 639-1 (alpha 2) representation of the detected language, but the current item's language field might be ISO 639-3 (alpha 3) so we should use a pycountry Language object to compare both represenations and give appropriate error messages that match the format used by in the input file.
detected_language = languages.get(alpha_2=langid_classification[0])
if len(language) == 2 and language != detected_language.alpha_2:
print(
f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_2}): {Fore.RESET}{title}"
)
elif len(language) == 3 and language != detected_language.alpha_3:
print(
f"{Fore.YELLOW}Possibly incorrect language {language} (detected {detected_language.alpha_3}): {Fore.RESET}{title}"
)
else:
return

View File

@@ -1,8 +1,16 @@
import pandas as pd # SPDX-License-Identifier: GPL-3.0-only
import re import re
from unicodedata import normalize
import pandas as pd
from colorama import Fore
from ftfy import fix_text
from csv_metadata_quality.util import is_mojibake, is_nfc
def whitespace(field): def whitespace(field, field_name):
"""Fix whitespace issues. """Fix whitespace issues.
Return string with leading, trailing, and consecutive whitespace trimmed. Return string with leading, trailing, and consecutive whitespace trimmed.
@@ -16,29 +24,38 @@ def whitespace(field):
values = list() values = list()
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split('||'): for value in field.split("||"):
# Strip leading and trailing whitespace # Strip leading and trailing whitespace
value = value.strip() value = value.strip()
# Replace excessive whitespace (>2) with one space # Replace excessive whitespace (>2) with one space
pattern = re.compile(r'\s{2,}') pattern = re.compile(r"\s{2,}")
match = re.findall(pattern, value) match = re.findall(pattern, value)
if match: if match:
print(f'Excessive whitespace: {value}') print(
value = re.sub(pattern, ' ', value) f"{Fore.GREEN}Removing excessive whitespace ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, " ", value)
# Save cleaned value # Save cleaned value
values.append(value) values.append(value)
# Create a new field consisting of all values joined with "||" # Create a new field consisting of all values joined with "||"
new_field = '||'.join(values) new_field = "||".join(values)
return new_field return new_field
def separators(field): def separators(field, field_name):
"""Fix for invalid multi-value separators (ie "|").""" """Fix for invalid and unnecessary multi-value separators, for example:
value|value
value|||value
value||value||
Prints the field with the invalid multi-value separator.
"""
# Skip fields with missing values # Skip fields with missing values
if pd.isna(field): if pd.isna(field):
@@ -48,21 +65,31 @@ def separators(field):
values = list() values = list()
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
for value in field.split('||'): for value in field.split("||"):
# Check if the value is blank and skip it
if value == "":
print(
f"{Fore.GREEN}Fixing unnecessary multi-value separator ({field_name}): {Fore.RESET}{field}"
)
continue
# After splitting, see if there are any remaining "|" characters # After splitting, see if there are any remaining "|" characters
pattern = re.compile(r'\|') pattern = re.compile(r"\|")
match = re.findall(pattern, value) match = re.findall(pattern, value)
if match: if match:
print(f'Fixing invalid multi-value separator: {value}') print(
f"{Fore.GREEN}Fixing invalid multi-value separator ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, '||', value) value = re.sub(pattern, "||", value)
# Save cleaned value # Save cleaned value
values.append(value) values.append(value)
# Create a new field consisting of all values joined with "||" # Create a new field consisting of all values joined with "||"
new_field = '||'.join(values) new_field = "||".join(values)
return new_field return new_field
@@ -73,10 +100,10 @@ def unnecessary_unicode(field):
Removes unnecessary Unicode characters like: Removes unnecessary Unicode characters like:
- Zero-width space (U+200B) - Zero-width space (U+200B)
- Replacement character (U+FFFD) - Replacement character (U+FFFD)
- No-break space (U+00A0)
Replaces unnecessary Unicode characters like: Replaces unnecessary Unicode characters like:
- Soft hyphen (U+00AD) → hyphen - Soft hyphen (U+00AD) → hyphen
- No-break space (U+00A0) → space
Return string with characters removed or replaced. Return string with characters removed or replaced.
""" """
@@ -86,41 +113,45 @@ def unnecessary_unicode(field):
return return
# Check for zero-width space characters (U+200B) # Check for zero-width space characters (U+200B)
pattern = re.compile(r'\u200B') pattern = re.compile(r"\u200B")
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f'Removing unnecessary Unicode (U+200B): {field}') print(f"{Fore.GREEN}Removing unnecessary Unicode (U+200B): {Fore.RESET}{field}")
field = re.sub(pattern, '', field) field = re.sub(pattern, "", field)
# Check for replacement characters (U+FFFD) # Check for replacement characters (U+FFFD)
pattern = re.compile(r'\uFFFD') pattern = re.compile(r"\uFFFD")
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f'Removing unnecessary Unicode (U+FFFD): {field}') print(f"{Fore.GREEN}Removing unnecessary Unicode (U+FFFD): {Fore.RESET}{field}")
field = re.sub(pattern, '', field) field = re.sub(pattern, "", field)
# Check for no-break spaces (U+00A0) # Check for no-break spaces (U+00A0)
pattern = re.compile(r'\u00A0') pattern = re.compile(r"\u00A0")
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f'Removing unnecessary Unicode (U+00A0): {field}') print(
field = re.sub(pattern, '', field) f"{Fore.GREEN}Replacing unnecessary Unicode (U+00A0): {Fore.RESET}{field}"
)
field = re.sub(pattern, " ", field)
# Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen # Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen
pattern = re.compile(r'\u002D*?\u00AD') pattern = re.compile(r"\u002D*?\u00AD")
match = re.findall(pattern, field) match = re.findall(pattern, field)
if match: if match:
print(f'Replacing unnecessary Unicode (U+00AD): {field}') print(
field = re.sub(pattern, '-', field) f"{Fore.GREEN}Replacing unnecessary Unicode (U+00AD): {Fore.RESET}{field}"
)
field = re.sub(pattern, "-", field)
return field return field
def duplicates(field): def duplicates(field, field_name):
"""Remove duplicate metadata values.""" """Remove duplicate metadata values."""
# Skip fields with missing values # Skip fields with missing values
@@ -128,7 +159,7 @@ def duplicates(field):
return return
# Try to split multi-value field on "||" separator # Try to split multi-value field on "||" separator
values = field.split('||') values = field.split("||")
# Initialize an empty list to hold the de-duplicated values # Initialize an empty list to hold the de-duplicated values
new_values = list() new_values = list()
@@ -139,10 +170,12 @@ def duplicates(field):
if value not in new_values: if value not in new_values:
new_values.append(value) new_values.append(value)
else: else:
print(f'Dropping duplicate value: {value}') print(
f"{Fore.GREEN}Removing duplicate value ({field_name}): {Fore.RESET}{value}"
)
# Create a new field consisting of all values joined with "||" # Create a new field consisting of all values joined with "||"
new_field = '||'.join(new_values) new_field = "||".join(new_values)
return new_field return new_field
@@ -169,11 +202,11 @@ def newlines(field):
return return
# Check for Unix line feed (LF) # Check for Unix line feed (LF)
match = re.findall(r'\n', field) match = re.findall(r"\n", field)
if match: if match:
print(f'Removing newline: {field}') print(f"{Fore.GREEN}Removing newline: {Fore.RESET}{field}")
field = field.replace('\n', '') field = field.replace("\n", "")
return field return field
@@ -193,10 +226,52 @@ def comma_space(field, field_name):
return return
# Check for comma followed by a word character # Check for comma followed by a word character
match = re.findall(r',\w', field) match = re.findall(r",\w", field)
if match: if match:
print(f'Adding space after comma ({field_name}): {field}') print(
field = re.sub(r',(\w)', r', \1', field) f"{Fore.GREEN}Adding space after comma ({field_name}): {Fore.RESET}{field}"
)
field = re.sub(r",(\w)", r", \1", field)
return field return field
def normalize_unicode(field, field_name):
"""Fix occurrences of decomposed Unicode characters by normalizing them
with NFC to their canonical forms, for example:
Ouédraogo, Mathieu → Ouédraogo, Mathieu
Return normalized string.
"""
# Skip fields with missing values
if pd.isna(field):
return
# Check if the current string is using normalized Unicode (NFC)
if not is_nfc(field):
print(f"{Fore.GREEN}Normalizing Unicode ({field_name}): {Fore.RESET}{field}")
field = normalize("NFC", field)
return field
def mojibake(field, field_name):
"""Attempts to fix mojibake (text that was encoded in one encoding and deco-
ded in another, perhaps multiple times). See util.py.
Return fixed string.
"""
# Skip fields with missing values
if pd.isna(field):
return field
if is_mojibake(field):
print(f"{Fore.GREEN}Fixing encoding issue ({field_name}): {Fore.RESET}{field}")
return fix_text(field)
else:
return field

View File

@@ -0,0 +1,51 @@
# SPDX-License-Identifier: GPL-3.0-only
from ftfy.badness import sequence_weirdness
def is_nfc(field):
"""Utility function to check whether a string is using normalized Unicode.
Python's built-in unicodedata library has the is_normalized() function, but
it was only introduced in Python 3.8. By using a simple utility function we
are able to run on Python >= 3.6 again.
See: https://docs.python.org/3/library/unicodedata.html
Return boolean.
"""
from unicodedata import normalize
return field == normalize("NFC", field)
def is_mojibake(field):
"""Determines whether a string contains mojibake.
We commonly deal with CSV files that were *encoded* in UTF-8, but decoded
as something else like CP-1252 (Windows Latin). This manifests in the form
of "mojibake", for example:
- CIAT Publicaçao
- CIAT Publicación
This uses the excellent "fixes text for you" (ftfy) library to determine
whether a string contains characters that have been encoded in one encoding
and decoded in another.
Inspired by this code snippet from Martijn Pieters on StackOverflow:
https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
Return boolean.
"""
if not sequence_weirdness(field):
# Nothing weird, should be okay
return False
try:
field.encode("sloppy-windows-1252")
except UnicodeEncodeError:
# Not CP-1252 encodable, probably fine
return False
else:
# Encodable as CP-1252, Mojibake alert level high
return True

View File

@@ -1 +1,3 @@
VERSION = '0.2.2' # SPDX-License-Identifier: GPL-3.0-only
VERSION = "0.4.8-dev"

View File

@@ -1,26 +1,35 @@
dc.contributor.author,birthdate,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country,filename dc.title,dcterms.issued,dc.identifier.issn,dc.identifier.isbn,dcterms.language,dcterms.subject,cg.coverage.country,filename,dcterms.license,dcterms.type
Leading space,2019-07-29,,,,,, Leading space,2019-07-29,,,,,,,,
Trailing space ,2019-07-29,,,,,, Trailing space ,2019-07-29,,,,,,,,
Excessive space,2019-07-29,,,,,, Excessive space,2019-07-29,,,,,,,,
Miscellaenous ||whitespace | issues ,2019-07-29,,,,,, Miscellaenous ||whitespace | issues ,2019-07-29,,,,,,,,
Duplicate||Duplicate,2019-07-29,,,,,, Duplicate||Duplicate,2019-07-29,,,,,,,,
Invalid ISSN,2019-07-29,2321-2302,,,,, Invalid ISSN,2019-07-29,2321-2302,,,,,,,
Invalid ISBN,2019-07-29,,978-0-306-40615-6,,,, Invalid ISBN,2019-07-29,,978-0-306-40615-6,,,,,,
Multiple valid ISSNs,2019-07-29,0378-5955||0024-9319,,,,, Multiple valid ISSNs,2019-07-29,0378-5955||0024-9319,,,,,,,
Multiple valid ISBNs,2019-07-29,,99921-58-10-7||978-0-306-40615-7,,,, Multiple valid ISBNs,2019-07-29,,99921-58-10-7||978-0-306-40615-7,,,,,,
Invalid date,2019-07-260,,,,,, Invalid date,2019-07-260,,,,,,,,
Multiple dates,2019-07-26||2019-01-10,,,,,, Multiple dates,2019-07-26||2019-01-10,,,,,,,,
Invalid multi-value separator,2019-07-29,0378-5955|0024-9319,,,,, Invalid multi-value separator,2019-07-29,0378-5955|0024-9319,,,,,,,
Unnecessary Unicode,2019-07-29,,,,,, Unnecessary Unicode,2019-07-29,,,,,,,,
Suspicious character||foreˆt,2019-07-29,,,,,, Suspicious character||foreˆt,2019-07-29,,,,,,,,
Invalid ISO 639-2 language,2019-07-29,,,jp,,, Invalid ISO 639-1 (alpha 2) language,2019-07-29,,,jp,,,,,
Invalid ISO 639-3 language,2019-07-29,,,chi,,, Invalid ISO 639-3 (alpha 3) language,2019-07-29,,,chi,,,,,
Invalid language,2019-07-29,,,Span,,, Invalid language,2019-07-29,,,Span,,,,,
Invalid AGROVOC subject,2019-07-29,,,,FOREST,, Invalid AGROVOC subject,2019-07-29,,,,FOREST,,,,
Newline (LF),2019-07-30,,,,"TANZA Newline (LF),2019-07-30,,,,"TANZA
NIA",, NIA",,,,
Missing date,,,,,,, Missing date,,,,,,,,,
Invalid country,2019-08-01,,,,,KENYAA, Invalid country,2019-08-01,,,,,KENYAA,,,
Uncommon filename extension,2019-08-10,,,,,,file.pdf.lck Uncommon filename extension,2019-08-10,,,,,,file.pdf.lck,,
Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,, Unneccesary unicode (U+002D + U+00AD),2019-08-10,,978-­92-­9043-­823-­6,,,,,,
"Missing space,after comma",2019-08-27,,,,,, "Missing space,after comma",2019-08-27,,,,,,,,
Incorrect ISO 639-1 language,2019-09-26,,,es,,,,,
Incorrect ISO 639-3 language,2019-09-26,,,spa,,,,,
Composéd Unicode,2020-01-14,,,,,,,,
Decomposéd Unicode,2020-01-14,,,,,,,,
Unnecessary multi-value separator,2021-01-03,0378-5955||,,,,,,,
Invalid SPDX license identifier,2021-03-11,,,,,,,CC-BY,
Duplicate Title,2021-03-17,,,,,,,,Report
Duplicate Title,2021-03-17,,,,,,,,Report
Mojibake,2021-03-18,,,,CIAT Publicaçao,,,,Report
1 dc.contributor.author dc.title birthdate dcterms.issued dc.identifier.issn dc.identifier.isbn dc.language.iso dcterms.language dc.subject dcterms.subject cg.coverage.country filename dcterms.license dcterms.type
2 Leading space 2019-07-29
3 Trailing space 2019-07-29
4 Excessive space 2019-07-29
5 Miscellaenous ||whitespace | issues 2019-07-29
6 Duplicate||Duplicate 2019-07-29
7 Invalid ISSN 2019-07-29 2321-2302
8 Invalid ISBN 2019-07-29 978-0-306-40615-6
9 Multiple valid ISSNs 2019-07-29 0378-5955||0024-9319
10 Multiple valid ISBNs 2019-07-29 99921-58-10-7||978-0-306-40615-7
11 Invalid date 2019-07-260
12 Multiple dates 2019-07-26||2019-01-10
13 Invalid multi-value separator 2019-07-29 0378-5955|0024-9319
14 Unnecessary Unicode​ 2019-07-29
15 Suspicious character||foreˆt 2019-07-29
16 Invalid ISO 639-2 language Invalid ISO 639-1 (alpha 2) language 2019-07-29 jp
17 Invalid ISO 639-3 language Invalid ISO 639-3 (alpha 3) language 2019-07-29 chi
18 Invalid language 2019-07-29 Span
19 Invalid AGROVOC subject 2019-07-29 FOREST
20 Newline (LF) 2019-07-30 TANZA NIA
21 Missing date
22 Invalid country 2019-08-01 KENYAA
23 Uncommon filename extension 2019-08-10 file.pdf.lck
24 Unneccesary unicode (U+002D + U+00AD) 2019-08-10 978-­92-­9043-­823-­6
25 Missing space,after comma 2019-08-27
26 Incorrect ISO 639-1 language 2019-09-26 es
27 Incorrect ISO 639-3 language 2019-09-26 spa
28 Composéd Unicode 2020-01-14
29 Decomposéd Unicode 2020-01-14
30 Unnecessary multi-value separator 2021-01-03 0378-5955||
31 Invalid SPDX license identifier 2021-03-11 CC-BY
32 Duplicate Title 2021-03-17 Report
33 Duplicate Title 2021-03-17 Report
34 Mojibake 2021-03-18 CIAT Publicaçao Report
35

1314
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

37
pyproject.toml Normal file
View File

@@ -0,0 +1,37 @@
[tool.poetry]
name = "csv-metadata-quality"
version = "0.4.8-dev"
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem."
authors = ["Alan Orth <alan.orth@gmail.com>"]
license="GPL-3.0-only"
repository = "https://github.com/ilri/csv-metadata-quality"
homepage = "https://github.com/ilri/csv-metadata-quality"
[tool.poetry.scripts]
csv-metadata-quality = 'csv_metadata_quality.__main__:main'
[tool.poetry.dependencies]
python = "^3.7.1"
pandas = "^1.0.4"
python-stdnum = "^1.13"
xlrd = "^1.2.0"
requests = "^2.23.0"
requests-cache = "^0.5.2"
pycountry = "^19.8.18"
langid = "^1.1.6"
colorama = "^0.4.4"
spdx-license-list = "^0.5.2"
ftfy = "^5.9"
[tool.poetry.dev-dependencies]
pytest = "^6.1.1"
ipython = { version = "^7.18.1", python = "^3.7" }
flake8 = "^3.8.4"
pytest-clarity = "^0.3.0-alpha.0"
black = "20.8b1"
isort = "^5.5.4"
csvkit = "^1.0.5"
[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

View File

@@ -1,5 +1,5 @@
[pytest] [pytest]
addopts= -rsxX -s -v --strict --capture=sys addopts= -rsxX -s -v --strict-markers --capture=sys
filterwarnings = filterwarnings =
error::UserWarning error::UserWarning
ignore:.*U.* is deprecated:DeprecationWarning ignore:.*U.* is deprecated:DeprecationWarning

View File

@@ -1,32 +1,76 @@
-i https://pypi.org/simple agate-dbf==0.2.2
atomicwrites==1.3.0 agate-excel==0.2.3
attrs==19.1.0 agate-sql==0.5.6
backcall==0.1.0 agate==1.6.2
decorator==4.4.0 appdirs==1.4.4; python_version >= "3.6"
entrypoints==0.3 appnope==0.1.2; python_version >= "3.7" and python_version < "4.0" and sys_platform == "darwin"
flake8==3.7.8 atomicwrites==1.4.0; python_version >= "3.6" and python_full_version < "3.0.0" and sys_platform == "win32" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") or sys_platform == "win32" and python_version >= "3.6" and python_full_version >= "3.4.0" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6")
importlib-metadata==0.19 attrs==20.3.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
ipython-genutils==0.2.0 babel==2.9.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
ipython==7.7.0 backcall==0.2.0; python_version >= "3.7" and python_version < "4.0"
jedi==0.14.1 black==20.8b1; python_version >= "3.6"
mccabe==0.6.1 certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
more-itertools==7.2.0 chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
packaging==19.1 click==7.1.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
parso==0.5.1 colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
pexpect==4.7.0 ; sys_platform != 'win32' csvkit==1.0.5
pickleshare==0.7.5 dbfread==2.0.7
pluggy==0.12.0 decorator==4.4.2; python_version >= "3.7" and python_full_version < "3.0.0" and python_version < "4.0" or python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.2.0"
prompt-toolkit==2.0.9 et-xmlfile==1.0.1; python_version >= "3.6"
ptyprocess==0.6.0 flake8==3.9.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
py==1.8.0 ftfy==5.9; python_version >= "3.5"
pycodestyle==2.5.0 greenlet==1.0.0; python_version >= "3" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version >= "3"
pyflakes==2.1.1 idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
pygments==2.4.2 importlib-metadata==3.7.3; python_version < "3.8" and python_version >= "3.6" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.5.0" and python_version < "3.8" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.4.0" and python_version >= "3.6" and python_version < "3.8") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.8" or python_full_version >= "3.6.0" and python_version < "3.8" and python_version >= "3.6")
pyparsing==2.4.2 iniconfig==1.1.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pytest-clarity==0.2.0a1 ipython-genutils==0.2.0; python_version >= "3.7" and python_version < "4.0"
pytest==5.0.1 ipython==7.21.0; python_version >= "3.7" and python_version < "4.0"
six==1.12.0 isodate==0.6.0
termcolor==1.1.0 isort==5.7.0; python_version >= "3.6" and python_version < "4.0"
traitlets==4.3.2 jedi==0.18.0; python_version >= "3.7" and python_version < "4.0"
wcwidth==0.1.7 langid==1.1.6
zipp==0.5.2 leather==0.3.3
mccabe==0.6.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
mypy-extensions==0.4.3; python_version >= "3.6"
numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
openpyxl==3.0.7; python_version >= "3.6"
packaging==20.9; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pandas==1.2.3; python_full_version >= "3.7.1"
parsedatetime==2.6
parso==0.8.1; python_version >= "3.7" and python_version < "4.0"
pathspec==0.8.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.6"
pexpect==4.8.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
pickleshare==0.7.5; python_version >= "3.7" and python_version < "4.0"
pluggy==0.13.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
prompt-toolkit==3.0.17; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
ptyprocess==0.7.0; python_version >= "3.7" and python_version < "4.0" and sys_platform != "win32"
py==1.10.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pycodestyle==2.7.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
pycountry==19.8.18
pyflakes==2.3.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
pygments==2.8.1; python_version >= "3.7" and python_version < "4.0"
pyicu==2.6
pyparsing==2.4.7; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pytest-clarity==0.3.0a0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
pytest==6.2.2; python_version >= "3.6"
python-dateutil==2.8.1; python_full_version >= "3.7.1"
python-slugify==4.0.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
python-stdnum==1.16
pytimeparse==1.1.8
pytz==2021.1; python_full_version >= "3.7.1"
regex==2020.11.13; python_version >= "3.6"
requests-cache==0.5.2
requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
six==1.15.0; python_full_version >= "3.7.1"
spdx-license-list==0.5.2
sqlalchemy==1.4.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0"
termcolor==1.1.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
text-unidecode==1.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
toml==0.10.2; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
traitlets==5.0.5; python_version >= "3.7" and python_version < "4.0"
typed-ast==1.4.2; python_version >= "3.6"
typing-extensions==3.7.4.3; python_version < "3.8" and python_version >= "3.6"
urllib3==1.26.4; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
wcwidth==0.2.5; python_version >= "3.7" and python_version < "4.0" and python_full_version >= "3.6.1"
xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")
zipp==3.4.1; python_version < "3.8" and python_version >= "3.6"

View File

@@ -1,16 +1,19 @@
-i https://pypi.org/simple certifi==2020.12.5; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
-e . chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
certifi==2019.6.16 colorama==0.4.4; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
chardet==3.0.4 ftfy==5.9; python_version >= "3.5"
idna==2.8 idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
numpy==1.17.0 langid==1.1.6
pandas==0.25.0 numpy==1.20.1; python_version >= "3.7" and python_full_version >= "3.7.1"
pycountry==19.7.15 pandas==1.2.3; python_full_version >= "3.7.1"
python-dateutil==2.8.0 pycountry==19.8.18
python-stdnum==1.11 python-dateutil==2.8.1; python_full_version >= "3.7.1"
pytz==2019.2 python-stdnum==1.16
requests-cache==0.5.0 pytz==2021.1; python_full_version >= "3.7.1"
requests==2.22.0 requests-cache==0.5.2
six==1.12.0 requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
urllib3==1.25.3 six==1.15.0; python_full_version >= "3.7.1"
xlrd==1.2.0 spdx-license-list==0.5.2
urllib3==1.26.4; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4"
wcwidth==0.2.5; python_version >= "3.5"
xlrd==1.2.0; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.4.0")

6
setup.cfg Normal file
View File

@@ -0,0 +1,6 @@
[isort]
multi_line_output=3
include_trailing_comma=True
force_grid_wrap=0
use_parentheses=True
line_length=88

View File

@@ -4,16 +4,17 @@ with open("README.md", "r") as fh:
long_description = fh.read() long_description = fh.read()
install_requires = [ install_requires = [
'pandas', "pandas",
'python-stdnum', "python-stdnum",
'requests', "requests",
'requests-cache', "requests-cache",
'pycountry' "pycountry",
"langid",
] ]
setuptools.setup( setuptools.setup(
name="csv-metadata-quality", name="csv-metadata-quality",
version="0.2.2", version="0.4.8-dev",
author="Alan Orth", author="Alan Orth",
author_email="aorth@mjanja.ch", author_email="aorth@mjanja.ch",
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.", description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
@@ -22,17 +23,15 @@ setuptools.setup(
long_description_content_type="text/markdown", long_description_content_type="text/markdown",
url="https://github.com/alanorth/csv-metadata-quality", url="https://github.com/alanorth/csv-metadata-quality",
classifiers=[ classifiers=[
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: OS Independent", "Operating System :: OS Independent",
"Development Status :: 4 - Beta"
], ],
packages=['csv_metadata_quality'], packages=["csv_metadata_quality"],
entry_points={ entry_points={
'console_scripts': [ "console_scripts": ["csv-metadata-quality = csv_metadata_quality.__main__:main"]
'csv-metadata-quality = csv_metadata_quality.__main__:main'
]
}, },
install_requires=install_requires install_requires=install_requires,
) )

View File

@@ -1,225 +1,369 @@
# SPDX-License-Identifier: GPL-3.0-only
import pandas as pd
from colorama import Fore
import csv_metadata_quality.check as check import csv_metadata_quality.check as check
import csv_metadata_quality.experimental as experimental
def test_check_invalid_issn(capsys): def test_check_invalid_issn(capsys):
'''Test checking invalid ISSN.''' """Test checking invalid ISSN."""
value = '2321-2302' value = "2321-2302"
check.issn(value) check.issn(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Invalid ISSN: {value}\n' assert captured.out == f"{Fore.RED}Invalid ISSN: {Fore.RESET}{value}\n"
def test_check_valid_issn(): def test_check_valid_issn():
'''Test checking valid ISSN.''' """Test checking valid ISSN."""
value = '0024-9319' value = "0024-9319"
result = check.issn(value) result = check.issn(value)
assert result == value assert result == None
def test_check_invalid_isbn(capsys): def test_check_invalid_isbn(capsys):
'''Test checking invalid ISBN.''' """Test checking invalid ISBN."""
value = '99921-58-10-6' value = "99921-58-10-6"
check.isbn(value) check.isbn(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Invalid ISBN: {value}\n' assert captured.out == f"{Fore.RED}Invalid ISBN: {Fore.RESET}{value}\n"
def test_check_valid_isbn(): def test_check_valid_isbn():
'''Test checking valid ISBN.''' """Test checking valid ISBN."""
value = '99921-58-10-7' value = "99921-58-10-7"
result = check.isbn(value) result = check.isbn(value)
assert result == value assert result == None
def test_check_invalid_separators(capsys):
'''Test checking invalid multi-value separators.'''
value = 'Alan|Orth'
check.separators(value)
captured = capsys.readouterr()
assert captured.out == f'Invalid multi-value separator: {value}\n'
def test_check_valid_separators():
'''Test checking valid multi-value separators.'''
value = 'Alan||Orth'
result = check.separators(value)
assert result == value
def test_check_missing_date(capsys): def test_check_missing_date(capsys):
'''Test checking missing date.''' """Test checking missing date."""
value = None value = None
field_name = 'dc.date.issued' field_name = "dc.date.issued"
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Missing date ({field_name}).\n' assert captured.out == f"{Fore.RED}Missing date ({field_name}).{Fore.RESET}\n"
def test_check_multiple_dates(capsys): def test_check_multiple_dates(capsys):
'''Test checking multiple dates.''' """Test checking multiple dates."""
value = '1990||1991' value = "1990||1991"
field_name = 'dc.date.issued' field_name = "dc.date.issued"
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Multiple dates not allowed ({field_name}): {value}\n' assert (
captured.out
== f"{Fore.RED}Multiple dates not allowed ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_invalid_date(capsys): def test_check_invalid_date(capsys):
'''Test checking invalid ISO8601 date.''' """Test checking invalid ISO8601 date."""
value = '1990-0' value = "1990-0"
field_name = 'dc.date.issued' field_name = "dc.date.issued"
check.date(value, field_name) check.date(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Invalid date ({field_name}): {value}\n' assert (
captured.out == f"{Fore.RED}Invalid date ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_valid_date(): def test_check_valid_date():
'''Test checking valid ISO8601 date.''' """Test checking valid ISO8601 date."""
value = '1990' value = "1990"
field_name = 'dc.date.issued' field_name = "dc.date.issued"
result = check.date(value, field_name) result = check.date(value, field_name)
assert result == value assert result == None
def test_check_suspicious_characters(capsys): def test_check_suspicious_characters(capsys):
'''Test checking for suspicious characters.''' """Test checking for suspicious characters."""
value = 'foreˆt' value = "foreˆt"
field_name = 'dc.contributor.author' field_name = "dc.contributor.author"
check.suspicious_characters(value, field_name) check.suspicious_characters(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Suspicious character ({field_name}): ˆt\n' assert (
captured.out
== f"{Fore.YELLOW}Suspicious character ({field_name}): {Fore.RESET}ˆt\n"
)
def test_check_valid_iso639_2_language(): def test_check_valid_iso639_1_language():
'''Test valid ISO 639-2 language.''' """Test valid ISO 639-1 (alpha 2) language."""
value = 'ja' value = "ja"
result = check.language(value) result = check.language(value)
assert result == value assert result == None
def test_check_valid_iso639_3_language(): def test_check_valid_iso639_3_language():
'''Test invalid ISO 639-3 language.''' """Test valid ISO 639-3 (alpha 3) language."""
value = 'eng' value = "eng"
result = check.language(value) result = check.language(value)
assert result == value assert result == None
def test_check_invalid_iso639_2_language(capsys): def test_check_invalid_iso639_1_language(capsys):
'''Test invalid ISO 639-2 language.''' """Test invalid ISO 639-1 (alpha 2) language."""
value = 'jp' value = "jp"
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Invalid ISO 639-2 language: {value}\n' assert (
captured.out == f"{Fore.RED}Invalid ISO 639-1 language: {Fore.RESET}{value}\n"
)
def test_check_invalid_iso639_3_language(capsys): def test_check_invalid_iso639_3_language(capsys):
'''Test invalid ISO 639-3 language.''' """Test invalid ISO 639-3 (alpha 3) language."""
value = 'chi' value = "chi"
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Invalid ISO 639-3 language: {value}\n' assert (
captured.out == f"{Fore.RED}Invalid ISO 639-3 language: {Fore.RESET}{value}\n"
)
def test_check_invalid_language(capsys): def test_check_invalid_language(capsys):
'''Test invalid language.''' """Test invalid language."""
value = 'Span' value = "Span"
check.language(value) check.language(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Invalid language: {value}\n' assert captured.out == f"{Fore.RED}Invalid language: {Fore.RESET}{value}\n"
def test_check_invalid_agrovoc(capsys): def test_check_invalid_agrovoc(capsys):
'''Test invalid AGROVOC subject.''' """Test invalid AGROVOC subject."""
value = 'FOREST' value = "FOREST"
field_name = 'dc.subject' field_name = "dcterms.subject"
check.agrovoc(value, field_name) check.agrovoc(value, field_name)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Invalid AGROVOC ({field_name}): {value}\n' assert (
captured.out
== f"{Fore.RED}Invalid AGROVOC ({field_name}): {Fore.RESET}{value}\n"
)
def test_check_valid_agrovoc(): def test_check_valid_agrovoc():
'''Test valid AGROVOC subject.''' """Test valid AGROVOC subject."""
value = 'FORESTS' value = "FORESTS"
field_name = 'dc.subject' field_name = "dcterms.subject"
result = check.agrovoc(value, field_name) result = check.agrovoc(value, field_name)
assert result == value assert result == None
def test_check_uncommon_filename_extension(capsys): def test_check_uncommon_filename_extension(capsys):
'''Test uncommon filename extension.''' """Test uncommon filename extension."""
value = 'file.pdf.lck' value = "file.pdf.lck"
check.filename_extension(value) check.filename_extension(value)
captured = capsys.readouterr() captured = capsys.readouterr()
assert captured.out == f'Filename with uncommon extension: {value}\n' assert (
captured.out
== f"{Fore.YELLOW}Filename with uncommon extension: {Fore.RESET}{value}\n"
)
def test_check_common_filename_extension(): def test_check_common_filename_extension():
'''Test common filename extension.''' """Test common filename extension."""
value = 'file.pdf' value = "file.pdf"
result = check.filename_extension(value) result = check.filename_extension(value)
assert result == value assert result == None
def test_check_incorrect_iso_639_1_language(capsys):
"""Test incorrect ISO 639-1 language, as determined by comparing the item's language field with the actual language predicted in the item's title."""
title = "A randomised vaccine field trial in Kenya demonstrates protection against wildebeest-associated malignant catarrhal fever in cattle"
language = "es"
# Create a dictionary to mimic Pandas series
row = {"dc.title": title, "dc.language.iso": language}
series = pd.Series(row)
experimental.correct_language(series)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.YELLOW}Possibly incorrect language {language} (detected en): {Fore.RESET}{title}\n"
)
def test_check_incorrect_iso_639_3_language(capsys):
"""Test incorrect ISO 639-3 language, as determined by comparing the item's language field with the actual language predicted in the item's title."""
title = "A randomised vaccine field trial in Kenya demonstrates protection against wildebeest-associated malignant catarrhal fever in cattle"
language = "spa"
# Create a dictionary to mimic Pandas series
row = {"dc.title": title, "dc.language.iso": language}
series = pd.Series(row)
experimental.correct_language(series)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.YELLOW}Possibly incorrect language {language} (detected eng): {Fore.RESET}{title}\n"
)
def test_check_correct_iso_639_1_language():
"""Test correct ISO 639-1 language, as determined by comparing the item's language field with the actual language predicted in the item's title."""
title = "A randomised vaccine field trial in Kenya demonstrates protection against wildebeest-associated malignant catarrhal fever in cattle"
language = "en"
# Create a dictionary to mimic Pandas series
row = {"dc.title": title, "dc.language.iso": language}
series = pd.Series(row)
result = experimental.correct_language(series)
assert result == None
def test_check_correct_iso_639_3_language():
"""Test correct ISO 639-3 language, as determined by comparing the item's language field with the actual language predicted in the item's title."""
title = "A randomised vaccine field trial in Kenya demonstrates protection against wildebeest-associated malignant catarrhal fever in cattle"
language = "eng"
# Create a dictionary to mimic Pandas series
row = {"dc.title": title, "dc.language.iso": language}
series = pd.Series(row)
result = experimental.correct_language(series)
assert result == None
def test_check_valid_spdx_license_identifier():
"""Test valid SPDX license identifier."""
license = "CC-BY-SA-4.0"
result = check.spdx_license_identifier(license)
assert result == None
def test_check_invalid_spdx_license_identifier(capsys):
"""Test invalid SPDX license identifier."""
license = "CC-BY-SA"
result = check.spdx_license_identifier(license)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.YELLOW}Non-SPDX license identifier: {Fore.RESET}{license}\n"
)
def test_check_duplicate_item(capsys):
"""Test item with duplicate title, type, and date."""
item_title = "Title"
item_type = "Report"
item_date = "2021-03-17"
d = {
"dc.title": [item_title, item_title],
"dcterms.type": [item_type, item_type],
"dcterms.issued": [item_date, item_date],
}
df = pd.DataFrame(data=d)
result = check.duplicate_items(df)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.YELLOW}Possible duplicate (dc.title): {Fore.RESET}{item_title}\n"
)
def test_check_no_mojibake():
"""Test string with no mojibake."""
field = "CIAT Publicaçao"
field_name = "dcterms.isPartOf"
result = check.mojibake(field, field_name)
assert result == None
def test_check_mojibake(capsys):
"""Test string with mojibake."""
field = "CIAT Publicaçao"
field_name = "dcterms.isPartOf"
result = check.mojibake(field, field_name)
captured = capsys.readouterr()
assert (
captured.out
== f"{Fore.YELLOW}Possible encoding issue ({field_name}): {Fore.RESET}{field}\n"
)

View File

@@ -1,68 +1,121 @@
# SPDX-License-Identifier: GPL-3.0-only
import csv_metadata_quality.fix as fix import csv_metadata_quality.fix as fix
def test_fix_leading_whitespace(): def test_fix_leading_whitespace():
'''Test fixing leading whitespace.''' """Test fixing leading whitespace."""
value = ' Alan' value = " Alan"
assert fix.whitespace(value) == 'Alan' field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_trailing_whitespace(): def test_fix_trailing_whitespace():
'''Test fixing trailing whitespace.''' """Test fixing trailing whitespace."""
value = 'Alan ' value = "Alan "
assert fix.whitespace(value) == 'Alan' field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan"
def test_fix_excessive_whitespace(): def test_fix_excessive_whitespace():
'''Test fixing excessive whitespace.''' """Test fixing excessive whitespace."""
value = 'Alan Orth' value = "Alan Orth"
assert fix.whitespace(value) == 'Alan Orth' field_name = "dc.contributor.author"
assert fix.whitespace(value, field_name) == "Alan Orth"
def test_fix_invalid_separators(): def test_fix_invalid_separators():
'''Test fixing invalid multi-value separators.''' """Test fixing invalid multi-value separators."""
value = 'Alan|Orth' value = "Alan|Orth"
assert fix.separators(value) == 'Alan||Orth' field_name = "dc.contributor.author"
assert fix.separators(value, field_name) == "Alan||Orth"
def test_fix_unnecessary_separators():
"""Test fixing unnecessary multi-value separators."""
field = "Alan||Orth||"
field_name = "dc.contributor.author"
assert fix.separators(field, field_name) == "Alan||Orth"
def test_fix_unnecessary_unicode(): def test_fix_unnecessary_unicode():
'''Test fixing unnecessary Unicode.''' """Test fixing unnecessary Unicode."""
value = 'Alan Orth' value = "Alan Orth"
assert fix.unnecessary_unicode(value) == 'Alan Orth' assert fix.unnecessary_unicode(value) == "Alan Orth"
def test_fix_duplicates(): def test_fix_duplicates():
'''Test fixing duplicate metadata values.''' """Test fixing duplicate metadata values."""
value = 'Kenya||Kenya' value = "Kenya||Kenya"
assert fix.duplicates(value) == 'Kenya' field_name = "dc.contributor.author"
assert fix.duplicates(value, field_name) == "Kenya"
def test_fix_newlines(): def test_fix_newlines():
'''Test fixing newlines.''' """Test fixing newlines."""
value = '''Ken value = """Ken
ya''' ya"""
assert fix.newlines(value) == 'Kenya' assert fix.newlines(value) == "Kenya"
def test_fix_comma_space(): def test_fix_comma_space():
'''Test adding space after comma.''' """Test adding space after comma."""
value = 'Orth,Alan S.' value = "Orth,Alan S."
field_name = 'dc.contributor.author' field_name = "dc.contributor.author"
assert fix.comma_space(value, field_name) == 'Orth, Alan S.' assert fix.comma_space(value, field_name) == "Orth, Alan S."
def test_fix_normalized_unicode():
"""Test fixing a string that is already in its normalized (NFC) Unicode form."""
# string using the normalized canonical form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
def test_fix_decomposed_unicode():
"""Test fixing a string that contains Unicode string."""
# string using the decomposed form of é
value = "Ouédraogo, Mathieu"
field_name = "dc.contributor.author"
assert fix.normalize_unicode(value, field_name) == "Ouédraogo, Mathieu"
def test_fix_mojibake():
"""Test string with no mojibake."""
field = "CIAT Publicaçao"
field_name = "dcterms.isPartOf"
assert fix.mojibake(field, field_name) == "CIAT Publicaçao"