1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-05 05:53:02 +01:00
Commit Graph

548 Commits

Author SHA1 Message Date
39a4b1a487
Add mojibake to data/test.csv and tests 2021-03-19 10:28:33 +02:00
898bb412c3
Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
e92ec5d371
README.md: Add note about duplicate checking
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:12:03 +02:00
f816e17fe7
Version 0.4.7
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:00:34 +02:00
9061c7c79b
setup.py: Remove beta tag
I think this is only used by pypi.org?
2021-03-17 10:00:09 +02:00
661d05b977
Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-17 09:58:35 +02:00
652b7ea98c
CHANGELOG.md: Add note about poetry dependencies 2021-03-17 09:58:02 +02:00
65da6e9b05
poetry.lock: Run pipenv update 2021-03-17 09:57:31 +02:00
a313b7527a
CHANGELOG.md: Add note about duplicate items 2021-03-17 09:55:07 +02:00
51ee370697
data/test.csv: Add duplicate item 2021-03-17 09:54:14 +02:00
e8422bfa74
tests/test_check.py: Add test for duplicate items 2021-03-17 09:54:02 +02:00
9f2dc0a0f5
Add support for detecting duplicate items
This uses the title, type, and date issued as a sort of "key" when
determining if an item already exists in the data set.
2021-03-17 09:53:07 +02:00
14010896a5
csv_metadata_quality/experimental.py: Move all imports to top of file
All checks were successful
continuous-integration/drone/push Build is passing
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-16 16:13:34 +02:00
ab3af2ec62
csv_metadata_quality/check.py: Reformat with black 2021-03-16 16:12:33 +02:00
1aa2084230
CHANGELOG.md: Add note about checks 2021-03-16 16:11:24 +02:00
330a7b7b9c
Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00
9a5e3fd6ef
README.md: Add TODO about detecting duplicates 2021-03-16 14:03:26 +02:00
ed084da08c
CHANGELOG.md: Add note about multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-14 21:04:19 +02:00
10612cf891
Remove checks for invalid multi-value separators
Now that I no longer treat the fix for these as "unsafe" I don't a
ctually need to check for them—I can just fix them when I see them.
2021-03-14 21:01:21 +02:00
3656e9f976
Update CI workflows to use DCTERMS instead of DC
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-14 15:52:51 +02:00
c9c277f8df
csv_metadata_quality/app.py: Update help text
All checks were successful
continuous-integration/drone/push Build is passing
Use DCTERMS fields where possible.
2021-03-14 10:52:58 +02:00
fb35afd937
CHANGELOG.md: Add note about requests cache 2021-03-14 09:13:51 +02:00
0e9176f0a6
csv_metadata_quality/check.py: requests cache
Allow overriding the directory for the requests cache. In the case
of csv-metadata-quality-web, which currently runs on Google's App
Engine, we can only write to /tmp.
2021-03-14 09:07:35 +02:00
1008acf35e
Always fix invalid multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
f00a07e2cd
README.md: Reorganize unsafe functionality
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-13 11:56:52 +02:00
46098861ed
poetry.lock: Run poetry update
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 22:45:32 +02:00
fa84cfa440
Bump version to 0.4.6-dev 2021-03-11 22:44:36 +02:00
6cc1401f88
pyproject.toml: Minimum Python is technically 3.7.1
All checks were successful
continuous-integration/drone/push Build is passing
See: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html
2021-03-11 13:41:58 +02:00
ad2cda8a41
README.md: Add note about SPDX license identifiers
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:21:34 +02:00
dc6920802e
.github/workflows/python-app.yml: Use Python 3.9
I now use this version in my development environment. Eventually I
should add a matrix of versions to use, but I don't know the GitHub
Actions syntax well enough yet.
2021-03-11 12:17:57 +02:00
6ca449d8ed
README.md: Update note about Python 3.8 to 3.8+
Currently the lower bound on Python version support is 3.7 because
of Pandas 1.2.0 requiring it, but I use 3.9 on my development box.
2021-03-11 12:16:07 +02:00
1554cfd5c9
Version 0.4.6 2021-03-11 12:14:54 +02:00
00b8faad6d
CHANGELOG.md: Fix headers 2021-03-11 12:13:22 +02:00
b19d81abdd
.drone.yml: We need some stuff to build pyicu now
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:07:28 +02:00
a0ea829f5c
csv_metadata_quality/fix.py: Fixes should be green 2021-03-11 11:47:24 +02:00
0089efa914
tests/test_check.py: Use dcterms.subject instead of dc.subject
Trying to move some old DC fields to DCTERMS.
2021-03-11 11:45:25 +02:00
3dbe656f9f
Update requirements
Some checks failed
continuous-integration/drone/push Build is failing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-11 11:11:19 +02:00
7ad821dcad
CHANGELOG.md: Add note about poetry dependencies 2021-03-11 11:10:27 +02:00
cd876c4fb3
poetry.lock: Run poetry update 2021-03-11 11:10:02 +02:00
d88ea56488
csv_metadata_quality/check.py: Move all imports to top of file
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-11 10:52:20 +02:00
e0e3ca6c58
CHANGELOG.md: Add notes about DCTERMS in data/test.csv 2021-03-11 10:50:52 +02:00
abae8ca4fb
data/test.csv: Move some DC fields to DCTERMS
The original Dublin Core elements set was superceded by DCTERMS in
2008 and we have started using them in our DSpace repository so I
think it's good to update them in our test data. Old DC fields are
still checked and fixed in this tool, though.

It's worth nothing that currently supported DSpace versions (4, 5,
and 6) all have hard-coded a few fields like dc.title internally so
we can't migrate those to their DCTERMS counterparts just yet.
2021-03-11 10:49:05 +02:00
d7d4d4efca
CHANGELOG.md: Add note about SPDX license identifiers 2021-03-11 10:37:27 +02:00
5318953150
tests/test_check.py: Add tests for licenses 2021-03-11 10:36:26 +02:00
3b17914002
data/test.csv: Add invalid SPDX license
Now we are checking dcterms.license against the list of SPDX license
identifiers using https://pypi.org/project/spdx-license-list/.
2021-03-11 10:34:58 +02:00
6e4b0e5c1b
Add validation of SPDX license identifiers
Currently this only checks the dcterms.license field and the result
will only be a warning.
2021-03-11 10:33:16 +02:00
b16fa9121f
pyproject.toml: Add csv-metadata-quality as a script
All checks were successful
continuous-integration/drone/push Build is passing
For some reason I stopped having csv-metadata-quality available in
my poetry environment after install. It seems I need to add it as a
poetry tool script? I had already done this in setup.py years ago,
which works for regular python setup.py installs, but hadn't needed
to do it in poetry for a year or more that I've been using it, until
now.
2021-03-08 09:50:05 +02:00
202bda862a
Bump version to 0.4.5
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-04 21:38:10 +02:00
7479310ac0
setup.py: Bump version to 0.4.4
I missed to increase this when I actually released version 0.4.4 so
I will do it in a separate commit now before I bump the version to
0.4.5.
2021-03-04 21:35:08 +02:00
98a91bc9c2
Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-03-04 21:33:33 +02:00