Commit Graph

368 Commits

Author SHA1 Message Date
Alan Orth af3493c724
CITATION.cff: Remove YAML formatting
continuous-integration/drone/push Build is passing Details
GitHub says it can't parse my CITATION.cff file. The example in the
docs shows version 1.2.0 also, I wonder if that's relevant.

See: https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files
2021-07-28 21:23:30 +03:00
Alan Orth 52644bf83e
Add CITATION.cff
Created with the cffinit tool:

https://citation-file-format.github.io/cff-initializer-javascript/
2021-07-28 21:11:11 +03:00
Alan Orth c8f5539d21
Update requirements
continuous-integration/drone/push Build is passing Details
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-07-06 15:47:44 +03:00
Alan Orth 382d0d6aed
Run poetry update 2021-07-06 15:37:57 +03:00
Alan Orth b8f4be9ebb
pyproject.toml: Update pytest-clarity and black
These seem to have much newer versions that didn't get updated in
this project due to the version pinning selector I was using with
poetry.

In the case of pytest-clarity the previous version was 0.3.1 and
the version selector was a caret (^), which will never update the
left-most (major) number. Now they seem to be on 1.x.x so it will
be OK in the future.

In the case of black, they use weird numbering so it's anyone's
guess how this will work! Luckily it's only used for linting and
formatting.
2021-07-06 15:30:41 +03:00
Alan Orth 4e2eab68b0
Update requests-cache
Apparently we were stuck on an older version of requests-cache due
to the fact that we were using the caret, which will never update
the left-most (major) version. Upstream requests-cache is currently
version 0.6.4, and there seems to have been some changes to the API.
2021-07-06 15:24:39 +03:00
Alan Orth 55165cb4ce
Update requirements
continuous-integration/drone/push Build is passing Details
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-06-14 12:52:47 +03:00
Alan Orth 93d3eabfba
poetry.lock: Run poetry update 2021-06-14 12:52:28 +03:00
Alan Orth a8fe623f4c
csv_metadata_quality/check.py: Remove unnecessary pass
continuous-integration/drone/push Build is passing Details
LGTM warned that these pass statements are not necessary.

See: https://lgtm.com/rules/910088/
2021-04-20 08:20:13 +03:00
Alan Orth dbc0437d59
CHANGELOG.md: Add note about Python deps
continuous-integration/drone/push Build is passing Details
2021-04-14 16:16:02 +03:00
Alan Orth 96ce1daa90
Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-04-14 16:15:28 +03:00
Alan Orth 3adb52d7c0
poetry.lock: Run poetry update 2021-04-14 16:14:37 +03:00
Alan Orth f958d1879f
poetry.lock: Run poetry update
continuous-integration/drone/push Build is passing Details
2021-04-02 16:19:16 +03:00
Alan Orth bd8943f36a
csv_metadata_quality/app.py: Don't crash if fields are missing
continuous-integration/drone/push Build is passing Details
We don't need to crash if someone feeds us a CSV file that is miss-
ing commont DSpace fields like title, type, and subject.
2021-03-21 19:47:29 +02:00
Alan Orth 28f9026286
README.md: Minor edit
continuous-integration/drone/push Build is passing Details
2021-03-19 16:26:31 +02:00
Alan Orth cfe09f7126
Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
Alan Orth 8eddb76aab
Bump version to 0.4.8-dev
continuous-integration/drone/push Build is passing Details
2021-03-19 11:53:56 +02:00
Alan Orth a04dbc50db
Add notes about checking and fixing mojibake 2021-03-19 11:48:27 +02:00
Alan Orth 28335ed159
Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-19 10:29:15 +02:00
Alan Orth 773a0a2695
poetry.lock: Run poetry update 2021-03-19 10:28:55 +02:00
Alan Orth 39a4b1a487
Add mojibake to data/test.csv and tests 2021-03-19 10:28:33 +02:00
Alan Orth 898bb412c3
Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
Alan Orth e92ec5d371
README.md: Add note about duplicate checking
continuous-integration/drone/push Build is passing Details
2021-03-17 10:12:03 +02:00
Alan Orth f816e17fe7
Version 0.4.7
continuous-integration/drone/push Build is passing Details
2021-03-17 10:00:34 +02:00
Alan Orth 9061c7c79b
setup.py: Remove beta tag
I think this is only used by pypi.org?
2021-03-17 10:00:09 +02:00
Alan Orth 661d05b977
Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-17 09:58:35 +02:00
Alan Orth 652b7ea98c
CHANGELOG.md: Add note about poetry dependencies 2021-03-17 09:58:02 +02:00
Alan Orth 65da6e9b05
poetry.lock: Run pipenv update 2021-03-17 09:57:31 +02:00
Alan Orth a313b7527a
CHANGELOG.md: Add note about duplicate items 2021-03-17 09:55:07 +02:00
Alan Orth 51ee370697
data/test.csv: Add duplicate item 2021-03-17 09:54:14 +02:00
Alan Orth e8422bfa74
tests/test_check.py: Add test for duplicate items 2021-03-17 09:54:02 +02:00
Alan Orth 9f2dc0a0f5
Add support for detecting duplicate items
This uses the title, type, and date issued as a sort of "key" when
determining if an item already exists in the data set.
2021-03-17 09:53:07 +02:00
Alan Orth 14010896a5
csv_metadata_quality/experimental.py: Move all imports to top of file
continuous-integration/drone/push Build is passing Details
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-16 16:13:34 +02:00
Alan Orth ab3af2ec62
csv_metadata_quality/check.py: Reformat with black 2021-03-16 16:12:33 +02:00
Alan Orth 1aa2084230
CHANGELOG.md: Add note about checks 2021-03-16 16:11:24 +02:00
Alan Orth 330a7b7b9c
Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00
Alan Orth 9a5e3fd6ef
README.md: Add TODO about detecting duplicates 2021-03-16 14:03:26 +02:00
Alan Orth ed084da08c
CHANGELOG.md: Add note about multi-value separators
continuous-integration/drone/push Build is passing Details
2021-03-14 21:04:19 +02:00
Alan Orth 10612cf891
Remove checks for invalid multi-value separators
Now that I no longer treat the fix for these as "unsafe" I don't a
ctually need to check for them—I can just fix them when I see them.
2021-03-14 21:01:21 +02:00
Alan Orth 3656e9f976
Update CI workflows to use DCTERMS instead of DC
continuous-integration/drone/push Build is passing Details
2021-03-14 15:52:51 +02:00
Alan Orth c9c277f8df
csv_metadata_quality/app.py: Update help text
continuous-integration/drone/push Build is passing Details
Use DCTERMS fields where possible.
2021-03-14 10:52:58 +02:00
Alan Orth fb35afd937
CHANGELOG.md: Add note about requests cache 2021-03-14 09:13:51 +02:00
Alan Orth 0e9176f0a6
csv_metadata_quality/check.py: requests cache
Allow overriding the directory for the requests cache. In the case
of csv-metadata-quality-web, which currently runs on Google's App
Engine, we can only write to /tmp.
2021-03-14 09:07:35 +02:00
Alan Orth 1008acf35e
Always fix invalid multi-value separators
continuous-integration/drone/push Build is passing Details
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
Alan Orth f00a07e2cd
README.md: Reorganize unsafe functionality
continuous-integration/drone/push Build is passing Details
2021-03-13 11:56:52 +02:00
Alan Orth 46098861ed
poetry.lock: Run poetry update
continuous-integration/drone/push Build is passing Details
2021-03-11 22:45:32 +02:00
Alan Orth fa84cfa440
Bump version to 0.4.6-dev 2021-03-11 22:44:36 +02:00
Alan Orth 6cc1401f88
pyproject.toml: Minimum Python is technically 3.7.1
continuous-integration/drone/push Build is passing Details
See: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html
2021-03-11 13:41:58 +02:00
Alan Orth ad2cda8a41
README.md: Add note about SPDX license identifiers
continuous-integration/drone/push Build is passing Details
2021-03-11 12:21:34 +02:00
Alan Orth dc6920802e
.github/workflows/python-app.yml: Use Python 3.9
I now use this version in my development environment. Eventually I
should add a matrix of versions to use, but I don't know the GitHub
Actions syntax well enough yet.
2021-03-11 12:17:57 +02:00