1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-05-09 06:36:00 +02:00
Commit Graph

384 Commits

Author SHA1 Message Date
8a27fb2589 Add check for missing DOIs
All checks were successful
continuous-integration/drone/push Build is passing
Sometimes an editor includes a DOI in the citation field, but does
not add a standalone DOI field.
2021-10-06 21:25:39 +03:00
831ce979c3 CHANGELOG.md: Clarify regex fixes 2021-10-06 21:23:35 +03:00
58ef62fbcd Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-10-06 21:20:35 +03:00
8c59f57e76 poetry.lock: Run poetry update 2021-10-06 21:19:54 +03:00
72dd3e7272 CHANGELOG.md: Add notes about regexes 2021-10-06 19:35:59 +03:00
6ba16d5d4c csv_metadata_quality/check.py: Fix duplicate checker
Fix the incorrect type field regex, and improve the title regex to
consider dcterms.title and dc.title (along with the DSpace language
variants like dc.title[en_US]), but ignore dc.title.alternative.

See: https://regex101.com/r/I4m06F/1
2021-10-06 19:32:40 +03:00
81069259ba CHANGELOG.md: Add note about bibliographicCitation
All checks were successful
continuous-integration/drone/push Build is passing
2021-10-06 16:16:51 +03:00
54ab869297 csv_metadata_quality/experimental.py: Adjust citation match
We need to match both of these citation fields:

- dc.identifier.citation
- dcterms.bibliographicCitation
2021-10-06 16:13:10 +03:00
22b359c8a8 Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-09-27 14:15:01 +03:00
3e06788d88 poetry.lock: Run poetry update 2021-09-27 14:11:21 +03:00
3c41cc283f Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-09-06 21:04:05 +03:00
5741e94571 poetry.lock: Run poetry update 2021-09-06 21:03:30 +03:00
215d61c188 pyproject.toml: limit SQLAlchemy to < 1.4.23
SQLAlchemy gets pulled in by csvkit's agate-sql dependency and there
is currently an issue with Poetry's parsing of the SQLAlchemy 1.4.23
constraints. Temporarily explicitly install a version of SQLAlchemy
that works (can remove later once Poetry fixes this). Anyways, I am
not using any SQLAlchemy features that I know of.

See: https://github.com/python-poetry/poetry/issues/4402
2021-09-06 21:01:09 +03:00
11ddde3327 data/test.csv: Update mojibake example
All checks were successful
continuous-integration/drone/push Build is passing
I was trying to find where I got this one and it seems to have been
the other way around. Doesn't matter here only that I was curious.
2021-08-19 15:48:41 +03:00
a347878d43 Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-08-12 21:49:36 +03:00
a89bc331f0 poetry.lock: Run poetry update
Lots of minor dependencies updates. All tests still passing with
pytest.
2021-08-12 21:47:46 +03:00
af3493c724 CITATION.cff: Remove YAML formatting
All checks were successful
continuous-integration/drone/push Build is passing
GitHub says it can't parse my CITATION.cff file. The example in the
docs shows version 1.2.0 also, I wonder if that's relevant.

See: https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files
2021-07-28 21:23:30 +03:00
52644bf83e Add CITATION.cff
Created with the cffinit tool:

https://citation-file-format.github.io/cff-initializer-javascript/
2021-07-28 21:11:11 +03:00
c8f5539d21 Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-07-06 15:47:44 +03:00
382d0d6aed Run poetry update 2021-07-06 15:37:57 +03:00
b8f4be9ebb pyproject.toml: Update pytest-clarity and black
These seem to have much newer versions that didn't get updated in
this project due to the version pinning selector I was using with
poetry.

In the case of pytest-clarity the previous version was 0.3.1 and
the version selector was a caret (^), which will never update the
left-most (major) number. Now they seem to be on 1.x.x so it will
be OK in the future.

In the case of black, they use weird numbering so it's anyone's
guess how this will work! Luckily it's only used for linting and
formatting.
2021-07-06 15:30:41 +03:00
4e2eab68b0 Update requests-cache
Apparently we were stuck on an older version of requests-cache due
to the fact that we were using the caret, which will never update
the left-most (major) version. Upstream requests-cache is currently
version 0.6.4, and there seems to have been some changes to the API.
2021-07-06 15:24:39 +03:00
55165cb4ce Update requirements
All checks were successful
continuous-integration/drone/push Build is passing
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.
2021-06-14 12:52:47 +03:00
93d3eabfba poetry.lock: Run poetry update 2021-06-14 12:52:28 +03:00
a8fe623f4c csv_metadata_quality/check.py: Remove unnecessary pass
All checks were successful
continuous-integration/drone/push Build is passing
LGTM warned that these pass statements are not necessary.

See: https://lgtm.com/rules/910088/
2021-04-20 08:20:13 +03:00
dbc0437d59 CHANGELOG.md: Add note about Python deps
All checks were successful
continuous-integration/drone/push Build is passing
2021-04-14 16:16:02 +03:00
96ce1daa90 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-04-14 16:15:28 +03:00
3adb52d7c0 poetry.lock: Run poetry update 2021-04-14 16:14:37 +03:00
f958d1879f poetry.lock: Run poetry update
All checks were successful
continuous-integration/drone/push Build is passing
2021-04-02 16:19:16 +03:00
bd8943f36a csv_metadata_quality/app.py: Don't crash if fields are missing
All checks were successful
continuous-integration/drone/push Build is passing
We don't need to crash if someone feeds us a CSV file that is miss-
ing commont DSpace fields like title, type, and subject.
2021-03-21 19:47:29 +02:00
28f9026286 README.md: Minor edit
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-19 16:26:31 +02:00
cfe09f7126 Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
8eddb76aab Bump version to 0.4.8-dev
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-19 11:53:56 +02:00
a04dbc50db Add notes about checking and fixing mojibake 2021-03-19 11:48:27 +02:00
28335ed159 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-19 10:29:15 +02:00
773a0a2695 poetry.lock: Run poetry update 2021-03-19 10:28:55 +02:00
39a4b1a487 Add mojibake to data/test.csv and tests 2021-03-19 10:28:33 +02:00
898bb412c3 Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
e92ec5d371 README.md: Add note about duplicate checking
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:12:03 +02:00
f816e17fe7 Version 0.4.7
All checks were successful
continuous-integration/drone/push Build is passing
v0.4.7
2021-03-17 10:00:34 +02:00
9061c7c79b setup.py: Remove beta tag
I think this is only used by pypi.org?
2021-03-17 10:00:09 +02:00
661d05b977 Update requirements
Generated with poetry export:

    $ poetry export --without-hashes -f requirements.txt > requirements.txt
    $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt

I am trying `--without-hashes` to work around an error on pip install
when running in CI:

    ERROR: In --require-hashes mode, all requirements must have
their versions pinned with ==.
2021-03-17 09:58:35 +02:00
652b7ea98c CHANGELOG.md: Add note about poetry dependencies 2021-03-17 09:58:02 +02:00
65da6e9b05 poetry.lock: Run pipenv update 2021-03-17 09:57:31 +02:00
a313b7527a CHANGELOG.md: Add note about duplicate items 2021-03-17 09:55:07 +02:00
51ee370697 data/test.csv: Add duplicate item 2021-03-17 09:54:14 +02:00
e8422bfa74 tests/test_check.py: Add test for duplicate items 2021-03-17 09:54:02 +02:00
9f2dc0a0f5 Add support for detecting duplicate items
This uses the title, type, and date issued as a sort of "key" when
determining if an item already exists in the data set.
2021-03-17 09:53:07 +02:00
14010896a5 csv_metadata_quality/experimental.py: Move all imports to top of file
All checks were successful
continuous-integration/drone/push Build is passing
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-16 16:13:34 +02:00
ab3af2ec62 csv_metadata_quality/check.py: Reformat with black 2021-03-16 16:12:33 +02:00