Commit Graph

28 Commits

Author SHA1 Message Date
Alan Orth cfe09f7126
Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
Alan Orth 39a4b1a487
Add mojibake to data/test.csv and tests 2021-03-19 10:28:33 +02:00
Alan Orth e8422bfa74
tests/test_check.py: Add test for duplicate items 2021-03-17 09:54:02 +02:00
Alan Orth 330a7b7b9c
Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00
Alan Orth 10612cf891
Remove checks for invalid multi-value separators
Now that I no longer treat the fix for these as "unsafe" I don't a
ctually need to check for them—I can just fix them when I see them.
2021-03-14 21:01:21 +02:00
Alan Orth 0089efa914
tests/test_check.py: Use dcterms.subject instead of dc.subject
Trying to move some old DC fields to DCTERMS.
2021-03-11 11:45:25 +02:00
Alan Orth 5318953150
tests/test_check.py: Add tests for licenses 2021-03-11 10:36:26 +02:00
Alan Orth a7fc5a246c
Colorize output
continuous-integration/drone/push Build is failing Details
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
Alan Orth 7edb8b19d7
tests/test_check.py: Reformat with black 2021-01-03 15:50:21 +02:00
Alan Orth 29e67a0887
Add tests for unnecessary multi-value separators 2021-01-03 15:37:18 +02:00
Alan Orth 28b5996aa6
Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
Alan Orth 87181bc7b8
Run black, isort, and flake8. 2020-01-15 11:41:31 +02:00
Alan Orth 604bd5bda6
Reformat tests with black 2019-09-26 14:02:51 +03:00
Alan Orth 8435ee242d
Experimental language detection using langid
Works decenty well assuming the title, abstract, and citation fields
are an accurate representation of the language as identified by the
language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3)
values seamlessly.

This includes updated pipenv environment, test data, pytest tests
for both correct and incorrect ISO 639-1 and ISO 639-3 languages,
and a new command line option "-e".
2019-09-26 13:46:32 +03:00
Alan Orth 86d4623fd3
More ISO 639-1 and ISO 639-3 fixes
ISO 639-1 uses two-letter codes and ISO 639-3 uses three-letter codes.
Technically there ISO 639-2/T and ISO 639-2/B, which also uses three
letter codes, but those are not supported by the pycountry library
so I won't even worry about them.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-26 07:44:39 +03:00
Alan Orth d9fc09f121
Fix references to ISO 639
It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2
is the three-letter codes, aka alpha2 and alpha3.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-11 16:36:53 +03:00
Alan Orth e7cb8920db
tests/test_check.py: Update date tests 2019-08-21 15:34:52 +03:00
Alan Orth e801042340
tests/test_check.py: Fix unused result
We don't need to capture the function's return value here because
pytest will capture stdout from the function.
2019-08-10 23:45:41 +03:00
Alan Orth 62ef2a4489
tests/test_check.py: Add tests for file extensions 2019-08-10 23:44:13 +03:00
Alan Orth d93c2aae13
tests/test_check.py: Update suspicious character check
The suspicious character check was updated to include the name of
the field where the metadata value with the  suspicious character
exists.
2019-08-09 01:26:38 +03:00
Alan Orth 456b8a2f26
Update tests 2019-08-01 23:59:11 +03:00
Alan Orth 1f65a28307
Add support for validating subjects against AGROVOC
Checks values in the dc.subject or dcterms.subject field against the
AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py.

See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/
See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py
2019-07-30 00:30:31 +03:00
Alan Orth a36454a3ac
Add support for validating languages
Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.

Does not currently have any support for generic values like "Other".
2019-07-29 18:59:42 +03:00
Alan Orth fa4fa3491b
Add check for "suspicious" characters
These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.

It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.
2019-07-29 17:08:49 +03:00
Alan Orth 87b1997051
Fix whitespace errors found by flake8 2019-07-28 17:47:28 +03:00
Alan Orth ce8f140c66
tests: Remove unused pytest import 2019-07-28 17:42:54 +03:00
Alan Orth 196bb434fa
Add date validation
I'm only concerned with validating issue dates here. In DSpace they
are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory
they could be any valid ISO8601 format).

This also checks for cases where the date is missing and where the
metadata has specified multiple dates like "1990||1991", as this is
valid, but there is no practical value for it in our system.
2019-07-28 16:11:36 +03:00
Alan Orth a849615b41
Add tests for check functions
Relies on capturing stdout.

See: https://docs.pytest.org/en/5.0.1/capture.html
2019-07-27 02:10:13 +03:00