Commit Graph

72 Commits

Author SHA1 Message Date
Alan Orth 0dc66c5c4e
Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
Alan Orth 7cfd4c0b59
csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
Alan Orth 431e6331c8
csv_metadata_quality/check.py: Format with black 2020-07-06 14:10:19 +03:00
Alan Orth cb07d357d4
Version 0.4.2 2020-07-06 14:04:34 +03:00
Alan Orth 2a1566af62
csv_metadata_quality/check.py: Parameterize AGROVOC request 2020-07-06 13:44:46 +03:00
Alan Orth 5fcaa63bd5
csv_metadata_quality/check.py: Prune requests cache once
We only need to prune the requests cache once before using it, not
for every value we check.
2020-07-06 13:42:19 +03:00
Alan Orth 28b5996aa6
Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
Alan Orth 0b2d211455
Version 0.4.1 2020-01-15 12:19:42 +02:00
Alan Orth 365ecda324
Add utility function to check normalization
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 12:17:52 +02:00
Alan Orth 705127fd28
Version 0.4.0 2020-01-15 11:44:56 +02:00
Alan Orth 87181bc7b8
Run black, isort, and flake8. 2020-01-15 11:41:31 +02:00
Alan Orth 49e3543878
Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
Alan Orth efdc3a841a
Version 0.3.1 2019-10-01 17:11:13 +03:00
Alan Orth e55380b4d5
csv_metadata_quality/fix.py: Harmonize language in fix output
We should always say if we're removing or replacing something.
2019-10-01 17:09:49 +03:00
Alan Orth c42f8b4812
csv_metadata_quality/fix.py: Replace non-breaking spaces
We should be replacing non-breaking spaces (U+00A0) with normal sp-
aces instead of removing them.
2019-10-01 16:55:04 +03:00
Alan Orth e15c98cccb
Move unreleased changes to v0.3.0 2019-09-26 14:06:31 +03:00
Alan Orth 8435ee242d
Experimental language detection using langid
Works decenty well assuming the title, abstract, and citation fields
are an accurate representation of the language as identified by the
language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3)
values seamlessly.

This includes updated pipenv environment, test data, pytest tests
for both correct and incorrect ISO 639-1 and ISO 639-3 languages,
and a new command line option "-e".
2019-09-26 13:46:32 +03:00
Alan Orth 86d4623fd3
More ISO 639-1 and ISO 639-3 fixes
ISO 639-1 uses two-letter codes and ISO 639-3 uses three-letter codes.
Technically there ISO 639-2/T and ISO 639-2/B, which also uses three
letter codes, but those are not supported by the pycountry library
so I won't even worry about them.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-26 07:44:39 +03:00
Alan Orth f304ca6a33
csv_metadata_quality/app.py: Use simpler column iteration
I don't know where I got the other one...
2019-09-21 17:19:39 +03:00
Alan Orth d9fc09f121
Fix references to ISO 639
It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2
is the three-letter codes, aka alpha2 and alpha3.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-11 16:36:53 +03:00
Alan Orth 280a99c8a8
Sort imports with isort
See: https://sourcery.ai/blog/python-best-practices/
2019-08-29 01:15:04 +03:00
Alan Orth d97dcd19db
Format with black 2019-08-29 01:10:39 +03:00
Alan Orth c354a3687c
Release version 0.2.2 2019-08-28 00:10:17 +03:00
Alan Orth 81190d56bb
Add fix for missing space after commas
This happens in names very often, for example in the contributor
and citation fields. I will limit this to those fields for now and
hide this fix behind the "unsafe fixes" option until I test it more.
2019-08-28 00:05:52 +03:00
Alan Orth 113e7cd8b6
csv_metadata_quality/app.py: Add ability to skip fields
The user may want to skip the checking and fixing of certain fields
in the input file.
2019-08-27 00:10:07 +03:00
Alan Orth 884e8f970d
csv_metadata_quality/check.py: Simplify AGROVOC check
I recycled this code from a separate agrovoc-lookup.py script that
checks lines in a text file to see if they are valid AGROVOC terms
or not. There I was concerned about skipping comments or something
I think, but we don't need to check that here. We simply check the
term that is in the field and inform the user if it's valid or not.
2019-08-21 16:35:29 +03:00
Alan Orth ed5612fbcf
Add column name to output in date checks
This makes it easier to understand where the error is in case a CSV
has multiple date fields, for example:

    Missing date (dc.date.issued).
    Missing date (dc.date.issued[]).

If you have 126 items and you get 126 "Missing date" messages then
it's likely that 100 of the items have dates in one field, and the
others have dates in other field.
2019-08-21 15:31:12 +03:00
Alan Orth 7255bf4707
Version 0.2.1 2019-08-11 10:39:39 +03:00
Alan Orth 232ff99898
csv_metadata_quality/fix.py: Add more unneccessary Unicode fixes
Add a check for soft hyphens (U+00AD). In one sample CSV I have a
normal hyphen followed by a soft hyphen in an ISBN. This causes the
ISBN validation to fail.
2019-08-11 00:07:21 +03:00
Alan Orth 13d5221378
csv_metadata_quality/check.py: Fix test for False 2019-08-10 23:52:53 +03:00
Alan Orth 9ce7dc6716
Add check for uncommon filenames
Generally we want people to upload documents in accessible formats
like PDF, Word, Excel, and PowerPoint. This check warns if a file
is using an uncommon extension.
2019-08-10 23:41:16 +03:00
Alan Orth 5ff584a8d7
Version 0.2.0 2019-08-09 01:39:51 +03:00
Alan Orth 62fea95087
Improve suspicious character detection
Now it will print just the part of the metadata value that contains
the suspicious character (up to 80 characters, so we don't make the
line break on terminals that use 80 character width by default).

Also, print the name of the field in which the metadata value is so
that it is easier for the user to locate.
2019-08-09 01:25:40 +03:00
Alan Orth 8772bdec51
csv_metadata_quality/app.py: Explicitly exit with success 2019-08-04 09:10:37 +03:00
Alan Orth 6d4ecd75aa
csv_metadata_quality/app.py: Close files before exit 2019-08-04 09:10:19 +03:00
Alan Orth f4e7fd73f5
csv_metadata_quality/app.py: Handle Ctrl-C
Instead of printing an ugly two-page stack trace.
2019-08-03 21:11:57 +03:00
Alan Orth 85ae7bdc5a
Increment version to 0.1.0 2019-08-02 00:12:47 +03:00
Alan Orth 0561300ebe
Add option to print version with `--version` or `-V`
I guess `-v` is more commonly used for "verbose" so I will use the
short option of `-V` for version.
2019-08-02 00:09:54 +03:00
Alan Orth bf876a046a
Rework AGROVOC validation
AGROVOC validation is now disabled by default, but can be enabled
on a field-by-field basis. For example, countries and regions are
also present in AGROVOC. Fields with these values can be enabled
using the new `--agrovoc-fields` option.

I reworked the script output to show the field name when printing
an invalid term so that the user knows in which field the term is.
2019-08-01 23:51:58 +03:00
Alan Orth 576b3a3638
csv_metadata_quality/__main__.py: Fix spacing
Identified by flake8.
2019-08-01 23:28:16 +03:00
Alan Orth 9100efdf50
Re-work as a proper standalone Python package
Add a setup.py so that installation is easier and a standalone CLI
script called csv-metadata-quality is provided. Now the user only
needs to run this from a virtual environment inside the project
directory:

    $ pip install .

Eventually I could publish this on PyPi when I settle on a more
appropriate package name.

See: https://packaging.python.org/tutorials/packaging-projects/
See: https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/
2019-07-31 17:34:36 +03:00
Alan Orth 40d5f7d81b
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
2019-07-30 20:05:12 +03:00
Alan Orth 3c798fb504
Use pycountry instead of iso-639 for languages
The latter is a fork that hasn't been updated since 2016 and the
original still seems to be well maintained, with recent database
updates as well as tests for Python 3.7.

Also, pycountry supports ISO 3166-2 (administrative zones), which
we could eventually use for sub regions.
2019-07-30 16:39:26 +03:00
Alan Orth 4e3511cd55
csv_metadata_quality/check.py: Fix AGROVOC lookup
We actually only need to see if there are more than zero matches
because a term like "Nigeria" will match in English, Spanish, etc,
whereas terms that *really* don't match will have zero results.
2019-07-30 14:51:44 +03:00
Alan Orth 1f65a28307
Add support for validating subjects against AGROVOC
Checks values in the dc.subject or dcterms.subject field against the
AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py.

See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/
See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py
2019-07-30 00:30:31 +03:00
Alan Orth a36454a3ac
Add support for validating languages
Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.

Does not currently have any support for generic values like "Other".
2019-07-29 18:59:42 +03:00
Alan Orth 1e444cf040
Add fix for duplicate metadata values 2019-07-29 18:05:03 +03:00
Alan Orth d7888d59a8
csv_metadata_quality/check.py: Return date even if it is invalid
Otherwise it is missing from the final CSV and then we can't even
fix it. :)
2019-07-29 17:40:14 +03:00
Alan Orth 50ae4e17f2
csv_metadata_quality/fix.py: Fix indent 2019-07-29 17:14:48 +03:00
Alan Orth fa4fa3491b
Add check for "suspicious" characters
These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.

It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.
2019-07-29 17:08:49 +03:00