Commit Graph

50 Commits

Author SHA1 Message Date
Alan Orth cfe09f7126
Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
Alan Orth 898bb412c3
Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
Alan Orth 9f2dc0a0f5
Add support for detecting duplicate items
This uses the title, type, and date issued as a sort of "key" when
determining if an item already exists in the data set.
2021-03-17 09:53:07 +02:00
Alan Orth 330a7b7b9c
Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00
Alan Orth 10612cf891
Remove checks for invalid multi-value separators
Now that I no longer treat the fix for these as "unsafe" I don't a
ctually need to check for them—I can just fix them when I see them.
2021-03-14 21:01:21 +02:00
Alan Orth c9c277f8df
csv_metadata_quality/app.py: Update help text
continuous-integration/drone/push Build is passing Details
Use DCTERMS fields where possible.
2021-03-14 10:52:58 +02:00
Alan Orth 1008acf35e
Always fix invalid multi-value separators
continuous-integration/drone/push Build is passing Details
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
Alan Orth 6e4b0e5c1b
Add validation of SPDX license identifiers
Currently this only checks the dcterms.license field and the result
will only be a warning.
2021-03-11 10:33:16 +02:00
Alan Orth dd2cfae047
csv_metadata_quality/app.py: Match dcterms.issued for dates
We used to only check fields that had "date" in their name because
we were using DSpace's default dc.date.* fields. Now we are using
dcterms.issued so I will add that one as well.
2021-02-28 15:11:06 +02:00
Alan Orth a7fc5a246c
Colorize output
continuous-integration/drone/push Build is failing Details
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
Alan Orth 0dc66c5c4e
Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
Alan Orth 28b5996aa6
Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
Alan Orth 87181bc7b8
Run black, isort, and flake8. 2020-01-15 11:41:31 +02:00
Alan Orth 49e3543878
Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
Alan Orth 8435ee242d
Experimental language detection using langid
Works decenty well assuming the title, abstract, and citation fields
are an accurate representation of the language as identified by the
language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3)
values seamlessly.

This includes updated pipenv environment, test data, pytest tests
for both correct and incorrect ISO 639-1 and ISO 639-3 languages,
and a new command line option "-e".
2019-09-26 13:46:32 +03:00
Alan Orth f304ca6a33
csv_metadata_quality/app.py: Use simpler column iteration
I don't know where I got the other one...
2019-09-21 17:19:39 +03:00
Alan Orth 280a99c8a8
Sort imports with isort
See: https://sourcery.ai/blog/python-best-practices/
2019-08-29 01:15:04 +03:00
Alan Orth d97dcd19db
Format with black 2019-08-29 01:10:39 +03:00
Alan Orth 81190d56bb
Add fix for missing space after commas
This happens in names very often, for example in the contributor
and citation fields. I will limit this to those fields for now and
hide this fix behind the "unsafe fixes" option until I test it more.
2019-08-28 00:05:52 +03:00
Alan Orth 113e7cd8b6
csv_metadata_quality/app.py: Add ability to skip fields
The user may want to skip the checking and fixing of certain fields
in the input file.
2019-08-27 00:10:07 +03:00
Alan Orth ed5612fbcf
Add column name to output in date checks
This makes it easier to understand where the error is in case a CSV
has multiple date fields, for example:

    Missing date (dc.date.issued).
    Missing date (dc.date.issued[]).

If you have 126 items and you get 126 "Missing date" messages then
it's likely that 100 of the items have dates in one field, and the
others have dates in other field.
2019-08-21 15:31:12 +03:00
Alan Orth 9ce7dc6716
Add check for uncommon filenames
Generally we want people to upload documents in accessible formats
like PDF, Word, Excel, and PowerPoint. This check warns if a file
is using an uncommon extension.
2019-08-10 23:41:16 +03:00
Alan Orth 62fea95087
Improve suspicious character detection
Now it will print just the part of the metadata value that contains
the suspicious character (up to 80 characters, so we don't make the
line break on terminals that use 80 character width by default).

Also, print the name of the field in which the metadata value is so
that it is easier for the user to locate.
2019-08-09 01:25:40 +03:00
Alan Orth 8772bdec51
csv_metadata_quality/app.py: Explicitly exit with success 2019-08-04 09:10:37 +03:00
Alan Orth 6d4ecd75aa
csv_metadata_quality/app.py: Close files before exit 2019-08-04 09:10:19 +03:00
Alan Orth f4e7fd73f5
csv_metadata_quality/app.py: Handle Ctrl-C
Instead of printing an ugly two-page stack trace.
2019-08-03 21:11:57 +03:00
Alan Orth 0561300ebe
Add option to print version with `--version` or `-V`
I guess `-v` is more commonly used for "verbose" so I will use the
short option of `-V` for version.
2019-08-02 00:09:54 +03:00
Alan Orth bf876a046a
Rework AGROVOC validation
AGROVOC validation is now disabled by default, but can be enabled
on a field-by-field basis. For example, countries and regions are
also present in AGROVOC. Fields with these values can be enabled
using the new `--agrovoc-fields` option.

I reworked the script output to show the field name when printing
an invalid term so that the user knows in which field the term is.
2019-08-01 23:51:58 +03:00
Alan Orth 9100efdf50
Re-work as a proper standalone Python package
Add a setup.py so that installation is easier and a standalone CLI
script called csv-metadata-quality is provided. Now the user only
needs to run this from a virtual environment inside the project
directory:

    $ pip install .

Eventually I could publish this on PyPi when I settle on a more
appropriate package name.

See: https://packaging.python.org/tutorials/packaging-projects/
See: https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/
2019-07-31 17:34:36 +03:00
Alan Orth 40d5f7d81b
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
2019-07-30 20:05:12 +03:00
Alan Orth 1f65a28307
Add support for validating subjects against AGROVOC
Checks values in the dc.subject or dcterms.subject field against the
AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py.

See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/
See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py
2019-07-30 00:30:31 +03:00
Alan Orth a36454a3ac
Add support for validating languages
Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.

Does not currently have any support for generic values like "Other".
2019-07-29 18:59:42 +03:00
Alan Orth 1e444cf040
Add fix for duplicate metadata values 2019-07-29 18:05:03 +03:00
Alan Orth fa4fa3491b
Add check for "suspicious" characters
These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.

It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.
2019-07-29 17:08:49 +03:00
Alan Orth 8047a57cc5
Add support for fixing "unnecessary" Unicode
These are things like non-breaking spaces, "replacement" characters,
etc that add nothing to the metadata and often cause errors during
parsing or displaying in a UI.
2019-07-29 16:38:10 +03:00
Alan Orth d73f7b54b1
csv_metadata_quality/app.py: Improve comments 2019-07-29 16:24:35 +03:00
Alan Orth 7b5db1f5d9
csv_metadata_quality/app.py: Remove erroneous comment 2019-07-29 16:15:25 +03:00
Alan Orth 40e77db713
Add "unsafe fixes" runtime option
In this case it fixes occurences of invalid multi-value separators.
DSpace uses "||" to separate multiple values in one field, but our
editors sometimes give us files with mistakes like "|". We can fix
these to be correct multi-value separators if we are sure that the
metadata is not actually using "|" for some legitimate purpose.
2019-07-28 22:53:39 +03:00
Alan Orth a93b5b31c5
Add support for command line arguments
Currently only supports specifying input and output files with -i
and -o. Eventually I'll add more options like dry run, debug, and
maybe things like forcing unsafe fixes.
2019-07-28 20:31:57 +03:00
Alan Orth aadb3117eb
csv_metadata_quality/app.py: Remove unused test input files 2019-07-28 17:45:05 +03:00
Alan Orth e88d35ace3
csv_metadata_quality/app.py: Use regex in column match
Check for a column that has "issn" or "isbn" in the name rather
than by its explicit name, as the column is dc.identifier.issn now,
but will be cg.issn in the future if CG Core v2 happens.
2019-07-28 17:27:20 +03:00
Alan Orth 196bb434fa
Add date validation
I'm only concerned with validating issue dates here. In DSpace they
are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory
they could be any valid ISO8601 format).

This also checks for cases where the date is missing and where the
metadata has specified multiple dates like "1990||1991", as this is
valid, but there is no practical value for it in our system.
2019-07-28 16:11:36 +03:00
Alan Orth e2bb2d4df9
Main function should be "main()" 2019-07-27 23:09:16 +03:00
Alan Orth c47c064a13
Make output less debuggy 2019-07-27 09:21:13 +03:00
Alan Orth 18f26c343d
csv_metadata_quality/app.py: Fix path to test.csv 2019-07-27 00:25:30 +03:00
Alan Orth 84c3b17678
csv_metadata_quality/app.py: Add comment 2019-07-26 23:49:13 +03:00
Alan Orth aaf3537ba4
Add check for invalid multi-value separators 2019-07-26 23:48:24 +03:00
Alan Orth dfd961d720
Bring test.csv into project 2019-07-26 23:14:37 +03:00
Alan Orth e160b17fb0
Add ISSN and ISBN checks using python-stdnum 2019-07-26 23:14:10 +03:00
Alan Orth 232d28e13e
Refactor as package with subpackages
This makes it cleaner for introducing checks, fixes, tests, docs,
and tests in the future. Currently can be run like this:

  python -m csv_metadata_quality

CSV input and output paths are still hard coded.

See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
2019-07-26 22:11:10 +03:00