1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-26 23:58:18 +01:00
Commit Graph

18 Commits

Author SHA1 Message Date
ed5612fbcf
Add column name to output in date checks
This makes it easier to understand where the error is in case a CSV
has multiple date fields, for example:

    Missing date (dc.date.issued).
    Missing date (dc.date.issued[]).

If you have 126 items and you get 126 "Missing date" messages then
it's likely that 100 of the items have dates in one field, and the
others have dates in other field.
2019-08-21 15:31:12 +03:00
13d5221378
csv_metadata_quality/check.py: Fix test for False 2019-08-10 23:52:53 +03:00
9ce7dc6716
Add check for uncommon filenames
Generally we want people to upload documents in accessible formats
like PDF, Word, Excel, and PowerPoint. This check warns if a file
is using an uncommon extension.
2019-08-10 23:41:16 +03:00
62fea95087
Improve suspicious character detection
Now it will print just the part of the metadata value that contains
the suspicious character (up to 80 characters, so we don't make the
line break on terminals that use 80 character width by default).

Also, print the name of the field in which the metadata value is so
that it is easier for the user to locate.
2019-08-09 01:25:40 +03:00
bf876a046a
Rework AGROVOC validation
AGROVOC validation is now disabled by default, but can be enabled
on a field-by-field basis. For example, countries and regions are
also present in AGROVOC. Fields with these values can be enabled
using the new `--agrovoc-fields` option.

I reworked the script output to show the field name when printing
an invalid term so that the user knows in which field the term is.
2019-08-01 23:51:58 +03:00
3c798fb504
Use pycountry instead of iso-639 for languages
The latter is a fork that hasn't been updated since 2016 and the
original still seems to be well maintained, with recent database
updates as well as tests for Python 3.7.

Also, pycountry supports ISO 3166-2 (administrative zones), which
we could eventually use for sub regions.
2019-07-30 16:39:26 +03:00
4e3511cd55
csv_metadata_quality/check.py: Fix AGROVOC lookup
We actually only need to see if there are more than zero matches
because a term like "Nigeria" will match in English, Spanish, etc,
whereas terms that *really* don't match will have zero results.
2019-07-30 14:51:44 +03:00
1f65a28307
Add support for validating subjects against AGROVOC
Checks values in the dc.subject or dcterms.subject field against the
AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py.

See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/
See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py
2019-07-30 00:30:31 +03:00
a36454a3ac
Add support for validating languages
Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.

Does not currently have any support for generic values like "Other".
2019-07-29 18:59:42 +03:00
d7888d59a8
csv_metadata_quality/check.py: Return date even if it is invalid
Otherwise it is missing from the final CSV and then we can't even
fix it. :)
2019-07-29 17:40:14 +03:00
fa4fa3491b
Add check for "suspicious" characters
These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.

It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.
2019-07-29 17:08:49 +03:00
42920e9c7c
Test Python regular expression matches directly
Match objects always have a boolean value of True.

See: https://docs.python.org/3.7/library/re.html
2019-07-29 16:16:30 +03:00
87b1997051
Fix whitespace errors found by flake8 2019-07-28 17:47:28 +03:00
196bb434fa
Add date validation
I'm only concerned with validating issue dates here. In DSpace they
are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory
they could be any valid ISO8601 format).

This also checks for cases where the date is missing and where the
metadata has specified multiple dates like "1990||1991", as this is
valid, but there is no practical value for it in our system.
2019-07-28 16:11:36 +03:00
3cf9f9452b
csv_metadata_quality/check.py: Always return field
We always need to return the field back so apply doesn't set it to
null when creating the new data frame.
2019-07-27 01:28:08 +03:00
aaf3537ba4
Add check for invalid multi-value separators 2019-07-26 23:48:24 +03:00
02f9d8a736
csv_metadata_quality/check.py: Add check for missing isbn values 2019-07-26 23:45:18 +03:00
e160b17fb0
Add ISSN and ISBN checks using python-stdnum 2019-07-26 23:14:10 +03:00