AGROVOC validation is now disabled by default, but can be enabled
on a field-by-field basis. For example, countries and regions are
also present in AGROVOC. Fields with these values can be enabled
using the new `--agrovoc-fields` option.
I reworked the script output to show the field name when printing
an invalid term so that the user knows in which field the term is.
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.
Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
It makes the summary of passes and fails more annoying to read due
to the lines being long and including newlines. I am actually not
sure why this fixes it, though...
See: https://pytest.readthedocs.io/en/latest/capture.html
The latter is a fork that hasn't been updated since 2016 and the
original still seems to be well maintained, with recent database
updates as well as tests for Python 3.7.
Also, pycountry supports ISO 3166-2 (administrative zones), which
we could eventually use for sub regions.
We actually only need to see if there are more than zero matches
because a term like "Nigeria" will match in English, Spanish, etc,
whereas terms that *really* don't match will have zero results.
This seems to be produce a more informative output, though I'm not
sure how to filter the annoying deprecation warnings pytest throws
about things that are usually done by modules I'm using, not by me.
From: https://github.com/taoufik07/responder/blob/master/pytest.ini
Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.
Does not currently have any support for generic values like "Other".
These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.
It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.