This happens in names very often, for example in the contributor
and citation fields. I will limit this to those fields for now and
hide this fix behind the "unsafe fixes" option until I test it more.
I recycled this code from a separate agrovoc-lookup.py script that
checks lines in a text file to see if they are valid AGROVOC terms
or not. There I was concerned about skipping comments or something
I think, but we don't need to check that here. We simply check the
term that is in the field and inform the user if it's valid or not.
This makes it easier to understand where the error is in case a CSV
has multiple date fields, for example:
Missing date (dc.date.issued).
Missing date (dc.date.issued[]).
If you have 126 items and you get 126 "Missing date" messages then
it's likely that 100 of the items have dates in one field, and the
others have dates in other field.
Add a check for soft hyphens (U+00AD). In one sample CSV I have a
normal hyphen followed by a soft hyphen in an ISBN. This causes the
ISBN validation to fail.
Generally we want people to upload documents in accessible formats
like PDF, Word, Excel, and PowerPoint. This check warns if a file
is using an uncommon extension.
This helps you understand the cryptic assertion error output from
pytest. For some reason pytest-clarity is a pre-release package so
we need to install it in pipenv with --pre.
Now it will print just the part of the metadata value that contains
the suspicious character (up to 80 characters, so we don't make the
line break on terminals that use 80 character width by default).
Also, print the name of the field in which the metadata value is so
that it is easier for the user to locate.