We actually want to do this after we try to fix mojibake with ftfy.
These "unnecessary" Unicode characters could actually help ftfy in
some cases because often times they indicate that some character
from another encoding was there before (like an accent, dash, or
smart quote).
Initialize the titles and citations before the for loop so we can
access them later. This makes it easier to check if the item actua-
lly has a citation.
This checks if the item title exists in the citation. If it is not
present it could just be missing, or could have minor differences
in the whitespace, accents, etc.
If unsafe fixes (-u) are enabled then we don't need to do the check
first before actually fixing them. Doing the check first creates e-
tra output that needs to be reviewed by the user.
Fix the incorrect type field regex, and improve the title regex to
consider dcterms.title and dc.title (along with the DSpace language
variants like dc.title[en_US]), but ignore dc.title.alternative.
See: https://regex101.com/r/I4m06F/1
Apparently we were stuck on an older version of requests-cache due
to the fact that we were using the caret, which will never update
the left-most (major) version. Upstream requests-cache is currently
version 0.6.4, and there seems to have been some changes to the API.
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.
For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:
- CIAT Publicaçao
- CIAT Publicación
The correct version of these in UTF-8 would be:
- CIAT Publicaçao
- CIAT Publicación
I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.
See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).
Imports sorted with isort.
See: https://www.python.org/dev/peps/pep-0008/#imports
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.
I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
Allow overriding the directory for the requests cache. In the case
of csv-metadata-quality-web, which currently runs on Google's App
Engine, we can only write to /tmp.
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).
Imports sorted with isort.
See: https://www.python.org/dev/peps/pep-0008/#imports
We used to only check fields that had "date" in their name because
we were using DSpace's default dc.date.* fields. Now we are using
dcterms.issued so I will add that one as well.
We should also allow ISO 8601 extended in combined date and time
format. DSpace does not have a problem with dates in this format
and I have found some metadata that uses this date format.
For example: 2020-08-31T11:04:56Z
See: https://en.wikipedia.org/wiki/ISO_8601
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.
For example: "Kenya||Tanzania||"
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.