I suspect this undermines the PyArrow backend performance gains in
recent Pandas 2.0.0, but we are dealing with messy data sometimes
and we must rely on data being strings.
When checking for uncommon file extensions in the filename field
we should strip descriptions that are meant for SAF Bundler, for
example: Annual_Report_2020.pdf__description:Report. This ends up
as a false positive that spams the output with warnings.
This is meant to catch licenses that are supposed to be SPDX but
aren't, not licenses that *aren't* supposed to be SPDX. We have so
many free-text license descriptions like "Copyrighted" and "Other"
that I'm sick of seeing warnings for them!
Don't run the invalid separators fix on title fields because some
items use "|" in the title to indicate something like a subtitle.
For example:
Progress Review and Work Planning Meeting | Day 1
Add the country to the message about missing regions. This makes it
easier to see which country is triggering the missing region error,
and helps in case of debugging possible mistakes in the data coming
from the country_converter library.
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
When a user explicitly requests that a field be excluded with -x we
skip that field in most checks. Up until now that did not include
the item-based checks using a transposed dataframe because we don't
know the metadata field names (labels) until we iterate over them.
Now the excludes are respected for item-based checks.
We need to make sure we're only manipulating the regions if we have
any missing. The previous code was always manipulating the existing
row, even when there were no missing regions, which resulted in new
values like "Eastern Africa||".
By default country_converter prints "not found in regex" if a coun-
try is not found. We can silence this by switching the logging lev-
el to something above WARNING.
The country_converter documentation says we should instantiate the
CountryConverter() class once instead of calling coco.convert() in
each iteration of the loop so we don't end up loading the data file
more than once.
It seems there was another logic error raised by the test in pytest.
With my real data, it was enough to check if the region column was
None, but with my test I was explicitly setting the region to "" (an
empty string). So to be really sure we should check if the string
is not None *and* if its length is greater than 0.
We actually want to do this after we try to fix mojibake with ftfy.
These "unnecessary" Unicode characters could actually help ftfy in
some cases because often times they indicate that some character
from another encoding was there before (like an accent, dash, or
smart quote).
Initialize the titles and citations before the for loop so we can
access them later. This makes it easier to check if the item actua-
lly has a citation.
This checks if the item title exists in the citation. If it is not
present it could just be missing, or could have minor differences
in the whitespace, accents, etc.
If unsafe fixes (-u) are enabled then we don't need to do the check
first before actually fixing them. Doing the check first creates e-
tra output that needs to be reviewed by the user.
Fix the incorrect type field regex, and improve the title regex to
consider dcterms.title and dc.title (along with the DSpace language
variants like dc.title[en_US]), but ignore dc.title.alternative.
See: https://regex101.com/r/I4m06F/1
Apparently we were stuck on an older version of requests-cache due
to the fact that we were using the caret, which will never update
the left-most (major) version. Upstream requests-cache is currently
version 0.6.4, and there seems to have been some changes to the API.