Commit Graph

151 Commits

Author SHA1 Message Date
Alan Orth d5c25f82fa
Update SPDX license list
From: https://github.com/spdx/license-list-data/blob/main/json/licenses.json
2024-03-02 10:38:27 +03:00
Alan Orth a21ffb0fa8
Use py3langid instead of langid
Faster and more modern code for Python 3 as a drop-in replacement.

See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html
2023-12-28 14:11:21 +03:00
Alan Orth f6018c51b6
Apply fixes from fixit
Apply recommended fix from fixit:

    RewriteToLiteral: It's slower to call list() than using the empty literal, because the name list must
    be looked up in the global scope in case it has been rebound.
2023-11-22 21:54:50 +03:00
Alan Orth 1f637f32cd
Rework requests-cache
We should only be running this once per invocation, not for every
row we check. This should be more efficient, but it means that we
don't cache responses when running via pytest, which is actually
probably a good thing.
2023-10-15 23:37:38 +03:00
Alan Orth b8dc19cc3f
csv_metadata_quality/check.py: enable requests-cache
This was disabled at some point. We also need to use the new delete
method instead.
2023-10-15 23:21:58 +03:00
Alan Orth 93c9b739ac
csv_metadata_quality/check.py: use HTTPS
Use HTTPS for AGROVOC REST API.
2023-10-15 22:38:45 +03:00
Alan Orth d21d2621e3 csv_metadata_quality/app.py: read fields as strings
I suspect this undermines the PyArrow backend performance gains in
recent Pandas 2.0.0, but we are dealing with messy data sometimes
and we must rely on data being strings.
2023-06-12 10:42:50 +03:00
Alan Orth f3fb1ff7fb Don't crash when title is missing
We shouldn't crash the country/region checker/fixer when the title
field is missing, since we only use it to show status to the user.
2023-06-12 10:42:50 +03:00
Alan Orth e2d46e9495
csv_metadata_quality/app.py: skip newline fix on description
The description field often has free-form text like the abstract and
there are too many legitimate newlines here to be correcting them
automatically.
2023-04-22 12:16:13 -07:00
Alan Orth 1491e1edb0
Fix path to data/licenses.json
continuous-integration/drone/push Build is passing Details
When we install and run this from CI, this file needs to exist in
the package's folder inside site-packages. Then we can use __file__
to get the path relative to the package.

See: https://python-packaging.readthedocs.io/en/latest/non-code-files.html
2023-04-05 15:28:21 +03:00
Alan Orth d4aed378cf
Switch to pandas 2.0.0rc1
Seems to work fine with the new PyArrow datatypes.
2023-03-22 12:16:56 +03:00
Alan Orth d661ffe439
Check comma space on bibliographicCitation too
The regex was only matching `dc.identifier.citation`, but we need
to match `dcterms.bibliographicCitation` too.
2023-03-10 16:13:16 +03:00
Alan Orth 45a310387a
Don't fix multi-value separators on citations 2023-03-10 16:12:30 +03:00
Alan Orth fdccdf7318
Version 0.6.1
continuous-integration/drone/push Build is failing Details
2023-02-23 13:46:56 +03:00
Alan Orth 53fdb50906
csv_metadata_quality/check.py: run black
continuous-integration/drone/push Build is failing Details
2023-02-18 22:10:04 +03:00
Alan Orth 8bc4cd419c
Strip filename descriptions before checking
continuous-integration/drone/push Build is failing Details
When checking for uncommon file extensions in the filename field
we should strip descriptions that are meant for SAF Bundler, for
example: Annual_Report_2020.pdf__description:Report. This ends up
as a false positive that spams the output with warnings.
2023-02-13 11:00:57 +03:00
Alan Orth 8db1e36a6d
csv_metadata_quality/app.py: skip abstract in separator check
Also skip abstract in the separator check, since it's rare to have
any "|" here, but more likely that if one is present then it's for
a reason.
2023-02-13 10:37:33 +03:00
Alan Orth fbb625be5c
Ignore common non-SPDX licenses
This is meant to catch licenses that are supposed to be SPDX but
aren't, not licenses that *aren't* supposed to be SPDX. We have so
many free-text license descriptions like "Copyrighted" and "Other"
that I'm sick of seeing warnings for them!
2023-02-07 17:01:56 +03:00
Alan Orth 545bb8cd0c
csv_metadata_quality/app.py: disable whitespace on abstracts
It's too aggressive on abstracts. If people paste in text from a
PDF there are often newlines, and most of the time this is what
they want.
2023-02-07 16:48:40 +03:00
Alan Orth 3596381d03
csv_metadata_quality/app.py: separators fix
Don't run the invalid separators fix on title fields because some
items use "|" in the title to indicate something like a subtitle.

For example:

    Progress Review and Work Planning Meeting | Day 1
2023-01-24 14:13:55 +03:00
Alan Orth 7cc49b500d
Use licenses.json from SPDX instead of spdx-license-list
spdx-license-list has been deprecated[1] and already has outdated
information compared to recent SPDX data releases. Now I use the
JSON license data directly from SPDX[2] (currently version 3.19).

The JSON file is loaded from the package's data directory using
Python 3's stdlib functions from importlib[3], though we now need
Python 3.9 as a minimum for importlib.resources.files[4].

Also note that the data directory is not properly packaged via
setuptools, so this only works for local installs, and not via
versions published to pypi, for example (I'm currently not doing
this anyways). If I want to publish this in the future I will
need to modify setup.py/pyproject.toml to include the data files.

[1] https://gitlab.com/uniqx/spdx-license-list
[2] https://github.com/spdx/license-list-data/blob/main/json/licenses.json
[3] https://copdips.com/2022/09/adding-data-files-to-python-package-with-setup-py.html
[4] https://docs.python.org/3/library/importlib.resources.html#importlib.resources.files
2022-12-13 10:39:17 +03:00
Alan Orth 051777bcec
Ignore subregion field for missing region checks
continuous-integration/drone/push Build is passing Details
Due to a sloppy regex I was sometimes matching the subregion field
when checking for missing UN M.49 regions in the region field.
2022-12-07 23:18:47 +01:00
Alan Orth 141b2e1da3
csv_metadata_quality/check.py: update region output
Add the country to the message about missing regions. This makes it
easier to see which country is triggering the missing region error,
and helps in case of debugging possible mistakes in the data coming
from the country_converter library.
2022-11-28 17:40:27 +03:00
Alan Orth 58b7b6e9d8
Version 0.6.0
continuous-integration/drone/push Build is passing Details
2022-09-02 16:35:58 +03:00
Alan Orth 566c2b45cf
Remove Excel support
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
2022-09-02 16:14:24 +03:00
Alan Orth 040e56fc76
Improve exclude function
When a user explicitly requests that a field be excluded with -x we
skip that field in most checks. Up until now that did not include
the item-based checks using a transposed dataframe because we don't
know the metadata field names (labels) until we iterate over them.

Now the excludes are respected for item-based checks.
2022-09-02 15:59:22 +03:00
Alan Orth 1f76247353
csv_metadata_quality/app.py: rework exclude/skip
Instead of processing the excludes inside the for column loop we do
it once before and then only need to check if the current column is
in the list.
2022-09-02 10:35:04 +03:00
Alan Orth 117c6ca85d
csv_metadata_quality/check.py: missing region fixes
Port over the recent fixes and logic improvements to regions from
fix.py.
2022-09-01 16:38:35 +03:00
Alan Orth f49214fa2e
csv_metadata_quality/fix.py: fix bug in regions
We need to make sure we're only manipulating the regions if we have
any missing. The previous code was always manipulating the existing
row, even when there were no missing regions, which resulted in new
values like "Eastern Africa||".
2022-09-01 16:15:32 +03:00
Alan Orth 7ce20726d0
csv_metadata_quality/fix.py: minor change
Print missing regions when we know they are missing, instead of do-
ing another check later and looping over them again.
2022-09-01 16:03:49 +03:00
Alan Orth 473be5ac2f
csv_metadata_quality/fix.py: don't add "not found" region
country_converter returns the literal "not found" string if a coun-
try cannot be found. In that case we do not want to consider that as
a region!
2022-09-01 15:46:21 +03:00
Alan Orth 7c61cae417 csv_metadata_quality/fix.py: silence warning
By default country_converter prints "not found in regex" if a coun-
try is not found. We can silence this by switching the logging lev-
el to something above WARNING.
2022-09-01 15:44:50 +03:00
Alan Orth ae16289637
csv_metadata_quality/fix.py: Minor change
The country_converter documentation says we should instantiate the
CountryConverter() class once instead of calling coco.convert() in
each iteration of the loop so we don't end up loading the data file
more than once.
2022-09-01 15:40:45 +03:00
Alan Orth 0cf0bc97f0
csv_metadata_quality/fix.py: fix logic error again
continuous-integration/drone/push Build is passing Details
It seems there was another logic error raised by the test in pytest.
With my real data, it was enough to check if the region column was
None, but with my test I was explicitly setting the region to "" (an
empty string). So to be really sure we should check if the string
is not None *and* if its length is greater than 0.
2022-08-03 20:51:14 +03:00
Alan Orth 40c3585bab
csv_metadata_quality/fix.py: fix logic error
Fix string concatenation with existing regions.
2022-08-03 18:26:08 +03:00
Alan Orth b9c44aed7d
csv_metadata_quality/fix.py: fix logic issue
continuous-integration/drone/push Build is passing Details
Forgot to return the row as-is if we don't find any countries.
2022-08-02 10:17:30 +03:00
Alan Orth 689ee184f7
Add unsafe check to add missing regions 2022-07-28 16:52:43 +03:00
Alan Orth a7727b8431
Add support for dropping invalid AGROVOC terms
Requires --agrovoc-fields <field.name> to do the actual validation,
and -d to drop invalid ones.
2021-12-23 12:43:55 +02:00
Alan Orth 7763a021c5
csv_metadata_quality/fix.py: sort imports with isort
continuous-integration/drone/push Build is passing Details
2021-12-15 23:15:02 +02:00
Alan Orth e4faf114dc
csv_metadata_quality/util.py: update for ftfy 6.0
The sequence_weirdness() heuristic is deprecated. Now we should use
is_bad().

See: https://ftfy.readthedocs.io/en/v6.0/heuristic.html
See: https://github.com/rspeer/python-ftfy/blob/master/CHANGELOG.md#version-60-april-2-2021
2021-12-15 21:58:07 +02:00
Alan Orth ff49a80432
csv_metadata_quality/fix.py: configure ftfy
Don't replace smart quotes in ftfy. If our text has them we should
keep them.
2021-12-15 21:51:51 +02:00
Alan Orth e7322efadd
csv_metadata_quality/app.py: move unnecessary Unicode fix
We actually want to do this after we try to fix mojibake with ftfy.
These "unnecessary" Unicode characters could actually help ftfy in
some cases because often times they indicate that some character
from another encoding was there before (like an accent, dash, or
smart quote).
2021-12-15 13:53:25 +02:00
Alan Orth 95015febbd
csv_metadata_quality/fix.py: fix thin spaces
continuous-integration/drone/push Build is passing Details
Replace thin spaces with normal spaces. Sometimes I see these get
mis handled on Windows machines and they end up as "?" or so.
2021-12-09 23:22:53 +02:00
Alan Orth 9905e183ea
Bump version to 0.6.0-dev 2021-12-09 23:21:30 +02:00
Alan Orth cc34db7ff8
Version 0.5.0
continuous-integration/drone/push Build is passing Details
2021-12-08 15:29:46 +02:00
Alan Orth ccc2a73456
Add check for countries without matching regions
If we have country "Kenya" we should have region "Eastern Africa"
according to the UN M.49 geolocation scheme.
2021-12-08 15:02:20 +02:00
Alan Orth 4d5696c4cb
csv_metadata_quality/check.py: update title in citation check
Initialize the titles and citations before the for loop so we can
access them later. This makes it easier to check if the item actua-
lly has a citation.
2021-12-05 16:21:44 +02:00
Alan Orth 3b40a68279
Add check for title in citation
This checks if the item title exists in the citation. If it is not
present it could just be missing, or could have minor differences
in the whitespace, accents, etc.
2021-12-05 15:52:42 +02:00
Alan Orth 999cc65097
csv_metadata_quality/app.py: adjust mojibake check
If unsafe fixes (-u) are enabled then we don't need to do the check
first before actually fixing them. Doing the check first creates e-
tra output that needs to be reviewed by the user.
2021-12-05 15:18:35 +02:00
Alan Orth 787fa9e8d9
Add field name to fix.newlines output 2021-10-08 14:36:43 +03:00