1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-06-26 16:13:46 +02:00
Commit Graph

137 Commits

Author SHA1 Message Date
53fdb50906
csv_metadata_quality/check.py: run black
Some checks failed
continuous-integration/drone/push Build is failing
2023-02-18 22:10:04 +03:00
8bc4cd419c
Strip filename descriptions before checking
Some checks failed
continuous-integration/drone/push Build is failing
When checking for uncommon file extensions in the filename field
we should strip descriptions that are meant for SAF Bundler, for
example: Annual_Report_2020.pdf__description:Report. This ends up
as a false positive that spams the output with warnings.
2023-02-13 11:00:57 +03:00
8db1e36a6d
csv_metadata_quality/app.py: skip abstract in separator check
Also skip abstract in the separator check, since it's rare to have
any "|" here, but more likely that if one is present then it's for
a reason.
2023-02-13 10:37:33 +03:00
fbb625be5c
Ignore common non-SPDX licenses
This is meant to catch licenses that are supposed to be SPDX but
aren't, not licenses that *aren't* supposed to be SPDX. We have so
many free-text license descriptions like "Copyrighted" and "Other"
that I'm sick of seeing warnings for them!
2023-02-07 17:01:56 +03:00
545bb8cd0c
csv_metadata_quality/app.py: disable whitespace on abstracts
It's too aggressive on abstracts. If people paste in text from a
PDF there are often newlines, and most of the time this is what
they want.
2023-02-07 16:48:40 +03:00
3596381d03
csv_metadata_quality/app.py: separators fix
Don't run the invalid separators fix on title fields because some
items use "|" in the title to indicate something like a subtitle.

For example:

    Progress Review and Work Planning Meeting | Day 1
2023-01-24 14:13:55 +03:00
7cc49b500d
Use licenses.json from SPDX instead of spdx-license-list
spdx-license-list has been deprecated[1] and already has outdated
information compared to recent SPDX data releases. Now I use the
JSON license data directly from SPDX[2] (currently version 3.19).

The JSON file is loaded from the package's data directory using
Python 3's stdlib functions from importlib[3], though we now need
Python 3.9 as a minimum for importlib.resources.files[4].

Also note that the data directory is not properly packaged via
setuptools, so this only works for local installs, and not via
versions published to pypi, for example (I'm currently not doing
this anyways). If I want to publish this in the future I will
need to modify setup.py/pyproject.toml to include the data files.

[1] https://gitlab.com/uniqx/spdx-license-list
[2] https://github.com/spdx/license-list-data/blob/main/json/licenses.json
[3] https://copdips.com/2022/09/adding-data-files-to-python-package-with-setup-py.html
[4] https://docs.python.org/3/library/importlib.resources.html#importlib.resources.files
2022-12-13 10:39:17 +03:00
051777bcec
Ignore subregion field for missing region checks
All checks were successful
continuous-integration/drone/push Build is passing
Due to a sloppy regex I was sometimes matching the subregion field
when checking for missing UN M.49 regions in the region field.
2022-12-07 23:18:47 +01:00
141b2e1da3
csv_metadata_quality/check.py: update region output
Add the country to the message about missing regions. This makes it
easier to see which country is triggering the missing region error,
and helps in case of debugging possible mistakes in the data coming
from the country_converter library.
2022-11-28 17:40:27 +03:00
58b7b6e9d8
Version 0.6.0
All checks were successful
continuous-integration/drone/push Build is passing
2022-09-02 16:35:58 +03:00
566c2b45cf
Remove Excel support
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
2022-09-02 16:14:24 +03:00
040e56fc76
Improve exclude function
When a user explicitly requests that a field be excluded with -x we
skip that field in most checks. Up until now that did not include
the item-based checks using a transposed dataframe because we don't
know the metadata field names (labels) until we iterate over them.

Now the excludes are respected for item-based checks.
2022-09-02 15:59:22 +03:00
1f76247353
csv_metadata_quality/app.py: rework exclude/skip
Instead of processing the excludes inside the for column loop we do
it once before and then only need to check if the current column is
in the list.
2022-09-02 10:35:04 +03:00
117c6ca85d
csv_metadata_quality/check.py: missing region fixes
Port over the recent fixes and logic improvements to regions from
fix.py.
2022-09-01 16:38:35 +03:00
f49214fa2e
csv_metadata_quality/fix.py: fix bug in regions
We need to make sure we're only manipulating the regions if we have
any missing. The previous code was always manipulating the existing
row, even when there were no missing regions, which resulted in new
values like "Eastern Africa||".
2022-09-01 16:15:32 +03:00
7ce20726d0
csv_metadata_quality/fix.py: minor change
Print missing regions when we know they are missing, instead of do-
ing another check later and looping over them again.
2022-09-01 16:03:49 +03:00
473be5ac2f
csv_metadata_quality/fix.py: don't add "not found" region
country_converter returns the literal "not found" string if a coun-
try cannot be found. In that case we do not want to consider that as
a region!
2022-09-01 15:46:21 +03:00
7c61cae417 csv_metadata_quality/fix.py: silence warning
By default country_converter prints "not found in regex" if a coun-
try is not found. We can silence this by switching the logging lev-
el to something above WARNING.
2022-09-01 15:44:50 +03:00
ae16289637
csv_metadata_quality/fix.py: Minor change
The country_converter documentation says we should instantiate the
CountryConverter() class once instead of calling coco.convert() in
each iteration of the loop so we don't end up loading the data file
more than once.
2022-09-01 15:40:45 +03:00
0cf0bc97f0
csv_metadata_quality/fix.py: fix logic error again
All checks were successful
continuous-integration/drone/push Build is passing
It seems there was another logic error raised by the test in pytest.
With my real data, it was enough to check if the region column was
None, but with my test I was explicitly setting the region to "" (an
empty string). So to be really sure we should check if the string
is not None *and* if its length is greater than 0.
2022-08-03 20:51:14 +03:00
40c3585bab
csv_metadata_quality/fix.py: fix logic error
Fix string concatenation with existing regions.
2022-08-03 18:26:08 +03:00
b9c44aed7d
csv_metadata_quality/fix.py: fix logic issue
All checks were successful
continuous-integration/drone/push Build is passing
Forgot to return the row as-is if we don't find any countries.
2022-08-02 10:17:30 +03:00
689ee184f7
Add unsafe check to add missing regions 2022-07-28 16:52:43 +03:00
a7727b8431
Add support for dropping invalid AGROVOC terms
Requires --agrovoc-fields <field.name> to do the actual validation,
and -d to drop invalid ones.
2021-12-23 12:43:55 +02:00
7763a021c5
csv_metadata_quality/fix.py: sort imports with isort
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-15 23:15:02 +02:00
e4faf114dc
csv_metadata_quality/util.py: update for ftfy 6.0
The sequence_weirdness() heuristic is deprecated. Now we should use
is_bad().

See: https://ftfy.readthedocs.io/en/v6.0/heuristic.html
See: https://github.com/rspeer/python-ftfy/blob/master/CHANGELOG.md#version-60-april-2-2021
2021-12-15 21:58:07 +02:00
ff49a80432
csv_metadata_quality/fix.py: configure ftfy
Don't replace smart quotes in ftfy. If our text has them we should
keep them.
2021-12-15 21:51:51 +02:00
e7322efadd
csv_metadata_quality/app.py: move unnecessary Unicode fix
We actually want to do this after we try to fix mojibake with ftfy.
These "unnecessary" Unicode characters could actually help ftfy in
some cases because often times they indicate that some character
from another encoding was there before (like an accent, dash, or
smart quote).
2021-12-15 13:53:25 +02:00
95015febbd
csv_metadata_quality/fix.py: fix thin spaces
All checks were successful
continuous-integration/drone/push Build is passing
Replace thin spaces with normal spaces. Sometimes I see these get
mis handled on Windows machines and they end up as "?" or so.
2021-12-09 23:22:53 +02:00
9905e183ea
Bump version to 0.6.0-dev 2021-12-09 23:21:30 +02:00
cc34db7ff8
Version 0.5.0
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-08 15:29:46 +02:00
ccc2a73456
Add check for countries without matching regions
If we have country "Kenya" we should have region "Eastern Africa"
according to the UN M.49 geolocation scheme.
2021-12-08 15:02:20 +02:00
4d5696c4cb
csv_metadata_quality/check.py: update title in citation check
Initialize the titles and citations before the for loop so we can
access them later. This makes it easier to check if the item actua-
lly has a citation.
2021-12-05 16:21:44 +02:00
3b40a68279
Add check for title in citation
This checks if the item title exists in the citation. If it is not
present it could just be missing, or could have minor differences
in the whitespace, accents, etc.
2021-12-05 15:52:42 +02:00
999cc65097
csv_metadata_quality/app.py: adjust mojibake check
If unsafe fixes (-u) are enabled then we don't need to do the check
first before actually fixing them. Doing the check first creates e-
tra output that needs to be reviewed by the user.
2021-12-05 15:18:35 +02:00
787fa9e8d9
Add field name to fix.newlines output 2021-10-08 14:36:43 +03:00
8a27fb2589
Add check for missing DOIs
All checks were successful
continuous-integration/drone/push Build is passing
Sometimes an editor includes a DOI in the citation field, but does
not add a standalone DOI field.
2021-10-06 21:25:39 +03:00
6ba16d5d4c
csv_metadata_quality/check.py: Fix duplicate checker
Fix the incorrect type field regex, and improve the title regex to
consider dcterms.title and dc.title (along with the DSpace language
variants like dc.title[en_US]), but ignore dc.title.alternative.

See: https://regex101.com/r/I4m06F/1
2021-10-06 19:32:40 +03:00
54ab869297
csv_metadata_quality/experimental.py: Adjust citation match
We need to match both of these citation fields:

- dc.identifier.citation
- dcterms.bibliographicCitation
2021-10-06 16:13:10 +03:00
4e2eab68b0
Update requests-cache
Apparently we were stuck on an older version of requests-cache due
to the fact that we were using the caret, which will never update
the left-most (major) version. Upstream requests-cache is currently
version 0.6.4, and there seems to have been some changes to the API.
2021-07-06 15:24:39 +03:00
a8fe623f4c
csv_metadata_quality/check.py: Remove unnecessary pass
All checks were successful
continuous-integration/drone/push Build is passing
LGTM warned that these pass statements are not necessary.

See: https://lgtm.com/rules/910088/
2021-04-20 08:20:13 +03:00
bd8943f36a
csv_metadata_quality/app.py: Don't crash if fields are missing
All checks were successful
continuous-integration/drone/push Build is passing
We don't need to crash if someone feeds us a CSV file that is miss-
ing commont DSpace fields like title, type, and subject.
2021-03-21 19:47:29 +02:00
cfe09f7126
Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
8eddb76aab
Bump version to 0.4.8-dev
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-19 11:53:56 +02:00
898bb412c3
Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
f816e17fe7
Version 0.4.7
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:00:34 +02:00
9f2dc0a0f5
Add support for detecting duplicate items
This uses the title, type, and date issued as a sort of "key" when
determining if an item already exists in the data set.
2021-03-17 09:53:07 +02:00
14010896a5
csv_metadata_quality/experimental.py: Move all imports to top of file
All checks were successful
continuous-integration/drone/push Build is passing
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-16 16:13:34 +02:00
ab3af2ec62
csv_metadata_quality/check.py: Reformat with black 2021-03-16 16:12:33 +02:00
330a7b7b9c
Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00