csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-09-14 23:48:15 +02:00

Author	SHA1	Message	Date
Alan Orth	8db1e36a6d	csv_metadata_quality/app.py: skip abstract in separator check Also skip abstract in the separator check, since it's rare to have any "\|" here, but more likely that if one is present then it's for a reason.	2023-02-13 10:37:33 +03:00
Alan Orth	fbb625be5c	Ignore common non-SPDX licenses This is meant to catch licenses that are supposed to be SPDX but aren't, not licenses that aren't supposed to be SPDX. We have so many free-text license descriptions like "Copyrighted" and "Other" that I'm sick of seeing warnings for them!	2023-02-07 17:01:56 +03:00
Alan Orth	545bb8cd0c	csv_metadata_quality/app.py: disable whitespace on abstracts It's too aggressive on abstracts. If people paste in text from a PDF there are often newlines, and most of the time this is what they want.	2023-02-07 16:48:40 +03:00
Alan Orth	3596381d03	csv_metadata_quality/app.py: separators fix Don't run the invalid separators fix on title fields because some items use "\|" in the title to indicate something like a subtitle. For example: Progress Review and Work Planning Meeting \| Day 1	2023-01-24 14:13:55 +03:00
Alan Orth	7cc49b500d	Use licenses.json from SPDX instead of spdx-license-list spdx-license-list has been deprecated[1] and already has outdated information compared to recent SPDX data releases. Now I use the JSON license data directly from SPDX[2] (currently version 3.19). The JSON file is loaded from the package's data directory using Python 3's stdlib functions from importlib[3], though we now need Python 3.9 as a minimum for importlib.resources.files[4]. Also note that the data directory is not properly packaged via setuptools, so this only works for local installs, and not via versions published to pypi, for example (I'm currently not doing this anyways). If I want to publish this in the future I will need to modify setup.py/pyproject.toml to include the data files. [1] https://gitlab.com/uniqx/spdx-license-list [2] https://github.com/spdx/license-list-data/blob/main/json/licenses.json [3] https://copdips.com/2022/09/adding-data-files-to-python-package-with-setup-py.html [4] https://docs.python.org/3/library/importlib.resources.html#importlib.resources.files	2022-12-13 10:39:17 +03:00
Alan Orth	051777bcec	Ignore subregion field for missing region checks All checks were successful continuous-integration/drone/push Build is passing Details Due to a sloppy regex I was sometimes matching the subregion field when checking for missing UN M.49 regions in the region field.	2022-12-07 23:18:47 +01:00
Alan Orth	141b2e1da3	csv_metadata_quality/check.py: update region output Add the country to the message about missing regions. This makes it easier to see which country is triggering the missing region error, and helps in case of debugging possible mistakes in the data coming from the country_converter library.	2022-11-28 17:40:27 +03:00
Alan Orth	58b7b6e9d8	Version 0.6.0 All checks were successful continuous-integration/drone/push Build is passing Details	2022-09-02 16:35:58 +03:00
Alan Orth	566c2b45cf	Remove Excel support I never used this and it seems xlrd doesn't even support .xlsx any- more anyways. If this was needed I could theoretically use openpyxl but I'd rather just stick to CSV.	2022-09-02 16:14:24 +03:00
Alan Orth	040e56fc76	Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks.	2022-09-02 15:59:22 +03:00
Alan Orth	1f76247353	csv_metadata_quality/app.py: rework exclude/skip Instead of processing the excludes inside the for column loop we do it once before and then only need to check if the current column is in the list.	2022-09-02 10:35:04 +03:00
Alan Orth	117c6ca85d	csv_metadata_quality/check.py: missing region fixes Port over the recent fixes and logic improvements to regions from fix.py.	2022-09-01 16:38:35 +03:00
Alan Orth	f49214fa2e	csv_metadata_quality/fix.py: fix bug in regions We need to make sure we're only manipulating the regions if we have any missing. The previous code was always manipulating the existing row, even when there were no missing regions, which resulted in new values like "Eastern Africa\|\|".	2022-09-01 16:15:32 +03:00
Alan Orth	7ce20726d0	csv_metadata_quality/fix.py: minor change Print missing regions when we know they are missing, instead of do- ing another check later and looping over them again.	2022-09-01 16:03:49 +03:00
Alan Orth	473be5ac2f	csv_metadata_quality/fix.py: don't add "not found" region country_converter returns the literal "not found" string if a coun- try cannot be found. In that case we do not want to consider that as a region!	2022-09-01 15:46:21 +03:00
Alan Orth	7c61cae417	csv_metadata_quality/fix.py: silence warning By default country_converter prints "not found in regex" if a coun- try is not found. We can silence this by switching the logging lev- el to something above WARNING.	2022-09-01 15:44:50 +03:00
Alan Orth	ae16289637	csv_metadata_quality/fix.py: Minor change The country_converter documentation says we should instantiate the CountryConverter() class once instead of calling coco.convert() in each iteration of the loop so we don't end up loading the data file more than once.	2022-09-01 15:40:45 +03:00
Alan Orth	0cf0bc97f0	csv_metadata_quality/fix.py: fix logic error again All checks were successful continuous-integration/drone/push Build is passing Details It seems there was another logic error raised by the test in pytest. With my real data, it was enough to check if the region column was None, but with my test I was explicitly setting the region to "" (an empty string). So to be really sure we should check if the string is not None and if its length is greater than 0.	2022-08-03 20:51:14 +03:00
Alan Orth	40c3585bab	csv_metadata_quality/fix.py: fix logic error Fix string concatenation with existing regions.	2022-08-03 18:26:08 +03:00
Alan Orth	b9c44aed7d	csv_metadata_quality/fix.py: fix logic issue All checks were successful continuous-integration/drone/push Build is passing Details Forgot to return the row as-is if we don't find any countries.	2022-08-02 10:17:30 +03:00
Alan Orth	689ee184f7	Add unsafe check to add missing regions	2022-07-28 16:52:43 +03:00
Alan Orth	a7727b8431	Add support for dropping invalid AGROVOC terms Requires --agrovoc-fields <field.name> to do the actual validation, and -d to drop invalid ones.	2021-12-23 12:43:55 +02:00
Alan Orth	7763a021c5	csv_metadata_quality/fix.py: sort imports with isort All checks were successful continuous-integration/drone/push Build is passing Details	2021-12-15 23:15:02 +02:00
Alan Orth	e4faf114dc	csv_metadata_quality/util.py: update for ftfy 6.0 The sequence_weirdness() heuristic is deprecated. Now we should use is_bad(). See: https://ftfy.readthedocs.io/en/v6.0/heuristic.html See: https://github.com/rspeer/python-ftfy/blob/master/CHANGELOG.md#version-60-april-2-2021	2021-12-15 21:58:07 +02:00
Alan Orth	ff49a80432	csv_metadata_quality/fix.py: configure ftfy Don't replace smart quotes in ftfy. If our text has them we should keep them.	2021-12-15 21:51:51 +02:00
Alan Orth	e7322efadd	csv_metadata_quality/app.py: move unnecessary Unicode fix We actually want to do this after we try to fix mojibake with ftfy. These "unnecessary" Unicode characters could actually help ftfy in some cases because often times they indicate that some character from another encoding was there before (like an accent, dash, or smart quote).	2021-12-15 13:53:25 +02:00
Alan Orth	95015febbd	csv_metadata_quality/fix.py: fix thin spaces All checks were successful continuous-integration/drone/push Build is passing Details Replace thin spaces with normal spaces. Sometimes I see these get mis handled on Windows machines and they end up as "?" or so.	2021-12-09 23:22:53 +02:00
Alan Orth	9905e183ea	Bump version to 0.6.0-dev	2021-12-09 23:21:30 +02:00
Alan Orth	cc34db7ff8	Version 0.5.0 All checks were successful continuous-integration/drone/push Build is passing Details	2021-12-08 15:29:46 +02:00
Alan Orth	ccc2a73456	Add check for countries without matching regions If we have country "Kenya" we should have region "Eastern Africa" according to the UN M.49 geolocation scheme.	2021-12-08 15:02:20 +02:00
Alan Orth	4d5696c4cb	csv_metadata_quality/check.py: update title in citation check Initialize the titles and citations before the for loop so we can access them later. This makes it easier to check if the item actua- lly has a citation.	2021-12-05 16:21:44 +02:00
Alan Orth	3b40a68279	Add check for title in citation This checks if the item title exists in the citation. If it is not present it could just be missing, or could have minor differences in the whitespace, accents, etc.	2021-12-05 15:52:42 +02:00
Alan Orth	999cc65097	csv_metadata_quality/app.py: adjust mojibake check If unsafe fixes (-u) are enabled then we don't need to do the check first before actually fixing them. Doing the check first creates e- tra output that needs to be reviewed by the user.	2021-12-05 15:18:35 +02:00
Alan Orth	787fa9e8d9	Add field name to fix.newlines output	2021-10-08 14:36:43 +03:00
Alan Orth	8a27fb2589	Add check for missing DOIs All checks were successful continuous-integration/drone/push Build is passing Details Sometimes an editor includes a DOI in the citation field, but does not add a standalone DOI field.	2021-10-06 21:25:39 +03:00
Alan Orth	6ba16d5d4c	csv_metadata_quality/check.py: Fix duplicate checker Fix the incorrect type field regex, and improve the title regex to consider dcterms.title and dc.title (along with the DSpace language variants like dc.title[en_US]), but ignore dc.title.alternative. See: https://regex101.com/r/I4m06F/1	2021-10-06 19:32:40 +03:00
Alan Orth	54ab869297	csv_metadata_quality/experimental.py: Adjust citation match We need to match both of these citation fields: - dc.identifier.citation - dcterms.bibliographicCitation	2021-10-06 16:13:10 +03:00
Alan Orth	4e2eab68b0	Update requests-cache Apparently we were stuck on an older version of requests-cache due to the fact that we were using the caret, which will never update the left-most (major) version. Upstream requests-cache is currently version 0.6.4, and there seems to have been some changes to the API.	2021-07-06 15:24:39 +03:00
Alan Orth	a8fe623f4c	csv_metadata_quality/check.py: Remove unnecessary pass All checks were successful continuous-integration/drone/push Build is passing Details LGTM warned that these pass statements are not necessary. See: https://lgtm.com/rules/910088/	2021-04-20 08:20:13 +03:00
Alan Orth	bd8943f36a	csv_metadata_quality/app.py: Don't crash if fields are missing All checks were successful continuous-integration/drone/push Build is passing Details We don't need to crash if someone feeds us a CSV file that is miss- ing commont DSpace fields like title, type, and subject.	2021-03-21 19:47:29 +02:00
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	8eddb76aab	Bump version to 0.4.8-dev All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-19 11:53:56 +02:00
Alan Orth	898bb412c3	Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python	2021-03-19 10:22:21 +02:00
Alan Orth	f816e17fe7	Version 0.4.7 All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-17 10:00:34 +02:00
Alan Orth	9f2dc0a0f5	Add support for detecting duplicate items This uses the title, type, and date issued as a sort of "key" when determining if an item already exists in the data set.	2021-03-17 09:53:07 +02:00
Alan Orth	14010896a5	csv_metadata_quality/experimental.py: Move all imports to top of file All checks were successful continuous-integration/drone/push Build is passing Details PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-16 16:13:34 +02:00
Alan Orth	ab3af2ec62	csv_metadata_quality/check.py: Reformat with black	2021-03-16 16:12:33 +02:00
Alan Orth	330a7b7b9c	Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory.	2021-03-16 16:04:19 +02:00
Alan Orth	10612cf891	Remove checks for invalid multi-value separators Now that I no longer treat the fix for these as "unsafe" I don't a ctually need to check for them—I can just fix them when I see them.	2021-03-14 21:01:21 +02:00
Alan Orth	c9c277f8df	csv_metadata_quality/app.py: Update help text All checks were successful continuous-integration/drone/push Build is passing Details Use DCTERMS fields where possible.	2021-03-14 10:52:58 +02:00

1 2 3

135 Commits