csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-09-10 21:57:02 +02:00

Author	SHA1	Message	Date
Alan Orth	92ff0ee51b	Normalize DOIs with %2f These seem to be incorrectly URL encoded.	2024-06-25 11:54:09 +03:00
Alan Orth	ae38a826ec	csv_metadata_quality/fix.py: minor update to DOI fix Normalize www.doi.org to doi.org in DOI field.	2024-06-25 11:48:45 +03:00
Alan Orth	5be2195325	Add fix for normalizing DOIs	2024-04-25 12:49:19 +03:00
Alan Orth	736948ed2c	csv_metadata_quality/check.py: run rye fmt	2024-04-12 13:40:55 +03:00
Alan Orth	ee0b448355	csv_metadata_quality/check.py: remove unused import	2024-04-12 11:07:36 +03:00
Alan Orth	d5c25f82fa	Update SPDX license list From: https://github.com/spdx/license-list-data/blob/main/json/licenses.json	2024-03-02 10:38:27 +03:00
Alan Orth	a21ffb0fa8	Use py3langid instead of langid Faster and more modern code for Python 3 as a drop-in replacement. See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html	2023-12-28 14:11:21 +03:00
Alan Orth	f6018c51b6	Apply fixes from fixit Apply recommended fix from fixit: RewriteToLiteral: It's slower to call list() than using the empty literal, because the name list must be looked up in the global scope in case it has been rebound.	2023-11-22 21:54:50 +03:00
Alan Orth	1f637f32cd	Rework requests-cache We should only be running this once per invocation, not for every row we check. This should be more efficient, but it means that we don't cache responses when running via pytest, which is actually probably a good thing.	2023-10-15 23:37:38 +03:00
Alan Orth	b8dc19cc3f	csv_metadata_quality/check.py: enable requests-cache This was disabled at some point. We also need to use the new delete method instead.	2023-10-15 23:21:58 +03:00
Alan Orth	93c9b739ac	csv_metadata_quality/check.py: use HTTPS Use HTTPS for AGROVOC REST API.	2023-10-15 22:38:45 +03:00
Alan Orth	d21d2621e3	csv_metadata_quality/app.py: read fields as strings I suspect this undermines the PyArrow backend performance gains in recent Pandas 2.0.0, but we are dealing with messy data sometimes and we must rely on data being strings.	2023-06-12 10:42:50 +03:00
Alan Orth	f3fb1ff7fb	Don't crash when title is missing We shouldn't crash the country/region checker/fixer when the title field is missing, since we only use it to show status to the user.	2023-06-12 10:42:50 +03:00
Alan Orth	e2d46e9495	csv_metadata_quality/app.py: skip newline fix on description The description field often has free-form text like the abstract and there are too many legitimate newlines here to be correcting them automatically.	2023-04-22 12:16:13 -07:00
Alan Orth	1491e1edb0	Fix path to data/licenses.json All checks were successful continuous-integration/drone/push Build is passing Details When we install and run this from CI, this file needs to exist in the package's folder inside site-packages. Then we can use __file__ to get the path relative to the package. See: https://python-packaging.readthedocs.io/en/latest/non-code-files.html	2023-04-05 15:28:21 +03:00
Alan Orth	d4aed378cf	Switch to pandas 2.0.0rc1 Seems to work fine with the new PyArrow datatypes.	2023-03-22 12:16:56 +03:00
Alan Orth	d661ffe439	Check comma space on bibliographicCitation too The regex was only matching `dc.identifier.citation`, but we need to match `dcterms.bibliographicCitation` too.	2023-03-10 16:13:16 +03:00
Alan Orth	45a310387a	Don't fix multi-value separators on citations	2023-03-10 16:12:30 +03:00
Alan Orth	fdccdf7318	Version 0.6.1 Some checks failed continuous-integration/drone/push Build is failing Details	2023-02-23 13:46:56 +03:00
Alan Orth	53fdb50906	csv_metadata_quality/check.py: run black Some checks failed continuous-integration/drone/push Build is failing Details	2023-02-18 22:10:04 +03:00
Alan Orth	8bc4cd419c	Strip filename descriptions before checking Some checks failed continuous-integration/drone/push Build is failing Details When checking for uncommon file extensions in the filename field we should strip descriptions that are meant for SAF Bundler, for example: Annual_Report_2020.pdf__description:Report. This ends up as a false positive that spams the output with warnings.	2023-02-13 11:00:57 +03:00
Alan Orth	8db1e36a6d	csv_metadata_quality/app.py: skip abstract in separator check Also skip abstract in the separator check, since it's rare to have any "\|" here, but more likely that if one is present then it's for a reason.	2023-02-13 10:37:33 +03:00
Alan Orth	fbb625be5c	Ignore common non-SPDX licenses This is meant to catch licenses that are supposed to be SPDX but aren't, not licenses that aren't supposed to be SPDX. We have so many free-text license descriptions like "Copyrighted" and "Other" that I'm sick of seeing warnings for them!	2023-02-07 17:01:56 +03:00
Alan Orth	545bb8cd0c	csv_metadata_quality/app.py: disable whitespace on abstracts It's too aggressive on abstracts. If people paste in text from a PDF there are often newlines, and most of the time this is what they want.	2023-02-07 16:48:40 +03:00
Alan Orth	3596381d03	csv_metadata_quality/app.py: separators fix Don't run the invalid separators fix on title fields because some items use "\|" in the title to indicate something like a subtitle. For example: Progress Review and Work Planning Meeting \| Day 1	2023-01-24 14:13:55 +03:00
Alan Orth	7cc49b500d	Use licenses.json from SPDX instead of spdx-license-list spdx-license-list has been deprecated[1] and already has outdated information compared to recent SPDX data releases. Now I use the JSON license data directly from SPDX[2] (currently version 3.19). The JSON file is loaded from the package's data directory using Python 3's stdlib functions from importlib[3], though we now need Python 3.9 as a minimum for importlib.resources.files[4]. Also note that the data directory is not properly packaged via setuptools, so this only works for local installs, and not via versions published to pypi, for example (I'm currently not doing this anyways). If I want to publish this in the future I will need to modify setup.py/pyproject.toml to include the data files. [1] https://gitlab.com/uniqx/spdx-license-list [2] https://github.com/spdx/license-list-data/blob/main/json/licenses.json [3] https://copdips.com/2022/09/adding-data-files-to-python-package-with-setup-py.html [4] https://docs.python.org/3/library/importlib.resources.html#importlib.resources.files	2022-12-13 10:39:17 +03:00
Alan Orth	051777bcec	Ignore subregion field for missing region checks All checks were successful continuous-integration/drone/push Build is passing Details Due to a sloppy regex I was sometimes matching the subregion field when checking for missing UN M.49 regions in the region field.	2022-12-07 23:18:47 +01:00
Alan Orth	141b2e1da3	csv_metadata_quality/check.py: update region output Add the country to the message about missing regions. This makes it easier to see which country is triggering the missing region error, and helps in case of debugging possible mistakes in the data coming from the country_converter library.	2022-11-28 17:40:27 +03:00
Alan Orth	58b7b6e9d8	Version 0.6.0 All checks were successful continuous-integration/drone/push Build is passing Details	2022-09-02 16:35:58 +03:00
Alan Orth	566c2b45cf	Remove Excel support I never used this and it seems xlrd doesn't even support .xlsx any- more anyways. If this was needed I could theoretically use openpyxl but I'd rather just stick to CSV.	2022-09-02 16:14:24 +03:00
Alan Orth	040e56fc76	Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks.	2022-09-02 15:59:22 +03:00
Alan Orth	1f76247353	csv_metadata_quality/app.py: rework exclude/skip Instead of processing the excludes inside the for column loop we do it once before and then only need to check if the current column is in the list.	2022-09-02 10:35:04 +03:00
Alan Orth	117c6ca85d	csv_metadata_quality/check.py: missing region fixes Port over the recent fixes and logic improvements to regions from fix.py.	2022-09-01 16:38:35 +03:00
Alan Orth	f49214fa2e	csv_metadata_quality/fix.py: fix bug in regions We need to make sure we're only manipulating the regions if we have any missing. The previous code was always manipulating the existing row, even when there were no missing regions, which resulted in new values like "Eastern Africa\|\|".	2022-09-01 16:15:32 +03:00
Alan Orth	7ce20726d0	csv_metadata_quality/fix.py: minor change Print missing regions when we know they are missing, instead of do- ing another check later and looping over them again.	2022-09-01 16:03:49 +03:00
Alan Orth	473be5ac2f	csv_metadata_quality/fix.py: don't add "not found" region country_converter returns the literal "not found" string if a coun- try cannot be found. In that case we do not want to consider that as a region!	2022-09-01 15:46:21 +03:00
Alan Orth	7c61cae417	csv_metadata_quality/fix.py: silence warning By default country_converter prints "not found in regex" if a coun- try is not found. We can silence this by switching the logging lev- el to something above WARNING.	2022-09-01 15:44:50 +03:00
Alan Orth	ae16289637	csv_metadata_quality/fix.py: Minor change The country_converter documentation says we should instantiate the CountryConverter() class once instead of calling coco.convert() in each iteration of the loop so we don't end up loading the data file more than once.	2022-09-01 15:40:45 +03:00
Alan Orth	0cf0bc97f0	csv_metadata_quality/fix.py: fix logic error again All checks were successful continuous-integration/drone/push Build is passing Details It seems there was another logic error raised by the test in pytest. With my real data, it was enough to check if the region column was None, but with my test I was explicitly setting the region to "" (an empty string). So to be really sure we should check if the string is not None and if its length is greater than 0.	2022-08-03 20:51:14 +03:00
Alan Orth	40c3585bab	csv_metadata_quality/fix.py: fix logic error Fix string concatenation with existing regions.	2022-08-03 18:26:08 +03:00
Alan Orth	b9c44aed7d	csv_metadata_quality/fix.py: fix logic issue All checks were successful continuous-integration/drone/push Build is passing Details Forgot to return the row as-is if we don't find any countries.	2022-08-02 10:17:30 +03:00
Alan Orth	689ee184f7	Add unsafe check to add missing regions	2022-07-28 16:52:43 +03:00
Alan Orth	a7727b8431	Add support for dropping invalid AGROVOC terms Requires --agrovoc-fields <field.name> to do the actual validation, and -d to drop invalid ones.	2021-12-23 12:43:55 +02:00
Alan Orth	7763a021c5	csv_metadata_quality/fix.py: sort imports with isort All checks were successful continuous-integration/drone/push Build is passing Details	2021-12-15 23:15:02 +02:00
Alan Orth	e4faf114dc	csv_metadata_quality/util.py: update for ftfy 6.0 The sequence_weirdness() heuristic is deprecated. Now we should use is_bad(). See: https://ftfy.readthedocs.io/en/v6.0/heuristic.html See: https://github.com/rspeer/python-ftfy/blob/master/CHANGELOG.md#version-60-april-2-2021	2021-12-15 21:58:07 +02:00
Alan Orth	ff49a80432	csv_metadata_quality/fix.py: configure ftfy Don't replace smart quotes in ftfy. If our text has them we should keep them.	2021-12-15 21:51:51 +02:00
Alan Orth	e7322efadd	csv_metadata_quality/app.py: move unnecessary Unicode fix We actually want to do this after we try to fix mojibake with ftfy. These "unnecessary" Unicode characters could actually help ftfy in some cases because often times they indicate that some character from another encoding was there before (like an accent, dash, or smart quote).	2021-12-15 13:53:25 +02:00
Alan Orth	95015febbd	csv_metadata_quality/fix.py: fix thin spaces All checks were successful continuous-integration/drone/push Build is passing Details Replace thin spaces with normal spaces. Sometimes I see these get mis handled on Windows machines and they end up as "?" or so.	2021-12-09 23:22:53 +02:00
Alan Orth	9905e183ea	Bump version to 0.6.0-dev	2021-12-09 23:21:30 +02:00
Alan Orth	cc34db7ff8	Version 0.5.0 All checks were successful continuous-integration/drone/push Build is passing Details	2021-12-08 15:29:46 +02:00

1 2 3 4

156 Commits