csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-10 00:05:51 +01:00

Author	SHA1	Message	Date
Alan Orth	a21ffb0fa8	Use py3langid instead of langid Faster and more modern code for Python 3 as a drop-in replacement. See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html	2023-12-28 14:11:21 +03:00
Alan Orth	f6018c51b6	Apply fixes from fixit Apply recommended fix from fixit: RewriteToLiteral: It's slower to call list() than using the empty literal, because the name list must be looked up in the global scope in case it has been rebound.	2023-11-22 21:54:50 +03:00
Alan Orth	040e56fc76	Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks.	2022-09-02 15:59:22 +03:00
Alan Orth	54ab869297	csv_metadata_quality/experimental.py: Adjust citation match We need to match both of these citation fields: - dc.identifier.citation - dcterms.bibliographicCitation	2021-10-06 16:13:10 +03:00
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	14010896a5	csv_metadata_quality/experimental.py: Move all imports to top of file All checks were successful continuous-integration/drone/push Build is passing Details PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-16 16:13:34 +02:00
Alan Orth	330a7b7b9c	Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory.	2021-03-16 16:04:19 +02:00
Alan Orth	a7fc5a246c	Colorize output Some checks failed continuous-integration/drone/push Build is failing Details Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes	2021-02-21 13:01:25 +02:00
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00

9 Commits