csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-09-20 18:21:50 +02:00

Author	SHA1	Message	Date
Alan Orth	8c59f57e76	poetry.lock: Run poetry update	2021-10-06 21:19:54 +03:00
Alan Orth	72dd3e7272	CHANGELOG.md: Add notes about regexes	2021-10-06 19:35:59 +03:00
Alan Orth	6ba16d5d4c	csv_metadata_quality/check.py: Fix duplicate checker Fix the incorrect type field regex, and improve the title regex to consider dcterms.title and dc.title (along with the DSpace language variants like dc.title[en_US]), but ignore dc.title.alternative. See: https://regex101.com/r/I4m06F/1	2021-10-06 19:32:40 +03:00
Alan Orth	81069259ba	CHANGELOG.md: Add note about bibliographicCitation All checks were successful continuous-integration/drone/push Build is passing Details	2021-10-06 16:16:51 +03:00
Alan Orth	54ab869297	csv_metadata_quality/experimental.py: Adjust citation match We need to match both of these citation fields: - dc.identifier.citation - dcterms.bibliographicCitation	2021-10-06 16:13:10 +03:00
Alan Orth	22b359c8a8	Update requirements All checks were successful continuous-integration/drone/push Build is passing Details Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-09-27 14:15:01 +03:00
Alan Orth	3e06788d88	poetry.lock: Run poetry update	2021-09-27 14:11:21 +03:00
Alan Orth	3c41cc283f	Update requirements All checks were successful continuous-integration/drone/push Build is passing Details Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-09-06 21:04:05 +03:00
Alan Orth	5741e94571	poetry.lock: Run poetry update	2021-09-06 21:03:30 +03:00
Alan Orth	215d61c188	pyproject.toml: limit SQLAlchemy to < 1.4.23 SQLAlchemy gets pulled in by csvkit's agate-sql dependency and there is currently an issue with Poetry's parsing of the SQLAlchemy 1.4.23 constraints. Temporarily explicitly install a version of SQLAlchemy that works (can remove later once Poetry fixes this). Anyways, I am not using any SQLAlchemy features that I know of. See: https://github.com/python-poetry/poetry/issues/4402	2021-09-06 21:01:09 +03:00
Alan Orth	11ddde3327	data/test.csv: Update mojibake example All checks were successful continuous-integration/drone/push Build is passing Details I was trying to find where I got this one and it seems to have been the other way around. Doesn't matter here only that I was curious.	2021-08-19 15:48:41 +03:00
Alan Orth	a347878d43	Update requirements All checks were successful continuous-integration/drone/push Build is passing Details Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-08-12 21:49:36 +03:00
Alan Orth	a89bc331f0	poetry.lock: Run poetry update Lots of minor dependencies updates. All tests still passing with pytest.	2021-08-12 21:47:46 +03:00
Alan Orth	af3493c724	CITATION.cff: Remove YAML formatting All checks were successful continuous-integration/drone/push Build is passing Details GitHub says it can't parse my CITATION.cff file. The example in the docs shows version 1.2.0 also, I wonder if that's relevant. See: https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files	2021-07-28 21:23:30 +03:00
Alan Orth	52644bf83e	Add CITATION.cff Created with the cffinit tool: https://citation-file-format.github.io/cff-initializer-javascript/	2021-07-28 21:11:11 +03:00
Alan Orth	c8f5539d21	Update requirements All checks were successful continuous-integration/drone/push Build is passing Details Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-07-06 15:47:44 +03:00
Alan Orth	382d0d6aed	Run poetry update	2021-07-06 15:37:57 +03:00
Alan Orth	b8f4be9ebb	pyproject.toml: Update pytest-clarity and black These seem to have much newer versions that didn't get updated in this project due to the version pinning selector I was using with poetry. In the case of pytest-clarity the previous version was 0.3.1 and the version selector was a caret (^), which will never update the left-most (major) number. Now they seem to be on 1.x.x so it will be OK in the future. In the case of black, they use weird numbering so it's anyone's guess how this will work! Luckily it's only used for linting and formatting.	2021-07-06 15:30:41 +03:00
Alan Orth	4e2eab68b0	Update requests-cache Apparently we were stuck on an older version of requests-cache due to the fact that we were using the caret, which will never update the left-most (major) version. Upstream requests-cache is currently version 0.6.4, and there seems to have been some changes to the API.	2021-07-06 15:24:39 +03:00
Alan Orth	55165cb4ce	Update requirements All checks were successful continuous-integration/drone/push Build is passing Details Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-06-14 12:52:47 +03:00
Alan Orth	93d3eabfba	poetry.lock: Run poetry update	2021-06-14 12:52:28 +03:00
Alan Orth	a8fe623f4c	csv_metadata_quality/check.py: Remove unnecessary pass All checks were successful continuous-integration/drone/push Build is passing Details LGTM warned that these pass statements are not necessary. See: https://lgtm.com/rules/910088/	2021-04-20 08:20:13 +03:00
Alan Orth	dbc0437d59	CHANGELOG.md: Add note about Python deps All checks were successful continuous-integration/drone/push Build is passing Details	2021-04-14 16:16:02 +03:00
Alan Orth	96ce1daa90	Update requirements Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-04-14 16:15:28 +03:00
Alan Orth	3adb52d7c0	poetry.lock: Run poetry update	2021-04-14 16:14:37 +03:00
Alan Orth	f958d1879f	poetry.lock: Run poetry update All checks were successful continuous-integration/drone/push Build is passing Details	2021-04-02 16:19:16 +03:00
Alan Orth	bd8943f36a	csv_metadata_quality/app.py: Don't crash if fields are missing All checks were successful continuous-integration/drone/push Build is passing Details We don't need to crash if someone feeds us a CSV file that is miss- ing commont DSpace fields like title, type, and subject.	2021-03-21 19:47:29 +02:00
Alan Orth	28f9026286	README.md: Minor edit All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-19 16:26:31 +02:00
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	8eddb76aab	Bump version to 0.4.8-dev All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-19 11:53:56 +02:00
Alan Orth	a04dbc50db	Add notes about checking and fixing mojibake	2021-03-19 11:48:27 +02:00
Alan Orth	28335ed159	Update requirements Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-03-19 10:29:15 +02:00
Alan Orth	773a0a2695	poetry.lock: Run poetry update	2021-03-19 10:28:55 +02:00
Alan Orth	39a4b1a487	Add mojibake to data/test.csv and tests	2021-03-19 10:28:33 +02:00
Alan Orth	898bb412c3	Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python	2021-03-19 10:22:21 +02:00
Alan Orth	e92ec5d371	README.md: Add note about duplicate checking All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-17 10:12:03 +02:00
Alan Orth	f816e17fe7	Version 0.4.7 All checks were successful continuous-integration/drone/push Build is passing Details v0.4.7	2021-03-17 10:00:34 +02:00
Alan Orth	9061c7c79b	setup.py: Remove beta tag I think this is only used by pypi.org?	2021-03-17 10:00:09 +02:00
Alan Orth	661d05b977	Update requirements Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-03-17 09:58:35 +02:00
Alan Orth	652b7ea98c	CHANGELOG.md: Add note about poetry dependencies	2021-03-17 09:58:02 +02:00
Alan Orth	65da6e9b05	poetry.lock: Run pipenv update	2021-03-17 09:57:31 +02:00
Alan Orth	a313b7527a	CHANGELOG.md: Add note about duplicate items	2021-03-17 09:55:07 +02:00
Alan Orth	51ee370697	data/test.csv: Add duplicate item	2021-03-17 09:54:14 +02:00
Alan Orth	e8422bfa74	tests/test_check.py: Add test for duplicate items	2021-03-17 09:54:02 +02:00
Alan Orth	9f2dc0a0f5	Add support for detecting duplicate items This uses the title, type, and date issued as a sort of "key" when determining if an item already exists in the data set.	2021-03-17 09:53:07 +02:00
Alan Orth	14010896a5	csv_metadata_quality/experimental.py: Move all imports to top of file All checks were successful continuous-integration/drone/push Build is passing Details PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-16 16:13:34 +02:00
Alan Orth	ab3af2ec62	csv_metadata_quality/check.py: Reformat with black	2021-03-16 16:12:33 +02:00
Alan Orth	1aa2084230	CHANGELOG.md: Add note about checks	2021-03-16 16:11:24 +02:00
Alan Orth	330a7b7b9c	Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory.	2021-03-16 16:04:19 +02:00
Alan Orth	9a5e3fd6ef	README.md: Add TODO about detecting duplicates	2021-03-16 14:03:26 +02:00

1 2 3 4 5 ...

381 Commits