csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-07-27 08:18:04 +02:00

Author	SHA1	Message	Date
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	8eddb76aab	Bump version to 0.4.8-dev All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-19 11:53:56 +02:00
Alan Orth	a04dbc50db	Add notes about checking and fixing mojibake	2021-03-19 11:48:27 +02:00
Alan Orth	28335ed159	Update requirements Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-03-19 10:29:15 +02:00
Alan Orth	773a0a2695	poetry.lock: Run poetry update	2021-03-19 10:28:55 +02:00
Alan Orth	39a4b1a487	Add mojibake to data/test.csv and tests	2021-03-19 10:28:33 +02:00
Alan Orth	898bb412c3	Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python	2021-03-19 10:22:21 +02:00
Alan Orth	e92ec5d371	README.md: Add note about duplicate checking All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-17 10:12:03 +02:00
Alan Orth	f816e17fe7	Version 0.4.7 All checks were successful continuous-integration/drone/push Build is passing Details v0.4.7	2021-03-17 10:00:34 +02:00
Alan Orth	9061c7c79b	setup.py: Remove beta tag I think this is only used by pypi.org?	2021-03-17 10:00:09 +02:00
Alan Orth	661d05b977	Update requirements Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-03-17 09:58:35 +02:00
Alan Orth	652b7ea98c	CHANGELOG.md: Add note about poetry dependencies	2021-03-17 09:58:02 +02:00
Alan Orth	65da6e9b05	poetry.lock: Run pipenv update	2021-03-17 09:57:31 +02:00
Alan Orth	a313b7527a	CHANGELOG.md: Add note about duplicate items	2021-03-17 09:55:07 +02:00
Alan Orth	51ee370697	data/test.csv: Add duplicate item	2021-03-17 09:54:14 +02:00
Alan Orth	e8422bfa74	tests/test_check.py: Add test for duplicate items	2021-03-17 09:54:02 +02:00
Alan Orth	9f2dc0a0f5	Add support for detecting duplicate items This uses the title, type, and date issued as a sort of "key" when determining if an item already exists in the data set.	2021-03-17 09:53:07 +02:00
Alan Orth	14010896a5	csv_metadata_quality/experimental.py: Move all imports to top of file All checks were successful continuous-integration/drone/push Build is passing Details PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-16 16:13:34 +02:00
Alan Orth	ab3af2ec62	csv_metadata_quality/check.py: Reformat with black	2021-03-16 16:12:33 +02:00
Alan Orth	1aa2084230	CHANGELOG.md: Add note about checks	2021-03-16 16:11:24 +02:00
Alan Orth	330a7b7b9c	Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory.	2021-03-16 16:04:19 +02:00
Alan Orth	9a5e3fd6ef	README.md: Add TODO about detecting duplicates	2021-03-16 14:03:26 +02:00
Alan Orth	ed084da08c	CHANGELOG.md: Add note about multi-value separators All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-14 21:04:19 +02:00
Alan Orth	10612cf891	Remove checks for invalid multi-value separators Now that I no longer treat the fix for these as "unsafe" I don't a ctually need to check for them—I can just fix them when I see them.	2021-03-14 21:01:21 +02:00
Alan Orth	3656e9f976	Update CI workflows to use DCTERMS instead of DC All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-14 15:52:51 +02:00
Alan Orth	c9c277f8df	csv_metadata_quality/app.py: Update help text All checks were successful continuous-integration/drone/push Build is passing Details Use DCTERMS fields where possible.	2021-03-14 10:52:58 +02:00
Alan Orth	fb35afd937	CHANGELOG.md: Add note about requests cache	2021-03-14 09:13:51 +02:00
Alan Orth	0e9176f0a6	csv_metadata_quality/check.py: requests cache Allow overriding the directory for the requests cache. In the case of csv-metadata-quality-web, which currently runs on Google's App Engine, we can only write to /tmp.	2021-03-14 09:07:35 +02:00
Alan Orth	1008acf35e	Always fix invalid multi-value separators All checks were successful continuous-integration/drone/push Build is passing Details This is no longer class-ified as "unsafe" as I have yet to see a case where this was intentional, and it always causes issues when you import the data in a DSpace repository.	2021-03-13 12:59:45 +02:00
Alan Orth	f00a07e2cd	README.md: Reorganize unsafe functionality All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-13 11:56:52 +02:00
Alan Orth	46098861ed	poetry.lock: Run poetry update All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-11 22:45:32 +02:00
Alan Orth	fa84cfa440	Bump version to 0.4.6-dev	2021-03-11 22:44:36 +02:00
Alan Orth	6cc1401f88	pyproject.toml: Minimum Python is technically 3.7.1 All checks were successful continuous-integration/drone/push Build is passing Details See: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html	2021-03-11 13:41:58 +02:00
Alan Orth	ad2cda8a41	README.md: Add note about SPDX license identifiers All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-11 12:21:34 +02:00
Alan Orth	dc6920802e	.github/workflows/python-app.yml: Use Python 3.9 I now use this version in my development environment. Eventually I should add a matrix of versions to use, but I don't know the GitHub Actions syntax well enough yet.	2021-03-11 12:17:57 +02:00
Alan Orth	6ca449d8ed	README.md: Update note about Python 3.8 to 3.8+ Currently the lower bound on Python version support is 3.7 because of Pandas 1.2.0 requiring it, but I use 3.9 on my development box.	2021-03-11 12:16:07 +02:00
Alan Orth	1554cfd5c9	Version 0.4.6 v0.4.6	2021-03-11 12:14:54 +02:00
Alan Orth	00b8faad6d	CHANGELOG.md: Fix headers	2021-03-11 12:13:22 +02:00
Alan Orth	b19d81abdd	.drone.yml: We need some stuff to build pyicu now All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-11 12:07:28 +02:00
Alan Orth	a0ea829f5c	csv_metadata_quality/fix.py: Fixes should be green	2021-03-11 11:47:24 +02:00
Alan Orth	0089efa914	tests/test_check.py: Use dcterms.subject instead of dc.subject Trying to move some old DC fields to DCTERMS.	2021-03-11 11:45:25 +02:00
Alan Orth	3dbe656f9f	Update requirements Some checks failed continuous-integration/drone/push Build is failing Details Generated with poetry export: $ poetry export --without-hashes -f requirements.txt > requirements.txt $ poetry export --without-hashes --dev -f requirements.txt > requirements-dev.txt I am trying `--without-hashes` to work around an error on pip install when running in CI: ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==.	2021-03-11 11:11:19 +02:00
Alan Orth	7ad821dcad	CHANGELOG.md: Add note about poetry dependencies	2021-03-11 11:10:27 +02:00
Alan Orth	cd876c4fb3	poetry.lock: Run poetry update	2021-03-11 11:10:02 +02:00
Alan Orth	d88ea56488	csv_metadata_quality/check.py: Move all imports to top of file PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-11 10:52:20 +02:00
Alan Orth	e0e3ca6c58	CHANGELOG.md: Add notes about DCTERMS in data/test.csv	2021-03-11 10:50:52 +02:00
Alan Orth	abae8ca4fb	data/test.csv: Move some DC fields to DCTERMS The original Dublin Core elements set was superceded by DCTERMS in 2008 and we have started using them in our DSpace repository so I think it's good to update them in our test data. Old DC fields are still checked and fixed in this tool, though. It's worth nothing that currently supported DSpace versions (4, 5, and 6) all have hard-coded a few fields like dc.title internally so we can't migrate those to their DCTERMS counterparts just yet.	2021-03-11 10:49:05 +02:00
Alan Orth	d7d4d4efca	CHANGELOG.md: Add note about SPDX license identifiers	2021-03-11 10:37:27 +02:00
Alan Orth	5318953150	tests/test_check.py: Add tests for licenses	2021-03-11 10:36:26 +02:00
Alan Orth	3b17914002	data/test.csv: Add invalid SPDX license Now we are checking dcterms.license against the list of SPDX license identifiers using https://pypi.org/project/spdx-license-list/.	2021-03-11 10:34:58 +02:00

... 3 4 5 6 7 ...

553 Commits