csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-09-12 14:47:03 +02:00

Author	SHA1	Message	Date
Alan Orth	e49b4e8f22	README.md: Try to simplify list of functionality	2019-07-29 18:25:38 +03:00
Alan Orth	0eb852a65b	README.md: Improve note about unsafe options	2019-07-29 18:14:50 +03:00
Alan Orth	8d92498bbf	Add .gitignore	2019-07-29 18:10:29 +03:00
Alan Orth	8c34c2d6e6	README.md: Add note about removing duplicate values	2019-07-29 18:09:48 +03:00
Alan Orth	1e444cf040	Add fix for duplicate metadata values	2019-07-29 18:05:03 +03:00
Alan Orth	d7888d59a8	csv_metadata_quality/check.py: Return date even if it is invalid Otherwise it is missing from the final CSV and then we can't even fix it. :)	2019-07-29 17:40:14 +03:00
Alan Orth	8509006165	data/test.csv: Use more descriptive tests To make it obvious what each item is testing.	2019-07-29 17:38:46 +03:00
Alan Orth	e33551776c	README.md: Update note about unsafe options	2019-07-29 17:25:42 +03:00
Alan Orth	7f781d7077	README.md: Finish writing usage section	2019-07-29 17:21:34 +03:00
Alan Orth	50ae4e17f2	csv_metadata_quality/fix.py: Fix indent	2019-07-29 17:14:48 +03:00
Alan Orth	fa4fa3491b	Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data.	2019-07-29 17:08:49 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	ae66382046	README.md: Add note about unnecessary Unicode fixes	2019-07-29 16:34:39 +03:00
Alan Orth	d73f7b54b1	csv_metadata_quality/app.py: Improve comments	2019-07-29 16:24:35 +03:00
Alan Orth	42920e9c7c	Test Python regular expression matches directly Match objects always have a boolean value of True. See: https://docs.python.org/3.7/library/re.html	2019-07-29 16:16:30 +03:00
Alan Orth	7b5db1f5d9	csv_metadata_quality/app.py: Remove erroneous comment	2019-07-29 16:15:25 +03:00
Alan Orth	9ac9474c69	README.md: Add todo about duplicates	2019-07-29 12:40:53 +03:00
Alan Orth	e000bd1f88	README.md: Add Travis CI badge	2019-07-29 12:19:10 +03:00
Alan Orth	3554c2991f	README.md: Add note about Python version	2019-07-29 12:15:09 +03:00
Alan Orth	8e3e7a3573	.travis.yml: Only test Python 3.6 and 3.7 I'm using f-strings, which are only supported in Python 3.6+.	2019-07-29 12:13:55 +03:00
Alan Orth	913500cfff	.travis.yml: Remove env This seems to make Travis run one job for each variable.	2019-07-29 11:51:12 +03:00
Alan Orth	17f4a26ac0	Add Travis CI support I'm happy with SourceHut's build support, but Travis allows us to tests other Python versions.	2019-07-29 11:45:58 +03:00
Alan Orth	ac127e7f8a	README.md: Add usage section	2019-07-29 11:30:06 +03:00
Alan Orth	aabb57321c	README.md: Improve Reorganize functionality section and add installation section.	2019-07-29 11:15:51 +03:00
Alan Orth	a8a41d60b6	README.md: Add note about Pandas	2019-07-29 10:56:02 +03:00
Alan Orth	cf6c01caaf	README.md: Add notes about unsafe fixes	2019-07-28 23:01:00 +03:00
Alan Orth	40e77db713	Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose.	2019-07-28 22:53:39 +03:00
Alan Orth	a93b5b31c5	Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes.	2019-07-28 20:31:57 +03:00
Alan Orth	4e6225c0a9	README.md: Add note about Excel files	2019-07-28 18:38:36 +03:00
Alan Orth	d293214d3a	README.md: Add note about checking dates	2019-07-28 18:37:46 +03:00
Alan Orth	87b1997051	Fix whitespace errors found by flake8	2019-07-28 17:47:28 +03:00
Alan Orth	c6e7d6d9b5	Add flake8 to pipenv dev environment To help check PEP8 formatting/style compliance.	2019-07-28 17:46:30 +03:00
Alan Orth	aadb3117eb	csv_metadata_quality/app.py: Remove unused test input files	2019-07-28 17:45:05 +03:00
Alan Orth	ce8f140c66	tests: Remove unused pytest import	2019-07-28 17:42:54 +03:00
Alan Orth	e88d35ace3	csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens.	2019-07-28 17:27:20 +03:00
Alan Orth	4687e2f5fa	README.md: Remove todo for date validation	2019-07-28 17:26:39 +03:00
Alan Orth	209a290de8	Update python requirements Generated using pipenv: $ pipenv lock -r > requirements.txt $ pipenv lock -r -d > requirements-dev.txt	2019-07-28 17:12:20 +03:00
Alan Orth	646146d59f	Add Excel verion of test file	2019-07-28 17:07:33 +03:00
Alan Orth	c2f28194eb	Add xlrd to pipenv for Pandas read_excel support	2019-07-28 17:05:17 +03:00
Alan Orth	5771764ad2	data/test.csv: Add some new records to test dates Test invalid, missing, and multiple dates.	2019-07-28 16:23:55 +03:00
Alan Orth	196bb434fa	Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system.	2019-07-28 16:11:36 +03:00
Alan Orth	73b4061c7b	Add ipython to pipenv dev packages	2019-07-28 10:06:41 +03:00
Alan Orth	e2bb2d4df9	Main function should be "main()"	2019-07-27 23:09:16 +03:00
Alan Orth	7e16968bf2	README.md: Add todos	2019-07-27 19:24:35 +03:00
Alan Orth	c47c064a13	Make output less debuggy	2019-07-27 09:21:13 +03:00
Alan Orth	a849615b41	Add tests for check functions Relies on capturing stdout. See: https://docs.pytest.org/en/5.0.1/capture.html	2019-07-27 02:10:13 +03:00
Alan Orth	2b41f9416b	csv_metadata_quality/fix.py: Remove extra newline	2019-07-27 01:29:22 +03:00
Alan Orth	3cf9f9452b	csv_metadata_quality/check.py: Always return field We always need to return the field back so apply doesn't set it to null when creating the new data frame.	2019-07-27 01:28:08 +03:00
Alan Orth	1d861f263b	.build.yml: Fix setup script I wasn't chaning into the project directory so the pipenv virtual environment was not getting created in the correct place.	2019-07-27 00:41:57 +03:00
Alan Orth	33121f8a01	.build.yml: Add tests	2019-07-27 00:38:34 +03:00

... 9 10 11 12 13

623 Commits