csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-16 02:57:04 +01:00

Author	SHA1	Message	Date
Alan Orth	a36454a3ac	Add support for validating languages Will validate against ISO 639-2 or ISO 639-3 depending on how long the language field is. Otherwise will return that the language is invalid. Does not currently have any support for generic values like "Other".	2019-07-29 18:59:42 +03:00
Alan Orth	1e444cf040	Add fix for duplicate metadata values	2019-07-29 18:05:03 +03:00
Alan Orth	fa4fa3491b	Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data.	2019-07-29 17:08:49 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	d73f7b54b1	csv_metadata_quality/app.py: Improve comments	2019-07-29 16:24:35 +03:00
Alan Orth	7b5db1f5d9	csv_metadata_quality/app.py: Remove erroneous comment	2019-07-29 16:15:25 +03:00
Alan Orth	40e77db713	Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose.	2019-07-28 22:53:39 +03:00
Alan Orth	a93b5b31c5	Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes.	2019-07-28 20:31:57 +03:00
Alan Orth	aadb3117eb	csv_metadata_quality/app.py: Remove unused test input files	2019-07-28 17:45:05 +03:00
Alan Orth	e88d35ace3	csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens.	2019-07-28 17:27:20 +03:00
Alan Orth	196bb434fa	Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system.	2019-07-28 16:11:36 +03:00
Alan Orth	e2bb2d4df9	Main function should be "main()"	2019-07-27 23:09:16 +03:00
Alan Orth	c47c064a13	Make output less debuggy	2019-07-27 09:21:13 +03:00
Alan Orth	18f26c343d	csv_metadata_quality/app.py: Fix path to test.csv	2019-07-27 00:25:30 +03:00
Alan Orth	84c3b17678	csv_metadata_quality/app.py: Add comment	2019-07-26 23:49:13 +03:00
Alan Orth	aaf3537ba4	Add check for invalid multi-value separators	2019-07-26 23:48:24 +03:00
Alan Orth	dfd961d720	Bring test.csv into project	2019-07-26 23:14:37 +03:00
Alan Orth	e160b17fb0	Add ISSN and ISBN checks using python-stdnum	2019-07-26 23:14:10 +03:00
Alan Orth	232d28e13e	Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6	2019-07-26 22:11:10 +03:00

19 Commits