csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-10-31 19:43:00 +01:00

Author	SHA1	Message	Date
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	898bb412c3	Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python	2021-03-19 10:22:21 +02:00
Alan Orth	a0ea829f5c	csv_metadata_quality/fix.py: Fixes should be green	2021-03-11 11:47:24 +02:00
Alan Orth	a7fc5a246c	Colorize output Some checks failed continuous-integration/drone/push Build is failing Details Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes	2021-02-21 13:01:25 +02:00
Alan Orth	0dc66c5c4e	Expand check/fix for multi-value separators I just came across some metadata that had unnecessary multi-value separators at the end of a field, causing a blank value to be used. For example: "Kenya\|\|Tanzania\|\|"	2021-01-03 15:30:03 +02:00
Alan Orth	7cfd4c0b59	csv_metadata_quality: Move scoped imports to global According to PEP8 we should avoid scoped imports unless you have a good reason. Here there are two cases where we do (issn and isbn), but I will move the others to the global scope.	2020-10-06 17:11:39 +03:00
Alan Orth	28b5996aa6	Output field name for more fixes and checks This helps identify which field has the error.	2020-01-16 12:35:11 +02:00
Alan Orth	365ecda324	Add utility function to check normalization Python's built-in unicodedata library includes the is_normalized() function starting with Python 3.8. This utility function allows us to do the same thing with earlier Python versions. See: https://docs.python.org/3/library/unicodedata.html	2020-01-15 12:17:52 +02:00
Alan Orth	49e3543878	Add Unicode normalization This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html	2020-01-15 11:37:54 +02:00
Alan Orth	e55380b4d5	csv_metadata_quality/fix.py: Harmonize language in fix output We should always say if we're removing or replacing something.	2019-10-01 17:09:49 +03:00
Alan Orth	c42f8b4812	csv_metadata_quality/fix.py: Replace non-breaking spaces We should be replacing non-breaking spaces (U+00A0) with normal sp- aces instead of removing them.	2019-10-01 16:55:04 +03:00
Alan Orth	280a99c8a8	Sort imports with isort See: https://sourcery.ai/blog/python-best-practices/	2019-08-29 01:15:04 +03:00
Alan Orth	d97dcd19db	Format with black	2019-08-29 01:10:39 +03:00
Alan Orth	81190d56bb	Add fix for missing space after commas This happens in names very often, for example in the contributor and citation fields. I will limit this to those fields for now and hide this fix behind the "unsafe fixes" option until I test it more.	2019-08-28 00:05:52 +03:00
Alan Orth	232ff99898	csv_metadata_quality/fix.py: Add more unneccessary Unicode fixes Add a check for soft hyphens (U+00AD). In one sample CSV I have a normal hyphen followed by a soft hyphen in an ISBN. This causes the ISBN validation to fail.	2019-08-11 00:07:21 +03:00
Alan Orth	40d5f7d81b	Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.	2019-07-30 20:05:12 +03:00
Alan Orth	1e444cf040	Add fix for duplicate metadata values	2019-07-29 18:05:03 +03:00
Alan Orth	50ae4e17f2	csv_metadata_quality/fix.py: Fix indent	2019-07-29 17:14:48 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	42920e9c7c	Test Python regular expression matches directly Match objects always have a boolean value of True. See: https://docs.python.org/3.7/library/re.html	2019-07-29 16:16:30 +03:00
Alan Orth	40e77db713	Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose.	2019-07-28 22:53:39 +03:00
Alan Orth	87b1997051	Fix whitespace errors found by flake8	2019-07-28 17:47:28 +03:00
Alan Orth	c47c064a13	Make output less debuggy	2019-07-27 09:21:13 +03:00
Alan Orth	2b41f9416b	csv_metadata_quality/fix.py: Remove extra newline	2019-07-27 01:29:22 +03:00
Alan Orth	30a4b0005f	csv_metadata_quality/fix.py: Remove test function	2019-07-26 22:56:40 +03:00
Alan Orth	232d28e13e	Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6	2019-07-26 22:11:10 +03:00

26 Commits