csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-22 05:45:02 +01:00

Author	SHA1	Message	Date
Alan Orth	051777bcec	Ignore subregion field for missing region checks All checks were successful continuous-integration/drone/push Build is passing Details Due to a sloppy regex I was sometimes matching the subregion field when checking for missing UN M.49 regions in the region field.	2022-12-07 23:18:47 +01:00
Alan Orth	040e56fc76	Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks.	2022-09-02 15:59:22 +03:00
Alan Orth	f49214fa2e	csv_metadata_quality/fix.py: fix bug in regions We need to make sure we're only manipulating the regions if we have any missing. The previous code was always manipulating the existing row, even when there were no missing regions, which resulted in new values like "Eastern Africa\|\|".	2022-09-01 16:15:32 +03:00
Alan Orth	7ce20726d0	csv_metadata_quality/fix.py: minor change Print missing regions when we know they are missing, instead of do- ing another check later and looping over them again.	2022-09-01 16:03:49 +03:00
Alan Orth	473be5ac2f	csv_metadata_quality/fix.py: don't add "not found" region country_converter returns the literal "not found" string if a coun- try cannot be found. In that case we do not want to consider that as a region!	2022-09-01 15:46:21 +03:00
Alan Orth	7c61cae417	csv_metadata_quality/fix.py: silence warning By default country_converter prints "not found in regex" if a coun- try is not found. We can silence this by switching the logging lev- el to something above WARNING.	2022-09-01 15:44:50 +03:00
Alan Orth	ae16289637	csv_metadata_quality/fix.py: Minor change The country_converter documentation says we should instantiate the CountryConverter() class once instead of calling coco.convert() in each iteration of the loop so we don't end up loading the data file more than once.	2022-09-01 15:40:45 +03:00
Alan Orth	0cf0bc97f0	csv_metadata_quality/fix.py: fix logic error again All checks were successful continuous-integration/drone/push Build is passing Details It seems there was another logic error raised by the test in pytest. With my real data, it was enough to check if the region column was None, but with my test I was explicitly setting the region to "" (an empty string). So to be really sure we should check if the string is not None and if its length is greater than 0.	2022-08-03 20:51:14 +03:00
Alan Orth	40c3585bab	csv_metadata_quality/fix.py: fix logic error Fix string concatenation with existing regions.	2022-08-03 18:26:08 +03:00
Alan Orth	b9c44aed7d	csv_metadata_quality/fix.py: fix logic issue All checks were successful continuous-integration/drone/push Build is passing Details Forgot to return the row as-is if we don't find any countries.	2022-08-02 10:17:30 +03:00
Alan Orth	689ee184f7	Add unsafe check to add missing regions	2022-07-28 16:52:43 +03:00
Alan Orth	7763a021c5	csv_metadata_quality/fix.py: sort imports with isort All checks were successful continuous-integration/drone/push Build is passing Details	2021-12-15 23:15:02 +02:00
Alan Orth	ff49a80432	csv_metadata_quality/fix.py: configure ftfy Don't replace smart quotes in ftfy. If our text has them we should keep them.	2021-12-15 21:51:51 +02:00
Alan Orth	95015febbd	csv_metadata_quality/fix.py: fix thin spaces All checks were successful continuous-integration/drone/push Build is passing Details Replace thin spaces with normal spaces. Sometimes I see these get mis handled on Windows machines and they end up as "?" or so.	2021-12-09 23:22:53 +02:00
Alan Orth	787fa9e8d9	Add field name to fix.newlines output	2021-10-08 14:36:43 +03:00
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	898bb412c3	Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python	2021-03-19 10:22:21 +02:00
Alan Orth	a0ea829f5c	csv_metadata_quality/fix.py: Fixes should be green	2021-03-11 11:47:24 +02:00
Alan Orth	a7fc5a246c	Colorize output Some checks failed continuous-integration/drone/push Build is failing Details Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes	2021-02-21 13:01:25 +02:00
Alan Orth	0dc66c5c4e	Expand check/fix for multi-value separators I just came across some metadata that had unnecessary multi-value separators at the end of a field, causing a blank value to be used. For example: "Kenya\|\|Tanzania\|\|"	2021-01-03 15:30:03 +02:00
Alan Orth	7cfd4c0b59	csv_metadata_quality: Move scoped imports to global According to PEP8 we should avoid scoped imports unless you have a good reason. Here there are two cases where we do (issn and isbn), but I will move the others to the global scope.	2020-10-06 17:11:39 +03:00
Alan Orth	28b5996aa6	Output field name for more fixes and checks This helps identify which field has the error.	2020-01-16 12:35:11 +02:00
Alan Orth	365ecda324	Add utility function to check normalization Python's built-in unicodedata library includes the is_normalized() function starting with Python 3.8. This utility function allows us to do the same thing with earlier Python versions. See: https://docs.python.org/3/library/unicodedata.html	2020-01-15 12:17:52 +02:00
Alan Orth	49e3543878	Add Unicode normalization This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html	2020-01-15 11:37:54 +02:00
Alan Orth	e55380b4d5	csv_metadata_quality/fix.py: Harmonize language in fix output We should always say if we're removing or replacing something.	2019-10-01 17:09:49 +03:00
Alan Orth	c42f8b4812	csv_metadata_quality/fix.py: Replace non-breaking spaces We should be replacing non-breaking spaces (U+00A0) with normal sp- aces instead of removing them.	2019-10-01 16:55:04 +03:00
Alan Orth	280a99c8a8	Sort imports with isort See: https://sourcery.ai/blog/python-best-practices/	2019-08-29 01:15:04 +03:00
Alan Orth	d97dcd19db	Format with black	2019-08-29 01:10:39 +03:00
Alan Orth	81190d56bb	Add fix for missing space after commas This happens in names very often, for example in the contributor and citation fields. I will limit this to those fields for now and hide this fix behind the "unsafe fixes" option until I test it more.	2019-08-28 00:05:52 +03:00
Alan Orth	232ff99898	csv_metadata_quality/fix.py: Add more unneccessary Unicode fixes Add a check for soft hyphens (U+00AD). In one sample CSV I have a normal hyphen followed by a soft hyphen in an ISBN. This causes the ISBN validation to fail.	2019-08-11 00:07:21 +03:00
Alan Orth	40d5f7d81b	Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.	2019-07-30 20:05:12 +03:00
Alan Orth	1e444cf040	Add fix for duplicate metadata values	2019-07-29 18:05:03 +03:00
Alan Orth	50ae4e17f2	csv_metadata_quality/fix.py: Fix indent	2019-07-29 17:14:48 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	42920e9c7c	Test Python regular expression matches directly Match objects always have a boolean value of True. See: https://docs.python.org/3.7/library/re.html	2019-07-29 16:16:30 +03:00
Alan Orth	40e77db713	Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose.	2019-07-28 22:53:39 +03:00
Alan Orth	87b1997051	Fix whitespace errors found by flake8	2019-07-28 17:47:28 +03:00
Alan Orth	c47c064a13	Make output less debuggy	2019-07-27 09:21:13 +03:00
Alan Orth	2b41f9416b	csv_metadata_quality/fix.py: Remove extra newline	2019-07-27 01:29:22 +03:00
Alan Orth	30a4b0005f	csv_metadata_quality/fix.py: Remove test function	2019-07-26 22:56:40 +03:00
Alan Orth	232d28e13e	Refactor as package with subpackages This makes it cleaner for introducing checks, fixes, tests, docs, and tests in the future. Currently can be run like this: python -m csv_metadata_quality CSV input and output paths are still hard coded. See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6	2019-07-26 22:11:10 +03:00

41 Commits