csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-09-17 08:56:40 +02:00

Author	SHA1	Message	Date
Alan Orth	898bb412c3	Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python	2021-03-19 10:22:21 +02:00
Alan Orth	f816e17fe7	Version 0.4.7 All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-17 10:00:34 +02:00
Alan Orth	9f2dc0a0f5	Add support for detecting duplicate items This uses the title, type, and date issued as a sort of "key" when determining if an item already exists in the data set.	2021-03-17 09:53:07 +02:00
Alan Orth	14010896a5	csv_metadata_quality/experimental.py: Move all imports to top of file All checks were successful continuous-integration/drone/push Build is passing Details PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-16 16:13:34 +02:00
Alan Orth	ab3af2ec62	csv_metadata_quality/check.py: Reformat with black	2021-03-16 16:12:33 +02:00
Alan Orth	330a7b7b9c	Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory.	2021-03-16 16:04:19 +02:00
Alan Orth	10612cf891	Remove checks for invalid multi-value separators Now that I no longer treat the fix for these as "unsafe" I don't a ctually need to check for them—I can just fix them when I see them.	2021-03-14 21:01:21 +02:00
Alan Orth	c9c277f8df	csv_metadata_quality/app.py: Update help text All checks were successful continuous-integration/drone/push Build is passing Details Use DCTERMS fields where possible.	2021-03-14 10:52:58 +02:00
Alan Orth	0e9176f0a6	csv_metadata_quality/check.py: requests cache Allow overriding the directory for the requests cache. In the case of csv-metadata-quality-web, which currently runs on Google's App Engine, we can only write to /tmp.	2021-03-14 09:07:35 +02:00
Alan Orth	1008acf35e	Always fix invalid multi-value separators All checks were successful continuous-integration/drone/push Build is passing Details This is no longer class-ified as "unsafe" as I have yet to see a case where this was intentional, and it always causes issues when you import the data in a DSpace repository.	2021-03-13 12:59:45 +02:00
Alan Orth	fa84cfa440	Bump version to 0.4.6-dev	2021-03-11 22:44:36 +02:00
Alan Orth	1554cfd5c9	Version 0.4.6	2021-03-11 12:14:54 +02:00
Alan Orth	a0ea829f5c	csv_metadata_quality/fix.py: Fixes should be green	2021-03-11 11:47:24 +02:00
Alan Orth	d88ea56488	csv_metadata_quality/check.py: Move all imports to top of file PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-11 10:52:20 +02:00
Alan Orth	6e4b0e5c1b	Add validation of SPDX license identifiers Currently this only checks the dcterms.license field and the result will only be a warning.	2021-03-11 10:33:16 +02:00
Alan Orth	202bda862a	Bump version to 0.4.5 All checks were successful continuous-integration/drone/push Build is passing Details	2021-03-04 21:38:10 +02:00
Alan Orth	dd2cfae047	csv_metadata_quality/app.py: Match dcterms.issued for dates We used to only check fields that had "date" in their name because we were using DSpace's default dc.date.* fields. Now we are using dcterms.issued so I will add that one as well.	2021-02-28 15:11:06 +02:00
Alan Orth	d76e72532a	Move unreleased changes to v0.4.4 All checks were successful continuous-integration/drone/push Build is passing Details	2021-02-21 13:25:22 +02:00
Alan Orth	a7fc5a246c	Colorize output Some checks failed continuous-integration/drone/push Build is failing Details Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes	2021-02-21 13:01:25 +02:00
Alan Orth	de92f32ab6	csv_metadata_quality/check.py: More date formats We should also allow ISO 8601 extended in combined date and time format. DSpace does not have a problem with dates in this format and I have found some metadata that uses this date format. For example: 2020-08-31T11:04:56Z See: https://en.wikipedia.org/wiki/ISO_8601	2021-02-04 21:39:14 +02:00
Alan Orth	cbf94490f2	Version 0.4.3	2021-01-26 15:22:40 +02:00
Alan Orth	0dc66c5c4e	Expand check/fix for multi-value separators I just came across some metadata that had unnecessary multi-value separators at the end of a field, causing a blank value to be used. For example: "Kenya\|\|Tanzania\|\|"	2021-01-03 15:30:03 +02:00
Alan Orth	7cfd4c0b59	csv_metadata_quality: Move scoped imports to global According to PEP8 we should avoid scoped imports unless you have a good reason. Here there are two cases where we do (issn and isbn), but I will move the others to the global scope.	2020-10-06 17:11:39 +03:00
Alan Orth	431e6331c8	csv_metadata_quality/check.py: Format with black	2020-07-06 14:10:19 +03:00
Alan Orth	cb07d357d4	Version 0.4.2	2020-07-06 14:04:34 +03:00
Alan Orth	2a1566af62	csv_metadata_quality/check.py: Parameterize AGROVOC request	2020-07-06 13:44:46 +03:00
Alan Orth	5fcaa63bd5	csv_metadata_quality/check.py: Prune requests cache once We only need to prune the requests cache once before using it, not for every value we check.	2020-07-06 13:42:19 +03:00
Alan Orth	28b5996aa6	Output field name for more fixes and checks This helps identify which field has the error.	2020-01-16 12:35:11 +02:00
Alan Orth	0b2d211455	Version 0.4.1	2020-01-15 12:19:42 +02:00
Alan Orth	365ecda324	Add utility function to check normalization Python's built-in unicodedata library includes the is_normalized() function starting with Python 3.8. This utility function allows us to do the same thing with earlier Python versions. See: https://docs.python.org/3/library/unicodedata.html	2020-01-15 12:17:52 +02:00
Alan Orth	705127fd28	Version 0.4.0	2020-01-15 11:44:56 +02:00
Alan Orth	87181bc7b8	Run black, isort, and flake8.	2020-01-15 11:41:31 +02:00
Alan Orth	49e3543878	Add Unicode normalization This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html	2020-01-15 11:37:54 +02:00
Alan Orth	efdc3a841a	Version 0.3.1	2019-10-01 17:11:13 +03:00
Alan Orth	e55380b4d5	csv_metadata_quality/fix.py: Harmonize language in fix output We should always say if we're removing or replacing something.	2019-10-01 17:09:49 +03:00
Alan Orth	c42f8b4812	csv_metadata_quality/fix.py: Replace non-breaking spaces We should be replacing non-breaking spaces (U+00A0) with normal sp- aces instead of removing them.	2019-10-01 16:55:04 +03:00
Alan Orth	e15c98cccb	Move unreleased changes to v0.3.0	2019-09-26 14:06:31 +03:00
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00
Alan Orth	86d4623fd3	More ISO 639-1 and ISO 639-3 fixes ISO 639-1 uses two-letter codes and ISO 639-3 uses three-letter codes. Technically there ISO 639-2/T and ISO 639-2/B, which also uses three letter codes, but those are not supported by the pycountry library so I won't even worry about them. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-26 07:44:39 +03:00
Alan Orth	f304ca6a33	csv_metadata_quality/app.py: Use simpler column iteration I don't know where I got the other one...	2019-09-21 17:19:39 +03:00
Alan Orth	d9fc09f121	Fix references to ISO 639 It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2 is the three-letter codes, aka alpha2 and alpha3. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-11 16:36:53 +03:00
Alan Orth	280a99c8a8	Sort imports with isort See: https://sourcery.ai/blog/python-best-practices/	2019-08-29 01:15:04 +03:00
Alan Orth	d97dcd19db	Format with black	2019-08-29 01:10:39 +03:00
Alan Orth	c354a3687c	Release version 0.2.2	2019-08-28 00:10:17 +03:00
Alan Orth	81190d56bb	Add fix for missing space after commas This happens in names very often, for example in the contributor and citation fields. I will limit this to those fields for now and hide this fix behind the "unsafe fixes" option until I test it more.	2019-08-28 00:05:52 +03:00
Alan Orth	113e7cd8b6	csv_metadata_quality/app.py: Add ability to skip fields The user may want to skip the checking and fixing of certain fields in the input file.	2019-08-27 00:10:07 +03:00
Alan Orth	884e8f970d	csv_metadata_quality/check.py: Simplify AGROVOC check I recycled this code from a separate agrovoc-lookup.py script that checks lines in a text file to see if they are valid AGROVOC terms or not. There I was concerned about skipping comments or something I think, but we don't need to check that here. We simply check the term that is in the field and inform the user if it's valid or not.	2019-08-21 16:35:29 +03:00
Alan Orth	ed5612fbcf	Add column name to output in date checks This makes it easier to understand where the error is in case a CSV has multiple date fields, for example: Missing date (dc.date.issued). Missing date (dc.date.issued[]). If you have 126 items and you get 126 "Missing date" messages then it's likely that 100 of the items have dates in one field, and the others have dates in other field.	2019-08-21 15:31:12 +03:00
Alan Orth	7255bf4707	Version 0.2.1	2019-08-11 10:39:39 +03:00
Alan Orth	232ff99898	csv_metadata_quality/fix.py: Add more unneccessary Unicode fixes Add a check for soft hyphens (U+00AD). In one sample CSV I have a normal hyphen followed by a soft hyphen in an ISBN. This causes the ISBN validation to fail.	2019-08-11 00:07:21 +03:00

1 2

93 Commits