csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-09-18 17:36:39 +02:00

Author	SHA1	Message	Date
Alan Orth	787fa9e8d9	Add field name to fix.newlines output	2021-10-08 14:36:43 +03:00
Alan Orth	8a27fb2589	Add check for missing DOIs Sometimes an editor includes a DOI in the citation field, but does not add a standalone DOI field.	2021-10-06 21:25:39 +03:00
Alan Orth	6ba16d5d4c	csv_metadata_quality/check.py: Fix duplicate checker Fix the incorrect type field regex, and improve the title regex to consider dcterms.title and dc.title (along with the DSpace language variants like dc.title[en_US]), but ignore dc.title.alternative. See: https://regex101.com/r/I4m06F/1	2021-10-06 19:32:40 +03:00
Alan Orth	54ab869297	csv_metadata_quality/experimental.py: Adjust citation match We need to match both of these citation fields: - dc.identifier.citation - dcterms.bibliographicCitation	2021-10-06 16:13:10 +03:00
Alan Orth	4e2eab68b0	Update requests-cache Apparently we were stuck on an older version of requests-cache due to the fact that we were using the caret, which will never update the left-most (major) version. Upstream requests-cache is currently version 0.6.4, and there seems to have been some changes to the API.	2021-07-06 15:24:39 +03:00
Alan Orth	a8fe623f4c	csv_metadata_quality/check.py: Remove unnecessary pass LGTM warned that these pass statements are not necessary. See: https://lgtm.com/rules/910088/	2021-04-20 08:20:13 +03:00
Alan Orth	bd8943f36a	csv_metadata_quality/app.py: Don't crash if fields are missing We don't need to crash if someone feeds us a CSV file that is miss- ing commont DSpace fields like title, type, and subject.	2021-03-21 19:47:29 +02:00
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	8eddb76aab	Bump version to 0.4.8-dev	2021-03-19 11:53:56 +02:00
Alan Orth	898bb412c3	Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python	2021-03-19 10:22:21 +02:00
Alan Orth	f816e17fe7	Version 0.4.7	2021-03-17 10:00:34 +02:00
Alan Orth	9f2dc0a0f5	Add support for detecting duplicate items This uses the title, type, and date issued as a sort of "key" when determining if an item already exists in the data set.	2021-03-17 09:53:07 +02:00
Alan Orth	14010896a5	csv_metadata_quality/experimental.py: Move all imports to top of file PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-16 16:13:34 +02:00
Alan Orth	ab3af2ec62	csv_metadata_quality/check.py: Reformat with black	2021-03-16 16:12:33 +02:00
Alan Orth	330a7b7b9c	Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory.	2021-03-16 16:04:19 +02:00
Alan Orth	10612cf891	Remove checks for invalid multi-value separators Now that I no longer treat the fix for these as "unsafe" I don't a ctually need to check for them—I can just fix them when I see them.	2021-03-14 21:01:21 +02:00
Alan Orth	c9c277f8df	csv_metadata_quality/app.py: Update help text Use DCTERMS fields where possible.	2021-03-14 10:52:58 +02:00
Alan Orth	0e9176f0a6	csv_metadata_quality/check.py: requests cache Allow overriding the directory for the requests cache. In the case of csv-metadata-quality-web, which currently runs on Google's App Engine, we can only write to /tmp.	2021-03-14 09:07:35 +02:00
Alan Orth	1008acf35e	Always fix invalid multi-value separators This is no longer class-ified as "unsafe" as I have yet to see a case where this was intentional, and it always causes issues when you import the data in a DSpace repository.	2021-03-13 12:59:45 +02:00
Alan Orth	fa84cfa440	Bump version to 0.4.6-dev	2021-03-11 22:44:36 +02:00
Alan Orth	1554cfd5c9	Version 0.4.6	2021-03-11 12:14:54 +02:00
Alan Orth	a0ea829f5c	csv_metadata_quality/fix.py: Fixes should be green	2021-03-11 11:47:24 +02:00
Alan Orth	d88ea56488	csv_metadata_quality/check.py: Move all imports to top of file PEP8 recommends keeping imports at the top of the file. Also, I had to re-work the issn/isbn so they didn't conflict with the functions in check.py (flake8 warned about them being redefined). Imports sorted with isort. See: https://www.python.org/dev/peps/pep-0008/#imports	2021-03-11 10:52:20 +02:00
Alan Orth	6e4b0e5c1b	Add validation of SPDX license identifiers Currently this only checks the dcterms.license field and the result will only be a warning.	2021-03-11 10:33:16 +02:00
Alan Orth	202bda862a	Bump version to 0.4.5	2021-03-04 21:38:10 +02:00
Alan Orth	dd2cfae047	csv_metadata_quality/app.py: Match dcterms.issued for dates We used to only check fields that had "date" in their name because we were using DSpace's default dc.date.* fields. Now we are using dcterms.issued so I will add that one as well.	2021-02-28 15:11:06 +02:00
Alan Orth	d76e72532a	Move unreleased changes to v0.4.4	2021-02-21 13:25:22 +02:00
Alan Orth	a7fc5a246c	Colorize output Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes	2021-02-21 13:01:25 +02:00
Alan Orth	de92f32ab6	csv_metadata_quality/check.py: More date formats We should also allow ISO 8601 extended in combined date and time format. DSpace does not have a problem with dates in this format and I have found some metadata that uses this date format. For example: 2020-08-31T11:04:56Z See: https://en.wikipedia.org/wiki/ISO_8601	2021-02-04 21:39:14 +02:00
Alan Orth	cbf94490f2	Version 0.4.3	2021-01-26 15:22:40 +02:00
Alan Orth	0dc66c5c4e	Expand check/fix for multi-value separators I just came across some metadata that had unnecessary multi-value separators at the end of a field, causing a blank value to be used. For example: "Kenya\|\|Tanzania\|\|"	2021-01-03 15:30:03 +02:00
Alan Orth	7cfd4c0b59	csv_metadata_quality: Move scoped imports to global According to PEP8 we should avoid scoped imports unless you have a good reason. Here there are two cases where we do (issn and isbn), but I will move the others to the global scope.	2020-10-06 17:11:39 +03:00
Alan Orth	431e6331c8	csv_metadata_quality/check.py: Format with black	2020-07-06 14:10:19 +03:00
Alan Orth	cb07d357d4	Version 0.4.2	2020-07-06 14:04:34 +03:00
Alan Orth	2a1566af62	csv_metadata_quality/check.py: Parameterize AGROVOC request	2020-07-06 13:44:46 +03:00
Alan Orth	5fcaa63bd5	csv_metadata_quality/check.py: Prune requests cache once We only need to prune the requests cache once before using it, not for every value we check.	2020-07-06 13:42:19 +03:00
Alan Orth	28b5996aa6	Output field name for more fixes and checks This helps identify which field has the error.	2020-01-16 12:35:11 +02:00
Alan Orth	0b2d211455	Version 0.4.1	2020-01-15 12:19:42 +02:00
Alan Orth	365ecda324	Add utility function to check normalization Python's built-in unicodedata library includes the is_normalized() function starting with Python 3.8. This utility function allows us to do the same thing with earlier Python versions. See: https://docs.python.org/3/library/unicodedata.html	2020-01-15 12:17:52 +02:00
Alan Orth	705127fd28	Version 0.4.0	2020-01-15 11:44:56 +02:00
Alan Orth	87181bc7b8	Run black, isort, and flake8.	2020-01-15 11:41:31 +02:00
Alan Orth	49e3543878	Add Unicode normalization This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html	2020-01-15 11:37:54 +02:00
Alan Orth	efdc3a841a	Version 0.3.1	2019-10-01 17:11:13 +03:00
Alan Orth	e55380b4d5	csv_metadata_quality/fix.py: Harmonize language in fix output We should always say if we're removing or replacing something.	2019-10-01 17:09:49 +03:00
Alan Orth	c42f8b4812	csv_metadata_quality/fix.py: Replace non-breaking spaces We should be replacing non-breaking spaces (U+00A0) with normal sp- aces instead of removing them.	2019-10-01 16:55:04 +03:00
Alan Orth	e15c98cccb	Move unreleased changes to v0.3.0	2019-09-26 14:06:31 +03:00
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00
Alan Orth	86d4623fd3	More ISO 639-1 and ISO 639-3 fixes ISO 639-1 uses two-letter codes and ISO 639-3 uses three-letter codes. Technically there ISO 639-2/T and ISO 639-2/B, which also uses three letter codes, but those are not supported by the pycountry library so I won't even worry about them. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-26 07:44:39 +03:00
Alan Orth	f304ca6a33	csv_metadata_quality/app.py: Use simpler column iteration I don't know where I got the other one...	2019-09-21 17:19:39 +03:00
Alan Orth	d9fc09f121	Fix references to ISO 639 It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2 is the three-letter codes, aka alpha2 and alpha3. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-11 16:36:53 +03:00

1 2 3

102 Commits