csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-06-09 13:25:08 +02:00

Author	SHA1	Message	Date
Alan Orth	7edb8b19d7	tests/test_check.py: Reformat with black	2021-01-03 15:50:21 +02:00
Alan Orth	29e67a0887	Add tests for unnecessary multi-value separators	2021-01-03 15:37:18 +02:00
Alan Orth	28b5996aa6	Output field name for more fixes and checks This helps identify which field has the error.	2020-01-16 12:35:11 +02:00
Alan Orth	87181bc7b8	Run black, isort, and flake8.	2020-01-15 11:41:31 +02:00
Alan Orth	604bd5bda6	Reformat tests with black	2019-09-26 14:02:51 +03:00
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00
Alan Orth	86d4623fd3	More ISO 639-1 and ISO 639-3 fixes ISO 639-1 uses two-letter codes and ISO 639-3 uses three-letter codes. Technically there ISO 639-2/T and ISO 639-2/B, which also uses three letter codes, but those are not supported by the pycountry library so I won't even worry about them. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-26 07:44:39 +03:00
Alan Orth	d9fc09f121	Fix references to ISO 639 It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2 is the three-letter codes, aka alpha2 and alpha3. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-11 16:36:53 +03:00
Alan Orth	e7cb8920db	tests/test_check.py: Update date tests	2019-08-21 15:34:52 +03:00
Alan Orth	e801042340	tests/test_check.py: Fix unused result We don't need to capture the function's return value here because pytest will capture stdout from the function.	2019-08-10 23:45:41 +03:00
Alan Orth	62ef2a4489	tests/test_check.py: Add tests for file extensions	2019-08-10 23:44:13 +03:00
Alan Orth	d93c2aae13	tests/test_check.py: Update suspicious character check The suspicious character check was updated to include the name of the field where the metadata value with the suspicious character exists.	2019-08-09 01:26:38 +03:00
Alan Orth	456b8a2f26	Update tests	2019-08-01 23:59:11 +03:00
Alan Orth	1f65a28307	Add support for validating subjects against AGROVOC Checks values in the dc.subject or dcterms.subject field against the AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py. See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/ See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py	2019-07-30 00:30:31 +03:00
Alan Orth	a36454a3ac	Add support for validating languages Will validate against ISO 639-2 or ISO 639-3 depending on how long the language field is. Otherwise will return that the language is invalid. Does not currently have any support for generic values like "Other".	2019-07-29 18:59:42 +03:00
Alan Orth	fa4fa3491b	Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data.	2019-07-29 17:08:49 +03:00
Alan Orth	87b1997051	Fix whitespace errors found by flake8	2019-07-28 17:47:28 +03:00
Alan Orth	ce8f140c66	tests: Remove unused pytest import	2019-07-28 17:42:54 +03:00
Alan Orth	196bb434fa	Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system.	2019-07-28 16:11:36 +03:00
Alan Orth	a849615b41	Add tests for check functions Relies on capturing stdout. See: https://docs.pytest.org/en/5.0.1/capture.html	2019-07-27 02:10:13 +03:00

20 Commits