csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-08-13 08:18:08 +02:00

Author	SHA1	Message	Date
Alan Orth	f64435fc9d	tests/test_check.py: add missing excludes	2022-09-02 16:24:33 +03:00
Alan Orth	040e56fc76	Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks.	2022-09-02 15:59:22 +03:00
Alan Orth	689ee184f7	Add unsafe check to add missing regions	2022-07-28 16:52:43 +03:00
Alan Orth	c43095139a	tests/test_check.py: add tests for dropping invalid AGROVOC	2021-12-23 12:44:32 +02:00
Alan Orth	120e8cf09f	tests/test_check.py: add checks for countries without regions	2021-12-08 15:18:50 +02:00
Alan Orth	e02678cd7c	tests/test_check.py: add tests for title in citation	2021-12-05 16:01:11 +02:00
Alan Orth	01b4354a14	tests/test_check.py: fix comment	2021-12-05 15:58:25 +02:00
Alan Orth	787fa9e8d9	Add field name to fix.newlines output	2021-10-08 14:36:43 +03:00
Alan Orth	82261f7fe0	tests/test_check.py: Run black All checks were successful continuous-integration/drone/push Build is passing Details	2021-10-06 22:10:26 +03:00
Alan Orth	8a27fb2589	Add check for missing DOIs All checks were successful continuous-integration/drone/push Build is passing Details Sometimes an editor includes a DOI in the citation field, but does not add a standalone DOI field.	2021-10-06 21:25:39 +03:00
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	39a4b1a487	Add mojibake to data/test.csv and tests	2021-03-19 10:28:33 +02:00
Alan Orth	e8422bfa74	tests/test_check.py: Add test for duplicate items	2021-03-17 09:54:02 +02:00
Alan Orth	330a7b7b9c	Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory.	2021-03-16 16:04:19 +02:00
Alan Orth	10612cf891	Remove checks for invalid multi-value separators Now that I no longer treat the fix for these as "unsafe" I don't a ctually need to check for them—I can just fix them when I see them.	2021-03-14 21:01:21 +02:00
Alan Orth	0089efa914	tests/test_check.py: Use dcterms.subject instead of dc.subject Trying to move some old DC fields to DCTERMS.	2021-03-11 11:45:25 +02:00
Alan Orth	5318953150	tests/test_check.py: Add tests for licenses	2021-03-11 10:36:26 +02:00
Alan Orth	a7fc5a246c	Colorize output Some checks failed continuous-integration/drone/push Build is failing Details Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes	2021-02-21 13:01:25 +02:00
Alan Orth	7edb8b19d7	tests/test_check.py: Reformat with black	2021-01-03 15:50:21 +02:00
Alan Orth	29e67a0887	Add tests for unnecessary multi-value separators	2021-01-03 15:37:18 +02:00
Alan Orth	28b5996aa6	Output field name for more fixes and checks This helps identify which field has the error.	2020-01-16 12:35:11 +02:00
Alan Orth	87181bc7b8	Run black, isort, and flake8.	2020-01-15 11:41:31 +02:00
Alan Orth	49e3543878	Add Unicode normalization This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html	2020-01-15 11:37:54 +02:00
Alan Orth	604bd5bda6	Reformat tests with black	2019-09-26 14:02:51 +03:00
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00
Alan Orth	86d4623fd3	More ISO 639-1 and ISO 639-3 fixes ISO 639-1 uses two-letter codes and ISO 639-3 uses three-letter codes. Technically there ISO 639-2/T and ISO 639-2/B, which also uses three letter codes, but those are not supported by the pycountry library so I won't even worry about them. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-26 07:44:39 +03:00
Alan Orth	d9fc09f121	Fix references to ISO 639 It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2 is the three-letter codes, aka alpha2 and alpha3. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-11 16:36:53 +03:00
Alan Orth	07f80cb37f	tests/test_fix.py: Add test for missing space after comma	2019-08-28 00:08:56 +03:00
Alan Orth	e7cb8920db	tests/test_check.py: Update date tests	2019-08-21 15:34:52 +03:00
Alan Orth	e801042340	tests/test_check.py: Fix unused result We don't need to capture the function's return value here because pytest will capture stdout from the function.	2019-08-10 23:45:41 +03:00
Alan Orth	62ef2a4489	tests/test_check.py: Add tests for file extensions	2019-08-10 23:44:13 +03:00
Alan Orth	d93c2aae13	tests/test_check.py: Update suspicious character check The suspicious character check was updated to include the name of the field where the metadata value with the suspicious character exists.	2019-08-09 01:26:38 +03:00
Alan Orth	456b8a2f26	Update tests	2019-08-01 23:59:11 +03:00
Alan Orth	40d5f7d81b	Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.	2019-07-30 20:05:12 +03:00
Alan Orth	1f65a28307	Add support for validating subjects against AGROVOC Checks values in the dc.subject or dcterms.subject field against the AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py. See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/ See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py	2019-07-30 00:30:31 +03:00
Alan Orth	a36454a3ac	Add support for validating languages Will validate against ISO 639-2 or ISO 639-3 depending on how long the language field is. Otherwise will return that the language is invalid. Does not currently have any support for generic values like "Other".	2019-07-29 18:59:42 +03:00
Alan Orth	1e444cf040	Add fix for duplicate metadata values	2019-07-29 18:05:03 +03:00
Alan Orth	fa4fa3491b	Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data.	2019-07-29 17:08:49 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	40e77db713	Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose.	2019-07-28 22:53:39 +03:00
Alan Orth	87b1997051	Fix whitespace errors found by flake8	2019-07-28 17:47:28 +03:00
Alan Orth	ce8f140c66	tests: Remove unused pytest import	2019-07-28 17:42:54 +03:00
Alan Orth	196bb434fa	Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system.	2019-07-28 16:11:36 +03:00
Alan Orth	a849615b41	Add tests for check functions Relies on capturing stdout. See: https://docs.pytest.org/en/5.0.1/capture.html	2019-07-27 02:10:13 +03:00
Alan Orth	41a30f1b07	Add initial tests For now only test fixes because they return changed data. I'm not sure how to test the checks, because they don't return data and I can't modify them to return boolean values without breaking the app.	2019-07-27 00:36:40 +03:00
Alan Orth	f2060adadf	Move tests.csv to data directory	2019-07-27 00:02:47 +03:00
Alan Orth	844b968098	tests/test.csv: Add invalid multi-value field separator	2019-07-26 23:48:45 +03:00
Alan Orth	dfd961d720	Bring test.csv into project	2019-07-26 23:14:37 +03:00

48 Commits