csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-08-05 04:25:41 +02:00

Author	SHA1	Message	Date
Alan Orth	92ff0ee51b	Normalize DOIs with %2f These seem to be incorrectly URL encoded.	2024-06-25 11:54:09 +03:00
Alan Orth	5be2195325	Add fix for normalizing DOIs	2024-04-25 12:49:19 +03:00
Alan Orth	171b35b015	Add data/abstract-check.csv A test file with several whitespace and newline scenarios in the abstract. I am currently disabling whitespace/newline fixes in the abstract because they are too agressive.	2023-02-07 16:50:47 +03:00
Alan Orth	051777bcec	Ignore subregion field for missing region checks All checks were successful continuous-integration/drone/push Build is passing Details Due to a sloppy regex I was sometimes matching the subregion field when checking for missing UN M.49 regions in the region field.	2022-12-07 23:18:47 +01:00
Alan Orth	2e489fc921	Add new data/test-geography.csv test file All checks were successful continuous-integration/drone/push Build is passing Details This file has metadata to test different scenarios related to chec- king and fixing missing regions.	2022-09-01 16:57:29 +03:00
Alan Orth	b7efe2de40	data/test.csv: update invalid AGROVOC entry Now that we can drop invalid AGROVOC values we should have a valid value and an invalid value here. Depending on how the checker is invoked we will either print a warning or drop the invalid value.	2021-12-23 12:45:38 +02:00
Alan Orth	a4eb79f625	data/test.csv: add data for countries without regions check	2021-12-08 15:17:55 +02:00
Alan Orth	1b978159c1	data/text.csv: Add data for title in citation test	2021-12-05 16:23:06 +02:00
Alan Orth	8a27fb2589	Add check for missing DOIs All checks were successful continuous-integration/drone/push Build is passing Details Sometimes an editor includes a DOI in the citation field, but does not add a standalone DOI field.	2021-10-06 21:25:39 +03:00
Alan Orth	11ddde3327	data/test.csv: Update mojibake example All checks were successful continuous-integration/drone/push Build is passing Details I was trying to find where I got this one and it seems to have been the other way around. Doesn't matter here only that I was curious.	2021-08-19 15:48:41 +03:00
Alan Orth	39a4b1a487	Add mojibake to data/test.csv and tests	2021-03-19 10:28:33 +02:00
Alan Orth	51ee370697	data/test.csv: Add duplicate item	2021-03-17 09:54:14 +02:00
Alan Orth	abae8ca4fb	data/test.csv: Move some DC fields to DCTERMS The original Dublin Core elements set was superceded by DCTERMS in 2008 and we have started using them in our DSpace repository so I think it's good to update them in our test data. Old DC fields are still checked and fixed in this tool, though. It's worth nothing that currently supported DSpace versions (4, 5, and 6) all have hard-coded a few fields like dc.title internally so we can't migrate those to their DCTERMS counterparts just yet.	2021-03-11 10:49:05 +02:00
Alan Orth	3b17914002	data/test.csv: Add invalid SPDX license Now we are checking dcterms.license against the list of SPDX license identifiers using https://pypi.org/project/spdx-license-list/.	2021-03-11 10:34:58 +02:00
Alan Orth	32cea2055f	data/test.csv: Add unnecessary multi-value separator	2021-01-03 15:33:04 +02:00
Alan Orth	49e3543878	Add Unicode normalization This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html	2020-01-15 11:37:54 +02:00
Alan Orth	9ca266f5f0	data/test.csv: Change birthdate column to dc.date.issued More accurately reflects actual data we will be validating.	2019-09-26 14:15:48 +03:00
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00
Alan Orth	ddbe970342	data/test.csv: Update titles of language tests ISO 639-1 is alpha 2 and ISO 639-3 is alpha 3. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-26 07:40:27 +03:00
Alan Orth	31c78ca6f3	data/test.csv: Rename contributor column to title This makes more sense as a description of each test and the titles are obviously not authors.	2019-09-26 05:50:40 +03:00
Alan Orth	89d72540f1	data/test.csv: Add sample for missing space after comma	2019-08-28 00:08:26 +03:00
Alan Orth	e324e321a2	data/test.csv: Add test for replacement of unneccessary Unicode	2019-08-11 00:08:44 +03:00
Alan Orth	a99fbd8a51	data/test.csv: Add test case for uncommon filename extension	2019-08-10 23:46:56 +03:00
Alan Orth	456b8a2f26	Update tests	2019-08-01 23:59:11 +03:00
Alan Orth	63ffd77723	data/test.csv: Clarify that newline is a line feed	2019-07-31 13:03:43 +03:00
Alan Orth	020c87768e	data/test.csv: Add item with missing date	2019-07-30 21:02:51 +03:00
Alan Orth	abf74909ee	data/test.csv: Add missing date I was meaning to test for an invalid multi-value separator here!	2019-07-30 21:01:42 +03:00
Alan Orth	40d5f7d81b	Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.	2019-07-30 20:05:12 +03:00
Alan Orth	5ea2e856e4	data/test.csv: Add dc.subject column for AGROVOC tests	2019-07-30 00:33:31 +03:00
Alan Orth	a36454a3ac	Add support for validating languages Will validate against ISO 639-2 or ISO 639-3 depending on how long the language field is. Otherwise will return that the language is invalid. Does not currently have any support for generic values like "Other".	2019-07-29 18:59:42 +03:00
Alan Orth	8509006165	data/test.csv: Use more descriptive tests To make it obvious what each item is testing.	2019-07-29 17:38:46 +03:00
Alan Orth	fa4fa3491b	Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data.	2019-07-29 17:08:49 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	40e77db713	Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose.	2019-07-28 22:53:39 +03:00
Alan Orth	646146d59f	Add Excel verion of test file	2019-07-28 17:07:33 +03:00
Alan Orth	5771764ad2	data/test.csv: Add some new records to test dates Test invalid, missing, and multiple dates.	2019-07-28 16:23:55 +03:00
Alan Orth	f2060adadf	Move tests.csv to data directory	2019-07-27 00:02:47 +03:00

37 Commits