csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-12-23 04:32:21 +01:00

Author	SHA1	Message	Date
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00
Alan Orth	ddbe970342	data/test.csv: Update titles of language tests ISO 639-1 is alpha 2 and ISO 639-3 is alpha 3. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-26 07:40:27 +03:00
Alan Orth	31c78ca6f3	data/test.csv: Rename contributor column to title This makes more sense as a description of each test and the titles are obviously not authors.	2019-09-26 05:50:40 +03:00
Alan Orth	89d72540f1	data/test.csv: Add sample for missing space after comma	2019-08-28 00:08:26 +03:00
Alan Orth	e324e321a2	data/test.csv: Add test for replacement of unneccessary Unicode	2019-08-11 00:08:44 +03:00
Alan Orth	a99fbd8a51	data/test.csv: Add test case for uncommon filename extension	2019-08-10 23:46:56 +03:00
Alan Orth	456b8a2f26	Update tests	2019-08-01 23:59:11 +03:00
Alan Orth	63ffd77723	data/test.csv: Clarify that newline is a line feed	2019-07-31 13:03:43 +03:00
Alan Orth	020c87768e	data/test.csv: Add item with missing date	2019-07-30 21:02:51 +03:00
Alan Orth	abf74909ee	data/test.csv: Add missing date I was meaning to test for an invalid multi-value separator here!	2019-07-30 21:01:42 +03:00
Alan Orth	40d5f7d81b	Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.	2019-07-30 20:05:12 +03:00
Alan Orth	5ea2e856e4	data/test.csv: Add dc.subject column for AGROVOC tests	2019-07-30 00:33:31 +03:00
Alan Orth	a36454a3ac	Add support for validating languages Will validate against ISO 639-2 or ISO 639-3 depending on how long the language field is. Otherwise will return that the language is invalid. Does not currently have any support for generic values like "Other".	2019-07-29 18:59:42 +03:00
Alan Orth	8509006165	data/test.csv: Use more descriptive tests To make it obvious what each item is testing.	2019-07-29 17:38:46 +03:00
Alan Orth	fa4fa3491b	Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data.	2019-07-29 17:08:49 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	40e77db713	Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose.	2019-07-28 22:53:39 +03:00
Alan Orth	5771764ad2	data/test.csv: Add some new records to test dates Test invalid, missing, and multiple dates.	2019-07-28 16:23:55 +03:00
Alan Orth	f2060adadf	Move tests.csv to data directory	2019-07-27 00:02:47 +03:00

19 Commits