1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-25 15:18:19 +01:00
csv-metadata-quality/data/test.csv
Alan Orth 49e3543878
Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00

1.4 KiB
Raw Blame History

1dc.titledc.date.issueddc.identifier.issndc.identifier.isbndc.language.isodc.subjectcg.coverage.countryfilename
2Leading space2019-07-29
3Trailing space 2019-07-29
4Excessive space2019-07-29
5Miscellaenous ||whitespace | issues 2019-07-29
6Duplicate||Duplicate2019-07-29
7Invalid ISSN2019-07-292321-2302
8Invalid ISBN2019-07-29978-0-306-40615-6
9Multiple valid ISSNs2019-07-290378-5955||0024-9319
10Multiple valid ISBNs2019-07-2999921-58-10-7||978-0-306-40615-7
11Invalid date2019-07-260
12Multiple dates2019-07-26||2019-01-10
13Invalid multi-value separator2019-07-290378-5955|0024-9319
14Unnecessary Unicode2019-07-29
15Suspicious character||foreˆt2019-07-29
16Invalid ISO 639-1 (alpha 2) language2019-07-29jp
17Invalid ISO 639-3 (alpha 3) language2019-07-29chi
18Invalid language2019-07-29Span
19Invalid AGROVOC subject2019-07-29FOREST
20Newline (LF)2019-07-30TANZA NIA
21Missing date
22Invalid country2019-08-01KENYAA
23Uncommon filename extension2019-08-10file.pdf.lck
24Unneccesary unicode (U+002D + U+00AD)2019-08-10978-­92-­9043-­823-­6
25Missing space,after comma2019-08-27
26Incorrect ISO 639-1 language2019-09-26es
27Incorrect ISO 639-3 language2019-09-26spa
28Composéd Unicode2020-01-14
29Decomposéd Unicode2020-01-14