mirror of
https://github.com/ilri/csv-metadata-quality.git
synced 2024-12-29 07:24:29 +01:00
Alan Orth
49e3543878
This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
1.4 KiB
1.4 KiB
1 | dc.title | dc.date.issued | dc.identifier.issn | dc.identifier.isbn | dc.language.iso | dc.subject | cg.coverage.country | filename |
---|---|---|---|---|---|---|---|---|
2 | Leading space | 2019-07-29 | ||||||
3 | Trailing space | 2019-07-29 | ||||||
4 | Excessive space | 2019-07-29 | ||||||
5 | Miscellaenous ||whitespace | issues | 2019-07-29 | ||||||
6 | Duplicate||Duplicate | 2019-07-29 | ||||||
7 | Invalid ISSN | 2019-07-29 | 2321-2302 | |||||
8 | Invalid ISBN | 2019-07-29 | 978-0-306-40615-6 | |||||
9 | Multiple valid ISSNs | 2019-07-29 | 0378-5955||0024-9319 | |||||
10 | Multiple valid ISBNs | 2019-07-29 | 99921-58-10-7||978-0-306-40615-7 | |||||
11 | Invalid date | 2019-07-260 | ||||||
12 | Multiple dates | 2019-07-26||2019-01-10 | ||||||
13 | Invalid multi-value separator | 2019-07-29 | 0378-5955|0024-9319 | |||||
14 | Unnecessary Unicode | 2019-07-29 | ||||||
15 | Suspicious character||foreˆt | 2019-07-29 | ||||||
16 | Invalid ISO 639-1 (alpha 2) language | 2019-07-29 | jp | |||||
17 | Invalid ISO 639-3 (alpha 3) language | 2019-07-29 | chi | |||||
18 | Invalid language | 2019-07-29 | Span | |||||
19 | Invalid AGROVOC subject | 2019-07-29 | FOREST | |||||
20 | Newline (LF) | 2019-07-30 | TANZA NIA | |||||
21 | Missing date | |||||||
22 | Invalid country | 2019-08-01 | KENYAA | |||||
23 | Uncommon filename extension | 2019-08-10 | file.pdf.lck | |||||
24 | Unneccesary unicode (U+002D + U+00AD) | 2019-08-10 | 978-92-9043-823-6 | |||||
25 | Missing space,after comma | 2019-08-27 | ||||||
26 | Incorrect ISO 639-1 language | 2019-09-26 | es | |||||
27 | Incorrect ISO 639-3 language | 2019-09-26 | spa | |||||
28 | Composéd Unicode | 2020-01-14 | ||||||
29 | Decomposéd Unicode | 2020-01-14 |