1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-12-22 20:22:18 +01:00
csv-metadata-quality/data/test.csv
Alan Orth 8435ee242d
Experimental language detection using langid
Works decenty well assuming the title, abstract, and citation fields
are an accurate representation of the language as identified by the
language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3)
values seamlessly.

This includes updated pipenv environment, test data, pytest tests
for both correct and incorrect ISO 639-1 and ISO 639-3 languages,
and a new command line option "-e".
2019-09-26 13:46:32 +03:00

1.3 KiB
Raw Blame History

1dc.titlebirthdatedc.identifier.issndc.identifier.isbndc.language.isodc.subjectcg.coverage.countryfilename
2Leading space2019-07-29
3Trailing space 2019-07-29
4Excessive space2019-07-29
5Miscellaenous ||whitespace | issues 2019-07-29
6Duplicate||Duplicate2019-07-29
7Invalid ISSN2019-07-292321-2302
8Invalid ISBN2019-07-29978-0-306-40615-6
9Multiple valid ISSNs2019-07-290378-5955||0024-9319
10Multiple valid ISBNs2019-07-2999921-58-10-7||978-0-306-40615-7
11Invalid date2019-07-260
12Multiple dates2019-07-26||2019-01-10
13Invalid multi-value separator2019-07-290378-5955|0024-9319
14Unnecessary Unicode2019-07-29
15Suspicious character||foreˆt2019-07-29
16Invalid ISO 639-1 (alpha 2) language2019-07-29jp
17Invalid ISO 639-3 (alpha 3) language2019-07-29chi
18Invalid language2019-07-29Span
19Invalid AGROVOC subject2019-07-29FOREST
20Newline (LF)2019-07-30TANZA NIA
21Missing date
22Invalid country2019-08-01KENYAA
23Uncommon filename extension2019-08-10file.pdf.lck
24Unneccesary unicode (U+002D + U+00AD)2019-08-10978-­92-­9043-­823-­6
25Missing space,after comma2019-08-27
26Incorrect ISO 639-1 language2019-09-26es
27Incorrect ISO 639-3 language2019-09-26spa