1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-12-22 20:22:18 +01:00
Commit Graph

12 Commits

Author SHA1 Message Date
63ffd77723
data/test.csv: Clarify that newline is a line feed 2019-07-31 13:03:43 +03:00
020c87768e
data/test.csv: Add item with missing date 2019-07-30 21:02:51 +03:00
abf74909ee
data/test.csv: Add missing date
I was meaning to test for an invalid multi-value separator here!
2019-07-30 21:01:42 +03:00
40d5f7d81b
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
2019-07-30 20:05:12 +03:00
5ea2e856e4
data/test.csv: Add dc.subject column for AGROVOC tests 2019-07-30 00:33:31 +03:00
a36454a3ac
Add support for validating languages
Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.

Does not currently have any support for generic values like "Other".
2019-07-29 18:59:42 +03:00
8509006165
data/test.csv: Use more descriptive tests
To make it obvious what each item is testing.
2019-07-29 17:38:46 +03:00
fa4fa3491b
Add check for "suspicious" characters
These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.

It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.
2019-07-29 17:08:49 +03:00
8047a57cc5
Add support for fixing "unnecessary" Unicode
These are things like non-breaking spaces, "replacement" characters,
etc that add nothing to the metadata and often cause errors during
parsing or displaying in a UI.
2019-07-29 16:38:10 +03:00
40e77db713
Add "unsafe fixes" runtime option
In this case it fixes occurences of invalid multi-value separators.
DSpace uses "||" to separate multiple values in one field, but our
editors sometimes give us files with mistakes like "|". We can fix
these to be correct multi-value separators if we are sure that the
metadata is not actually using "|" for some legitimate purpose.
2019-07-28 22:53:39 +03:00
5771764ad2
data/test.csv: Add some new records to test dates
Test invalid, missing, and multiple dates.
2019-07-28 16:23:55 +03:00
f2060adadf
Move tests.csv to data directory 2019-07-27 00:02:47 +03:00