1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-16 02:57:04 +01:00
Commit Graph

28 Commits

Author SHA1 Message Date
40d5f7d81b
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
2019-07-30 20:05:12 +03:00
346e66ca98
README.md: Add more information to introduction 2019-07-30 17:44:30 +03:00
a85b410ab9
README.md: Improve introduction and functionality 2019-07-30 16:09:15 +03:00
1f65a28307
Add support for validating subjects against AGROVOC
Checks values in the dc.subject or dcterms.subject field against the
AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py.

See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/
See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py
2019-07-30 00:30:31 +03:00
a36454a3ac
Add support for validating languages
Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.

Does not currently have any support for generic values like "Other".
2019-07-29 18:59:42 +03:00
e49b4e8f22
README.md: Try to simplify list of functionality 2019-07-29 18:25:38 +03:00
0eb852a65b
README.md: Improve note about unsafe options 2019-07-29 18:14:50 +03:00
8c34c2d6e6
README.md: Add note about removing duplicate values 2019-07-29 18:09:48 +03:00
1e444cf040
Add fix for duplicate metadata values 2019-07-29 18:05:03 +03:00
e33551776c
README.md: Update note about unsafe options 2019-07-29 17:25:42 +03:00
7f781d7077
README.md: Finish writing usage section 2019-07-29 17:21:34 +03:00
fa4fa3491b
Add check for "suspicious" characters
These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.

It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.
2019-07-29 17:08:49 +03:00
ae66382046
README.md: Add note about unnecessary Unicode fixes 2019-07-29 16:34:39 +03:00
9ac9474c69
README.md: Add todo about duplicates 2019-07-29 12:40:53 +03:00
e000bd1f88
README.md: Add Travis CI badge 2019-07-29 12:19:10 +03:00
3554c2991f
README.md: Add note about Python version 2019-07-29 12:15:09 +03:00
ac127e7f8a
README.md: Add usage section 2019-07-29 11:30:06 +03:00
aabb57321c
README.md: Improve
Reorganize functionality section and add installation section.
2019-07-29 11:15:51 +03:00
a8a41d60b6
README.md: Add note about Pandas 2019-07-29 10:56:02 +03:00
cf6c01caaf
README.md: Add notes about unsafe fixes 2019-07-28 23:01:00 +03:00
4e6225c0a9
README.md: Add note about Excel files 2019-07-28 18:38:36 +03:00
d293214d3a
README.md: Add note about checking dates 2019-07-28 18:37:46 +03:00
4687e2f5fa
README.md: Remove todo for date validation 2019-07-28 17:26:39 +03:00
7e16968bf2
README.md: Add todos 2019-07-27 19:24:35 +03:00
0a751c1f25
README.md: Add SourceHut build badge 2019-07-26 23:59:31 +03:00
df1087b26f
README.md: Improve introduction, checks, and todo 2019-07-26 23:50:41 +03:00
64e7a73417
README.md: Add information about checks and fixes 2019-07-26 23:20:16 +03:00
b657c51fd2
Add initial README.md with intro, license, and todo 2019-07-26 22:18:38 +03:00