1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-22 22:05:03 +01:00
Commit Graph

43 Commits

Author SHA1 Message Date
40ba9bae6c
README.md: Adjust heading size 2020-01-15 12:26:11 +02:00
49e3543878
Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
1c75608d54
README.md: Update introduction text
We should mention that this is not DSpace specific. Rather, it is
much more realistically Dublin Core specific.
2019-09-26 14:19:13 +03:00
0b15a8ed3b
README.md: Remove TODO about lack of space after comma
This was added as an automatic global fix a few weeks ago.
2019-09-26 14:16:33 +03:00
e7c220039b
README.md: Add note about experimental language validation 2019-09-26 13:59:50 +03:00
7ac1c6f554
README.md: Update comment about ISO 639-3
The pycountry library is actually using ISO 639-3 apparently.

See: https://pypi.org/project/pycountry/
2019-09-26 07:51:41 +03:00
d9fc09f121
Fix references to ISO 639
It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2
is the three-letter codes, aka alpha2 and alpha3.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-11 16:36:53 +03:00
2af714fb05
README.md: Add a handful of TODOs 2019-08-27 00:12:41 +03:00
bd984f3db5
README.md: Update TravisCI badge 2019-08-22 15:07:03 +03:00
3f4e84a638
README.md: Use ILRI GitHub remote 2019-08-22 14:54:12 +03:00
a00d3d7ea5
README.md: Simplify installation instructions
Pipenv has captured the local dependency with `-e .` so now it gets
installed by the Pipfile or requirements.txt.
2019-08-02 11:02:50 +03:00
0ed390dbd5
README.md: Update AGROVOC information
Now details the new `--agrovoc-fields` option.
2019-08-01 23:54:40 +03:00
fd3861e7cd
README.md: Update installation and usage instructions
It is much easier now that I have created a proper package.
2019-07-31 17:41:18 +03:00
4c4f4a3ba2
README.md: Update todos 2019-07-31 16:33:49 +03:00
22cc7bc793
README.md: Improve section on unsafe fixes 2019-07-31 16:00:05 +03:00
40d5f7d81b
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
2019-07-30 20:05:12 +03:00
346e66ca98
README.md: Add more information to introduction 2019-07-30 17:44:30 +03:00
a85b410ab9
README.md: Improve introduction and functionality 2019-07-30 16:09:15 +03:00
1f65a28307
Add support for validating subjects against AGROVOC
Checks values in the dc.subject or dcterms.subject field against the
AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py.

See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/
See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py
2019-07-30 00:30:31 +03:00
a36454a3ac
Add support for validating languages
Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.

Does not currently have any support for generic values like "Other".
2019-07-29 18:59:42 +03:00
e49b4e8f22
README.md: Try to simplify list of functionality 2019-07-29 18:25:38 +03:00
0eb852a65b
README.md: Improve note about unsafe options 2019-07-29 18:14:50 +03:00
8c34c2d6e6
README.md: Add note about removing duplicate values 2019-07-29 18:09:48 +03:00
1e444cf040
Add fix for duplicate metadata values 2019-07-29 18:05:03 +03:00
e33551776c
README.md: Update note about unsafe options 2019-07-29 17:25:42 +03:00
7f781d7077
README.md: Finish writing usage section 2019-07-29 17:21:34 +03:00
fa4fa3491b
Add check for "suspicious" characters
These standalone characters often indicate issues with encoding or
copy/paste in languages with accents like French and Spanish. For
example: foreˆt should be forêt.

It is not possible to fix these issues automatically, but this will
print a warning so you can notify the owner of the data.
2019-07-29 17:08:49 +03:00
ae66382046
README.md: Add note about unnecessary Unicode fixes 2019-07-29 16:34:39 +03:00
9ac9474c69
README.md: Add todo about duplicates 2019-07-29 12:40:53 +03:00
e000bd1f88
README.md: Add Travis CI badge 2019-07-29 12:19:10 +03:00
3554c2991f
README.md: Add note about Python version 2019-07-29 12:15:09 +03:00
ac127e7f8a
README.md: Add usage section 2019-07-29 11:30:06 +03:00
aabb57321c
README.md: Improve
Reorganize functionality section and add installation section.
2019-07-29 11:15:51 +03:00
a8a41d60b6
README.md: Add note about Pandas 2019-07-29 10:56:02 +03:00
cf6c01caaf
README.md: Add notes about unsafe fixes 2019-07-28 23:01:00 +03:00
4e6225c0a9
README.md: Add note about Excel files 2019-07-28 18:38:36 +03:00
d293214d3a
README.md: Add note about checking dates 2019-07-28 18:37:46 +03:00
4687e2f5fa
README.md: Remove todo for date validation 2019-07-28 17:26:39 +03:00
7e16968bf2
README.md: Add todos 2019-07-27 19:24:35 +03:00
0a751c1f25
README.md: Add SourceHut build badge 2019-07-26 23:59:31 +03:00
df1087b26f
README.md: Improve introduction, checks, and todo 2019-07-26 23:50:41 +03:00
64e7a73417
README.md: Add information about checks and fixes 2019-07-26 23:20:16 +03:00
b657c51fd2
Add initial README.md with intro, license, and todo 2019-07-26 22:18:38 +03:00