1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-30 01:28:18 +01:00
Commit Graph

22 Commits

Author SHA1 Message Date
0dc66c5c4e
Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
7cfd4c0b59
csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
28b5996aa6
Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
365ecda324
Add utility function to check normalization
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 12:17:52 +02:00
49e3543878
Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
e55380b4d5
csv_metadata_quality/fix.py: Harmonize language in fix output
We should always say if we're removing or replacing something.
2019-10-01 17:09:49 +03:00
c42f8b4812
csv_metadata_quality/fix.py: Replace non-breaking spaces
We should be replacing non-breaking spaces (U+00A0) with normal sp-
aces instead of removing them.
2019-10-01 16:55:04 +03:00
280a99c8a8
Sort imports with isort
See: https://sourcery.ai/blog/python-best-practices/
2019-08-29 01:15:04 +03:00
d97dcd19db
Format with black 2019-08-29 01:10:39 +03:00
81190d56bb
Add fix for missing space after commas
This happens in names very often, for example in the contributor
and citation fields. I will limit this to those fields for now and
hide this fix behind the "unsafe fixes" option until I test it more.
2019-08-28 00:05:52 +03:00
232ff99898
csv_metadata_quality/fix.py: Add more unneccessary Unicode fixes
Add a check for soft hyphens (U+00AD). In one sample CSV I have a
normal hyphen followed by a soft hyphen in an ISBN. This causes the
ISBN validation to fail.
2019-08-11 00:07:21 +03:00
40d5f7d81b
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
2019-07-30 20:05:12 +03:00
1e444cf040
Add fix for duplicate metadata values 2019-07-29 18:05:03 +03:00
50ae4e17f2
csv_metadata_quality/fix.py: Fix indent 2019-07-29 17:14:48 +03:00
8047a57cc5
Add support for fixing "unnecessary" Unicode
These are things like non-breaking spaces, "replacement" characters,
etc that add nothing to the metadata and often cause errors during
parsing or displaying in a UI.
2019-07-29 16:38:10 +03:00
42920e9c7c
Test Python regular expression matches directly
Match objects always have a boolean value of True.

See: https://docs.python.org/3.7/library/re.html
2019-07-29 16:16:30 +03:00
40e77db713
Add "unsafe fixes" runtime option
In this case it fixes occurences of invalid multi-value separators.
DSpace uses "||" to separate multiple values in one field, but our
editors sometimes give us files with mistakes like "|". We can fix
these to be correct multi-value separators if we are sure that the
metadata is not actually using "|" for some legitimate purpose.
2019-07-28 22:53:39 +03:00
87b1997051
Fix whitespace errors found by flake8 2019-07-28 17:47:28 +03:00
c47c064a13
Make output less debuggy 2019-07-27 09:21:13 +03:00
2b41f9416b
csv_metadata_quality/fix.py: Remove extra newline 2019-07-27 01:29:22 +03:00
30a4b0005f
csv_metadata_quality/fix.py: Remove test function 2019-07-26 22:56:40 +03:00
232d28e13e
Refactor as package with subpackages
This makes it cleaner for introducing checks, fixes, tests, docs,
and tests in the future. Currently can be run like this:

  python -m csv_metadata_quality

CSV input and output paths are still hard coded.

See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
2019-07-26 22:11:10 +03:00