Commit Graph

44 Commits

Author SHA1 Message Date
Alan Orth 5be2195325
Add fix for normalizing DOIs 2024-04-25 12:49:19 +03:00
Alan Orth f6018c51b6
Apply fixes from fixit
Apply recommended fix from fixit:

    RewriteToLiteral: It's slower to call list() than using the empty literal, because the name list must
    be looked up in the global scope in case it has been rebound.
2023-11-22 21:54:50 +03:00
Alan Orth f3fb1ff7fb Don't crash when title is missing
We shouldn't crash the country/region checker/fixer when the title
field is missing, since we only use it to show status to the user.
2023-06-12 10:42:50 +03:00
Alan Orth 051777bcec
Ignore subregion field for missing region checks
continuous-integration/drone/push Build is passing Details
Due to a sloppy regex I was sometimes matching the subregion field
when checking for missing UN M.49 regions in the region field.
2022-12-07 23:18:47 +01:00
Alan Orth 040e56fc76
Improve exclude function
When a user explicitly requests that a field be excluded with -x we
skip that field in most checks. Up until now that did not include
the item-based checks using a transposed dataframe because we don't
know the metadata field names (labels) until we iterate over them.

Now the excludes are respected for item-based checks.
2022-09-02 15:59:22 +03:00
Alan Orth f49214fa2e
csv_metadata_quality/fix.py: fix bug in regions
We need to make sure we're only manipulating the regions if we have
any missing. The previous code was always manipulating the existing
row, even when there were no missing regions, which resulted in new
values like "Eastern Africa||".
2022-09-01 16:15:32 +03:00
Alan Orth 7ce20726d0
csv_metadata_quality/fix.py: minor change
Print missing regions when we know they are missing, instead of do-
ing another check later and looping over them again.
2022-09-01 16:03:49 +03:00
Alan Orth 473be5ac2f
csv_metadata_quality/fix.py: don't add "not found" region
country_converter returns the literal "not found" string if a coun-
try cannot be found. In that case we do not want to consider that as
a region!
2022-09-01 15:46:21 +03:00
Alan Orth 7c61cae417 csv_metadata_quality/fix.py: silence warning
By default country_converter prints "not found in regex" if a coun-
try is not found. We can silence this by switching the logging lev-
el to something above WARNING.
2022-09-01 15:44:50 +03:00
Alan Orth ae16289637
csv_metadata_quality/fix.py: Minor change
The country_converter documentation says we should instantiate the
CountryConverter() class once instead of calling coco.convert() in
each iteration of the loop so we don't end up loading the data file
more than once.
2022-09-01 15:40:45 +03:00
Alan Orth 0cf0bc97f0
csv_metadata_quality/fix.py: fix logic error again
continuous-integration/drone/push Build is passing Details
It seems there was another logic error raised by the test in pytest.
With my real data, it was enough to check if the region column was
None, but with my test I was explicitly setting the region to "" (an
empty string). So to be really sure we should check if the string
is not None *and* if its length is greater than 0.
2022-08-03 20:51:14 +03:00
Alan Orth 40c3585bab
csv_metadata_quality/fix.py: fix logic error
Fix string concatenation with existing regions.
2022-08-03 18:26:08 +03:00
Alan Orth b9c44aed7d
csv_metadata_quality/fix.py: fix logic issue
continuous-integration/drone/push Build is passing Details
Forgot to return the row as-is if we don't find any countries.
2022-08-02 10:17:30 +03:00
Alan Orth 689ee184f7
Add unsafe check to add missing regions 2022-07-28 16:52:43 +03:00
Alan Orth 7763a021c5
csv_metadata_quality/fix.py: sort imports with isort
continuous-integration/drone/push Build is passing Details
2021-12-15 23:15:02 +02:00
Alan Orth ff49a80432
csv_metadata_quality/fix.py: configure ftfy
Don't replace smart quotes in ftfy. If our text has them we should
keep them.
2021-12-15 21:51:51 +02:00
Alan Orth 95015febbd
csv_metadata_quality/fix.py: fix thin spaces
continuous-integration/drone/push Build is passing Details
Replace thin spaces with normal spaces. Sometimes I see these get
mis handled on Windows machines and they end up as "?" or so.
2021-12-09 23:22:53 +02:00
Alan Orth 787fa9e8d9
Add field name to fix.newlines output 2021-10-08 14:36:43 +03:00
Alan Orth cfe09f7126
Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
Alan Orth 898bb412c3
Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
Alan Orth a0ea829f5c
csv_metadata_quality/fix.py: Fixes should be green 2021-03-11 11:47:24 +02:00
Alan Orth a7fc5a246c
Colorize output
continuous-integration/drone/push Build is failing Details
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
Alan Orth 0dc66c5c4e
Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
Alan Orth 7cfd4c0b59
csv_metadata_quality: Move scoped imports to global
According to PEP8 we should avoid scoped imports unless you have a
good reason. Here there are two cases where we do (issn and isbn),
but I will move the others to the global scope.
2020-10-06 17:11:39 +03:00
Alan Orth 28b5996aa6
Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
Alan Orth 365ecda324
Add utility function to check normalization
Python's built-in unicodedata library includes the is_normalized()
function starting with Python 3.8. This utility function allows us
to do the same thing with earlier Python versions.

See: https://docs.python.org/3/library/unicodedata.html
2020-01-15 12:17:52 +02:00
Alan Orth 49e3543878
Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
Alan Orth e55380b4d5
csv_metadata_quality/fix.py: Harmonize language in fix output
We should always say if we're removing or replacing something.
2019-10-01 17:09:49 +03:00
Alan Orth c42f8b4812
csv_metadata_quality/fix.py: Replace non-breaking spaces
We should be replacing non-breaking spaces (U+00A0) with normal sp-
aces instead of removing them.
2019-10-01 16:55:04 +03:00
Alan Orth 280a99c8a8
Sort imports with isort
See: https://sourcery.ai/blog/python-best-practices/
2019-08-29 01:15:04 +03:00
Alan Orth d97dcd19db
Format with black 2019-08-29 01:10:39 +03:00
Alan Orth 81190d56bb
Add fix for missing space after commas
This happens in names very often, for example in the contributor
and citation fields. I will limit this to those fields for now and
hide this fix behind the "unsafe fixes" option until I test it more.
2019-08-28 00:05:52 +03:00
Alan Orth 232ff99898
csv_metadata_quality/fix.py: Add more unneccessary Unicode fixes
Add a check for soft hyphens (U+00AD). In one sample CSV I have a
normal hyphen followed by a soft hyphen in an ISBN. This causes the
ISBN validation to fail.
2019-08-11 00:07:21 +03:00
Alan Orth 40d5f7d81b
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
2019-07-30 20:05:12 +03:00
Alan Orth 1e444cf040
Add fix for duplicate metadata values 2019-07-29 18:05:03 +03:00
Alan Orth 50ae4e17f2
csv_metadata_quality/fix.py: Fix indent 2019-07-29 17:14:48 +03:00
Alan Orth 8047a57cc5
Add support for fixing "unnecessary" Unicode
These are things like non-breaking spaces, "replacement" characters,
etc that add nothing to the metadata and often cause errors during
parsing or displaying in a UI.
2019-07-29 16:38:10 +03:00
Alan Orth 42920e9c7c
Test Python regular expression matches directly
Match objects always have a boolean value of True.

See: https://docs.python.org/3.7/library/re.html
2019-07-29 16:16:30 +03:00
Alan Orth 40e77db713
Add "unsafe fixes" runtime option
In this case it fixes occurences of invalid multi-value separators.
DSpace uses "||" to separate multiple values in one field, but our
editors sometimes give us files with mistakes like "|". We can fix
these to be correct multi-value separators if we are sure that the
metadata is not actually using "|" for some legitimate purpose.
2019-07-28 22:53:39 +03:00
Alan Orth 87b1997051
Fix whitespace errors found by flake8 2019-07-28 17:47:28 +03:00
Alan Orth c47c064a13
Make output less debuggy 2019-07-27 09:21:13 +03:00
Alan Orth 2b41f9416b
csv_metadata_quality/fix.py: Remove extra newline 2019-07-27 01:29:22 +03:00
Alan Orth 30a4b0005f
csv_metadata_quality/fix.py: Remove test function 2019-07-26 22:56:40 +03:00
Alan Orth 232d28e13e
Refactor as package with subpackages
This makes it cleaner for introducing checks, fixes, tests, docs,
and tests in the future. Currently can be run like this:

  python -m csv_metadata_quality

CSV input and output paths are still hard coded.

See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
2019-07-26 22:11:10 +03:00