1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-10 00:05:51 +01:00
Commit Graph

9 Commits

Author SHA1 Message Date
a21ffb0fa8
Use py3langid instead of langid
Faster and more modern code for Python 3 as a drop-in replacement.

See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html
2023-12-28 14:11:21 +03:00
f6018c51b6
Apply fixes from fixit
Apply recommended fix from fixit:

    RewriteToLiteral: It's slower to call list() than using the empty literal, because the name list must
    be looked up in the global scope in case it has been rebound.
2023-11-22 21:54:50 +03:00
040e56fc76
Improve exclude function
When a user explicitly requests that a field be excluded with -x we
skip that field in most checks. Up until now that did not include
the item-based checks using a transposed dataframe because we don't
know the metadata field names (labels) until we iterate over them.

Now the excludes are respected for item-based checks.
2022-09-02 15:59:22 +03:00
54ab869297
csv_metadata_quality/experimental.py: Adjust citation match
We need to match both of these citation fields:

- dc.identifier.citation
- dcterms.bibliographicCitation
2021-10-06 16:13:10 +03:00
cfe09f7126
Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
14010896a5
csv_metadata_quality/experimental.py: Move all imports to top of file
All checks were successful
continuous-integration/drone/push Build is passing
PEP8 recommends keeping imports at the top of the file. Also, I had
to re-work the issn/isbn so they didn't conflict with the functions
in check.py (flake8 warned about them being redefined).

Imports sorted with isort.

See: https://www.python.org/dev/peps/pep-0008/#imports
2021-03-16 16:13:34 +02:00
330a7b7b9c
Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00
a7fc5a246c
Colorize output
Some checks failed
continuous-integration/drone/push Build is failing
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
8435ee242d
Experimental language detection using langid
Works decenty well assuming the title, abstract, and citation fields
are an accurate representation of the language as identified by the
language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3)
values seamlessly.

This includes updated pipenv environment, test data, pytest tests
for both correct and incorrect ISO 639-1 and ISO 639-3 languages,
and a new command line option "-e".
2019-09-26 13:46:32 +03:00