1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-17 19:47:03 +01:00
Commit Graph

94 Commits

Author SHA1 Message Date
a21ffb0fa8
Use py3langid instead of langid
Faster and more modern code for Python 3 as a drop-in replacement.

See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html
2023-12-28 14:11:21 +03:00
1f637f32cd
Rework requests-cache
We should only be running this once per invocation, not for every
row we check. This should be more efficient, but it means that we
don't cache responses when running via pytest, which is actually
probably a good thing.
2023-10-15 23:37:38 +03:00
f3fb1ff7fb Don't crash when title is missing
We shouldn't crash the country/region checker/fixer when the title
field is missing, since we only use it to show status to the user.
2023-06-12 10:42:50 +03:00
8d4295b2b3
CHANGELOG.md: add note about description field 2023-04-22 12:17:44 -07:00
c64b7eb1f1
CHANGELOG.md: add note about Pandas 2.0.0 2023-04-05 11:17:48 +03:00
20a2cce34b
CHANGELOG.md: add fixes
Some checks failed
continuous-integration/drone/push Build is failing
2023-03-10 16:17:20 +03:00
fdccdf7318
Version 0.6.1
Some checks failed
continuous-integration/drone/push Build is failing
2023-02-23 13:46:56 +03:00
8bc4cd419c
Strip filename descriptions before checking
Some checks failed
continuous-integration/drone/push Build is failing
When checking for uncommon file extensions in the filename field
we should strip descriptions that are meant for SAF Bundler, for
example: Annual_Report_2020.pdf__description:Report. This ends up
as a false positive that spams the output with warnings.
2023-02-13 11:00:57 +03:00
bde38e9ed4
CHANGELOG.md: add notes about abstracts 2023-02-13 10:39:03 +03:00
fbb625be5c
Ignore common non-SPDX licenses
This is meant to catch licenses that are supposed to be SPDX but
aren't, not licenses that *aren't* supposed to be SPDX. We have so
many free-text license descriptions like "Copyrighted" and "Other"
that I'm sick of seeing warnings for them!
2023-02-07 17:01:56 +03:00
084b970798
CHANGELOG.md: add note about abstract field 2023-02-07 16:52:34 +03:00
c4a2ee8563
CHANGELOG.md: add note about fix.separators() 2023-01-24 14:16:23 +03:00
5abd32a41f
CHANGELOG.md: run poetry update 2022-12-20 15:09:58 +02:00
f640161d87
CHANGELOG.md: add notes about SPDX and Python 2022-12-13 10:45:36 +03:00
051777bcec
Ignore subregion field for missing region checks
All checks were successful
continuous-integration/drone/push Build is passing
Due to a sloppy regex I was sometimes matching the subregion field
when checking for missing UN M.49 regions in the region field.
2022-12-07 23:18:47 +01:00
8f3db86a36
CHANGELOG.md: fix header
All checks were successful
continuous-integration/drone/push Build is passing
2022-10-31 11:43:14 +03:00
58b7b6e9d8
Version 0.6.0
All checks were successful
continuous-integration/drone/push Build is passing
2022-09-02 16:35:58 +03:00
566c2b45cf
Remove Excel support
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
2022-09-02 16:14:24 +03:00
41b813be6e
CHANGELOG.md: add not about exclude logic 2022-09-02 16:03:51 +03:00
da87531779
CHANGELOG.md: Add note about adding missing regions 2022-07-28 16:54:05 +03:00
e1b270cf83
CHANGELOG.md: add note about dropping invalid AGROVOC values
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-23 12:47:42 +02:00
a351ba9706
CHANGELOG.md: add notes about ftfy 2021-12-15 22:09:01 +02:00
5854f8e865
CHANGELOG.md: add note about unnecessary Unicode 2021-12-15 13:56:31 +02:00
cef6c66b30
CHANGELOG.md: start next changes 2021-12-09 23:21:58 +02:00
cc34db7ff8
Version 0.5.0
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-08 15:29:46 +02:00
b79e07b814
CHANGELOG.md: Add note about countries without regions 2021-12-08 15:21:45 +02:00
f5fa33bbc6
CHANGELOG.md: add title in citation note 2021-12-05 16:23:39 +02:00
c95261f522
CHANGELOG.md: Add note about fix.newlines
All checks were successful
continuous-integration/drone/push Build is passing
2021-10-08 14:37:12 +03:00
831ce979c3
CHANGELOG.md: Clarify regex fixes 2021-10-06 21:23:35 +03:00
72dd3e7272
CHANGELOG.md: Add notes about regexes 2021-10-06 19:35:59 +03:00
81069259ba
CHANGELOG.md: Add note about bibliographicCitation
All checks were successful
continuous-integration/drone/push Build is passing
2021-10-06 16:16:51 +03:00
dbc0437d59
CHANGELOG.md: Add note about Python deps
All checks were successful
continuous-integration/drone/push Build is passing
2021-04-14 16:16:02 +03:00
a04dbc50db
Add notes about checking and fixing mojibake 2021-03-19 11:48:27 +02:00
f816e17fe7
Version 0.4.7
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:00:34 +02:00
652b7ea98c
CHANGELOG.md: Add note about poetry dependencies 2021-03-17 09:58:02 +02:00
a313b7527a
CHANGELOG.md: Add note about duplicate items 2021-03-17 09:55:07 +02:00
1aa2084230
CHANGELOG.md: Add note about checks 2021-03-16 16:11:24 +02:00
ed084da08c
CHANGELOG.md: Add note about multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-14 21:04:19 +02:00
fb35afd937
CHANGELOG.md: Add note about requests cache 2021-03-14 09:13:51 +02:00
1008acf35e
Always fix invalid multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
1554cfd5c9
Version 0.4.6 2021-03-11 12:14:54 +02:00
00b8faad6d
CHANGELOG.md: Fix headers 2021-03-11 12:13:22 +02:00
7ad821dcad
CHANGELOG.md: Add note about poetry dependencies 2021-03-11 11:10:27 +02:00
e0e3ca6c58
CHANGELOG.md: Add notes about DCTERMS in data/test.csv 2021-03-11 10:50:52 +02:00
d7d4d4efca
CHANGELOG.md: Add note about SPDX license identifiers 2021-03-11 10:37:27 +02:00
202bda862a
Bump version to 0.4.5
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-04 21:38:10 +02:00
fc5bedcc5c
CHANGELOG.md: Add poetry update 2021-03-04 21:32:46 +02:00
27b2d81ca8
CHANGELOG.md: Add note about dcterms.issued
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-28 15:14:39 +02:00
d76e72532a
Move unreleased changes to v0.4.4
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-21 13:25:22 +02:00
13980d2dde
CHANGELOG.md: Add note about colored output 2021-02-21 13:12:26 +02:00