1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-05-09 22:56:01 +02:00
Commit Graph

94 Commits

Author SHA1 Message Date
a21ffb0fa8 Use py3langid instead of langid
Faster and more modern code for Python 3 as a drop-in replacement.

See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html
2023-12-28 14:11:21 +03:00
1f637f32cd Rework requests-cache
We should only be running this once per invocation, not for every
row we check. This should be more efficient, but it means that we
don't cache responses when running via pytest, which is actually
probably a good thing.
2023-10-15 23:37:38 +03:00
f3fb1ff7fb Don't crash when title is missing
We shouldn't crash the country/region checker/fixer when the title
field is missing, since we only use it to show status to the user.
2023-06-12 10:42:50 +03:00
8d4295b2b3 CHANGELOG.md: add note about description field 2023-04-22 12:17:44 -07:00
c64b7eb1f1 CHANGELOG.md: add note about Pandas 2.0.0 2023-04-05 11:17:48 +03:00
20a2cce34b CHANGELOG.md: add fixes
Some checks failed
continuous-integration/drone/push Build is failing
2023-03-10 16:17:20 +03:00
fdccdf7318 Version 0.6.1
Some checks failed
continuous-integration/drone/push Build is failing
2023-02-23 13:46:56 +03:00
8bc4cd419c Strip filename descriptions before checking
Some checks failed
continuous-integration/drone/push Build is failing
When checking for uncommon file extensions in the filename field
we should strip descriptions that are meant for SAF Bundler, for
example: Annual_Report_2020.pdf__description:Report. This ends up
as a false positive that spams the output with warnings.
2023-02-13 11:00:57 +03:00
bde38e9ed4 CHANGELOG.md: add notes about abstracts 2023-02-13 10:39:03 +03:00
fbb625be5c Ignore common non-SPDX licenses
This is meant to catch licenses that are supposed to be SPDX but
aren't, not licenses that *aren't* supposed to be SPDX. We have so
many free-text license descriptions like "Copyrighted" and "Other"
that I'm sick of seeing warnings for them!
2023-02-07 17:01:56 +03:00
084b970798 CHANGELOG.md: add note about abstract field 2023-02-07 16:52:34 +03:00
c4a2ee8563 CHANGELOG.md: add note about fix.separators() 2023-01-24 14:16:23 +03:00
5abd32a41f CHANGELOG.md: run poetry update 2022-12-20 15:09:58 +02:00
f640161d87 CHANGELOG.md: add notes about SPDX and Python 2022-12-13 10:45:36 +03:00
051777bcec Ignore subregion field for missing region checks
All checks were successful
continuous-integration/drone/push Build is passing
Due to a sloppy regex I was sometimes matching the subregion field
when checking for missing UN M.49 regions in the region field.
2022-12-07 23:18:47 +01:00
8f3db86a36 CHANGELOG.md: fix header
All checks were successful
continuous-integration/drone/push Build is passing
2022-10-31 11:43:14 +03:00
58b7b6e9d8 Version 0.6.0
All checks were successful
continuous-integration/drone/push Build is passing
2022-09-02 16:35:58 +03:00
566c2b45cf Remove Excel support
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
2022-09-02 16:14:24 +03:00
41b813be6e CHANGELOG.md: add not about exclude logic 2022-09-02 16:03:51 +03:00
da87531779 CHANGELOG.md: Add note about adding missing regions 2022-07-28 16:54:05 +03:00
e1b270cf83 CHANGELOG.md: add note about dropping invalid AGROVOC values
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-23 12:47:42 +02:00
a351ba9706 CHANGELOG.md: add notes about ftfy 2021-12-15 22:09:01 +02:00
5854f8e865 CHANGELOG.md: add note about unnecessary Unicode 2021-12-15 13:56:31 +02:00
cef6c66b30 CHANGELOG.md: start next changes 2021-12-09 23:21:58 +02:00
cc34db7ff8 Version 0.5.0
All checks were successful
continuous-integration/drone/push Build is passing
2021-12-08 15:29:46 +02:00
b79e07b814 CHANGELOG.md: Add note about countries without regions 2021-12-08 15:21:45 +02:00
f5fa33bbc6 CHANGELOG.md: add title in citation note 2021-12-05 16:23:39 +02:00
c95261f522 CHANGELOG.md: Add note about fix.newlines
All checks were successful
continuous-integration/drone/push Build is passing
2021-10-08 14:37:12 +03:00
831ce979c3 CHANGELOG.md: Clarify regex fixes 2021-10-06 21:23:35 +03:00
72dd3e7272 CHANGELOG.md: Add notes about regexes 2021-10-06 19:35:59 +03:00
81069259ba CHANGELOG.md: Add note about bibliographicCitation
All checks were successful
continuous-integration/drone/push Build is passing
2021-10-06 16:16:51 +03:00
dbc0437d59 CHANGELOG.md: Add note about Python deps
All checks were successful
continuous-integration/drone/push Build is passing
2021-04-14 16:16:02 +03:00
a04dbc50db Add notes about checking and fixing mojibake 2021-03-19 11:48:27 +02:00
f816e17fe7 Version 0.4.7
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:00:34 +02:00
652b7ea98c CHANGELOG.md: Add note about poetry dependencies 2021-03-17 09:58:02 +02:00
a313b7527a CHANGELOG.md: Add note about duplicate items 2021-03-17 09:55:07 +02:00
1aa2084230 CHANGELOG.md: Add note about checks 2021-03-16 16:11:24 +02:00
ed084da08c CHANGELOG.md: Add note about multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-14 21:04:19 +02:00
fb35afd937 CHANGELOG.md: Add note about requests cache 2021-03-14 09:13:51 +02:00
1008acf35e Always fix invalid multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
1554cfd5c9 Version 0.4.6 2021-03-11 12:14:54 +02:00
00b8faad6d CHANGELOG.md: Fix headers 2021-03-11 12:13:22 +02:00
7ad821dcad CHANGELOG.md: Add note about poetry dependencies 2021-03-11 11:10:27 +02:00
e0e3ca6c58 CHANGELOG.md: Add notes about DCTERMS in data/test.csv 2021-03-11 10:50:52 +02:00
d7d4d4efca CHANGELOG.md: Add note about SPDX license identifiers 2021-03-11 10:37:27 +02:00
202bda862a Bump version to 0.4.5
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-04 21:38:10 +02:00
fc5bedcc5c CHANGELOG.md: Add poetry update 2021-03-04 21:32:46 +02:00
27b2d81ca8 CHANGELOG.md: Add note about dcterms.issued
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-28 15:14:39 +02:00
d76e72532a Move unreleased changes to v0.4.4
All checks were successful
continuous-integration/drone/push Build is passing
2021-02-21 13:25:22 +02:00
13980d2dde CHANGELOG.md: Add note about colored output 2021-02-21 13:12:26 +02:00