a21ffb0fa8
Use py3langid instead of langid
...
Faster and more modern code for Python 3 as a drop-in replacement.
See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html
2023-12-28 14:11:21 +03:00
1f637f32cd
Rework requests-cache
...
We should only be running this once per invocation, not for every
row we check. This should be more efficient, but it means that we
don't cache responses when running via pytest, which is actually
probably a good thing.
2023-10-15 23:37:38 +03:00
f3fb1ff7fb
Don't crash when title is missing
...
We shouldn't crash the country/region checker/fixer when the title
field is missing, since we only use it to show status to the user.
2023-06-12 10:42:50 +03:00
8d4295b2b3
CHANGELOG.md: add note about description field
2023-04-22 12:17:44 -07:00
c64b7eb1f1
CHANGELOG.md: add note about Pandas 2.0.0
2023-04-05 11:17:48 +03:00
20a2cce34b
CHANGELOG.md: add fixes
continuous-integration/drone/push Build is failing
2023-03-10 16:17:20 +03:00
fdccdf7318
Version 0.6.1
continuous-integration/drone/push Build is failing
2023-02-23 13:46:56 +03:00
8bc4cd419c
Strip filename descriptions before checking
...
continuous-integration/drone/push Build is failing
When checking for uncommon file extensions in the filename field
we should strip descriptions that are meant for SAF Bundler, for
example: Annual_Report_2020.pdf__description:Report. This ends up
as a false positive that spams the output with warnings.
2023-02-13 11:00:57 +03:00
bde38e9ed4
CHANGELOG.md: add notes about abstracts
2023-02-13 10:39:03 +03:00
fbb625be5c
Ignore common non-SPDX licenses
...
This is meant to catch licenses that are supposed to be SPDX but
aren't, not licenses that *aren't* supposed to be SPDX. We have so
many free-text license descriptions like "Copyrighted" and "Other"
that I'm sick of seeing warnings for them!
2023-02-07 17:01:56 +03:00
084b970798
CHANGELOG.md: add note about abstract field
2023-02-07 16:52:34 +03:00
c4a2ee8563
CHANGELOG.md: add note about fix.separators()
2023-01-24 14:16:23 +03:00
5abd32a41f
CHANGELOG.md: run poetry update
2022-12-20 15:09:58 +02:00
f640161d87
CHANGELOG.md: add notes about SPDX and Python
2022-12-13 10:45:36 +03:00
051777bcec
Ignore subregion field for missing region checks
...
continuous-integration/drone/push Build is passing
Due to a sloppy regex I was sometimes matching the subregion field
when checking for missing UN M.49 regions in the region field.
2022-12-07 23:18:47 +01:00
8f3db86a36
CHANGELOG.md: fix header
continuous-integration/drone/push Build is passing
2022-10-31 11:43:14 +03:00
58b7b6e9d8
Version 0.6.0
continuous-integration/drone/push Build is passing
2022-09-02 16:35:58 +03:00
566c2b45cf
Remove Excel support
...
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
2022-09-02 16:14:24 +03:00
41b813be6e
CHANGELOG.md: add not about exclude logic
2022-09-02 16:03:51 +03:00
da87531779
CHANGELOG.md: Add note about adding missing regions
2022-07-28 16:54:05 +03:00
e1b270cf83
CHANGELOG.md: add note about dropping invalid AGROVOC values
continuous-integration/drone/push Build is passing
2021-12-23 12:47:42 +02:00
a351ba9706
CHANGELOG.md: add notes about ftfy
2021-12-15 22:09:01 +02:00
5854f8e865
CHANGELOG.md: add note about unnecessary Unicode
2021-12-15 13:56:31 +02:00
cef6c66b30
CHANGELOG.md: start next changes
2021-12-09 23:21:58 +02:00
cc34db7ff8
Version 0.5.0
continuous-integration/drone/push Build is passing
2021-12-08 15:29:46 +02:00
b79e07b814
CHANGELOG.md: Add note about countries without regions
2021-12-08 15:21:45 +02:00
f5fa33bbc6
CHANGELOG.md: add title in citation note
2021-12-05 16:23:39 +02:00
c95261f522
CHANGELOG.md: Add note about fix.newlines
continuous-integration/drone/push Build is passing
2021-10-08 14:37:12 +03:00
831ce979c3
CHANGELOG.md: Clarify regex fixes
2021-10-06 21:23:35 +03:00
72dd3e7272
CHANGELOG.md: Add notes about regexes
2021-10-06 19:35:59 +03:00
81069259ba
CHANGELOG.md: Add note about bibliographicCitation
continuous-integration/drone/push Build is passing
2021-10-06 16:16:51 +03:00
dbc0437d59
CHANGELOG.md: Add note about Python deps
continuous-integration/drone/push Build is passing
2021-04-14 16:16:02 +03:00
a04dbc50db
Add notes about checking and fixing mojibake
2021-03-19 11:48:27 +02:00
f816e17fe7
Version 0.4.7
continuous-integration/drone/push Build is passing
2021-03-17 10:00:34 +02:00
652b7ea98c
CHANGELOG.md: Add note about poetry dependencies
2021-03-17 09:58:02 +02:00
a313b7527a
CHANGELOG.md: Add note about duplicate items
2021-03-17 09:55:07 +02:00
1aa2084230
CHANGELOG.md: Add note about checks
2021-03-16 16:11:24 +02:00
ed084da08c
CHANGELOG.md: Add note about multi-value separators
continuous-integration/drone/push Build is passing
2021-03-14 21:04:19 +02:00
fb35afd937
CHANGELOG.md: Add note about requests cache
2021-03-14 09:13:51 +02:00
1008acf35e
Always fix invalid multi-value separators
...
continuous-integration/drone/push Build is passing
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
1554cfd5c9
Version 0.4.6
2021-03-11 12:14:54 +02:00
00b8faad6d
CHANGELOG.md: Fix headers
2021-03-11 12:13:22 +02:00
7ad821dcad
CHANGELOG.md: Add note about poetry dependencies
2021-03-11 11:10:27 +02:00
e0e3ca6c58
CHANGELOG.md: Add notes about DCTERMS in data/test.csv
2021-03-11 10:50:52 +02:00
d7d4d4efca
CHANGELOG.md: Add note about SPDX license identifiers
2021-03-11 10:37:27 +02:00
202bda862a
Bump version to 0.4.5
continuous-integration/drone/push Build is passing
2021-03-04 21:38:10 +02:00
fc5bedcc5c
CHANGELOG.md: Add poetry update
2021-03-04 21:32:46 +02:00
27b2d81ca8
CHANGELOG.md: Add note about dcterms.issued
continuous-integration/drone/push Build is passing
2021-02-28 15:14:39 +02:00
d76e72532a
Move unreleased changes to v0.4.4
continuous-integration/drone/push Build is passing
2021-02-21 13:25:22 +02:00
13980d2dde
CHANGELOG.md: Add note about colored output
2021-02-21 13:12:26 +02:00