Commit Graph

95 Commits

Author SHA1 Message Date
Alan Orth 4f3174a543
CHANGELOG.md: add note about SPDX license list
continuous-integration/drone/push Build is passing Details
2024-03-02 10:39:00 +03:00
Alan Orth a21ffb0fa8
Use py3langid instead of langid
Faster and more modern code for Python 3 as a drop-in replacement.

See: https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html
2023-12-28 14:11:21 +03:00
Alan Orth 1f637f32cd
Rework requests-cache
We should only be running this once per invocation, not for every
row we check. This should be more efficient, but it means that we
don't cache responses when running via pytest, which is actually
probably a good thing.
2023-10-15 23:37:38 +03:00
Alan Orth f3fb1ff7fb Don't crash when title is missing
We shouldn't crash the country/region checker/fixer when the title
field is missing, since we only use it to show status to the user.
2023-06-12 10:42:50 +03:00
Alan Orth 8d4295b2b3
CHANGELOG.md: add note about description field 2023-04-22 12:17:44 -07:00
Alan Orth c64b7eb1f1
CHANGELOG.md: add note about Pandas 2.0.0 2023-04-05 11:17:48 +03:00
Alan Orth 20a2cce34b
CHANGELOG.md: add fixes
continuous-integration/drone/push Build is failing Details
2023-03-10 16:17:20 +03:00
Alan Orth fdccdf7318
Version 0.6.1
continuous-integration/drone/push Build is failing Details
2023-02-23 13:46:56 +03:00
Alan Orth 8bc4cd419c
Strip filename descriptions before checking
continuous-integration/drone/push Build is failing Details
When checking for uncommon file extensions in the filename field
we should strip descriptions that are meant for SAF Bundler, for
example: Annual_Report_2020.pdf__description:Report. This ends up
as a false positive that spams the output with warnings.
2023-02-13 11:00:57 +03:00
Alan Orth bde38e9ed4
CHANGELOG.md: add notes about abstracts 2023-02-13 10:39:03 +03:00
Alan Orth fbb625be5c
Ignore common non-SPDX licenses
This is meant to catch licenses that are supposed to be SPDX but
aren't, not licenses that *aren't* supposed to be SPDX. We have so
many free-text license descriptions like "Copyrighted" and "Other"
that I'm sick of seeing warnings for them!
2023-02-07 17:01:56 +03:00
Alan Orth 084b970798
CHANGELOG.md: add note about abstract field 2023-02-07 16:52:34 +03:00
Alan Orth c4a2ee8563
CHANGELOG.md: add note about fix.separators() 2023-01-24 14:16:23 +03:00
Alan Orth 5abd32a41f
CHANGELOG.md: run poetry update 2022-12-20 15:09:58 +02:00
Alan Orth f640161d87
CHANGELOG.md: add notes about SPDX and Python 2022-12-13 10:45:36 +03:00
Alan Orth 051777bcec
Ignore subregion field for missing region checks
continuous-integration/drone/push Build is passing Details
Due to a sloppy regex I was sometimes matching the subregion field
when checking for missing UN M.49 regions in the region field.
2022-12-07 23:18:47 +01:00
Alan Orth 8f3db86a36
CHANGELOG.md: fix header
continuous-integration/drone/push Build is passing Details
2022-10-31 11:43:14 +03:00
Alan Orth 58b7b6e9d8
Version 0.6.0
continuous-integration/drone/push Build is passing Details
2022-09-02 16:35:58 +03:00
Alan Orth 566c2b45cf
Remove Excel support
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
2022-09-02 16:14:24 +03:00
Alan Orth 41b813be6e
CHANGELOG.md: add not about exclude logic 2022-09-02 16:03:51 +03:00
Alan Orth da87531779
CHANGELOG.md: Add note about adding missing regions 2022-07-28 16:54:05 +03:00
Alan Orth e1b270cf83
CHANGELOG.md: add note about dropping invalid AGROVOC values
continuous-integration/drone/push Build is passing Details
2021-12-23 12:47:42 +02:00
Alan Orth a351ba9706
CHANGELOG.md: add notes about ftfy 2021-12-15 22:09:01 +02:00
Alan Orth 5854f8e865
CHANGELOG.md: add note about unnecessary Unicode 2021-12-15 13:56:31 +02:00
Alan Orth cef6c66b30
CHANGELOG.md: start next changes 2021-12-09 23:21:58 +02:00
Alan Orth cc34db7ff8
Version 0.5.0
continuous-integration/drone/push Build is passing Details
2021-12-08 15:29:46 +02:00
Alan Orth b79e07b814
CHANGELOG.md: Add note about countries without regions 2021-12-08 15:21:45 +02:00
Alan Orth f5fa33bbc6
CHANGELOG.md: add title in citation note 2021-12-05 16:23:39 +02:00
Alan Orth c95261f522
CHANGELOG.md: Add note about fix.newlines
continuous-integration/drone/push Build is passing Details
2021-10-08 14:37:12 +03:00
Alan Orth 831ce979c3
CHANGELOG.md: Clarify regex fixes 2021-10-06 21:23:35 +03:00
Alan Orth 72dd3e7272
CHANGELOG.md: Add notes about regexes 2021-10-06 19:35:59 +03:00
Alan Orth 81069259ba
CHANGELOG.md: Add note about bibliographicCitation
continuous-integration/drone/push Build is passing Details
2021-10-06 16:16:51 +03:00
Alan Orth dbc0437d59
CHANGELOG.md: Add note about Python deps
continuous-integration/drone/push Build is passing Details
2021-04-14 16:16:02 +03:00
Alan Orth a04dbc50db
Add notes about checking and fixing mojibake 2021-03-19 11:48:27 +02:00
Alan Orth f816e17fe7
Version 0.4.7
continuous-integration/drone/push Build is passing Details
2021-03-17 10:00:34 +02:00
Alan Orth 652b7ea98c
CHANGELOG.md: Add note about poetry dependencies 2021-03-17 09:58:02 +02:00
Alan Orth a313b7527a
CHANGELOG.md: Add note about duplicate items 2021-03-17 09:55:07 +02:00
Alan Orth 1aa2084230
CHANGELOG.md: Add note about checks 2021-03-16 16:11:24 +02:00
Alan Orth ed084da08c
CHANGELOG.md: Add note about multi-value separators
continuous-integration/drone/push Build is passing Details
2021-03-14 21:04:19 +02:00
Alan Orth fb35afd937
CHANGELOG.md: Add note about requests cache 2021-03-14 09:13:51 +02:00
Alan Orth 1008acf35e
Always fix invalid multi-value separators
continuous-integration/drone/push Build is passing Details
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
Alan Orth 1554cfd5c9
Version 0.4.6 2021-03-11 12:14:54 +02:00
Alan Orth 00b8faad6d
CHANGELOG.md: Fix headers 2021-03-11 12:13:22 +02:00
Alan Orth 7ad821dcad
CHANGELOG.md: Add note about poetry dependencies 2021-03-11 11:10:27 +02:00
Alan Orth e0e3ca6c58
CHANGELOG.md: Add notes about DCTERMS in data/test.csv 2021-03-11 10:50:52 +02:00
Alan Orth d7d4d4efca
CHANGELOG.md: Add note about SPDX license identifiers 2021-03-11 10:37:27 +02:00
Alan Orth 202bda862a
Bump version to 0.4.5
continuous-integration/drone/push Build is passing Details
2021-03-04 21:38:10 +02:00
Alan Orth fc5bedcc5c
CHANGELOG.md: Add poetry update 2021-03-04 21:32:46 +02:00
Alan Orth 27b2d81ca8
CHANGELOG.md: Add note about dcterms.issued
continuous-integration/drone/push Build is passing Details
2021-02-28 15:14:39 +02:00
Alan Orth d76e72532a
Move unreleased changes to v0.4.4
continuous-integration/drone/push Build is passing Details
2021-02-21 13:25:22 +02:00