Commit Graph

73 Commits

Author SHA1 Message Date
Alan Orth 5be2195325
Add fix for normalizing DOIs 2024-04-25 12:49:19 +03:00
Alan Orth f6018c51b6
Apply fixes from fixit
Apply recommended fix from fixit:

    RewriteToLiteral: It's slower to call list() than using the empty literal, because the name list must
    be looked up in the global scope in case it has been rebound.
2023-11-22 21:54:50 +03:00
Alan Orth 1f637f32cd
Rework requests-cache
We should only be running this once per invocation, not for every
row we check. This should be more efficient, but it means that we
don't cache responses when running via pytest, which is actually
probably a good thing.
2023-10-15 23:37:38 +03:00
Alan Orth d21d2621e3 csv_metadata_quality/app.py: read fields as strings
I suspect this undermines the PyArrow backend performance gains in
recent Pandas 2.0.0, but we are dealing with messy data sometimes
and we must rely on data being strings.
2023-06-12 10:42:50 +03:00
Alan Orth e2d46e9495
csv_metadata_quality/app.py: skip newline fix on description
The description field often has free-form text like the abstract and
there are too many legitimate newlines here to be correcting them
automatically.
2023-04-22 12:16:13 -07:00
Alan Orth d4aed378cf
Switch to pandas 2.0.0rc1
Seems to work fine with the new PyArrow datatypes.
2023-03-22 12:16:56 +03:00
Alan Orth d661ffe439
Check comma space on bibliographicCitation too
The regex was only matching `dc.identifier.citation`, but we need
to match `dcterms.bibliographicCitation` too.
2023-03-10 16:13:16 +03:00
Alan Orth 45a310387a
Don't fix multi-value separators on citations 2023-03-10 16:12:30 +03:00
Alan Orth 8db1e36a6d
csv_metadata_quality/app.py: skip abstract in separator check
Also skip abstract in the separator check, since it's rare to have
any "|" here, but more likely that if one is present then it's for
a reason.
2023-02-13 10:37:33 +03:00
Alan Orth 545bb8cd0c
csv_metadata_quality/app.py: disable whitespace on abstracts
It's too aggressive on abstracts. If people paste in text from a
PDF there are often newlines, and most of the time this is what
they want.
2023-02-07 16:48:40 +03:00
Alan Orth 3596381d03
csv_metadata_quality/app.py: separators fix
Don't run the invalid separators fix on title fields because some
items use "|" in the title to indicate something like a subtitle.

For example:

    Progress Review and Work Planning Meeting | Day 1
2023-01-24 14:13:55 +03:00
Alan Orth 566c2b45cf
Remove Excel support
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
2022-09-02 16:14:24 +03:00
Alan Orth 040e56fc76
Improve exclude function
When a user explicitly requests that a field be excluded with -x we
skip that field in most checks. Up until now that did not include
the item-based checks using a transposed dataframe because we don't
know the metadata field names (labels) until we iterate over them.

Now the excludes are respected for item-based checks.
2022-09-02 15:59:22 +03:00
Alan Orth 1f76247353
csv_metadata_quality/app.py: rework exclude/skip
Instead of processing the excludes inside the for column loop we do
it once before and then only need to check if the current column is
in the list.
2022-09-02 10:35:04 +03:00
Alan Orth 689ee184f7
Add unsafe check to add missing regions 2022-07-28 16:52:43 +03:00
Alan Orth a7727b8431
Add support for dropping invalid AGROVOC terms
Requires --agrovoc-fields <field.name> to do the actual validation,
and -d to drop invalid ones.
2021-12-23 12:43:55 +02:00
Alan Orth e7322efadd
csv_metadata_quality/app.py: move unnecessary Unicode fix
We actually want to do this after we try to fix mojibake with ftfy.
These "unnecessary" Unicode characters could actually help ftfy in
some cases because often times they indicate that some character
from another encoding was there before (like an accent, dash, or
smart quote).
2021-12-15 13:53:25 +02:00
Alan Orth ccc2a73456
Add check for countries without matching regions
If we have country "Kenya" we should have region "Eastern Africa"
according to the UN M.49 geolocation scheme.
2021-12-08 15:02:20 +02:00
Alan Orth 3b40a68279
Add check for title in citation
This checks if the item title exists in the citation. If it is not
present it could just be missing, or could have minor differences
in the whitespace, accents, etc.
2021-12-05 15:52:42 +02:00
Alan Orth 999cc65097
csv_metadata_quality/app.py: adjust mojibake check
If unsafe fixes (-u) are enabled then we don't need to do the check
first before actually fixing them. Doing the check first creates e-
tra output that needs to be reviewed by the user.
2021-12-05 15:18:35 +02:00
Alan Orth 787fa9e8d9
Add field name to fix.newlines output 2021-10-08 14:36:43 +03:00
Alan Orth 8a27fb2589
Add check for missing DOIs
continuous-integration/drone/push Build is passing Details
Sometimes an editor includes a DOI in the citation field, but does
not add a standalone DOI field.
2021-10-06 21:25:39 +03:00
Alan Orth bd8943f36a
csv_metadata_quality/app.py: Don't crash if fields are missing
continuous-integration/drone/push Build is passing Details
We don't need to crash if someone feeds us a CSV file that is miss-
ing commont DSpace fields like title, type, and subject.
2021-03-21 19:47:29 +02:00
Alan Orth cfe09f7126
Add SPDX short license identifier to all Python files
See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/
2021-03-19 16:04:40 +02:00
Alan Orth 898bb412c3
Add checks and unsafe fixes for mojibake
This detects whether text has likely been encoded in one encoding
and decoded in another, perhaps multiple times. This often results
in display of "mojibake" characters.

For example, a file encoded in UTF-8 is opened as CP-1252 (Windows
Latin codepage) in Microsoft Excel, and saved again as UTF-8. You
will see strings like this in the resulting file:

    - CIAT Publicaçao
    - CIAT Publicación

The correct version of these in UTF-8 would be:

    - CIAT Publicaçao
    - CIAT Publicación

I use a code snippet from Martijn Pieters on StackOverflow to de-
tect whether a string is "weird" as determined by the excellent
"fixes text for you" (ftfy) Python library, then check if a weird
string encodes as CP-1252 or not. If so, I can try to fix it.

See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python
2021-03-19 10:22:21 +02:00
Alan Orth 9f2dc0a0f5
Add support for detecting duplicate items
This uses the title, type, and date issued as a sort of "key" when
determining if an item already exists in the data set.
2021-03-17 09:53:07 +02:00
Alan Orth 330a7b7b9c
Don't unnecessarily rewrite DataFrames for checks
By using df[column] = df[column].apply(check...) we were re-writing
the DataFrame every time we returned from a check. We don't actuall
y need to return a value at all, as the point of checks is to print
a warning to the screen. In Python a "return" statement without a v
ariable returns None.

I haven't measured the impact of this, but I assume it will mean we
are faster and use less memory.
2021-03-16 16:04:19 +02:00
Alan Orth 10612cf891
Remove checks for invalid multi-value separators
Now that I no longer treat the fix for these as "unsafe" I don't a
ctually need to check for them—I can just fix them when I see them.
2021-03-14 21:01:21 +02:00
Alan Orth c9c277f8df
csv_metadata_quality/app.py: Update help text
continuous-integration/drone/push Build is passing Details
Use DCTERMS fields where possible.
2021-03-14 10:52:58 +02:00
Alan Orth 1008acf35e
Always fix invalid multi-value separators
continuous-integration/drone/push Build is passing Details
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
Alan Orth 6e4b0e5c1b
Add validation of SPDX license identifiers
Currently this only checks the dcterms.license field and the result
will only be a warning.
2021-03-11 10:33:16 +02:00
Alan Orth dd2cfae047
csv_metadata_quality/app.py: Match dcterms.issued for dates
We used to only check fields that had "date" in their name because
we were using DSpace's default dc.date.* fields. Now we are using
dcterms.issued so I will add that one as well.
2021-02-28 15:11:06 +02:00
Alan Orth a7fc5a246c
Colorize output
continuous-integration/drone/push Build is failing Details
Messages will be colorized:

- Red for errors
- Yellow for warnings or information
- Green for fixes
2021-02-21 13:01:25 +02:00
Alan Orth 0dc66c5c4e
Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
Alan Orth 28b5996aa6
Output field name for more fixes and checks
This helps identify which field has the error.
2020-01-16 12:35:11 +02:00
Alan Orth 87181bc7b8
Run black, isort, and flake8. 2020-01-15 11:41:31 +02:00
Alan Orth 49e3543878
Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
Alan Orth 8435ee242d
Experimental language detection using langid
Works decenty well assuming the title, abstract, and citation fields
are an accurate representation of the language as identified by the
language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3)
values seamlessly.

This includes updated pipenv environment, test data, pytest tests
for both correct and incorrect ISO 639-1 and ISO 639-3 languages,
and a new command line option "-e".
2019-09-26 13:46:32 +03:00
Alan Orth f304ca6a33
csv_metadata_quality/app.py: Use simpler column iteration
I don't know where I got the other one...
2019-09-21 17:19:39 +03:00
Alan Orth 280a99c8a8
Sort imports with isort
See: https://sourcery.ai/blog/python-best-practices/
2019-08-29 01:15:04 +03:00
Alan Orth d97dcd19db
Format with black 2019-08-29 01:10:39 +03:00
Alan Orth 81190d56bb
Add fix for missing space after commas
This happens in names very often, for example in the contributor
and citation fields. I will limit this to those fields for now and
hide this fix behind the "unsafe fixes" option until I test it more.
2019-08-28 00:05:52 +03:00
Alan Orth 113e7cd8b6
csv_metadata_quality/app.py: Add ability to skip fields
The user may want to skip the checking and fixing of certain fields
in the input file.
2019-08-27 00:10:07 +03:00
Alan Orth ed5612fbcf
Add column name to output in date checks
This makes it easier to understand where the error is in case a CSV
has multiple date fields, for example:

    Missing date (dc.date.issued).
    Missing date (dc.date.issued[]).

If you have 126 items and you get 126 "Missing date" messages then
it's likely that 100 of the items have dates in one field, and the
others have dates in other field.
2019-08-21 15:31:12 +03:00
Alan Orth 9ce7dc6716
Add check for uncommon filenames
Generally we want people to upload documents in accessible formats
like PDF, Word, Excel, and PowerPoint. This check warns if a file
is using an uncommon extension.
2019-08-10 23:41:16 +03:00
Alan Orth 62fea95087
Improve suspicious character detection
Now it will print just the part of the metadata value that contains
the suspicious character (up to 80 characters, so we don't make the
line break on terminals that use 80 character width by default).

Also, print the name of the field in which the metadata value is so
that it is easier for the user to locate.
2019-08-09 01:25:40 +03:00
Alan Orth 8772bdec51
csv_metadata_quality/app.py: Explicitly exit with success 2019-08-04 09:10:37 +03:00
Alan Orth 6d4ecd75aa
csv_metadata_quality/app.py: Close files before exit 2019-08-04 09:10:19 +03:00
Alan Orth f4e7fd73f5
csv_metadata_quality/app.py: Handle Ctrl-C
Instead of printing an ugly two-page stack trace.
2019-08-03 21:11:57 +03:00
Alan Orth 0561300ebe
Add option to print version with `--version` or `-V`
I guess `-v` is more commonly used for "verbose" so I will use the
short option of `-V` for version.
2019-08-02 00:09:54 +03:00