csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-27 16:18:19 +01:00

Author	SHA1	Message	Date
Alan Orth	d21d2621e3	csv_metadata_quality/app.py: read fields as strings I suspect this undermines the PyArrow backend performance gains in recent Pandas 2.0.0, but we are dealing with messy data sometimes and we must rely on data being strings.	2023-06-12 10:42:50 +03:00
Alan Orth	e2d46e9495	csv_metadata_quality/app.py: skip newline fix on description The description field often has free-form text like the abstract and there are too many legitimate newlines here to be correcting them automatically.	2023-04-22 12:16:13 -07:00
Alan Orth	d4aed378cf	Switch to pandas 2.0.0rc1 Seems to work fine with the new PyArrow datatypes.	2023-03-22 12:16:56 +03:00
Alan Orth	d661ffe439	Check comma space on bibliographicCitation too The regex was only matching `dc.identifier.citation`, but we need to match `dcterms.bibliographicCitation` too.	2023-03-10 16:13:16 +03:00
Alan Orth	45a310387a	Don't fix multi-value separators on citations	2023-03-10 16:12:30 +03:00
Alan Orth	8db1e36a6d	csv_metadata_quality/app.py: skip abstract in separator check Also skip abstract in the separator check, since it's rare to have any "\|" here, but more likely that if one is present then it's for a reason.	2023-02-13 10:37:33 +03:00
Alan Orth	545bb8cd0c	csv_metadata_quality/app.py: disable whitespace on abstracts It's too aggressive on abstracts. If people paste in text from a PDF there are often newlines, and most of the time this is what they want.	2023-02-07 16:48:40 +03:00
Alan Orth	3596381d03	csv_metadata_quality/app.py: separators fix Don't run the invalid separators fix on title fields because some items use "\|" in the title to indicate something like a subtitle. For example: Progress Review and Work Planning Meeting \| Day 1	2023-01-24 14:13:55 +03:00
Alan Orth	566c2b45cf	Remove Excel support I never used this and it seems xlrd doesn't even support .xlsx any- more anyways. If this was needed I could theoretically use openpyxl but I'd rather just stick to CSV.	2022-09-02 16:14:24 +03:00
Alan Orth	040e56fc76	Improve exclude function When a user explicitly requests that a field be excluded with -x we skip that field in most checks. Up until now that did not include the item-based checks using a transposed dataframe because we don't know the metadata field names (labels) until we iterate over them. Now the excludes are respected for item-based checks.	2022-09-02 15:59:22 +03:00
Alan Orth	1f76247353	csv_metadata_quality/app.py: rework exclude/skip Instead of processing the excludes inside the for column loop we do it once before and then only need to check if the current column is in the list.	2022-09-02 10:35:04 +03:00
Alan Orth	689ee184f7	Add unsafe check to add missing regions	2022-07-28 16:52:43 +03:00
Alan Orth	a7727b8431	Add support for dropping invalid AGROVOC terms Requires --agrovoc-fields <field.name> to do the actual validation, and -d to drop invalid ones.	2021-12-23 12:43:55 +02:00
Alan Orth	e7322efadd	csv_metadata_quality/app.py: move unnecessary Unicode fix We actually want to do this after we try to fix mojibake with ftfy. These "unnecessary" Unicode characters could actually help ftfy in some cases because often times they indicate that some character from another encoding was there before (like an accent, dash, or smart quote).	2021-12-15 13:53:25 +02:00
Alan Orth	ccc2a73456	Add check for countries without matching regions If we have country "Kenya" we should have region "Eastern Africa" according to the UN M.49 geolocation scheme.	2021-12-08 15:02:20 +02:00
Alan Orth	3b40a68279	Add check for title in citation This checks if the item title exists in the citation. If it is not present it could just be missing, or could have minor differences in the whitespace, accents, etc.	2021-12-05 15:52:42 +02:00
Alan Orth	999cc65097	csv_metadata_quality/app.py: adjust mojibake check If unsafe fixes (-u) are enabled then we don't need to do the check first before actually fixing them. Doing the check first creates e- tra output that needs to be reviewed by the user.	2021-12-05 15:18:35 +02:00
Alan Orth	787fa9e8d9	Add field name to fix.newlines output	2021-10-08 14:36:43 +03:00
Alan Orth	8a27fb2589	Add check for missing DOIs All checks were successful continuous-integration/drone/push Build is passing Details Sometimes an editor includes a DOI in the citation field, but does not add a standalone DOI field.	2021-10-06 21:25:39 +03:00
Alan Orth	bd8943f36a	csv_metadata_quality/app.py: Don't crash if fields are missing All checks were successful continuous-integration/drone/push Build is passing Details We don't need to crash if someone feeds us a CSV file that is miss- ing commont DSpace fields like title, type, and subject.	2021-03-21 19:47:29 +02:00
Alan Orth	cfe09f7126	Add SPDX short license identifier to all Python files See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-19 16:04:40 +02:00
Alan Orth	898bb412c3	Add checks and unsafe fixes for mojibake This detects whether text has likely been encoded in one encoding and decoded in another, perhaps multiple times. This often results in display of "mojibake" characters. For example, a file encoded in UTF-8 is opened as CP-1252 (Windows Latin codepage) in Microsoft Excel, and saved again as UTF-8. You will see strings like this in the resulting file: - CIAT PublicaÃ§ao - CIAT PublicaciÃ³n The correct version of these in UTF-8 would be: - CIAT Publicaçao - CIAT Publicación I use a code snippet from Martijn Pieters on StackOverflow to de- tect whether a string is "weird" as determined by the excellent "fixes text for you" (ftfy) Python library, then check if a weird string encodes as CP-1252 or not. If so, I can try to fix it. See: https://stackoverflow.com/questions/29071995/identify-garbage-unicode-string-using-python	2021-03-19 10:22:21 +02:00
Alan Orth	9f2dc0a0f5	Add support for detecting duplicate items This uses the title, type, and date issued as a sort of "key" when determining if an item already exists in the data set.	2021-03-17 09:53:07 +02:00
Alan Orth	330a7b7b9c	Don't unnecessarily rewrite DataFrames for checks By using df[column] = df[column].apply(check...) we were re-writing the DataFrame every time we returned from a check. We don't actuall y need to return a value at all, as the point of checks is to print a warning to the screen. In Python a "return" statement without a v ariable returns None. I haven't measured the impact of this, but I assume it will mean we are faster and use less memory.	2021-03-16 16:04:19 +02:00
Alan Orth	10612cf891	Remove checks for invalid multi-value separators Now that I no longer treat the fix for these as "unsafe" I don't a ctually need to check for them—I can just fix them when I see them.	2021-03-14 21:01:21 +02:00
Alan Orth	c9c277f8df	csv_metadata_quality/app.py: Update help text All checks were successful continuous-integration/drone/push Build is passing Details Use DCTERMS fields where possible.	2021-03-14 10:52:58 +02:00
Alan Orth	1008acf35e	Always fix invalid multi-value separators All checks were successful continuous-integration/drone/push Build is passing Details This is no longer class-ified as "unsafe" as I have yet to see a case where this was intentional, and it always causes issues when you import the data in a DSpace repository.	2021-03-13 12:59:45 +02:00
Alan Orth	6e4b0e5c1b	Add validation of SPDX license identifiers Currently this only checks the dcterms.license field and the result will only be a warning.	2021-03-11 10:33:16 +02:00
Alan Orth	dd2cfae047	csv_metadata_quality/app.py: Match dcterms.issued for dates We used to only check fields that had "date" in their name because we were using DSpace's default dc.date.* fields. Now we are using dcterms.issued so I will add that one as well.	2021-02-28 15:11:06 +02:00
Alan Orth	a7fc5a246c	Colorize output Some checks failed continuous-integration/drone/push Build is failing Details Messages will be colorized: - Red for errors - Yellow for warnings or information - Green for fixes	2021-02-21 13:01:25 +02:00
Alan Orth	0dc66c5c4e	Expand check/fix for multi-value separators I just came across some metadata that had unnecessary multi-value separators at the end of a field, causing a blank value to be used. For example: "Kenya\|\|Tanzania\|\|"	2021-01-03 15:30:03 +02:00
Alan Orth	28b5996aa6	Output field name for more fixes and checks This helps identify which field has the error.	2020-01-16 12:35:11 +02:00
Alan Orth	87181bc7b8	Run black, isort, and flake8.	2020-01-15 11:41:31 +02:00
Alan Orth	49e3543878	Add Unicode normalization This will check all strings for un-normalized Unicode characters. Normalization is done using NFC. This includes tests and updated sample data (data/test.csv). See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html	2020-01-15 11:37:54 +02:00
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00
Alan Orth	f304ca6a33	csv_metadata_quality/app.py: Use simpler column iteration I don't know where I got the other one...	2019-09-21 17:19:39 +03:00
Alan Orth	280a99c8a8	Sort imports with isort See: https://sourcery.ai/blog/python-best-practices/	2019-08-29 01:15:04 +03:00
Alan Orth	d97dcd19db	Format with black	2019-08-29 01:10:39 +03:00
Alan Orth	81190d56bb	Add fix for missing space after commas This happens in names very often, for example in the contributor and citation fields. I will limit this to those fields for now and hide this fix behind the "unsafe fixes" option until I test it more.	2019-08-28 00:05:52 +03:00
Alan Orth	113e7cd8b6	csv_metadata_quality/app.py: Add ability to skip fields The user may want to skip the checking and fixing of certain fields in the input file.	2019-08-27 00:10:07 +03:00
Alan Orth	ed5612fbcf	Add column name to output in date checks This makes it easier to understand where the error is in case a CSV has multiple date fields, for example: Missing date (dc.date.issued). Missing date (dc.date.issued[]). If you have 126 items and you get 126 "Missing date" messages then it's likely that 100 of the items have dates in one field, and the others have dates in other field.	2019-08-21 15:31:12 +03:00
Alan Orth	9ce7dc6716	Add check for uncommon filenames Generally we want people to upload documents in accessible formats like PDF, Word, Excel, and PowerPoint. This check warns if a file is using an uncommon extension.	2019-08-10 23:41:16 +03:00
Alan Orth	62fea95087	Improve suspicious character detection Now it will print just the part of the metadata value that contains the suspicious character (up to 80 characters, so we don't make the line break on terminals that use 80 character width by default). Also, print the name of the field in which the metadata value is so that it is easier for the user to locate.	2019-08-09 01:25:40 +03:00
Alan Orth	8772bdec51	csv_metadata_quality/app.py: Explicitly exit with success	2019-08-04 09:10:37 +03:00
Alan Orth	6d4ecd75aa	csv_metadata_quality/app.py: Close files before exit	2019-08-04 09:10:19 +03:00
Alan Orth	f4e7fd73f5	csv_metadata_quality/app.py: Handle Ctrl-C Instead of printing an ugly two-page stack trace.	2019-08-03 21:11:57 +03:00
Alan Orth	0561300ebe	Add option to print version with `--version` or `-V` I guess `-v` is more commonly used for "verbose" so I will use the short option of `-V` for version.	2019-08-02 00:09:54 +03:00
Alan Orth	bf876a046a	Rework AGROVOC validation AGROVOC validation is now disabled by default, but can be enabled on a field-by-field basis. For example, countries and regions are also present in AGROVOC. Fields with these values can be enabled using the new `--agrovoc-fields` option. I reworked the script output to show the field name when printing an invalid term so that the user knows in which field the term is.	2019-08-01 23:51:58 +03:00
Alan Orth	9100efdf50	Re-work as a proper standalone Python package Add a setup.py so that installation is easier and a standalone CLI script called csv-metadata-quality is provided. Now the user only needs to run this from a virtual environment inside the project directory: $ pip install . Eventually I could publish this on PyPi when I settle on a more appropriate package name. See: https://packaging.python.org/tutorials/packaging-projects/ See: https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/	2019-07-31 17:34:36 +03:00
Alan Orth	40d5f7d81b	Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.	2019-07-30 20:05:12 +03:00

1 2

70 Commits