csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-09-10 21:57:02 +02:00

Author	SHA1	Message	Date
Alan Orth	8435ee242d	Experimental language detection using langid Works decenty well assuming the title, abstract, and citation fields are an accurate representation of the language as identified by the language field. Handles ISO 639-1 (alpha 2) and ISO 639-3 (alpha 3) values seamlessly. This includes updated pipenv environment, test data, pytest tests for both correct and incorrect ISO 639-1 and ISO 639-3 languages, and a new command line option "-e".	2019-09-26 13:46:32 +03:00
Alan Orth	86d4623fd3	More ISO 639-1 and ISO 639-3 fixes ISO 639-1 uses two-letter codes and ISO 639-3 uses three-letter codes. Technically there ISO 639-2/T and ISO 639-2/B, which also uses three letter codes, but those are not supported by the pycountry library so I won't even worry about them. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-26 07:44:39 +03:00
Alan Orth	f304ca6a33	csv_metadata_quality/app.py: Use simpler column iteration I don't know where I got the other one...	2019-09-21 17:19:39 +03:00
Alan Orth	d9fc09f121	Fix references to ISO 639 It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2 is the three-letter codes, aka alpha2 and alpha3. See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes	2019-09-11 16:36:53 +03:00
Alan Orth	280a99c8a8	Sort imports with isort See: https://sourcery.ai/blog/python-best-practices/	2019-08-29 01:15:04 +03:00
Alan Orth	d97dcd19db	Format with black	2019-08-29 01:10:39 +03:00
Alan Orth	c354a3687c	Release version 0.2.2	2019-08-28 00:10:17 +03:00
Alan Orth	81190d56bb	Add fix for missing space after commas This happens in names very often, for example in the contributor and citation fields. I will limit this to those fields for now and hide this fix behind the "unsafe fixes" option until I test it more.	2019-08-28 00:05:52 +03:00
Alan Orth	113e7cd8b6	csv_metadata_quality/app.py: Add ability to skip fields The user may want to skip the checking and fixing of certain fields in the input file.	2019-08-27 00:10:07 +03:00
Alan Orth	884e8f970d	csv_metadata_quality/check.py: Simplify AGROVOC check I recycled this code from a separate agrovoc-lookup.py script that checks lines in a text file to see if they are valid AGROVOC terms or not. There I was concerned about skipping comments or something I think, but we don't need to check that here. We simply check the term that is in the field and inform the user if it's valid or not.	2019-08-21 16:35:29 +03:00
Alan Orth	ed5612fbcf	Add column name to output in date checks This makes it easier to understand where the error is in case a CSV has multiple date fields, for example: Missing date (dc.date.issued). Missing date (dc.date.issued[]). If you have 126 items and you get 126 "Missing date" messages then it's likely that 100 of the items have dates in one field, and the others have dates in other field.	2019-08-21 15:31:12 +03:00
Alan Orth	7255bf4707	Version 0.2.1	2019-08-11 10:39:39 +03:00
Alan Orth	232ff99898	csv_metadata_quality/fix.py: Add more unneccessary Unicode fixes Add a check for soft hyphens (U+00AD). In one sample CSV I have a normal hyphen followed by a soft hyphen in an ISBN. This causes the ISBN validation to fail.	2019-08-11 00:07:21 +03:00
Alan Orth	13d5221378	csv_metadata_quality/check.py: Fix test for False	2019-08-10 23:52:53 +03:00
Alan Orth	9ce7dc6716	Add check for uncommon filenames Generally we want people to upload documents in accessible formats like PDF, Word, Excel, and PowerPoint. This check warns if a file is using an uncommon extension.	2019-08-10 23:41:16 +03:00
Alan Orth	5ff584a8d7	Version 0.2.0	2019-08-09 01:39:51 +03:00
Alan Orth	62fea95087	Improve suspicious character detection Now it will print just the part of the metadata value that contains the suspicious character (up to 80 characters, so we don't make the line break on terminals that use 80 character width by default). Also, print the name of the field in which the metadata value is so that it is easier for the user to locate.	2019-08-09 01:25:40 +03:00
Alan Orth	8772bdec51	csv_metadata_quality/app.py: Explicitly exit with success	2019-08-04 09:10:37 +03:00
Alan Orth	6d4ecd75aa	csv_metadata_quality/app.py: Close files before exit	2019-08-04 09:10:19 +03:00
Alan Orth	f4e7fd73f5	csv_metadata_quality/app.py: Handle Ctrl-C Instead of printing an ugly two-page stack trace.	2019-08-03 21:11:57 +03:00
Alan Orth	85ae7bdc5a	Increment version to 0.1.0	2019-08-02 00:12:47 +03:00
Alan Orth	0561300ebe	Add option to print version with `--version` or `-V` I guess `-v` is more commonly used for "verbose" so I will use the short option of `-V` for version.	2019-08-02 00:09:54 +03:00
Alan Orth	bf876a046a	Rework AGROVOC validation AGROVOC validation is now disabled by default, but can be enabled on a field-by-field basis. For example, countries and regions are also present in AGROVOC. Fields with these values can be enabled using the new `--agrovoc-fields` option. I reworked the script output to show the field name when printing an invalid term so that the user knows in which field the term is.	2019-08-01 23:51:58 +03:00
Alan Orth	576b3a3638	csv_metadata_quality/__main__.py: Fix spacing Identified by flake8.	2019-08-01 23:28:16 +03:00
Alan Orth	9100efdf50	Re-work as a proper standalone Python package Add a setup.py so that installation is easier and a standalone CLI script called csv-metadata-quality is provided. Now the user only needs to run this from a virtual environment inside the project directory: $ pip install . Eventually I could publish this on PyPi when I settle on a more appropriate package name. See: https://packaging.python.org/tutorials/packaging-projects/ See: https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/	2019-07-31 17:34:36 +03:00
Alan Orth	40d5f7d81b	Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.	2019-07-30 20:05:12 +03:00
Alan Orth	3c798fb504	Use pycountry instead of iso-639 for languages The latter is a fork that hasn't been updated since 2016 and the original still seems to be well maintained, with recent database updates as well as tests for Python 3.7. Also, pycountry supports ISO 3166-2 (administrative zones), which we could eventually use for sub regions.	2019-07-30 16:39:26 +03:00
Alan Orth	4e3511cd55	csv_metadata_quality/check.py: Fix AGROVOC lookup We actually only need to see if there are more than zero matches because a term like "Nigeria" will match in English, Spanish, etc, whereas terms that really don't match will have zero results.	2019-07-30 14:51:44 +03:00
Alan Orth	1f65a28307	Add support for validating subjects against AGROVOC Checks values in the dc.subject or dcterms.subject field against the AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py. See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/ See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py	2019-07-30 00:30:31 +03:00
Alan Orth	a36454a3ac	Add support for validating languages Will validate against ISO 639-2 or ISO 639-3 depending on how long the language field is. Otherwise will return that the language is invalid. Does not currently have any support for generic values like "Other".	2019-07-29 18:59:42 +03:00
Alan Orth	1e444cf040	Add fix for duplicate metadata values	2019-07-29 18:05:03 +03:00
Alan Orth	d7888d59a8	csv_metadata_quality/check.py: Return date even if it is invalid Otherwise it is missing from the final CSV and then we can't even fix it. :)	2019-07-29 17:40:14 +03:00
Alan Orth	50ae4e17f2	csv_metadata_quality/fix.py: Fix indent	2019-07-29 17:14:48 +03:00
Alan Orth	fa4fa3491b	Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data.	2019-07-29 17:08:49 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	d73f7b54b1	csv_metadata_quality/app.py: Improve comments	2019-07-29 16:24:35 +03:00
Alan Orth	42920e9c7c	Test Python regular expression matches directly Match objects always have a boolean value of True. See: https://docs.python.org/3.7/library/re.html	2019-07-29 16:16:30 +03:00
Alan Orth	7b5db1f5d9	csv_metadata_quality/app.py: Remove erroneous comment	2019-07-29 16:15:25 +03:00
Alan Orth	40e77db713	Add "unsafe fixes" runtime option In this case it fixes occurences of invalid multi-value separators. DSpace uses "\|\|" to separate multiple values in one field, but our editors sometimes give us files with mistakes like "\|". We can fix these to be correct multi-value separators if we are sure that the metadata is not actually using "\|" for some legitimate purpose.	2019-07-28 22:53:39 +03:00
Alan Orth	a93b5b31c5	Add support for command line arguments Currently only supports specifying input and output files with -i and -o. Eventually I'll add more options like dry run, debug, and maybe things like forcing unsafe fixes.	2019-07-28 20:31:57 +03:00
Alan Orth	87b1997051	Fix whitespace errors found by flake8	2019-07-28 17:47:28 +03:00
Alan Orth	aadb3117eb	csv_metadata_quality/app.py: Remove unused test input files	2019-07-28 17:45:05 +03:00
Alan Orth	e88d35ace3	csv_metadata_quality/app.py: Use regex in column match Check for a column that has "issn" or "isbn" in the name rather than by its explicit name, as the column is dc.identifier.issn now, but will be cg.issn in the future if CG Core v2 happens.	2019-07-28 17:27:20 +03:00
Alan Orth	196bb434fa	Add date validation I'm only concerned with validating issue dates here. In DSpace they are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory they could be any valid ISO8601 format). This also checks for cases where the date is missing and where the metadata has specified multiple dates like "1990\|\|1991", as this is valid, but there is no practical value for it in our system.	2019-07-28 16:11:36 +03:00
Alan Orth	e2bb2d4df9	Main function should be "main()"	2019-07-27 23:09:16 +03:00
Alan Orth	c47c064a13	Make output less debuggy	2019-07-27 09:21:13 +03:00
Alan Orth	2b41f9416b	csv_metadata_quality/fix.py: Remove extra newline	2019-07-27 01:29:22 +03:00
Alan Orth	3cf9f9452b	csv_metadata_quality/check.py: Always return field We always need to return the field back so apply doesn't set it to null when creating the new data frame.	2019-07-27 01:28:08 +03:00
Alan Orth	18f26c343d	csv_metadata_quality/app.py: Fix path to test.csv	2019-07-27 00:25:30 +03:00
Alan Orth	84c3b17678	csv_metadata_quality/app.py: Add comment	2019-07-26 23:49:13 +03:00

1 2 3 4

156 Commits