csv-metadata-quality

mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-08-24 05:41:50 +02:00

Author	SHA1	Message	Date
Alan Orth	576b3a3638	csv_metadata_quality/__main__.py: Fix spacing Identified by flake8.	2019-08-01 23:28:16 +03:00
Alan Orth	a6eba0fc1a	.build.yml: Use pipenv run instead of pipenv shell The latter creates a shell and of course it doesn't ever exit!	2019-07-31 17:58:21 +03:00
Alan Orth	857492cd93	.build.yml: Try fix CLI test I was writing the CLI output to /dev/null because I was lazy, but actually this might break the SourceHut build.	2019-07-31 17:54:49 +03:00
Alan Orth	db42fbfea9	.build.yml: Try to run the script itself	2019-07-31 17:48:17 +03:00
Alan Orth	734ae7f011	setup.py: Add "DSpace" to description I suppose someone searching PyPi (eventually) would be happy to see that this was written specifically with DSpace in mind.	2019-07-31 17:42:27 +03:00
Alan Orth	fd3861e7cd	README.md: Update installation and usage instructions It is much easier now that I have created a proper package.	2019-07-31 17:41:18 +03:00
Alan Orth	cec1a34dfe	.gitignore: Ignore egg cache from distutils	2019-07-31 17:38:54 +03:00
Alan Orth	9100efdf50	Re-work as a proper standalone Python package Add a setup.py so that installation is easier and a standalone CLI script called csv-metadata-quality is provided. Now the user only needs to run this from a virtual environment inside the project directory: $ pip install . Eventually I could publish this on PyPi when I settle on a more appropriate package name. See: https://packaging.python.org/tutorials/packaging-projects/ See: https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/	2019-07-31 17:34:36 +03:00
Alan Orth	4c4f4a3ba2	README.md: Update todos	2019-07-31 16:33:49 +03:00
Alan Orth	22cc7bc793	README.md: Improve section on unsafe fixes	2019-07-31 16:00:05 +03:00
Alan Orth	915327539a	pytest.ini: Ignore deprecation warnings These come from third-party libraries I have no control over. See: https://docs.pytest.org/en/latest/warnings.html#deprecationwarning-and-pendingdeprecationwarning	2019-07-31 13:33:20 +03:00
Alan Orth	63ffd77723	data/test.csv: Clarify that newline is a line feed	2019-07-31 13:03:43 +03:00
Alan Orth	b9d041927e	Update python requirements Generated using pipenv: $ pipenv lock -r > requirements.txt $ pipenv lock -r -d > requirements-dev.txt	2019-07-30 23:06:26 +03:00
Alan Orth	020c87768e	data/test.csv: Add item with missing date	2019-07-30 21:02:51 +03:00
Alan Orth	abf74909ee	data/test.csv: Add missing date I was meaning to test for an invalid multi-value separator here!	2019-07-30 21:01:42 +03:00
Alan Orth	40d5f7d81b	Add support for removing newlines This was tricky because of the nature of newlines. In actuality we are removing Unix line feeds here (U+000A) because Windows carriage returns are actually already removed by the string stripping in the whitespace fix. Creating the test case in Vim was difficult because I couldn't fig- ure out how to manually enter a line feed character. In the end I used a search and replace on a known pattern like "ALAN", replacing it with \r. Neither entering the Unicode code point (U+000A) direc- tly or typing an "Enter" character after ^V worked. Grrr.	2019-07-30 20:05:12 +03:00
Alan Orth	346e66ca98	README.md: Add more information to introduction	2019-07-30 17:44:30 +03:00
Alan Orth	bad2fce124	pytest.ini: Don't print captured output It makes the summary of passes and fails more annoying to read due to the lines being long and including newlines. I am actually not sure why this fixes it, though... See: https://pytest.readthedocs.io/en/latest/capture.html	2019-07-30 16:43:31 +03:00
Alan Orth	3c798fb504	Use pycountry instead of iso-639 for languages The latter is a fork that hasn't been updated since 2016 and the original still seems to be well maintained, with recent database updates as well as tests for Python 3.7. Also, pycountry supports ISO 3166-2 (administrative zones), which we could eventually use for sub regions.	2019-07-30 16:39:26 +03:00
Alan Orth	a85b410ab9	README.md: Improve introduction and functionality	2019-07-30 16:09:15 +03:00
Alan Orth	4e3511cd55	csv_metadata_quality/check.py: Fix AGROVOC lookup We actually only need to see if there are more than zero matches because a term like "Nigeria" will match in English, Spanish, etc, whereas terms that really don't match will have zero results.	2019-07-30 14:51:44 +03:00
Alan Orth	1fd3d8bc2f	Add pytest.ini from responder This seems to be produce a more informative output, though I'm not sure how to filter the annoying deprecation warnings pytest throws about things that are usually done by modules I'm using, not by me. From: https://github.com/taoufik07/responder/blob/master/pytest.ini	2019-07-30 00:45:18 +03:00
Alan Orth	b4aaffd6f1	Update python requirements Generated using pipenv: $ pipenv lock -r > requirements.txt	2019-07-30 00:34:32 +03:00
Alan Orth	5ea2e856e4	data/test.csv: Add dc.subject column for AGROVOC tests	2019-07-30 00:33:31 +03:00
Alan Orth	1f65a28307	Add support for validating subjects against AGROVOC Checks values in the dc.subject or dcterms.subject field against the AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py. See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/ See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py	2019-07-30 00:30:31 +03:00
Alan Orth	bb882315f1	Update python requirements Generated using pipenv: $ pipenv lock -r > requirements.txt $ pipenv lock -r -d > requirements-dev.txt	2019-07-29 19:10:49 +03:00
Alan Orth	a36454a3ac	Add support for validating languages Will validate against ISO 639-2 or ISO 639-3 depending on how long the language field is. Otherwise will return that the language is invalid. Does not currently have any support for generic values like "Other".	2019-07-29 18:59:42 +03:00
Alan Orth	1978fa7b48	Add iso-639 library to validate languages	2019-07-29 18:58:50 +03:00
Alan Orth	e49b4e8f22	README.md: Try to simplify list of functionality	2019-07-29 18:25:38 +03:00
Alan Orth	0eb852a65b	README.md: Improve note about unsafe options	2019-07-29 18:14:50 +03:00
Alan Orth	8d92498bbf	Add .gitignore	2019-07-29 18:10:29 +03:00
Alan Orth	8c34c2d6e6	README.md: Add note about removing duplicate values	2019-07-29 18:09:48 +03:00
Alan Orth	1e444cf040	Add fix for duplicate metadata values	2019-07-29 18:05:03 +03:00
Alan Orth	d7888d59a8	csv_metadata_quality/check.py: Return date even if it is invalid Otherwise it is missing from the final CSV and then we can't even fix it. :)	2019-07-29 17:40:14 +03:00
Alan Orth	8509006165	data/test.csv: Use more descriptive tests To make it obvious what each item is testing.	2019-07-29 17:38:46 +03:00
Alan Orth	e33551776c	README.md: Update note about unsafe options	2019-07-29 17:25:42 +03:00
Alan Orth	7f781d7077	README.md: Finish writing usage section	2019-07-29 17:21:34 +03:00
Alan Orth	50ae4e17f2	csv_metadata_quality/fix.py: Fix indent	2019-07-29 17:14:48 +03:00
Alan Orth	fa4fa3491b	Add check for "suspicious" characters These standalone characters often indicate issues with encoding or copy/paste in languages with accents like French and Spanish. For example: foreˆt should be forêt. It is not possible to fix these issues automatically, but this will print a warning so you can notify the owner of the data.	2019-07-29 17:08:49 +03:00
Alan Orth	8047a57cc5	Add support for fixing "unnecessary" Unicode These are things like non-breaking spaces, "replacement" characters, etc that add nothing to the metadata and often cause errors during parsing or displaying in a UI.	2019-07-29 16:38:10 +03:00
Alan Orth	ae66382046	README.md: Add note about unnecessary Unicode fixes	2019-07-29 16:34:39 +03:00
Alan Orth	d73f7b54b1	csv_metadata_quality/app.py: Improve comments	2019-07-29 16:24:35 +03:00
Alan Orth	42920e9c7c	Test Python regular expression matches directly Match objects always have a boolean value of True. See: https://docs.python.org/3.7/library/re.html	2019-07-29 16:16:30 +03:00
Alan Orth	7b5db1f5d9	csv_metadata_quality/app.py: Remove erroneous comment	2019-07-29 16:15:25 +03:00
Alan Orth	9ac9474c69	README.md: Add todo about duplicates	2019-07-29 12:40:53 +03:00
Alan Orth	e000bd1f88	README.md: Add Travis CI badge	2019-07-29 12:19:10 +03:00
Alan Orth	3554c2991f	README.md: Add note about Python version	2019-07-29 12:15:09 +03:00
Alan Orth	8e3e7a3573	.travis.yml: Only test Python 3.6 and 3.7 I'm using f-strings, which are only supported in Python 3.6+.	2019-07-29 12:13:55 +03:00
Alan Orth	913500cfff	.travis.yml: Remove env This seems to make Travis run one job for each variable.	2019-07-29 11:51:12 +03:00
Alan Orth	17f4a26ac0	Add Travis CI support I'm happy with SourceHut's build support, but Travis allows us to tests other Python versions.	2019-07-29 11:45:58 +03:00

... 8 9 10 11 12

551 Commits