Commit Graph

42 Commits

Author SHA1 Message Date
Alan Orth c6e7d6d9b5
Add flake8 to pipenv dev environment
To help check PEP8 formatting/style compliance.
2019-07-28 17:46:30 +03:00
Alan Orth aadb3117eb
csv_metadata_quality/app.py: Remove unused test input files 2019-07-28 17:45:05 +03:00
Alan Orth ce8f140c66
tests: Remove unused pytest import 2019-07-28 17:42:54 +03:00
Alan Orth e88d35ace3
csv_metadata_quality/app.py: Use regex in column match
Check for a column that has "issn" or "isbn" in the name rather
than by its explicit name, as the column is dc.identifier.issn now,
but will be cg.issn in the future if CG Core v2 happens.
2019-07-28 17:27:20 +03:00
Alan Orth 4687e2f5fa
README.md: Remove todo for date validation 2019-07-28 17:26:39 +03:00
Alan Orth 209a290de8
Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-07-28 17:12:20 +03:00
Alan Orth 646146d59f
Add Excel verion of test file 2019-07-28 17:07:33 +03:00
Alan Orth c2f28194eb
Add xlrd to pipenv for Pandas read_excel support 2019-07-28 17:05:17 +03:00
Alan Orth 5771764ad2
data/test.csv: Add some new records to test dates
Test invalid, missing, and multiple dates.
2019-07-28 16:23:55 +03:00
Alan Orth 196bb434fa
Add date validation
I'm only concerned with validating issue dates here. In DSpace they
are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory
they could be any valid ISO8601 format).

This also checks for cases where the date is missing and where the
metadata has specified multiple dates like "1990||1991", as this is
valid, but there is no practical value for it in our system.
2019-07-28 16:11:36 +03:00
Alan Orth 73b4061c7b
Add ipython to pipenv dev packages 2019-07-28 10:06:41 +03:00
Alan Orth e2bb2d4df9
Main function should be "main()" 2019-07-27 23:09:16 +03:00
Alan Orth 7e16968bf2
README.md: Add todos 2019-07-27 19:24:35 +03:00
Alan Orth c47c064a13
Make output less debuggy 2019-07-27 09:21:13 +03:00
Alan Orth a849615b41
Add tests for check functions
Relies on capturing stdout.

See: https://docs.pytest.org/en/5.0.1/capture.html
2019-07-27 02:10:13 +03:00
Alan Orth 2b41f9416b
csv_metadata_quality/fix.py: Remove extra newline 2019-07-27 01:29:22 +03:00
Alan Orth 3cf9f9452b
csv_metadata_quality/check.py: Always return field
We always need to return the field back so apply doesn't set it to
null when creating the new data frame.
2019-07-27 01:28:08 +03:00
Alan Orth 1d861f263b
.build.yml: Fix setup script
I wasn't chaning into the project directory so the pipenv virtual
environment was not getting created in the correct place.
2019-07-27 00:41:57 +03:00
Alan Orth 33121f8a01
.build.yml: Add tests 2019-07-27 00:38:34 +03:00
Alan Orth 41a30f1b07
Add initial tests
For now only test fixes because they return changed data. I'm not
sure how to test the checks, because they don't return data and I
can't modify them to return boolean values without breaking the app.
2019-07-27 00:36:40 +03:00
Alan Orth 103e630f6e
Add requirements-dev.txt
Generated with:

  $ pipenv lock -r -d > requirements-dev.txt
2019-07-27 00:33:52 +03:00
Alan Orth 99f00fcb85
Add pytest to pipenv dev environment 2019-07-27 00:32:53 +03:00
Alan Orth 18f26c343d
csv_metadata_quality/app.py: Fix path to test.csv 2019-07-27 00:25:30 +03:00
Alan Orth f2060adadf
Move tests.csv to data directory 2019-07-27 00:02:47 +03:00
Alan Orth 0a751c1f25
README.md: Add SourceHut build badge 2019-07-26 23:59:31 +03:00
Alan Orth 2eb48d8ed0
Add SourceHut build file
For now it only attempts to install the Python requirements using
pipenv. Later it will run tests with pytest.
2019-07-26 23:56:16 +03:00
Alan Orth 7fb7f7e03c
Add requirements.txt
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
2019-07-26 23:54:07 +03:00
Alan Orth df1087b26f
README.md: Improve introduction, checks, and todo 2019-07-26 23:50:41 +03:00
Alan Orth 84c3b17678
csv_metadata_quality/app.py: Add comment 2019-07-26 23:49:13 +03:00
Alan Orth 844b968098
tests/test.csv: Add invalid multi-value field separator 2019-07-26 23:48:45 +03:00
Alan Orth aaf3537ba4
Add check for invalid multi-value separators 2019-07-26 23:48:24 +03:00
Alan Orth 02f9d8a736
csv_metadata_quality/check.py: Add check for missing isbn values 2019-07-26 23:45:18 +03:00
Alan Orth 64e7a73417
README.md: Add information about checks and fixes 2019-07-26 23:20:16 +03:00
Alan Orth dfd961d720
Bring test.csv into project 2019-07-26 23:14:37 +03:00
Alan Orth e160b17fb0
Add ISSN and ISBN checks using python-stdnum 2019-07-26 23:14:10 +03:00
Alan Orth 30a4b0005f
csv_metadata_quality/fix.py: Remove test function 2019-07-26 22:56:40 +03:00
Alan Orth b657c51fd2
Add initial README.md with intro, license, and todo 2019-07-26 22:18:38 +03:00
Alan Orth 5c6453b397
Add GPLv3 license 2019-07-26 22:16:16 +03:00
Alan Orth 232d28e13e
Refactor as package with subpackages
This makes it cleaner for introducing checks, fixes, tests, docs,
and tests in the future. Currently can be run like this:

  python -m csv_metadata_quality

CSV input and output paths are still hard coded.

See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
2019-07-26 22:11:10 +03:00
Alan Orth ef5b8f7244
fix.py: Massive improvements
Use Python's str.strip() instead of kludgy regular expressions and
use split/join to handle multi-value fields more cleanly.
2019-07-26 19:31:55 +03:00
Alan Orth 801870e0ba
Add fix.py
Initial working version of metadata cleaning script that fixes lea-
ding and trailing whitespace (even in DSpace multi-value fields).
2019-07-26 19:08:28 +03:00
Alan Orth 21b78b9519
Initial commit
Pipenv environment with Pandas.
2019-07-26 17:54:13 +03:00