1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-01-05 18:34:55 +01:00
Commit Graph

338 Commits

Author SHA1 Message Date
4687e2f5fa
README.md: Remove todo for date validation 2019-07-28 17:26:39 +03:00
209a290de8
Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-07-28 17:12:20 +03:00
646146d59f
Add Excel verion of test file 2019-07-28 17:07:33 +03:00
c2f28194eb
Add xlrd to pipenv for Pandas read_excel support 2019-07-28 17:05:17 +03:00
5771764ad2
data/test.csv: Add some new records to test dates
Test invalid, missing, and multiple dates.
2019-07-28 16:23:55 +03:00
196bb434fa
Add date validation
I'm only concerned with validating issue dates here. In DSpace they
are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory
they could be any valid ISO8601 format).

This also checks for cases where the date is missing and where the
metadata has specified multiple dates like "1990||1991", as this is
valid, but there is no practical value for it in our system.
2019-07-28 16:11:36 +03:00
73b4061c7b
Add ipython to pipenv dev packages 2019-07-28 10:06:41 +03:00
e2bb2d4df9
Main function should be "main()" 2019-07-27 23:09:16 +03:00
7e16968bf2
README.md: Add todos 2019-07-27 19:24:35 +03:00
c47c064a13
Make output less debuggy 2019-07-27 09:21:13 +03:00
a849615b41
Add tests for check functions
Relies on capturing stdout.

See: https://docs.pytest.org/en/5.0.1/capture.html
2019-07-27 02:10:13 +03:00
2b41f9416b
csv_metadata_quality/fix.py: Remove extra newline 2019-07-27 01:29:22 +03:00
3cf9f9452b
csv_metadata_quality/check.py: Always return field
We always need to return the field back so apply doesn't set it to
null when creating the new data frame.
2019-07-27 01:28:08 +03:00
1d861f263b
.build.yml: Fix setup script
I wasn't chaning into the project directory so the pipenv virtual
environment was not getting created in the correct place.
2019-07-27 00:41:57 +03:00
33121f8a01
.build.yml: Add tests 2019-07-27 00:38:34 +03:00
41a30f1b07
Add initial tests
For now only test fixes because they return changed data. I'm not
sure how to test the checks, because they don't return data and I
can't modify them to return boolean values without breaking the app.
2019-07-27 00:36:40 +03:00
103e630f6e
Add requirements-dev.txt
Generated with:

  $ pipenv lock -r -d > requirements-dev.txt
2019-07-27 00:33:52 +03:00
99f00fcb85
Add pytest to pipenv dev environment 2019-07-27 00:32:53 +03:00
18f26c343d
csv_metadata_quality/app.py: Fix path to test.csv 2019-07-27 00:25:30 +03:00
f2060adadf
Move tests.csv to data directory 2019-07-27 00:02:47 +03:00
0a751c1f25
README.md: Add SourceHut build badge 2019-07-26 23:59:31 +03:00
2eb48d8ed0
Add SourceHut build file
For now it only attempts to install the Python requirements using
pipenv. Later it will run tests with pytest.
2019-07-26 23:56:16 +03:00
7fb7f7e03c
Add requirements.txt
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
2019-07-26 23:54:07 +03:00
df1087b26f
README.md: Improve introduction, checks, and todo 2019-07-26 23:50:41 +03:00
84c3b17678
csv_metadata_quality/app.py: Add comment 2019-07-26 23:49:13 +03:00
844b968098
tests/test.csv: Add invalid multi-value field separator 2019-07-26 23:48:45 +03:00
aaf3537ba4
Add check for invalid multi-value separators 2019-07-26 23:48:24 +03:00
02f9d8a736
csv_metadata_quality/check.py: Add check for missing isbn values 2019-07-26 23:45:18 +03:00
64e7a73417
README.md: Add information about checks and fixes 2019-07-26 23:20:16 +03:00
dfd961d720
Bring test.csv into project 2019-07-26 23:14:37 +03:00
e160b17fb0
Add ISSN and ISBN checks using python-stdnum 2019-07-26 23:14:10 +03:00
30a4b0005f
csv_metadata_quality/fix.py: Remove test function 2019-07-26 22:56:40 +03:00
b657c51fd2
Add initial README.md with intro, license, and todo 2019-07-26 22:18:38 +03:00
5c6453b397
Add GPLv3 license 2019-07-26 22:16:16 +03:00
232d28e13e
Refactor as package with subpackages
This makes it cleaner for introducing checks, fixes, tests, docs,
and tests in the future. Currently can be run like this:

  python -m csv_metadata_quality

CSV input and output paths are still hard coded.

See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
2019-07-26 22:11:10 +03:00
ef5b8f7244
fix.py: Massive improvements
Use Python's str.strip() instead of kludgy regular expressions and
use split/join to handle multi-value fields more cleanly.
2019-07-26 19:31:55 +03:00
801870e0ba
Add fix.py
Initial working version of metadata cleaning script that fixes lea-
ding and trailing whitespace (even in DSpace multi-value fields).
2019-07-26 19:08:28 +03:00
21b78b9519
Initial commit
Pipenv environment with Pandas.
2019-07-26 17:54:13 +03:00