1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2025-01-07 03:04:54 +01:00
Commit Graph

449 Commits

Author SHA1 Message Date
a8a41d60b6
README.md: Add note about Pandas 2019-07-29 10:56:02 +03:00
cf6c01caaf
README.md: Add notes about unsafe fixes 2019-07-28 23:01:00 +03:00
40e77db713
Add "unsafe fixes" runtime option
In this case it fixes occurences of invalid multi-value separators.
DSpace uses "||" to separate multiple values in one field, but our
editors sometimes give us files with mistakes like "|". We can fix
these to be correct multi-value separators if we are sure that the
metadata is not actually using "|" for some legitimate purpose.
2019-07-28 22:53:39 +03:00
a93b5b31c5
Add support for command line arguments
Currently only supports specifying input and output files with -i
and -o. Eventually I'll add more options like dry run, debug, and
maybe things like forcing unsafe fixes.
2019-07-28 20:31:57 +03:00
4e6225c0a9
README.md: Add note about Excel files 2019-07-28 18:38:36 +03:00
d293214d3a
README.md: Add note about checking dates 2019-07-28 18:37:46 +03:00
87b1997051
Fix whitespace errors found by flake8 2019-07-28 17:47:28 +03:00
c6e7d6d9b5
Add flake8 to pipenv dev environment
To help check PEP8 formatting/style compliance.
2019-07-28 17:46:30 +03:00
aadb3117eb
csv_metadata_quality/app.py: Remove unused test input files 2019-07-28 17:45:05 +03:00
ce8f140c66
tests: Remove unused pytest import 2019-07-28 17:42:54 +03:00
e88d35ace3
csv_metadata_quality/app.py: Use regex in column match
Check for a column that has "issn" or "isbn" in the name rather
than by its explicit name, as the column is dc.identifier.issn now,
but will be cg.issn in the future if CG Core v2 happens.
2019-07-28 17:27:20 +03:00
4687e2f5fa
README.md: Remove todo for date validation 2019-07-28 17:26:39 +03:00
209a290de8
Update python requirements
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
  $ pipenv lock -r -d > requirements-dev.txt
2019-07-28 17:12:20 +03:00
646146d59f
Add Excel verion of test file 2019-07-28 17:07:33 +03:00
c2f28194eb
Add xlrd to pipenv for Pandas read_excel support 2019-07-28 17:05:17 +03:00
5771764ad2
data/test.csv: Add some new records to test dates
Test invalid, missing, and multiple dates.
2019-07-28 16:23:55 +03:00
196bb434fa
Add date validation
I'm only concerned with validating issue dates here. In DSpace they
are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory
they could be any valid ISO8601 format).

This also checks for cases where the date is missing and where the
metadata has specified multiple dates like "1990||1991", as this is
valid, but there is no practical value for it in our system.
2019-07-28 16:11:36 +03:00
73b4061c7b
Add ipython to pipenv dev packages 2019-07-28 10:06:41 +03:00
e2bb2d4df9
Main function should be "main()" 2019-07-27 23:09:16 +03:00
7e16968bf2
README.md: Add todos 2019-07-27 19:24:35 +03:00
c47c064a13
Make output less debuggy 2019-07-27 09:21:13 +03:00
a849615b41
Add tests for check functions
Relies on capturing stdout.

See: https://docs.pytest.org/en/5.0.1/capture.html
2019-07-27 02:10:13 +03:00
2b41f9416b
csv_metadata_quality/fix.py: Remove extra newline 2019-07-27 01:29:22 +03:00
3cf9f9452b
csv_metadata_quality/check.py: Always return field
We always need to return the field back so apply doesn't set it to
null when creating the new data frame.
2019-07-27 01:28:08 +03:00
1d861f263b
.build.yml: Fix setup script
I wasn't chaning into the project directory so the pipenv virtual
environment was not getting created in the correct place.
2019-07-27 00:41:57 +03:00
33121f8a01
.build.yml: Add tests 2019-07-27 00:38:34 +03:00
41a30f1b07
Add initial tests
For now only test fixes because they return changed data. I'm not
sure how to test the checks, because they don't return data and I
can't modify them to return boolean values without breaking the app.
2019-07-27 00:36:40 +03:00
103e630f6e
Add requirements-dev.txt
Generated with:

  $ pipenv lock -r -d > requirements-dev.txt
2019-07-27 00:33:52 +03:00
99f00fcb85
Add pytest to pipenv dev environment 2019-07-27 00:32:53 +03:00
18f26c343d
csv_metadata_quality/app.py: Fix path to test.csv 2019-07-27 00:25:30 +03:00
f2060adadf
Move tests.csv to data directory 2019-07-27 00:02:47 +03:00
0a751c1f25
README.md: Add SourceHut build badge 2019-07-26 23:59:31 +03:00
2eb48d8ed0
Add SourceHut build file
For now it only attempts to install the Python requirements using
pipenv. Later it will run tests with pytest.
2019-07-26 23:56:16 +03:00
7fb7f7e03c
Add requirements.txt
Generated using pipenv:

  $ pipenv lock -r > requirements.txt
2019-07-26 23:54:07 +03:00
df1087b26f
README.md: Improve introduction, checks, and todo 2019-07-26 23:50:41 +03:00
84c3b17678
csv_metadata_quality/app.py: Add comment 2019-07-26 23:49:13 +03:00
844b968098
tests/test.csv: Add invalid multi-value field separator 2019-07-26 23:48:45 +03:00
aaf3537ba4
Add check for invalid multi-value separators 2019-07-26 23:48:24 +03:00
02f9d8a736
csv_metadata_quality/check.py: Add check for missing isbn values 2019-07-26 23:45:18 +03:00
64e7a73417
README.md: Add information about checks and fixes 2019-07-26 23:20:16 +03:00
dfd961d720
Bring test.csv into project 2019-07-26 23:14:37 +03:00
e160b17fb0
Add ISSN and ISBN checks using python-stdnum 2019-07-26 23:14:10 +03:00
30a4b0005f
csv_metadata_quality/fix.py: Remove test function 2019-07-26 22:56:40 +03:00
b657c51fd2
Add initial README.md with intro, license, and todo 2019-07-26 22:18:38 +03:00
5c6453b397
Add GPLv3 license 2019-07-26 22:16:16 +03:00
232d28e13e
Refactor as package with subpackages
This makes it cleaner for introducing checks, fixes, tests, docs,
and tests in the future. Currently can be run like this:

  python -m csv_metadata_quality

CSV input and output paths are still hard coded.

See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
2019-07-26 22:11:10 +03:00
ef5b8f7244
fix.py: Massive improvements
Use Python's str.strip() instead of kludgy regular expressions and
use split/join to handle multi-value fields more cleanly.
2019-07-26 19:31:55 +03:00
801870e0ba
Add fix.py
Initial working version of metadata cleaning script that fixes lea-
ding and trailing whitespace (even in DSpace multi-value fields).
2019-07-26 19:08:28 +03:00
21b78b9519
Initial commit
Pipenv environment with Pandas.
2019-07-26 17:54:13 +03:00