ac127e7f8a
README.md: Add usage section
2019-07-29 11:30:06 +03:00
aabb57321c
README.md: Improve
...
Reorganize functionality section and add installation section.
2019-07-29 11:15:51 +03:00
a8a41d60b6
README.md: Add note about Pandas
2019-07-29 10:56:02 +03:00
cf6c01caaf
README.md: Add notes about unsafe fixes
2019-07-28 23:01:00 +03:00
40e77db713
Add "unsafe fixes" runtime option
...
In this case it fixes occurences of invalid multi-value separators.
DSpace uses "||" to separate multiple values in one field, but our
editors sometimes give us files with mistakes like "|". We can fix
these to be correct multi-value separators if we are sure that the
metadata is not actually using "|" for some legitimate purpose.
2019-07-28 22:53:39 +03:00
a93b5b31c5
Add support for command line arguments
...
Currently only supports specifying input and output files with -i
and -o. Eventually I'll add more options like dry run, debug, and
maybe things like forcing unsafe fixes.
2019-07-28 20:31:57 +03:00
4e6225c0a9
README.md: Add note about Excel files
2019-07-28 18:38:36 +03:00
d293214d3a
README.md: Add note about checking dates
2019-07-28 18:37:46 +03:00
87b1997051
Fix whitespace errors found by flake8
2019-07-28 17:47:28 +03:00
c6e7d6d9b5
Add flake8 to pipenv dev environment
...
To help check PEP8 formatting/style compliance.
2019-07-28 17:46:30 +03:00
aadb3117eb
csv_metadata_quality/app.py: Remove unused test input files
2019-07-28 17:45:05 +03:00
ce8f140c66
tests: Remove unused pytest import
2019-07-28 17:42:54 +03:00
e88d35ace3
csv_metadata_quality/app.py: Use regex in column match
...
Check for a column that has "issn" or "isbn" in the name rather
than by its explicit name, as the column is dc.identifier.issn now,
but will be cg.issn in the future if CG Core v2 happens.
2019-07-28 17:27:20 +03:00
4687e2f5fa
README.md: Remove todo for date validation
2019-07-28 17:26:39 +03:00
209a290de8
Update python requirements
...
Generated using pipenv:
$ pipenv lock -r > requirements.txt
$ pipenv lock -r -d > requirements-dev.txt
2019-07-28 17:12:20 +03:00
646146d59f
Add Excel verion of test file
2019-07-28 17:07:33 +03:00
c2f28194eb
Add xlrd to pipenv for Pandas read_excel support
2019-07-28 17:05:17 +03:00
5771764ad2
data/test.csv: Add some new records to test dates
...
Test invalid, missing, and multiple dates.
2019-07-28 16:23:55 +03:00
196bb434fa
Add date validation
...
I'm only concerned with validating issue dates here. In DSpace they
are generally always YYYY, YYY-MM, or YYYY-MM-DD (though in theory
they could be any valid ISO8601 format).
This also checks for cases where the date is missing and where the
metadata has specified multiple dates like "1990||1991", as this is
valid, but there is no practical value for it in our system.
2019-07-28 16:11:36 +03:00
73b4061c7b
Add ipython to pipenv dev packages
2019-07-28 10:06:41 +03:00
e2bb2d4df9
Main function should be "main()"
2019-07-27 23:09:16 +03:00
7e16968bf2
README.md: Add todos
2019-07-27 19:24:35 +03:00
c47c064a13
Make output less debuggy
2019-07-27 09:21:13 +03:00
a849615b41
Add tests for check functions
...
Relies on capturing stdout.
See: https://docs.pytest.org/en/5.0.1/capture.html
2019-07-27 02:10:13 +03:00
2b41f9416b
csv_metadata_quality/fix.py: Remove extra newline
2019-07-27 01:29:22 +03:00
3cf9f9452b
csv_metadata_quality/check.py: Always return field
...
We always need to return the field back so apply doesn't set it to
null when creating the new data frame.
2019-07-27 01:28:08 +03:00
1d861f263b
.build.yml: Fix setup script
...
I wasn't chaning into the project directory so the pipenv virtual
environment was not getting created in the correct place.
2019-07-27 00:41:57 +03:00
33121f8a01
.build.yml: Add tests
2019-07-27 00:38:34 +03:00
41a30f1b07
Add initial tests
...
For now only test fixes because they return changed data. I'm not
sure how to test the checks, because they don't return data and I
can't modify them to return boolean values without breaking the app.
2019-07-27 00:36:40 +03:00
103e630f6e
Add requirements-dev.txt
...
Generated with:
$ pipenv lock -r -d > requirements-dev.txt
2019-07-27 00:33:52 +03:00
99f00fcb85
Add pytest to pipenv dev environment
2019-07-27 00:32:53 +03:00
18f26c343d
csv_metadata_quality/app.py: Fix path to test.csv
2019-07-27 00:25:30 +03:00
f2060adadf
Move tests.csv to data directory
2019-07-27 00:02:47 +03:00
0a751c1f25
README.md: Add SourceHut build badge
2019-07-26 23:59:31 +03:00
2eb48d8ed0
Add SourceHut build file
...
For now it only attempts to install the Python requirements using
pipenv. Later it will run tests with pytest.
2019-07-26 23:56:16 +03:00
7fb7f7e03c
Add requirements.txt
...
Generated using pipenv:
$ pipenv lock -r > requirements.txt
2019-07-26 23:54:07 +03:00
df1087b26f
README.md: Improve introduction, checks, and todo
2019-07-26 23:50:41 +03:00
84c3b17678
csv_metadata_quality/app.py: Add comment
2019-07-26 23:49:13 +03:00
844b968098
tests/test.csv: Add invalid multi-value field separator
2019-07-26 23:48:45 +03:00
aaf3537ba4
Add check for invalid multi-value separators
2019-07-26 23:48:24 +03:00
02f9d8a736
csv_metadata_quality/check.py: Add check for missing isbn values
2019-07-26 23:45:18 +03:00
64e7a73417
README.md: Add information about checks and fixes
2019-07-26 23:20:16 +03:00
dfd961d720
Bring test.csv into project
2019-07-26 23:14:37 +03:00
e160b17fb0
Add ISSN and ISBN checks using python-stdnum
2019-07-26 23:14:10 +03:00
30a4b0005f
csv_metadata_quality/fix.py: Remove test function
2019-07-26 22:56:40 +03:00
b657c51fd2
Add initial README.md with intro, license, and todo
2019-07-26 22:18:38 +03:00
5c6453b397
Add GPLv3 license
2019-07-26 22:16:16 +03:00
232d28e13e
Refactor as package with subpackages
...
This makes it cleaner for introducing checks, fixes, tests, docs,
and tests in the future. Currently can be run like this:
python -m csv_metadata_quality
CSV input and output paths are still hard coded.
See: https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
2019-07-26 22:11:10 +03:00
ef5b8f7244
fix.py: Massive improvements
...
Use Python's str.strip() instead of kludgy regular expressions and
use split/join to handle multi-value fields more cleanly.
2019-07-26 19:31:55 +03:00
801870e0ba
Add fix.py
...
Initial working version of metadata cleaning script that fixes lea-
ding and trailing whitespace (even in DSpace multi-value fields).
2019-07-26 19:08:28 +03:00