2019-07-29 11:19:10 +02:00
# CSV Metadata Quality [![Build Status](https://travis-ci.org/alanorth/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/alanorth/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
2019-07-28 17:38:36 +02:00
A simple but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. Supports multi-value fields using the standard DSpace value separator ("||"). Despite the name it does support reading Excel files.
2019-07-26 22:50:41 +02:00
2019-07-29 11:15:09 +02:00
Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas ](https://pandas.pydata.org/ ) library.
2019-07-26 22:20:16 +02:00
2019-07-29 10:15:51 +02:00
## Functionality
- Read/write CSV files ✓
- Read Excel files ✓
- Validate dates, ISSNs, ISBNs, and multi-value separators ("||") ✓
- Fix leading, trailing, and excessive whitespace ✓
- Fix invalid multi-value separators ("|") using `--unsafe-fixes` ✓
## Installation
The easiest way to install CSV Metadata Quality is with [pipenv ](https://github.com/pypa/pipenv ):
```
$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
$ cd csv-metadata-quality
$ pipenv install
$ pipenv shell
```
Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:
```
$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
$ cd csv-metadata-quality
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
```
2019-07-26 21:18:38 +02:00
2019-07-29 10:30:06 +02:00
## Usage
Run CSV Metadata Quality with the `--help` flag to see available options:
```
$ python -m csv_metadata_quality --help
```
To validate and clean a CSV file you must specify input and output files using the `-i` and `-o` options. For example, using the included test file:
```
$ python -m csv_metadata_quality -i data/test.csv -o /tmp/test.csv
```
2019-07-29 11:15:09 +02:00
You can enable "unsafe fixes" with the `--unsafe-fixes` option. This will attempt
2019-07-26 21:18:38 +02:00
## Todo
2019-07-26 22:50:41 +02:00
- Reporting / summary
- Real logging
2019-07-29 11:40:53 +02:00
- Detect and fix duplicate values like "Alan||Alan"
2019-07-26 21:18:38 +02:00
## License
This work is licensed under the [GPLv3 ](https://www.gnu.org/licenses/gpl-3.0.en.html ).
The license allows you to use and modify the work for personal and commercial purposes, but if you distribute the work you must provide users with a means to access the source code for the version you are distributing. Read more about the [GPLv3 at TL;DR Legal ](https://tldrlegal.com/license/gnu-general-public-license-v3-(gpl-3 )).