csv-metadata-quality/README.md

# CSV Metadata Quality [![Build Status](https://travis-ci.org/alanorth/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/alanorth/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.

Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.

## Functionality

- Validate dates, ISSNs, ISBNs, and multi-value separators ("||")
- Validate languages against ISO 639-2 and ISO 639-3
- Validate subjects against the AGROVOC REST API
- Fix leading, trailing, and excessive (ie, more than one) whitespace
- Fix invalid multi-value separators (`|`) using `--unsafe-fixes`
- Fix problematic newlines (line feeds) using `--unsafe-fixes`
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
- Remove duplicate metadata values

## Installation
The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):

```
$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
$ cd csv-metadata-quality
$ pipenv install
$ pipenv shell
```

Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:

```
$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
$ cd csv-metadata-quality
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
```

## Usage
Run CSV Metadata Quality with the `--help` flag to see available options:

```
$ python -m csv_metadata_quality --help
```

To validate and clean a CSV file you must specify input and output files using the `-i` and `-o` options. For example, using the included test file:

```
$ python -m csv_metadata_quality -i data/test.csv -o /tmp/test.csv
```

You can enable "unsafe fixes" with the `--unsafe-fixes` option. Currently this will attempt to fix things like invalid multi-value separators (`|`). This is considered "unsafe" because it's theoretically possible for the `|` to be used legitimately in a metadata value, but in my experience it's always a typo where the user was attempting to use multiple metadata values, for example: `Kenya|Tanzania`.

## Todo

- Reporting / summary
- Real logging

## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

The license allows you to use and modify the work for personal and commercial purposes, but if you distribute the work you must provide users with a means to access the source code for the version you are distributing. Read more about the [GPLv3 at TL;DR Legal](https://tldrlegal.com/license/gnu-general-public-license-v3-(gpl-3)).
-												README.md: Add Travis CI badge

											
										
										
											2019-07-29 11:19:10 +02:00
+								# CSV Metadata Quality [![Build Status](https://travis-ci.org/alanorth/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/alanorth/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
-												README.md: Add more information to introduction

											
										
										
											2019-07-30 16:44:30 +02:00
+								A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.
-												README.md: Improve introduction, checks, and todo

											
										
										
											2019-07-26 22:50:41 +02:00
-												README.md: Improve introduction and functionality

											
										
										
											2019-07-30 15:09:15 +02:00
+								Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
-												README.md: Add information about checks and fixes

											
										
										
											2019-07-26 22:20:16 +02:00
-												README.md: Improve

Reorganize functionality section and add installation section.

											
										
										
											2019-07-29 10:15:51 +02:00
+								## Functionality
-												README.md: Try to simplify list of functionality

											
										
										
											2019-07-29 17:25:38 +02:00
+								- Validate dates, ISSNs, ISBNs, and multi-value separators ("||")
-												Add support for validating languages

Will validate against ISO 639-2 or ISO 639-3 depending on how long
the language field is. Otherwise will return that the language is
invalid.

Does not currently have any support for generic values like "Other".

											
										
										
											2019-07-29 17:59:42 +02:00
+								- Validate languages against ISO 639-2 and ISO 639-3
-												README.md: Improve introduction and functionality

											
										
										
											2019-07-30 15:09:15 +02:00
+								- Validate subjects against the AGROVOC REST API
 								- Fix leading, trailing, and excessive (ie, more than one) whitespace
-												README.md: Try to simplify list of functionality

											
										
										
											2019-07-29 17:25:38 +02:00
+								- Fix invalid multi-value separators (`|`) using `--unsafe-fixes`
-												Add support for removing newlines

This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.

											
										
										
											2019-07-30 19:05:12 +02:00
+								- Fix problematic newlines (line feeds) using `--unsafe-fixes`
-												README.md: Try to simplify list of functionality

											
										
										
											2019-07-29 17:25:38 +02:00
+								- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
 								- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
 								- Remove duplicate metadata values
-												README.md: Improve

Reorganize functionality section and add installation section.

											
										
										
											2019-07-29 10:15:51 +02:00
 								## Installation
 								The easiest way to install CSV Metadata Quality is with [pipenv](https://github.com/pypa/pipenv):
 								```
 								$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
 								$ cd csv-metadata-quality
 								$ pipenv install
 								$ pipenv shell
 								```
 								Otherwise, if you don't have pipenv, you can use a vanilla Python virtual environment:
 								```
 								$ git clone https://git.sr.ht/~alanorth/csv-metadata-quality
 								$ cd csv-metadata-quality
 								$ python3 -m venv venv
 								$ source venv/bin/activate
 								$ pip install -r requirements.txt
 								```
-												Add initial README.md with intro, license, and todo

											
										
										
											2019-07-26 21:18:38 +02:00
-												README.md: Add usage section

											
										
										
											2019-07-29 10:30:06 +02:00
+								## Usage
 								Run CSV Metadata Quality with the `--help` flag to see available options:
 								```
 								$ python -m csv_metadata_quality --help
 								```
 								To validate and clean a CSV file you must specify input and output files using the `-i` and `-o` options. For example, using the included test file:
 								```
 								$ python -m csv_metadata_quality -i data/test.csv -o /tmp/test.csv
 								```
-												README.md: Improve note about unsafe options

											
										
										
											2019-07-29 17:14:50 +02:00
+								You can enable "unsafe fixes" with the `--unsafe-fixes` option. Currently this will attempt to fix things like invalid multi-value separators (`|`). This is considered "unsafe" because it's theoretically possible for the `|` to be used legitimately in a metadata value, but in my experience it's always a typo where the user was attempting to use multiple metadata values, for example: `Kenya|Tanzania`.
-												README.md: Add note about Python version

											
										
										
											2019-07-29 11:15:09 +02:00
-												Add initial README.md with intro, license, and todo

											
										
										
											2019-07-26 21:18:38 +02:00
+								## Todo
-												README.md: Improve introduction, checks, and todo

											
										
										
											2019-07-26 22:50:41 +02:00
+								- Reporting / summary
 								- Real logging
-												Add initial README.md with intro, license, and todo

											
										
										
											2019-07-26 21:18:38 +02:00
 								## License
 								This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
 								The license allows you to use and modify the work for personal and commercial purposes, but if you distribute the work you must provide users with a means to access the source code for the version you are distributing. Read more about the [GPLv3 at TL;DR Legal](https://tldrlegal.com/license/gnu-general-public-license-v3-(gpl-3)).