1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-25 23:28:18 +01:00

Compare commits

...

8 Commits

Author SHA1 Message Date
ad2cda8a41
README.md: Add note about SPDX license identifiers
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:21:34 +02:00
dc6920802e
.github/workflows/python-app.yml: Use Python 3.9
I now use this version in my development environment. Eventually I
should add a matrix of versions to use, but I don't know the GitHub
Actions syntax well enough yet.
2021-03-11 12:17:57 +02:00
6ca449d8ed
README.md: Update note about Python 3.8 to 3.8+
Currently the lower bound on Python version support is 3.7 because
of Pandas 1.2.0 requiring it, but I use 3.9 on my development box.
2021-03-11 12:16:07 +02:00
1554cfd5c9
Version 0.4.6 2021-03-11 12:14:54 +02:00
00b8faad6d
CHANGELOG.md: Fix headers 2021-03-11 12:13:22 +02:00
b19d81abdd
.drone.yml: We need some stuff to build pyicu now
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:07:28 +02:00
a0ea829f5c
csv_metadata_quality/fix.py: Fixes should be green 2021-03-11 11:47:24 +02:00
0089efa914
tests/test_check.py: Use dcterms.subject instead of dc.subject
Trying to move some old DC fields to DCTERMS.
2021-03-11 11:45:25 +02:00
9 changed files with 19 additions and 12 deletions

View File

@ -9,6 +9,7 @@ steps:
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
@ -25,6 +26,7 @@ steps:
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install
@ -41,6 +43,7 @@ steps:
commands:
- id
- python -V
- apt update && apt install -y gcc g++ libicu-dev pkg-config
- pip install -r requirements-dev.txt
- pytest
- python setup.py install

View File

@ -16,10 +16,10 @@ jobs:
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
- name: Set up Python 3.9
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: 3.9
- name: Install dependencies
run: |
python -m pip install --upgrade pip

View File

@ -4,16 +4,19 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Unreleased
## Added
## [0.4.6] - 2021-03-11
### Added
- Validation of dcterms.license field against SPDX license identifiers
## Changed
### Changed
- Use DCTERMS fields where possible in `data/test.csv`
### Updated
- Run `poetry update` to update project dependencies
### Fixed
- Output for all fixes should be green, because it is good
## [0.4.5] - 2021-03-04
### Added
- Check dates in dcterms.issued field as well, not just fields that have the

View File

@ -1,7 +1,7 @@
# DSpace CSV Metadata Quality Checker ![GitHub Actions](https://github.com/ilri/csv-metadata-quality/workflows/Build%20and%20Test/badge.svg) [![Build Status](https://ci.mjanja.ch/api/badges/alanorth/csv-metadata-quality/status.svg)](https://ci.mjanja.ch/alanorth/csv-metadata-quality)
A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, unnecessary Unicode, AGROVOC terms, etc.
Requires Python 3.7 or greater (3.8 recommended). CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
Requires Python 3.7 or greater (3.8+ recommended). CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.
If you use the DSpace CSV metadata quality checker please cite:
@ -13,6 +13,7 @@ If you use the DSpace CSV metadata quality checker please cite:
- Validate languages against ISO 639-1 (alpha2) and ISO 639-3 (alpha3)
- Experimental validation of titles and abstracts against item's Dublin Core language field
- Validate subjects against the AGROVOC REST API (see the `--agrovoc-fields` option)
- Validation of licenses against the list of [SPDX license identifiers](https://spdx.org/licenses)
- Fix leading, trailing, and excessive (ie, more than one) whitespace
- Fix invalid and unnecessary multi-value separators (`|`) using `--unsafe-fixes`
- Fix problematic newlines (line feeds) using `--unsafe-fixes`

View File

@ -77,7 +77,7 @@ def separators(field, field_name):
if match:
print(
f"{Fore.RED}Fixing invalid multi-value separator ({field_name}): {Fore.RESET}{value}"
f"{Fore.GREEN}Fixing invalid multi-value separator ({field_name}): {Fore.RESET}{value}"
)
value = re.sub(pattern, "||", value)

View File

@ -1 +1 @@
VERSION = "0.4.5"
VERSION = "0.4.6"

View File

@ -1,6 +1,6 @@
[tool.poetry]
name = "csv-metadata-quality"
version = "0.4.5"
version = "0.4.6"
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem."
authors = ["Alan Orth <alan.orth@gmail.com>"]
license="GPL-3.0-only"

View File

@ -14,7 +14,7 @@ install_requires = [
setuptools.setup(
name="csv-metadata-quality",
version="0.4.5",
version="0.4.6",
author="Alan Orth",
author_email="aorth@mjanja.ch",
description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",

View File

@ -224,7 +224,7 @@ def test_check_invalid_agrovoc(capsys):
"""Test invalid AGROVOC subject."""
value = "FOREST"
field_name = "dc.subject"
field_name = "dcterms.subject"
check.agrovoc(value, field_name)
@ -239,7 +239,7 @@ def test_check_valid_agrovoc():
"""Test valid AGROVOC subject."""
value = "FORESTS"
field_name = "dc.subject"
field_name = "dcterms.subject"
result = check.agrovoc(value, field_name)