CHANGELOG.md: Move unreleased changes to v0.3.1

Version 0.3.1
CHANGELOG.md: Update unreleased notes
2025-05-10 07:06:00 +02:00 · 2019-10-01 17:11:52 +03:00 · 2019-10-01 17:11:13 +03:00 · 2019-10-01 17:10:23 +03:00 · 2019-10-01 17:09:49 +03:00 · 2019-10-01 16:56:37 +03:00
7 changed files with 17 additions and 13 deletions
--- a/.build.yml
+++ b/.build.yml
@ -13,7 +13,7 @@ tasks:
  - testcli: |
      cd csv-metadata-quality
      pipenv run pip install .
-      pipenv run csv-metadata-quality -i data/test.csv -o /tmp/test.csv -u --agrovoc-fields dc.subject,cg.coverage.country
+      pipenv run csv-metadata-quality -i data/test.csv -o /tmp/test.csv -e -u --agrovoc-fields dc.subject,cg.coverage.country
 environment:
  PIPENV_NOSPIN: 'True'
  PIPENV_HIDE_EMOJIS: 'True'
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,14 +4,19 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [0.3.1] - 2019-10-01
+## Changed
+- Replace non-breaking spaces (U+00A0) with space instead of removing them
+- Harmonize language of script output when fixing various issues
+
 ## [0.3.0] - 2019-09-26
 ### Updated
 - Update python dependencies to latest versions, including numpy 1.17.2, pandas
 0.25.1, pytest 5.1.3, and requests-cache 0.5.2

-## Added
+### Added
 - csvkit to dev requirements (csvcut etc are useful during development)
- Experimental language validation using `-e` (see README.md)
+- Experimental language validation using the Python `langid` library (enable with `-e`, see README.md)

 ### Changed
 - Re-formatted code with black and isort
--- a/README.md
+++ b/README.md
@ -1,5 +1,5 @@
 # CSV Metadata Quality [![Build Status](https://travis-ci.org/ilri/csv-metadata-quality.svg?branch=master)](https://travis-ci.org/ilri/csv-metadata-quality) [![builds.sr.ht status](https://builds.sr.ht/~alanorth/csv-metadata-quality.svg)](https://builds.sr.ht/~alanorth/csv-metadata-quality?)
-A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem. The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.
+A simple, but opinionated metadata quality checker and fixer designed to work with CSVs in the DSpace ecosystem (though it could theoretically work on any CSV that uses Dublin Core fields as columns). The implementation is essentially a pipeline of checks and fixes that begins with splitting multi-value fields on the standard DSpace "||" separator, trimming leading/trailing whitespace, and then proceeding to more specialized cases like ISSNs, ISBNs, languages, etc.

 Requires Python 3.6 or greater. CSV and Excel support comes from the [Pandas](https://pandas.pydata.org/) library, though your mileage may vary with Excel because this is much less tested.

@ -92,7 +92,6 @@ This currently uses the [Python langid](https://github.com/saffsd/langid.py) lib
 - Validate DOIs? Normalize to https://doi.org format? Or use just the DOI part: 10.1016/j.worlddev.2010.06.006
 - Warn if two items use the same file in `filename` column
 - Add an option to drop invalid AGROVOC subjects?
- Add check for author names with incorrect spacing after commas, ie "Orth,Alan S."
 - Add tests for application invocation, ie `tests/test_app.py`?

 ## License
--- a/csv_metadata_quality/fix.py
+++ b/csv_metadata_quality/fix.py
@ -26,7 +26,7 @@ def whitespace(field):
        match = re.findall(pattern, value)

        if match:
-            print(f"Excessive whitespace: {value}")
+            print(f"Removing excessive whitespace: {value}")
            value = re.sub(pattern, " ", value)

        # Save cleaned value
@ -74,10 +74,10 @@ def unnecessary_unicode(field):
    Removes unnecessary Unicode characters like:
        - Zero-width space (U+200B)
        - Replacement character (U+FFFD)
-        - No-break space (U+00A0)

    Replaces unnecessary Unicode characters like:
        - Soft hyphen (U+00AD) → hyphen
+        - No-break space (U+00A0) → space

    Return string with characters removed or replaced.
    """
@ -107,8 +107,8 @@ def unnecessary_unicode(field):
    match = re.findall(pattern, field)

    if match:
-        print(f"Removing unnecessary Unicode (U+00A0): {field}")
-        field = re.sub(pattern, "", field)
+        print(f"Replacing unnecessary Unicode (U+00A0): {field}")
+        field = re.sub(pattern, " ", field)

    # Check for soft hyphens (U+00AD), sometimes preceeded with a normal hyphen
    pattern = re.compile(r"\u002D*?\u00AD")
@ -140,7 +140,7 @@ def duplicates(field):
        if value not in new_values:
            new_values.append(value)
        else:
-            print(f"Dropping duplicate value: {value}")
+            print(f"Removing duplicate value: {value}")

    # Create a new field consisting of all values joined with "||"
    new_field = "||".join(new_values)
--- a/csv_metadata_quality/version.py
+++ b/csv_metadata_quality/version.py
@ -1 +1 @@
-VERSION = "0.3.0"
+VERSION = "0.3.1"
--- a/data/test.csv
+++ b/data/test.csv
@ -1,4 +1,4 @@
-dc.title,birthdate,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country,filename
+dc.title,dc.date.issued,dc.identifier.issn,dc.identifier.isbn,dc.language.iso,dc.subject,cg.coverage.country,filename
 Leading space,2019-07-29,,,,,,
 Trailing space ,2019-07-29,,,,,,
 Excessive  space,2019-07-29,,,,,,
--- a/setup.py
+++ b/setup.py
@ -14,7 +14,7 @@ install_requires = [

 setuptools.setup(
    name="csv-metadata-quality",
-    version="0.3.0",
+    version="0.3.1",
    author="Alan Orth",
    author_email="aorth@mjanja.ch",
    description="A simple, but opinionated CSV quality checking and fixing pipeline for CSVs in the DSpace ecosystem.",
Author	SHA1	Message	Date
Alan Orth	36d0474b95	CHANGELOG.md: Move unreleased changes to v0.3.1	2019-10-01 17:11:52 +03:00
Alan Orth	efdc3a841a	Version 0.3.1	2019-10-01 17:11:13 +03:00
Alan Orth	fd2ba6845d	CHANGELOG.md: Update unreleased notes	2019-10-01 17:10:23 +03:00
Alan Orth	e55380b4d5	csv_metadata_quality/fix.py: Harmonize language in fix output We should always say if we're removing or replacing something.	2019-10-01 17:09:49 +03:00
Alan Orth	85ae16d9b7	CHANGELOG.md: Add note about non-breaking spaces	2019-10-01 16:56:37 +03:00
Alan Orth	c42f8b4812	csv_metadata_quality/fix.py: Replace non-breaking spaces We should be replacing non-breaking spaces (U+00A0) with normal sp- aces instead of removing them.	2019-10-01 16:55:04 +03:00
Alan Orth	1c75608d54	README.md: Update introduction text We should mention that this is not DSpace specific. Rather, it is much more realistically Dublin Core specific.	2019-09-26 14:19:13 +03:00
Alan Orth	0b15a8ed3b	README.md: Remove TODO about lack of space after comma This was added as an automatic global fix a few weeks ago.	2019-09-26 14:16:33 +03:00
Alan Orth	9ca266f5f0	data/test.csv: Change birthdate column to dc.date.issued More accurately reflects actual data we will be validating.	2019-09-26 14:15:48 +03:00
Alan Orth	0d3f948708	CHANGELOG.md: Update comment about language validation	2019-09-26 14:14:57 +03:00
Alan Orth	c04207fcfc	CHANGELOG.md: Fix header formatting	2019-09-26 14:13:50 +03:00
Alan Orth	9d4eceddc7	.build.yml: Enable experimental CLI checks on SourceHut	2019-09-26 14:11:35 +03:00