csv-metadata-quality/CHANGELOG.md

# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Unreleased
### Fixed
- Missing region check should ignore subregion field, if it exists

### Changed
- Use SPDX license data from SPDX themselves instead of spdx-license-list
because it is deprecated and outdated
- Require Python 3.9+

### Updated
- Python dependencies

## [0.6.0] - 2022-09-02
### Changed
- Perform fix for "unnecessary" Unicode characters after we try to fix encoding
issues with ftfy
- ftfy heuristics to use `is_bad()` instead of `sequence_weirdness()`
- ftfy `fix_text()` to *not* change “smart quotes” to "ASCII quotes"

### Updated
- Python dependencies
- Metadatata field exclude logic

### Added
- Ability to drop invalid AGROVOC values with `-d` when checking AGROVOC values
with `-a <field.name>`
- Ability to add missing UN M.49 regions when both country and region columns
are present. Enable with `-u` (unsafe fixes) for now.

### Removed
- Support for reading Excel files (both `.xls` and `.xlsx`) as it was completely
untested

## [0.5.0] - 2021-12-08
### Added
- Ability to check for, and fix, "mojibake" characters using [ftfy](https://github.com/LuminosoInsight/python-ftfy)
- Ability to check if the item's title exists in the citation
- Ability to check if an item has countries, but no matching regions (only
suggests missing regions if there is a region field in the CSV)

### Updated
- Python dependencies

### Fixed
- Regular expression to match all citation fields (dc.identifier.citation as
well as dcterms.bibliographicCitation) in `experimental.correct_language()`
- Regular expression to match dc.title and dcterms.title, but
ignore dc.title.alternative `check.duplicate_items()`
- Missing field name in `fix.newlines()` output

## [0.4.7] - 2021-03-17
### Changed
- Fixing invalid multi-value separators like `|` and `|||` is no longer class-
ified as "unsafe" as I have yet to see a case where this was intentional
- Not user visible, but now checks only print a warning to the screen instead
of returning a value and re-writing the DataFrame, which should be faster and
use less memory

### Added
- Configurable directory for AGROVOC requests cache (to allow running the web
version from Google App Engine where we can only write to /tmp)
- Ability to check for duplicate items in the data set (uses a combination of
the title, type, and date issued to determine uniqueness)

### Removed
- Checks for invalid and unnecessary multi-value separators because now I fix
them whenever I see them, so there is no need to have checks for them

### Updated
- Run `poetry update` to update project dependencies

## [0.4.6] - 2021-03-11
### Added
- Validation of dcterms.license field against SPDX license identifiers 

### Changed
- Use DCTERMS fields where possible in `data/test.csv`

### Updated
- Run `poetry update` to update project dependencies

### Fixed
- Output for all fixes should be green, because it is good

## [0.4.5] - 2021-03-04
### Added
- Check dates in dcterms.issued field as well, not just fields that have the
word "date" in them

### Updated
- Run `poetry update` to update project dependencies

## [0.4.4] - 2021-02-21
### Added
- Accept dates formatted in ISO 8601 extended with combined date and time, for
example: 2020-08-31T11:04:56Z
- Colorized output: red for errors, yellow for warnings and information, green
for changes

### Updated
- Run `poetry update` to update project dependencies

## [0.4.3] - 2021-01-26
### Changed
- Reformat with black
- Requires Python 3.7+ for pandas 1.2.0

### Updated
- Run `poetry update`
- Expand check/fix for multi-value separators to include metadata with invalid
separators at the end, for example "Kenya||Tanzania||"

## [0.4.2] - 2020-07-06
### Changed
- Add field name to the output for more fixes and checks to help identify where
the error is
- Minor optimizations to AGROVOC subject lookup
- Use Poetry instead of Pipenv

### Updated
- Update python dependencies to latest versions

## [0.4.1] - 2020-01-15
### Changed
- Reduce minimum Python version to 3.6 by working around the `is_normalized()`
that only works in Python >= 3.8

## [0.4.0] - 2020-01-15
### Added
- Unicode normalization (enable with `--unsafe-fixes`, see README.md)

### Updated
- Update python dependencies to latest versions, including numpy 1.18.1, pandas
1.0.0rc0, flake8 3.7.9, pytest 5.3.2, and black 19.10b0
- Regenerate requirements.txt and requirements-dev.txt

### Changed
- Use Python 3.8.0 for pipenv
- Use Ubuntu 18.04 "Bionic" for TravisCI builds
- Test Python 3.8 in TravisCI builds

## [0.3.1] - 2019-10-01
## Changed
- Replace non-breaking spaces (U+00A0) with space instead of removing them
- Harmonize language of script output when fixing various issues

## [0.3.0] - 2019-09-26
### Updated
- Update python dependencies to latest versions, including numpy 1.17.2, pandas
0.25.1, pytest 5.1.3, and requests-cache 0.5.2

### Added
- csvkit to dev requirements (csvcut etc are useful during development)
- Experimental language validation using the Python `langid` library (enable with `-e`, see README.md)

### Changed
- Re-formatted code with black and isort

## [0.2.2] - 2019-08-27
### Changed
- Output of date checks to include column names (helps debugging in case there are multiple date fields)

### Added
- Ability to exclude certain fields using `--exclude-fields`
- Fix for missing space after a comma, ie "Orth,Alan S."

### Improved
- AGROVOC lookup code

## [0.2.1] - 2019-08-11
### Added
- Check for uncommon filename extensions
- Replacement of unneccessary Unicode characters like soft hyphens (U+00AD)

## [0.2.0] - 2019-08-09
### Added
- Handle Ctrl-C interrupt gracefully
- Make output in suspicious character check more user friendly
- Add pytest-clarity to dev packages for more user friendly pytest output

## [0.1.0] - 2019-08-01
### Changed
- AGROVOC validation is now turned off by default

### Added
- Ability to enable AGROVOC validation on a field-by-field basis using the `--agrovoc-fields` option
- Option to print the version (`--version` or `-V`)
Add initial changelog 2019-08-01 23:10:28 +02:00			`# Changelog`
			`All notable changes to this project will be documented in this file.`

			`The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),`
			`and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).`

Ignore subregion field for missing region checks Due to a sloppy regex I was sometimes matching the subregion field when checking for missing UN M.49 regions in the region field. 2022-12-07 23:18:47 +01:00			`## Unreleased`
			`### Fixed`
			`- Missing region check should ignore subregion field, if it exists`

CHANGELOG.md: add notes about SPDX and Python 2022-12-13 08:45:36 +01:00			`### Changed`
			`- Use SPDX license data from SPDX themselves instead of spdx-license-list`
			`because it is deprecated and outdated`
			`- Require Python 3.9+`

CHANGELOG.md: run poetry update 2022-12-20 14:09:58 +01:00			`### Updated`
			`- Python dependencies`

CHANGELOG.md: fix header 2022-10-31 09:43:14 +01:00			`## [0.6.0] - 2022-09-02`
CHANGELOG.md: add note about dropping invalid AGROVOC values 2021-12-23 11:47:42 +01:00			`### Changed`
CHANGELOG.md: add note about unnecessary Unicode 2021-12-15 12:56:31 +01:00			`- Perform fix for "unnecessary" Unicode characters after we try to fix encoding`
			`issues with ftfy`
CHANGELOG.md: add notes about ftfy 2021-12-15 21:09:01 +01:00			- ftfy heuristics to use `is_bad()` instead of `sequence_weirdness()`
			- ftfy `fix_text()` to not change “smart quotes” to "ASCII quotes"

CHANGELOG.md: add note about dropping invalid AGROVOC values 2021-12-23 11:47:42 +01:00			`### Updated`
CHANGELOG.md: add notes about ftfy 2021-12-15 21:09:01 +01:00			`- Python dependencies`
CHANGELOG.md: add not about exclude logic 2022-09-02 15:03:51 +02:00			`- Metadatata field exclude logic`
CHANGELOG.md: start next changes 2021-12-09 22:21:58 +01:00
CHANGELOG.md: add note about dropping invalid AGROVOC values 2021-12-23 11:47:42 +01:00			`### Added`
			- Ability to drop invalid AGROVOC values with `-d` when checking AGROVOC values
			with `-a <field.name>`
CHANGELOG.md: Add note about adding missing regions 2022-07-28 15:54:05 +02:00			`- Ability to add missing UN M.49 regions when both country and region columns`
			are present. Enable with `-u` (unsafe fixes) for now.
CHANGELOG.md: add note about dropping invalid AGROVOC values 2021-12-23 11:47:42 +01:00
Remove Excel support I never used this and it seems xlrd doesn't even support .xlsx any- more anyways. If this was needed I could theoretically use openpyxl but I'd rather just stick to CSV. 2022-09-02 15:14:24 +02:00			`### Removed`
			- Support for reading Excel files (both `.xls` and `.xlsx`) as it was completely
			`untested`

Version 0.5.0 2021-12-08 14:29:46 +01:00			`## [0.5.0] - 2021-12-08`
Add notes about checking and fixing mojibake 2021-03-19 10:48:27 +01:00			`### Added`
			`- Ability to check for, and fix, "mojibake" characters using [ftfy](https://github.com/LuminosoInsight/python-ftfy)`
CHANGELOG.md: add title in citation note 2021-12-05 15:23:39 +01:00			`- Ability to check if the item's title exists in the citation`
CHANGELOG.md: Add note about countries without regions 2021-12-08 14:21:45 +01:00			`- Ability to check if an item has countries, but no matching regions (only`
			`suggests missing regions if there is a region field in the CSV)`
Add notes about checking and fixing mojibake 2021-03-19 10:48:27 +01:00
CHANGELOG.md: Add note about Python deps 2021-04-14 15:16:02 +02:00			`### Updated`
			`- Python dependencies`
CHANGELOG.md: Add notes about regexes 2021-10-06 18:35:59 +02:00
			`### Fixed`
CHANGELOG.md: Add note about bibliographicCitation 2021-10-06 15:16:51 +02:00			`- Regular expression to match all citation fields (dc.identifier.citation as`
CHANGELOG.md: Clarify regex fixes 2021-10-06 20:23:35 +02:00			well as dcterms.bibliographicCitation) in `experimental.correct_language()`
CHANGELOG.md: Add notes about regexes 2021-10-06 18:35:59 +02:00			`- Regular expression to match dc.title and dcterms.title, but`
CHANGELOG.md: Clarify regex fixes 2021-10-06 20:23:35 +02:00			ignore dc.title.alternative `check.duplicate_items()`
CHANGELOG.md: Add note about fix.newlines 2021-10-08 13:37:12 +02:00			- Missing field name in `fix.newlines()` output
CHANGELOG.md: Add note about Python deps 2021-04-14 15:16:02 +02:00
Version 0.4.7 2021-03-17 09:00:34 +01:00			`## [0.4.7] - 2021-03-17`
Always fix invalid multi-value separators This is no longer class-ified as "unsafe" as I have yet to see a case where this was intentional, and it always causes issues when you import the data in a DSpace repository. 2021-03-13 11:59:45 +01:00			`### Changed`
			- Fixing invalid multi-value separators like `\|` and `\|\|\|` is no longer class-
			`ified as "unsafe" as I have yet to see a case where this was intentional`
CHANGELOG.md: Add note about checks 2021-03-16 15:11:24 +01:00			`- Not user visible, but now checks only print a warning to the screen instead`
			`of returning a value and re-writing the DataFrame, which should be faster and`
			`use less memory`
Always fix invalid multi-value separators This is no longer class-ified as "unsafe" as I have yet to see a case where this was intentional, and it always causes issues when you import the data in a DSpace repository. 2021-03-13 11:59:45 +01:00
CHANGELOG.md: Add note about requests cache 2021-03-14 08:13:51 +01:00			`### Added`
			`- Configurable directory for AGROVOC requests cache (to allow running the web`
			`version from Google App Engine where we can only write to /tmp)`
CHANGELOG.md: Add note about duplicate items 2021-03-17 08:55:07 +01:00			`- Ability to check for duplicate items in the data set (uses a combination of`
			`the title, type, and date issued to determine uniqueness)`
CHANGELOG.md: Add note about requests cache 2021-03-14 08:13:51 +01:00
CHANGELOG.md: Add note about multi-value separators 2021-03-14 20:04:19 +01:00			`### Removed`
			`- Checks for invalid and unnecessary multi-value separators because now I fix`
			`them whenever I see them, so there is no need to have checks for them`

CHANGELOG.md: Add note about poetry dependencies 2021-03-17 08:58:02 +01:00			`### Updated`
			- Run `poetry update` to update project dependencies

Version 0.4.6 2021-03-11 11:14:54 +01:00			`## [0.4.6] - 2021-03-11`
CHANGELOG.md: Fix headers 2021-03-11 11:13:22 +01:00			`### Added`
CHANGELOG.md: Add note about SPDX license identifiers 2021-03-11 09:37:27 +01:00			`- Validation of dcterms.license field against SPDX license identifiers`

CHANGELOG.md: Fix headers 2021-03-11 11:13:22 +01:00			`### Changed`
CHANGELOG.md: Add notes about DCTERMS in data/test.csv 2021-03-11 09:50:52 +01:00			- Use DCTERMS fields where possible in `data/test.csv`

CHANGELOG.md: Add note about poetry dependencies 2021-03-11 10:10:27 +01:00			`### Updated`
			- Run `poetry update` to update project dependencies

Version 0.4.6 2021-03-11 11:14:54 +01:00			`### Fixed`
			`- Output for all fixes should be green, because it is good`

Bump version to 0.4.5 2021-03-04 20:38:10 +01:00			`## [0.4.5] - 2021-03-04`
CHANGELOG.md: Add note about dcterms.issued 2021-02-28 14:14:39 +01:00			`### Added`
			`- Check dates in dcterms.issued field as well, not just fields that have the`
			`word "date" in them`

CHANGELOG.md: Add poetry update 2021-03-04 20:32:46 +01:00			`### Updated`
			- Run `poetry update` to update project dependencies

Move unreleased changes to v0.4.4 2021-02-21 12:25:22 +01:00			`## [0.4.4] - 2021-02-21`
CHANGELOG.md: Add note about new date format 2021-02-04 20:43:44 +01:00			`### Added`
			`- Accept dates formatted in ISO 8601 extended with combined date and time, for`
			`example: 2020-08-31T11:04:56Z`
CHANGELOG.md: Add note about colored output 2021-02-21 12:12:26 +01:00			`- Colorized output: red for errors, yellow for warnings and information, green`
			`for changes`
CHANGELOG.md: Add note about new date format 2021-02-04 20:43:44 +01:00
CHANGELOG.md: Add note about poetry 2021-02-04 20:48:12 +01:00			`### Updated`
			- Run `poetry update` to update project dependencies

Version 0.4.3 2021-01-26 14:22:40 +01:00			`## [0.4.3] - 2021-01-26`
CHANGELOG.md: Add notes 2020-07-06 13:10:46 +02:00			`### Changed`
			`- Reformat with black`
Remove Python 3.6 support Pandas 1.2.0 apparently requires Python 3.7.1+. 2021-01-03 14:51:53 +01:00			`- Requires Python 3.7+ for pandas 1.2.0`
CHANGELOG.md: Add notes 2020-07-06 13:10:46 +02:00
CHANGELOG.md: Add note about dependencies update 2020-09-08 14:04:40 +02:00			`### Updated`
			- Run `poetry update`
CHANGELOG.md: Add note about multi-value separators 2021-01-26 14:20:22 +01:00			`- Expand check/fix for multi-value separators to include metadata with invalid`
			`separators at the end, for example "Kenya\|\|Tanzania\|\|"`
CHANGELOG.md: Add note about dependencies update 2020-09-08 14:04:40 +02:00
Version 0.4.2 2020-07-06 13:04:34 +02:00			`## [0.4.2] - 2020-07-06`
CHANGELOG.md: Add note about field names 2020-01-16 11:37:11 +01:00			`### Changed`
			`- Add field name to the output for more fixes and checks to help identify where`
			`the error is`
CHANGELOG.md: Update changes 2020-07-06 13:00:21 +02:00			`- Minor optimizations to AGROVOC subject lookup`
			`- Use Poetry instead of Pipenv`
CHANGELOG.md: Add note about field names 2020-01-16 11:37:11 +01:00
CHANGELOG.md: Add note about python dependencies 2020-01-29 11:41:43 +01:00			`### Updated`
			`- Update python dependencies to latest versions`

Version 0.4.1 2020-01-15 11:19:42 +01:00			`## [0.4.1] - 2020-01-15`
			`### Changed`
			- Reduce minimum Python version to 3.6 by working around the `is_normalized()`
			`that only works in Python >= 3.8`

Version 0.4.0 2020-01-15 10:44:56 +01:00			`## [0.4.0] - 2020-01-15`
CHANGELOG.md: Add note about Unicode normalization 2020-01-15 10:40:40 +01:00			`### Added`
			- Unicode normalization (enable with `--unsafe-fixes`, see README.md)

CHANGELOG.md: Add unreleased changes 2019-11-14 08:19:19 +01:00			`### Updated`
CHANGELOG.md: Update python library versions 2020-01-15 09:58:44 +01:00			`- Update python dependencies to latest versions, including numpy 1.18.1, pandas`
			`1.0.0rc0, flake8 3.7.9, pytest 5.3.2, and black 19.10b0`
CHANGELOG.md: Add note about requirements 2019-11-14 22:30:26 +01:00			`- Regenerate requirements.txt and requirements-dev.txt`
CHANGELOG.md: Add unreleased changes 2019-11-14 08:19:19 +01:00
CHANGELOG.md: Use Python 3.8.0 for pipenv 2019-11-14 22:11:43 +01:00			`### Changed`
			`- Use Python 3.8.0 for pipenv`
CHANGELOG.md: Add TravisCI changes 2019-11-14 22:24:08 +01:00			`- Use Ubuntu 18.04 "Bionic" for TravisCI builds`
			`- Test Python 3.8 in TravisCI builds`
CHANGELOG.md: Use Python 3.8.0 for pipenv 2019-11-14 22:11:43 +01:00
CHANGELOG.md: Move unreleased changes to v0.3.1 2019-10-01 16:11:52 +02:00			`## [0.3.1] - 2019-10-01`
CHANGELOG.md: Update unreleased notes 2019-10-01 16:10:23 +02:00			`## Changed`
CHANGELOG.md: Add note about non-breaking spaces 2019-10-01 15:56:37 +02:00			`- Replace non-breaking spaces (U+00A0) with space instead of removing them`
CHANGELOG.md: Update unreleased notes 2019-10-01 16:10:23 +02:00			`- Harmonize language of script output when fixing various issues`
CHANGELOG.md: Add note about non-breaking spaces 2019-10-01 15:56:37 +02:00
Move unreleased changes to v0.3.0 2019-09-26 13:06:31 +02:00			`## [0.3.0] - 2019-09-26`
Add note about updated python dependencies 2019-08-28 20:02:21 +02:00			`### Updated`
CHANGELOG.md: Add notes about updated python packages 2019-09-11 15:45:39 +02:00			`- Update python dependencies to latest versions, including numpy 1.17.2, pandas`
CHANGELOG.md: Update notes 2019-09-24 17:55:05 +02:00			`0.25.1, pytest 5.1.3, and requests-cache 0.5.2`
Add note about updated python dependencies 2019-08-28 20:02:21 +02:00
CHANGELOG.md: Fix header formatting 2019-09-26 13:13:50 +02:00			`### Added`
CHANGELOG.md: Add note about csvkit 2019-09-24 17:49:20 +02:00			`- csvkit to dev requirements (csvcut etc are useful during development)`
CHANGELOG.md: Update comment about language validation 2019-09-26 13:14:57 +02:00			- Experimental language validation using the Python `langid` library (enable with `-e`, see README.md)
CHANGELOG.md: Add note about csvkit 2019-09-24 17:49:20 +02:00
CHANGELOG.md: Add note about black and isort 2019-08-29 00:26:11 +02:00			`### Changed`
			`- Re-formatted code with black and isort`

Release version 0.2.2 2019-08-27 23:10:17 +02:00			`## [0.2.2] - 2019-08-27`
CHANGELOG.md: Add note about date checks 2019-08-21 14:35:46 +02:00			`### Changed`
			`- Output of date checks to include column names (helps debugging in case there are multiple date fields)`

CHANGELOG.md: Add note about excluding fields 2019-08-26 23:11:22 +02:00			`### Added`
			- Ability to exclude certain fields using `--exclude-fields`
Release version 0.2.2 2019-08-27 23:10:17 +02:00			`- Fix for missing space after a comma, ie "Orth,Alan S."`
CHANGELOG.md: Add note about excluding fields 2019-08-26 23:11:22 +02:00
CHANGELOG.md: Add note about AGROVOC 2019-08-21 15:37:49 +02:00			`### Improved`
			`- AGROVOC lookup code`

CHANGELOG.md: Move unreleased changes to 0.2.1 2019-08-11 09:39:18 +02:00			`## [0.2.1] - 2019-08-11`
CHANGELOG.md: Add check for uncommon filename extensions 2019-08-10 22:47:46 +02:00			`### Added`
CHANGELOG.md: Remove extra space 2019-08-11 09:43:27 +02:00			`- Check for uncommon filename extensions`
CHANGELOG.md: Add note about replacement of unnccesary Unicode 2019-08-10 23:09:35 +02:00			`- Replacement of unneccessary Unicode characters like soft hyphens (U+00AD)`
CHANGELOG.md: Add check for uncommon filename extensions 2019-08-10 22:47:46 +02:00
Version 0.2.0 2019-08-09 00:39:43 +02:00			`## [0.2.0] - 2019-08-09`
CHANGELOG.md: Add new item for Ctrl-C handling 2019-08-03 21:18:44 +02:00			`### Added`
			`- Handle Ctrl-C interrupt gracefully`
CHANGELOG.md: Add improved suspicious character check 2019-08-09 00:28:07 +02:00			`- Make output in suspicious character check more user friendly`
CHANGELOG.md: Add pytest-clarity 2019-08-09 00:33:34 +02:00			`- Add pytest-clarity to dev packages for more user friendly pytest output`
Add initial changelog 2019-08-01 23:10:28 +02:00
			`## [0.1.0] - 2019-08-01`
			`### Changed`
			`- AGROVOC validation is now turned off by default`

			`### Added`
			- Ability to enable AGROVOC validation on a field-by-field basis using the `--agrovoc-fields` option
			- Option to print the version (`--version` or `-V`)