csv-metadata-quality/CHANGELOG.md

216 lines
7.2 KiB
Markdown
Raw Permalink Normal View History

2019-08-01 23:10:28 +02:00
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
2023-03-10 14:17:20 +01:00
## Unreleased
### Fixed
- Fixed regex so we don't run the invalid multi-value separator fix on
`dcterms.bibliographicCitation` fields
- Fixed regex so we run the comma space fix on `dcterms.bibliographicCitation`
fields
- Don't crash the country/region checker/fixer when a title field is missing
2023-03-10 14:17:20 +01:00
### Changed
- Don't run newline fix on description fields
- Install requests-cache in main run() function instead of check.agrovoc() function so we only incur the overhead once
- Use py3langid instead of langid, see: [How to make language detection with langid.py faster](https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html)
### Updated
- Python dependencies, including Pandas 2.0.0 and [Arrow-backed dtypes](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)
- SPDX license list
2023-02-23 11:46:56 +01:00
## [0.6.1] - 2023-02-23
### Fixed
- Missing region check should ignore subregion field, if it exists
### Changed
- Use SPDX license data from SPDX themselves instead of spdx-license-list
because it is deprecated and outdated
- Require Python 3.9+
- Don't run `fix.separators()` on title or abstract fields
- Don't run whitespace or newline fixes on abstract fields
- Ignore some common non-SPDX licenses
- Ignore `__description` suffix in filenames meant for SAFBuilder when checking
for uncommon file extensions
2022-12-20 14:09:58 +01:00
### Updated
- Python dependencies
2022-10-31 09:43:14 +01:00
## [0.6.0] - 2022-09-02
### Changed
- Perform fix for "unnecessary" Unicode characters after we try to fix encoding
issues with ftfy
2021-12-15 21:09:01 +01:00
- ftfy heuristics to use `is_bad()` instead of `sequence_weirdness()`
- ftfy `fix_text()` to *not* change “smart quotes” to "ASCII quotes"
### Updated
2021-12-15 21:09:01 +01:00
- Python dependencies
- Metadatata field exclude logic
2021-12-09 22:21:58 +01:00
### Added
- Ability to drop invalid AGROVOC values with `-d` when checking AGROVOC values
with `-a <field.name>`
- Ability to add missing UN M.49 regions when both country and region columns
are present. Enable with `-u` (unsafe fixes) for now.
### Removed
- Support for reading Excel files (both `.xls` and `.xlsx`) as it was completely
untested
2021-12-08 14:29:46 +01:00
## [0.5.0] - 2021-12-08
### Added
- Ability to check for, and fix, "mojibake" characters using [ftfy](https://github.com/LuminosoInsight/python-ftfy)
- Ability to check if the item's title exists in the citation
- Ability to check if an item has countries, but no matching regions (only
suggests missing regions if there is a region field in the CSV)
### Updated
- Python dependencies
2021-10-06 18:35:59 +02:00
### Fixed
- Regular expression to match all citation fields (dc.identifier.citation as
2021-10-06 20:23:35 +02:00
well as dcterms.bibliographicCitation) in `experimental.correct_language()`
2021-10-06 18:35:59 +02:00
- Regular expression to match dc.title and dcterms.title, but
2021-10-06 20:23:35 +02:00
ignore dc.title.alternative `check.duplicate_items()`
- Missing field name in `fix.newlines()` output
2021-03-17 09:00:34 +01:00
## [0.4.7] - 2021-03-17
### Changed
- Fixing invalid multi-value separators like `|` and `|||` is no longer class-
ified as "unsafe" as I have yet to see a case where this was intentional
2021-03-16 15:11:24 +01:00
- Not user visible, but now checks only print a warning to the screen instead
of returning a value and re-writing the DataFrame, which should be faster and
use less memory
### Added
- Configurable directory for AGROVOC requests cache (to allow running the web
version from Google App Engine where we can only write to /tmp)
- Ability to check for duplicate items in the data set (uses a combination of
the title, type, and date issued to determine uniqueness)
### Removed
- Checks for invalid and unnecessary multi-value separators because now I fix
them whenever I see them, so there is no need to have checks for them
### Updated
- Run `poetry update` to update project dependencies
2021-03-11 11:14:54 +01:00
## [0.4.6] - 2021-03-11
2021-03-11 11:13:22 +01:00
### Added
- Validation of dcterms.license field against SPDX license identifiers
2021-03-11 11:13:22 +01:00
### Changed
- Use DCTERMS fields where possible in `data/test.csv`
### Updated
- Run `poetry update` to update project dependencies
2021-03-11 11:14:54 +01:00
### Fixed
- Output for all fixes should be green, because it is good
2021-03-04 20:38:10 +01:00
## [0.4.5] - 2021-03-04
### Added
- Check dates in dcterms.issued field as well, not just fields that have the
word "date" in them
2021-03-04 20:32:46 +01:00
### Updated
- Run `poetry update` to update project dependencies
2021-02-21 12:25:22 +01:00
## [0.4.4] - 2021-02-21
### Added
- Accept dates formatted in ISO 8601 extended with combined date and time, for
example: 2020-08-31T11:04:56Z
- Colorized output: red for errors, yellow for warnings and information, green
for changes
2021-02-04 20:48:12 +01:00
### Updated
- Run `poetry update` to update project dependencies
2021-01-26 14:22:40 +01:00
## [0.4.3] - 2021-01-26
2020-07-06 13:10:46 +02:00
### Changed
- Reformat with black
- Requires Python 3.7+ for pandas 1.2.0
2020-07-06 13:10:46 +02:00
### Updated
- Run `poetry update`
- Expand check/fix for multi-value separators to include metadata with invalid
separators at the end, for example "Kenya||Tanzania||"
2020-07-06 13:04:34 +02:00
## [0.4.2] - 2020-07-06
### Changed
- Add field name to the output for more fixes and checks to help identify where
the error is
2020-07-06 13:00:21 +02:00
- Minor optimizations to AGROVOC subject lookup
- Use Poetry instead of Pipenv
### Updated
- Update python dependencies to latest versions
2020-01-15 11:19:42 +01:00
## [0.4.1] - 2020-01-15
### Changed
- Reduce minimum Python version to 3.6 by working around the `is_normalized()`
that only works in Python >= 3.8
2020-01-15 10:44:56 +01:00
## [0.4.0] - 2020-01-15
### Added
- Unicode normalization (enable with `--unsafe-fixes`, see README.md)
2019-11-14 08:19:19 +01:00
### Updated
- Update python dependencies to latest versions, including numpy 1.18.1, pandas
1.0.0rc0, flake8 3.7.9, pytest 5.3.2, and black 19.10b0
- Regenerate requirements.txt and requirements-dev.txt
2019-11-14 08:19:19 +01:00
### Changed
- Use Python 3.8.0 for pipenv
2019-11-14 22:24:08 +01:00
- Use Ubuntu 18.04 "Bionic" for TravisCI builds
- Test Python 3.8 in TravisCI builds
## [0.3.1] - 2019-10-01
2019-10-01 16:10:23 +02:00
## Changed
- Replace non-breaking spaces (U+00A0) with space instead of removing them
2019-10-01 16:10:23 +02:00
- Harmonize language of script output when fixing various issues
2019-09-26 13:06:31 +02:00
## [0.3.0] - 2019-09-26
### Updated
- Update python dependencies to latest versions, including numpy 1.17.2, pandas
2019-09-24 17:55:05 +02:00
0.25.1, pytest 5.1.3, and requests-cache 0.5.2
2019-09-26 13:13:50 +02:00
### Added
2019-09-24 17:49:20 +02:00
- csvkit to dev requirements (csvcut etc are useful during development)
- Experimental language validation using the Python `langid` library (enable with `-e`, see README.md)
2019-09-24 17:49:20 +02:00
### Changed
- Re-formatted code with black and isort
2019-08-27 23:10:17 +02:00
## [0.2.2] - 2019-08-27
### Changed
- Output of date checks to include column names (helps debugging in case there are multiple date fields)
### Added
- Ability to exclude certain fields using `--exclude-fields`
2019-08-27 23:10:17 +02:00
- Fix for missing space after a comma, ie "Orth,Alan S."
2019-08-21 15:37:49 +02:00
### Improved
- AGROVOC lookup code
## [0.2.1] - 2019-08-11
### Added
2019-08-11 09:43:27 +02:00
- Check for uncommon filename extensions
- Replacement of unneccessary Unicode characters like soft hyphens (U+00AD)
2019-08-09 00:39:43 +02:00
## [0.2.0] - 2019-08-09
### Added
- Handle Ctrl-C interrupt gracefully
- Make output in suspicious character check more user friendly
2019-08-09 00:33:34 +02:00
- Add pytest-clarity to dev packages for more user friendly pytest output
2019-08-01 23:10:28 +02:00
## [0.1.0] - 2019-08-01
### Changed
- AGROVOC validation is now turned off by default
### Added
- Ability to enable AGROVOC validation on a field-by-field basis using the `--agrovoc-fields` option
- Option to print the version (`--version` or `-V`)