1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-12-23 04:32:21 +01:00

README.md: Add note about duplicate checking
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
Alan Orth 2021-03-17 10:12:03 +02:00
parent f816e17fe7
commit e92ec5d371
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9

View File

@ -21,6 +21,7 @@ If you use the DSpace CSV metadata quality checker please cite:
- Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc - Remove unnecessary Unicode like [non-breaking spaces](https://en.wikipedia.org/wiki/Non-breaking_space), [replacement characters](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character), etc
- Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt" - Check for "suspicious" characters that indicate encoding or copy/paste issues, for example "foreˆt" should be "forêt"
- Remove duplicate metadata values - Remove duplicate metadata values
- Check for duplicate items, using the title, type, and date issued as an indicator
## Installation ## Installation
The easiest way to install CSV Metadata Quality is with [poetry](https://python-poetry.org): The easiest way to install CSV Metadata Quality is with [poetry](https://python-poetry.org):
@ -116,10 +117,6 @@ This currently uses the [Python langid](https://github.com/saffsd/langid.py) lib
- Warn if item is Open Access, but missing a license - Warn if item is Open Access, but missing a license
- Warn if item has an ISSN but no journal title - Warn if item has an ISSN but no journal title
- Update journal titles from ISSN - Update journal titles from ISSN
- Check for duplicates
- If I check titles only, then I might miss if one is a Report and another is a Presentation
- I could just check each item against each other item, but that sounds slow...
- Perhaps I could check for the number of unique values in a few rows, like title and doi, and see if it is the same as the total number of items
## License ## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html). This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).