1
0
mirror of https://github.com/ilri/csv-metadata-quality.git synced 2024-11-14 10:07:03 +01:00
Commit Graph

74 Commits

Author SHA1 Message Date
5be2195325
Add fix for normalizing DOIs 2024-04-25 12:49:19 +03:00
47b03c49ba
README.md: Update TODOs
Some checks failed
continuous-integration/drone/push Build is failing
2023-03-07 10:45:04 +03:00
2dfb073b6b
Update minimum Python version to 3.9
Due to importlib.resources.files. It's a very minor thing and there
are ways to use back-ported third-party modules with this function-
ality, but I'm the only one use this so...

See: https://docs.python.org/3/library/importlib.resources.html#importlib.resources.files
2022-12-13 10:41:32 +03:00
566c2b45cf
Remove Excel support
I never used this and it seems xlrd doesn't even support .xlsx any-
more anyways. If this was needed I could theoretically use openpyxl
but I'd rather just stick to CSV.
2022-09-02 16:14:24 +03:00
032a1db392
README.md: Add note about missing regions
All checks were successful
continuous-integration/drone/push Build is passing
2022-07-28 16:58:01 +03:00
e7ea8ef9f0
README.md: add note about spdx-license-list
All checks were successful
continuous-integration/drone/push Build is passing
This Python module was deprecated in favor of using the SPDX license
data directly.

See: https://github.com/spdx/license-list-data
2022-01-30 13:27:20 +03:00
d126304534
README.md: update note about Python version 2022-01-30 13:05:36 +03:00
ad33195ba3
README.md: adjust intro
All checks were successful
continuous-integration/drone/push Build is passing
Makes the badges not wrap and looks better in my opinion.
2021-12-08 11:36:34 +02:00
28f9026286
README.md: Minor edit
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-19 16:26:31 +02:00
a04dbc50db
Add notes about checking and fixing mojibake 2021-03-19 11:48:27 +02:00
e92ec5d371
README.md: Add note about duplicate checking
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-17 10:12:03 +02:00
9a5e3fd6ef
README.md: Add TODO about detecting duplicates 2021-03-16 14:03:26 +02:00
1008acf35e
Always fix invalid multi-value separators
All checks were successful
continuous-integration/drone/push Build is passing
This is no longer class-ified as "unsafe" as I have yet to see a
case where this was intentional, and it always causes issues when
you import the data in a DSpace repository.
2021-03-13 12:59:45 +02:00
f00a07e2cd
README.md: Reorganize unsafe functionality
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-13 11:56:52 +02:00
6cc1401f88
pyproject.toml: Minimum Python is technically 3.7.1
All checks were successful
continuous-integration/drone/push Build is passing
See: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html
2021-03-11 13:41:58 +02:00
ad2cda8a41
README.md: Add note about SPDX license identifiers
All checks were successful
continuous-integration/drone/push Build is passing
2021-03-11 12:21:34 +02:00
6ca449d8ed
README.md: Update note about Python 3.8 to 3.8+
Currently the lower bound on Python version support is 3.7 because
of Pandas 1.2.0 requiring it, but I use 3.9 on my development box.
2021-03-11 12:16:07 +02:00
6e4b0e5c1b
Add validation of SPDX license identifiers
Currently this only checks the dcterms.license field and the result
will only be a warning.
2021-03-11 10:33:16 +02:00
4a7000e975
README.md: Add more ideas to do 2021-03-04 21:26:53 +02:00
91ebd0f606
README.md: Update TODOs
A few of these date things have been addressed.
2021-02-28 15:13:36 +02:00
dbbbc0944a
README.md: Add handle to citation
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-27 10:33:37 +02:00
d17bf3033c
README.md: Add citation 2021-01-27 10:32:26 +02:00
2ec52f1b73
README.md: Update description
All checks were successful
continuous-integration/drone/push Build is passing
2021-01-26 15:43:41 +02:00
aa1abf15a7
README.md: Adjust title 2021-01-26 15:35:21 +02:00
df670e81b9
README.md: Use badge from my Drone CI
All checks were successful
continuous-integration/drone/push Build is passing
I'm not using SourceHut anymore.
2021-01-26 14:38:50 +02:00
d3880a9dfa
Remove Python 3.6 support
All checks were successful
continuous-integration/drone/push Build is passing
Pandas 1.2.0 apparently requires Python 3.7.1+.
2021-01-03 15:51:53 +02:00
0dc66c5c4e
Expand check/fix for multi-value separators
I just came across some metadata that had unnecessary multi-value
separators at the end of a field, causing a blank value to be used.

For example: "Kenya||Tanzania||"
2021-01-03 15:30:03 +02:00
fc0367bfc8
README.md: Update note about Python version 2020-12-08 10:52:24 +02:00
e33b285034
README.md: Add GitHub Actions badge 2020-12-08 10:48:31 +02:00
db474a802f
README.md: Use badge from travis-ci.com 2020-08-04 11:12:28 +03:00
f4c5c5781e
README.md: Switch to poetry 2020-07-06 13:59:11 +03:00
40ba9bae6c
README.md: Adjust heading size 2020-01-15 12:26:11 +02:00
49e3543878
Add Unicode normalization
This will check all strings for un-normalized Unicode characters.
Normalization is done using NFC. This includes tests and updated
sample data (data/test.csv).

See: https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
2020-01-15 11:37:54 +02:00
1c75608d54
README.md: Update introduction text
We should mention that this is not DSpace specific. Rather, it is
much more realistically Dublin Core specific.
2019-09-26 14:19:13 +03:00
0b15a8ed3b
README.md: Remove TODO about lack of space after comma
This was added as an automatic global fix a few weeks ago.
2019-09-26 14:16:33 +03:00
e7c220039b
README.md: Add note about experimental language validation 2019-09-26 13:59:50 +03:00
7ac1c6f554
README.md: Update comment about ISO 639-3
The pycountry library is actually using ISO 639-3 apparently.

See: https://pypi.org/project/pycountry/
2019-09-26 07:51:41 +03:00
d9fc09f121
Fix references to ISO 639
It turns out that ISO 639-1 is the two-letter codes, and ISO 639-2
is the three-letter codes, aka alpha2 and alpha3.

See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
2019-09-11 16:36:53 +03:00
2af714fb05
README.md: Add a handful of TODOs 2019-08-27 00:12:41 +03:00
bd984f3db5
README.md: Update TravisCI badge 2019-08-22 15:07:03 +03:00
3f4e84a638
README.md: Use ILRI GitHub remote 2019-08-22 14:54:12 +03:00
a00d3d7ea5
README.md: Simplify installation instructions
Pipenv has captured the local dependency with `-e .` so now it gets
installed by the Pipfile or requirements.txt.
2019-08-02 11:02:50 +03:00
0ed390dbd5
README.md: Update AGROVOC information
Now details the new `--agrovoc-fields` option.
2019-08-01 23:54:40 +03:00
fd3861e7cd
README.md: Update installation and usage instructions
It is much easier now that I have created a proper package.
2019-07-31 17:41:18 +03:00
4c4f4a3ba2
README.md: Update todos 2019-07-31 16:33:49 +03:00
22cc7bc793
README.md: Improve section on unsafe fixes 2019-07-31 16:00:05 +03:00
40d5f7d81b
Add support for removing newlines
This was tricky because of the nature of newlines. In actuality we
are removing Unix line feeds here (U+000A) because Windows carriage
returns are actually already removed by the string stripping in the
whitespace fix.

Creating the test case in Vim was difficult because I couldn't fig-
ure out how to manually enter a line feed character. In the end I
used a search and replace on a known pattern like "ALAN", replacing
it with \r. Neither entering the Unicode code point (U+000A) direc-
tly or typing an "Enter" character after ^V worked. Grrr.
2019-07-30 20:05:12 +03:00
346e66ca98
README.md: Add more information to introduction 2019-07-30 17:44:30 +03:00
a85b410ab9
README.md: Improve introduction and functionality 2019-07-30 16:09:15 +03:00
1f65a28307
Add support for validating subjects against AGROVOC
Checks values in the dc.subject or dcterms.subject field against the
AGROVOC REST API hosted by FAO. Code borrowed from agrovoc-lookup.py.

See: http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/
See: https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py
2019-07-30 00:30:31 +03:00