Add notes for 2022-06-30

This commit is contained in:
2022-06-30 16:48:03 +03:00
parent 53f60284e2
commit 05cf7a26ec
28 changed files with 98 additions and 38 deletions

View File

@ -246,13 +246,43 @@ for triple in agrovoc.search_triples(None, None, '"abalones"@en'):
## 2022-06-29
- Continue working on the list of non-AGROVOC subject to report to FAO
- I got a one liner to get the list of non-AGROVOC subjects and join them with their counts:
- I got a one liner to get the list of non-AGROVOC subjects and join them with their counts (updated to use regex in csvgrep):
```console
$ csvgrep -c 'number of matches' -m 0 /tmp/2022-06-28-cgspace-subjects-results.csv \
$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-28-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
> /tmp/2022-06-28-cgspace-non-agrovoc.csv
```
## 2022-06-30
- Check some AfricaRice records for potential duplicates on CGSpace for Abenet:
```console
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/Africarice_2ndBatch_ay.csv | sed '1s/line_number/id/' > /tmp/africarice.csv
$ csv-metadata-quality -i /tmp/africarice.csv -o /tmp/africarice-cleaned.csv -u
$ ./ilri/check-duplicates.py -i /tmp/africarice-cleaned.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/africarice-duplicates.csv
```
- Looking at the non-AGROVOC subjects again, I see some in our list that are duplicated in uppercase and lowercase, so I will run it again with all lowercase:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER;
```
- Also, I see there might be something wrong with my csvjoin because nigeria shows up in the final list as having not matched...
- Ah, I was using `csvgrep -m 0` to find rows that didn't match, but that also matched items that had 10, 100, 50, etc...
- We need to use a regex:
```console
$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2022-06-30-cgspace-subjects.csv - \
> /tmp/2022-06-30-cgspace-non-agrovoc.csv
```
- Then I took all the terms with fifty or more occurences and put them on a Google Sheet
- There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept
<!-- vim: set sw=2 ts=2: -->