mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-06-30
This commit is contained in:
@ -246,13 +246,43 @@ for triple in agrovoc.search_triples(None, None, '"abalones"@en'):
|
||||
## 2022-06-29
|
||||
|
||||
- Continue working on the list of non-AGROVOC subject to report to FAO
|
||||
- I got a one liner to get the list of non-AGROVOC subjects and join them with their counts:
|
||||
- I got a one liner to get the list of non-AGROVOC subjects and join them with their counts (updated to use regex in csvgrep):
|
||||
|
||||
```console
|
||||
$ csvgrep -c 'number of matches' -m 0 /tmp/2022-06-28-cgspace-subjects-results.csv \
|
||||
$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-28-cgspace-subjects-results.csv \
|
||||
| csvcut -c subject \
|
||||
| csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
|
||||
> /tmp/2022-06-28-cgspace-non-agrovoc.csv
|
||||
```
|
||||
|
||||
## 2022-06-30
|
||||
|
||||
- Check some AfricaRice records for potential duplicates on CGSpace for Abenet:
|
||||
|
||||
```console
|
||||
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/Africarice_2ndBatch_ay.csv | sed '1s/line_number/id/' > /tmp/africarice.csv
|
||||
$ csv-metadata-quality -i /tmp/africarice.csv -o /tmp/africarice-cleaned.csv -u
|
||||
$ ./ilri/check-duplicates.py -i /tmp/africarice-cleaned.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/africarice-duplicates.csv
|
||||
```
|
||||
|
||||
- Looking at the non-AGROVOC subjects again, I see some in our list that are duplicated in uppercase and lowercase, so I will run it again with all lowercase:
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER;
|
||||
```
|
||||
|
||||
- Also, I see there might be something wrong with my csvjoin because nigeria shows up in the final list as having not matched...
|
||||
- Ah, I was using `csvgrep -m 0` to find rows that didn't match, but that also matched items that had 10, 100, 50, etc...
|
||||
- We need to use a regex:
|
||||
|
||||
```console
|
||||
$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-results.csv \
|
||||
| csvcut -c subject \
|
||||
| csvjoin -c subject /tmp/2022-06-30-cgspace-subjects.csv - \
|
||||
> /tmp/2022-06-30-cgspace-non-agrovoc.csv
|
||||
```
|
||||
|
||||
- Then I took all the terms with fifty or more occurences and put them on a Google Sheet
|
||||
- There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user