Add notes for 2022-06-30

2025-01-27 05:49:12 +01:00 · 2022-06-30 16:48:03 +03:00
parent 53f60284e2
commit 05cf7a26ec
28 changed files with 98 additions and 38 deletions
--- a/content/posts/2022-06.md
+++ b/content/posts/2022-06.md
@ -246,13 +246,43 @@ for triple in agrovoc.search_triples(None, None, '"abalones"@en'):
 ## 2022-06-29

 - Continue working on the list of non-AGROVOC subject to report to FAO
-  - I got a one liner to get the list of non-AGROVOC subjects and join them with their counts:
+  - I got a one liner to get the list of non-AGROVOC subjects and join them with their counts (updated to use regex in csvgrep):

 ```console
-$ csvgrep -c 'number of matches' -m 0 /tmp/2022-06-28-cgspace-subjects-results.csv \
+$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-28-cgspace-subjects-results.csv \
  | csvcut -c subject \
  | csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
  > /tmp/2022-06-28-cgspace-non-agrovoc.csv
 ```

+## 2022-06-30
+
+- Check some AfricaRice records for potential duplicates on CGSpace for Abenet:
+
+```console
+$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/Africarice_2ndBatch_ay.csv | sed '1s/line_number/id/' > /tmp/africarice.csv
+$ csv-metadata-quality -i /tmp/africarice.csv -o /tmp/africarice-cleaned.csv -u
+$ ./ilri/check-duplicates.py -i /tmp/africarice-cleaned.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/africarice-duplicates.csv
+```
+
+- Looking at the non-AGROVOC subjects again, I see some in our list that are duplicated in uppercase and lowercase, so I will run it again with all lowercase:
+
+```console
+localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER;
+```
+
+- Also, I see there might be something wrong with my csvjoin because nigeria shows up in the final list as having not matched...
+  - Ah, I was using `csvgrep -m 0` to find rows that didn't match, but that also matched items that had 10, 100, 50, etc...
+  - We need to use a regex:
+
+```console
+$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-results.csv \
+  | csvcut -c subject \
+  | csvjoin -c subject /tmp/2022-06-30-cgspace-subjects.csv - \
+  > /tmp/2022-06-30-cgspace-non-agrovoc.csv
+```
+
+- Then I took all the terms with fifty or more occurences and put them on a Google Sheet
+  - There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept
+
 <!-- vim: set sw=2 ts=2: -->