diff --git a/content/posts/2022-06.md b/content/posts/2022-06.md index f0343b1e6..d09c70f5e 100644 --- a/content/posts/2022-06.md +++ b/content/posts/2022-06.md @@ -246,13 +246,43 @@ for triple in agrovoc.search_triples(None, None, '"abalones"@en'): ## 2022-06-29 - Continue working on the list of non-AGROVOC subject to report to FAO - - I got a one liner to get the list of non-AGROVOC subjects and join them with their counts: + - I got a one liner to get the list of non-AGROVOC subjects and join them with their counts (updated to use regex in csvgrep): ```console -$ csvgrep -c 'number of matches' -m 0 /tmp/2022-06-28-cgspace-subjects-results.csv \ +$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-28-cgspace-subjects-results.csv \ | csvcut -c subject \ | csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \ > /tmp/2022-06-28-cgspace-non-agrovoc.csv ``` +## 2022-06-30 + +- Check some AfricaRice records for potential duplicates on CGSpace for Abenet: + +```console +$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/Africarice_2ndBatch_ay.csv | sed '1s/line_number/id/' > /tmp/africarice.csv +$ csv-metadata-quality -i /tmp/africarice.csv -o /tmp/africarice-cleaned.csv -u +$ ./ilri/check-duplicates.py -i /tmp/africarice-cleaned.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/africarice-duplicates.csv +``` + +- Looking at the non-AGROVOC subjects again, I see some in our list that are duplicated in uppercase and lowercase, so I will run it again with all lowercase: + +```console +localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER; +``` + +- Also, I see there might be something wrong with my csvjoin because nigeria shows up in the final list as having not matched... + - Ah, I was using `csvgrep -m 0` to find rows that didn't match, but that also matched items that had 10, 100, 50, etc... + - We need to use a regex: + +```console +$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-results.csv \ + | csvcut -c subject \ + | csvjoin -c subject /tmp/2022-06-30-cgspace-subjects.csv - \ + > /tmp/2022-06-30-cgspace-non-agrovoc.csv +``` + +- Then I took all the terms with fifty or more occurences and put them on a Google Sheet + - There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept + diff --git a/docs/2022-06/index.html b/docs/2022-06/index.html index c63208405..751e58e93 100644 --- a/docs/2022-06/index.html +++ b/docs/2022-06/index.html @@ -26,7 +26,7 @@ There seem to be many more of these: - + @@ -58,9 +58,9 @@ There seem to be many more of these: "@type": "BlogPosting", "headline": "June, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-06/", - "wordCount": "1520", + "wordCount": "1761", "datePublished": "2022-06-06T09:01:36+03:00", - "dateModified": "2022-06-26T18:11:33+03:00", + "dateModified": "2022-06-30T09:41:54+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -394,15 +394,45 @@ There seem to be many more of these: -
$ csvgrep -c 'number of matches' -m 0 /tmp/2022-06-28-cgspace-subjects-results.csv \
+
$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-28-cgspace-subjects-results.csv \
   | csvcut -c subject \
   | csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
   > /tmp/2022-06-28-cgspace-non-agrovoc.csv
-
+

2022-06-30

+ +
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/Africarice_2ndBatch_ay.csv | sed '1s/line_number/id/' > /tmp/africarice.csv
+$ csv-metadata-quality -i /tmp/africarice.csv -o /tmp/africarice-cleaned.csv -u
+$ ./ilri/check-duplicates.py -i /tmp/africarice-cleaned.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/africarice-duplicates.csv
+
+
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER;
+
+
$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-results.csv \
+  | csvcut -c subject \
+  | csvjoin -c subject /tmp/2022-06-30-cgspace-subjects.csv - \
+  > /tmp/2022-06-30-cgspace-non-agrovoc.csv
+
+ diff --git a/docs/categories/index.html b/docs/categories/index.html index 361ceb8d5..ff4e3799c 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index a7f21ae4a..5429adf40 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index aafc6d6d5..94e1ee230 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 43f684302..7e5a26d0b 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 737f4eb76..6dc6256ff 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index b15a331c8..7884bb4fc 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index 9ae458335..9916a1858 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 8e8a2ec84..7d66864d2 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 9949fcc24..563b8ccec 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 285fe7780..1622b723a 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 0e7b00aed..e059782af 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 7a052cd29..48e59c0fc 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 2c111dcc5..798a0c5d9 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index c9cb37fc7..489df27d3 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index afe2d7a22..43d16abbe 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index 0e33dd4cb..31ecd6e95 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 46f761653..93aec380f 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index baf0bb255..10436fba6 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 6ee14fd33..5eff8af89 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index b14900c29..6246cc380 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 1aab32af4..add425bbc 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 868432f48..af94539a1 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index c6a9ab217..237ede810 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 0b327d406..37d4c929e 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index d65f1cd39..49980a74e 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 5271ebf4c..238d46329 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2022-06-26T18:11:33+03:00 + 2022-06-30T09:41:54+03:00 https://alanorth.github.io/cgspace-notes/ - 2022-06-26T18:11:33+03:00 + 2022-06-30T09:41:54+03:00 https://alanorth.github.io/cgspace-notes/2022-06/ - 2022-06-26T18:11:33+03:00 + 2022-06-30T09:41:54+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2022-06-26T18:11:33+03:00 + 2022-06-30T09:41:54+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2022-06-26T18:11:33+03:00 + 2022-06-30T09:41:54+03:00 https://alanorth.github.io/cgspace-notes/2022-05/ 2022-05-30T16:00:02+03:00