diff --git a/content/posts/2022-06.md b/content/posts/2022-06.md
index f0343b1e6..d09c70f5e 100644
--- a/content/posts/2022-06.md
+++ b/content/posts/2022-06.md
@@ -246,13 +246,43 @@ for triple in agrovoc.search_triples(None, None, '"abalones"@en'):
## 2022-06-29
- Continue working on the list of non-AGROVOC subject to report to FAO
- - I got a one liner to get the list of non-AGROVOC subjects and join them with their counts:
+ - I got a one liner to get the list of non-AGROVOC subjects and join them with their counts (updated to use regex in csvgrep):
```console
-$ csvgrep -c 'number of matches' -m 0 /tmp/2022-06-28-cgspace-subjects-results.csv \
+$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-28-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
> /tmp/2022-06-28-cgspace-non-agrovoc.csv
```
+## 2022-06-30
+
+- Check some AfricaRice records for potential duplicates on CGSpace for Abenet:
+
+```console
+$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/Africarice_2ndBatch_ay.csv | sed '1s/line_number/id/' > /tmp/africarice.csv
+$ csv-metadata-quality -i /tmp/africarice.csv -o /tmp/africarice-cleaned.csv -u
+$ ./ilri/check-duplicates.py -i /tmp/africarice-cleaned.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/africarice-duplicates.csv
+```
+
+- Looking at the non-AGROVOC subjects again, I see some in our list that are duplicated in uppercase and lowercase, so I will run it again with all lowercase:
+
+```console
+localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER;
+```
+
+- Also, I see there might be something wrong with my csvjoin because nigeria shows up in the final list as having not matched...
+ - Ah, I was using `csvgrep -m 0` to find rows that didn't match, but that also matched items that had 10, 100, 50, etc...
+ - We need to use a regex:
+
+```console
+$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-results.csv \
+ | csvcut -c subject \
+ | csvjoin -c subject /tmp/2022-06-30-cgspace-subjects.csv - \
+ > /tmp/2022-06-30-cgspace-non-agrovoc.csv
+```
+
+- Then I took all the terms with fifty or more occurences and put them on a Google Sheet
+ - There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept
+
diff --git a/docs/2022-06/index.html b/docs/2022-06/index.html
index c63208405..751e58e93 100644
--- a/docs/2022-06/index.html
+++ b/docs/2022-06/index.html
@@ -26,7 +26,7 @@ There seem to be many more of these:
-
+
@@ -58,9 +58,9 @@ There seem to be many more of these:
"@type": "BlogPosting",
"headline": "June, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-06/",
- "wordCount": "1520",
+ "wordCount": "1761",
"datePublished": "2022-06-06T09:01:36+03:00",
- "dateModified": "2022-06-26T18:11:33+03:00",
+ "dateModified": "2022-06-30T09:41:54+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -394,15 +394,45 @@ There seem to be many more of these:
- Continue working on the list of non-AGROVOC subject to report to FAO
-- I got a one liner to get the list of non-AGROVOC subjects and join them with their counts:
+- I got a one liner to get the list of non-AGROVOC subjects and join them with their counts (updated to use regex in csvgrep):
-$ csvgrep -c 'number of matches' -m 0 /tmp/2022-06-28-cgspace-subjects-results.csv \
+$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-28-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
> /tmp/2022-06-28-cgspace-non-agrovoc.csv
-
+
2022-06-30
+
+- Check some AfricaRice records for potential duplicates on CGSpace for Abenet:
+
+$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/Africarice_2ndBatch_ay.csv | sed '1s/line_number/id/' > /tmp/africarice.csv
+$ csv-metadata-quality -i /tmp/africarice.csv -o /tmp/africarice-cleaned.csv -u
+$ ./ilri/check-duplicates.py -i /tmp/africarice-cleaned.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/africarice-duplicates.csv
+
+- Looking at the non-AGROVOC subjects again, I see some in our list that are duplicated in uppercase and lowercase, so I will run it again with all lowercase:
+
+localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-30-cgspace-subjects.csv WITH CSV HEADER;
+
+- Also, I see there might be something wrong with my csvjoin because nigeria shows up in the final list as having not matched…
+
+- Ah, I was using
csvgrep -m 0
to find rows that didn’t match, but that also matched items that had 10, 100, 50, etc…
+- We need to use a regex:
+
+
+
+$ csvgrep -c 'number of matches' -r '^0$' /tmp/2022-06-30-cgspace-subjects-results.csv \
+ | csvcut -c subject \
+ | csvjoin -c subject /tmp/2022-06-30-cgspace-subjects.csv - \
+ > /tmp/2022-06-30-cgspace-non-agrovoc.csv
+
+- Then I took all the terms with fifty or more occurences and put them on a Google Sheet
+
+- There I started removing any term that was a variation of an existing AGROVOC term (like cowpea/cowpeas, policy/policies) or a compound concept
+
+
+
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 361ceb8d5..ff4e3799c 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index a7f21ae4a..5429adf40 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index aafc6d6d5..94e1ee230 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 43f684302..7e5a26d0b 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 737f4eb76..6dc6256ff 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index b15a331c8..7884bb4fc 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html
index 9ae458335..9916a1858 100644
--- a/docs/categories/notes/page/6/index.html
+++ b/docs/categories/notes/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index 8e8a2ec84..7d66864d2 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index 9949fcc24..563b8ccec 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 285fe7780..1622b723a 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index 0e7b00aed..e059782af 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 7a052cd29..48e59c0fc 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 2c111dcc5..798a0c5d9 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index c9cb37fc7..489df27d3 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index afe2d7a22..43d16abbe 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/9/index.html b/docs/page/9/index.html
index 0e33dd4cb..31ecd6e95 100644
--- a/docs/page/9/index.html
+++ b/docs/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 46f761653..93aec380f 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index baf0bb255..10436fba6 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index 6ee14fd33..5eff8af89 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index b14900c29..6246cc380 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 1aab32af4..add425bbc 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 868432f48..af94539a1 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index c6a9ab217..237ede810 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index 0b327d406..37d4c929e 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html
index d65f1cd39..49980a74e 100644
--- a/docs/posts/page/9/index.html
+++ b/docs/posts/page/9/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 5271ebf4c..238d46329 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/categories/
- 2022-06-26T18:11:33+03:00
+ 2022-06-30T09:41:54+03:00
https://alanorth.github.io/cgspace-notes/
- 2022-06-26T18:11:33+03:00
+ 2022-06-30T09:41:54+03:00
https://alanorth.github.io/cgspace-notes/2022-06/
- 2022-06-26T18:11:33+03:00
+ 2022-06-30T09:41:54+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2022-06-26T18:11:33+03:00
+ 2022-06-30T09:41:54+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2022-06-26T18:11:33+03:00
+ 2022-06-30T09:41:54+03:00
https://alanorth.github.io/cgspace-notes/2022-05/
2022-05-30T16:00:02+03:00