From ad8516bbb356737ceaa91c7f1585e53b01962f63 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Thu, 27 Apr 2023 13:10:13 -0700 Subject: [PATCH] Add notes for 2023-04-27 --- content/posts/2022-06.md | 4 +- content/posts/2023-04.md | 43 +++++++++++++++++++++ docs/2022-06/index.html | 6 +-- docs/2023-04/index.html | 51 +++++++++++++++++++++++-- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/categories/notes/page/6/index.html | 2 +- docs/categories/notes/page/7/index.html | 2 +- docs/index.html | 2 +- docs/page/10/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/page/8/index.html | 2 +- docs/page/9/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/10/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/posts/page/8/index.html | 2 +- docs/posts/page/9/index.html | 2 +- docs/sitemap.xml | 10 ++--- 33 files changed, 129 insertions(+), 41 deletions(-) diff --git a/content/posts/2022-06.md b/content/posts/2022-06.md index 3a9c436ef..6df39a9e6 100644 --- a/content/posts/2022-06.md +++ b/content/posts/2022-06.md @@ -202,7 +202,7 @@ $ xsv join --full alpha2 /tmp/clarisa-un-cgspace-xsv-full.csv alpha2 /tmp/mel-co ## 2022-06-28 -- Start working on the CGSpace subject export for FAO +- Start working on the CGSpace subject export for FAO / AGROVOC - First I exported a list of all metadata in our `dcterms.subject` and other center-specific subject fields with their counts: ```console @@ -220,7 +220,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2022-06-28-cgspace-subjects.txt -o /tmp/2022- - I keep getting timeouts after every five or ten requests, so this will not be feasible for 27,000 subjects! - I think I will have to write some custom script to use the AGROVOC RDF file - Using rdflib to open the 1.2GB `agrovoc_lod.rdf` file takes several minutes and doesn't seem very efficient -- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limiting and I'm not sure how to search yet +- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limited and I'm not sure how to search yet - I had to try in different Python versions because 3.10.x is apparently too new - For future reference I was able to search with lightrdf: diff --git a/content/posts/2023-04.md b/content/posts/2023-04.md index 019f32777..c577229f0 100644 --- a/content/posts/2023-04.md +++ b/content/posts/2023-04.md @@ -481,6 +481,10 @@ $ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_b - As the quality settings are not comparable between formats, we need to compare the formats at matching perceptual scores (ssimulacra2 in this case) - I used a ssimulacra2 score of 80 because that's the about the highest score I see with WebP using my samples, though JPEG and AVIF do go higher - Also, according to current ssimulacra2 (v2.1), a score of 70 is "high quality" and a score of 90 is "very high quality", so 80 should be reasonably high enough... +- Here is a plot of the qualities and ssimulacra2 scores: + +![Quality vs Score](/cgspace-notes/2023/04/quality-vs-score-ssimulacra-v2.1.png) + - Export CGSpace to check for missing Initiatives mappings ## 2023-04-22 @@ -491,4 +495,43 @@ $ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_b - Also, I found a few items submitted by MEL that had dates in DD/MM/YYYY format, so I sent them to Salem for him to investigate - Start a harvest on AReS +## 2023-04-26 + +- Begin working on the list of non-AGROVOC CGSpace subjects for FAO + - The last time I did this was in 2022-06 + - I used the following SQL query to dump values from all subject fields, lower case them, and group by counts: + +```console +localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2023-04-26-cgspace-subjects.csv WITH CSV HEADER; +COPY 26315 +Time: 2761.981 ms (00:02.762) +``` + +- Then I extracted the subjects and looked them up against AGROVOC: + +```console +$ csvcut -c subject /tmp/2023-04-26-cgspace-subjects.csv | sed '1d' > /tmp/2023-04-26-cgspace-subjects.txt +$ ./ilri/agrovoc_lookup.py -i /tmp/2023-04-26-cgspace-subjects.txt -o /tmp/2023-04-26-cgspace-subjects-results.csv +``` + +## 2023-04-27 + +- The AGROVOC lookup from yesterday finished, so I extracted all terms that did not match and joined them with the original CSV so I can see the counts: + - (I also note that the `agrovoc_lookup.py` script didn't seem to be caching properly, as it had to look up everything again the next time I ran it despite the requests cache being 174MB!) + +```console +csvgrep -c 'number of matches' -r '^0$' /tmp/2023-04-26-cgspace-subjects-results.csv \ + | csvcut -c subject \ + | csvjoin -c subject /tmp/2023-04-26-cgspace-subjects.csv - \ + > /tmp/2023-04-26-cgspace-non-agrovoc.csv +``` + +- I filtered for only those terms that had counts larger than fifty + - I also removed terms like "forages", "policy", "pests and diseases" because those exist as singular or separate terms in AGROVOC + - I also removed ambiguous terms like "cocoa", "diversity", "resistance" etc because there are various other preferred terms for those in AGROVOC + - I also removed spelling mistakes like "modeling" and "savanas" because those exist in their correct form in AGROVOC + - I also removed internal CGIAR terms like "tac", "crp", "internal review" etc (note: these are mostly from CGIAR System Office's subjects... perhaps I exclude those next time?) +- I note that many of *our* terms would match if they were singular, plural, or split up into separate terms, so perhaps we should pair this with an excercise to review our own terms +- I couldn't finish the work locally yet so I uploaded my list to Google Docs to continue later + diff --git a/docs/2022-06/index.html b/docs/2022-06/index.html index e24322005..ab7bd6884 100644 --- a/docs/2022-06/index.html +++ b/docs/2022-06/index.html @@ -58,7 +58,7 @@ There seem to be many more of these: "@type": "BlogPosting", "headline": "June, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-06/", - "wordCount": "1786", + "wordCount": "1788", "datePublished": "2022-06-06T09:01:36+03:00", "dateModified": "2022-08-03T21:01:39+03:00", "author": { @@ -349,7 +349,7 @@ There seem to be many more of these:

2022-06-28

localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-28-cgspace-subjects.csv WITH CSV HEADER;
@@ -366,7 +366,7 @@ There seem to be many more of these:
 
  • Using rdflib to open the 1.2GB agrovoc_lod.rdf file takes several minutes and doesn’t seem very efficient
  • -
  • I tried using lightrdf and it’s much quicker, but the documentation is limiting and I’m not sure how to search yet +
  • I tried using lightrdf and it’s much quicker, but the documentation is limited and I’m not sure how to search yet
    • I had to try in different Python versions because 3.10.x is apparently too new
    diff --git a/docs/2023-04/index.html b/docs/2023-04/index.html index d9509a485..f67dfe3d5 100644 --- a/docs/2023-04/index.html +++ b/docs/2023-04/index.html @@ -20,7 +20,7 @@ Start a harvest on AReS - + @@ -46,9 +46,9 @@ Start a harvest on AReS "@type": "BlogPosting", "headline": "April, 2023", "url": "https://alanorth.github.io/cgspace-notes/2023-04/", - "wordCount": "2033", + "wordCount": "2400", "datePublished": "2023-04-02T08:19:36+03:00", - "dateModified": "2023-04-20T22:44:18-07:00", + "dateModified": "2023-04-22T16:37:19-07:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -661,6 +661,10 @@ Start a harvest on AReS
  • Also, according to current ssimulacra2 (v2.1), a score of 70 is “high quality” and a score of 90 is “very high quality”, so 80 should be reasonably high enough…
  • +
  • Here is a plot of the qualities and ssimulacra2 scores:
  • + +

    Quality vs Score

    +
    • Export CGSpace to check for missing Initiatives mappings

    2023-04-22

    @@ -674,6 +678,47 @@ Start a harvest on AReS
  • Start a harvest on AReS
  • +

    2023-04-26

    +
      +
    • Begin working on the list of non-AGROVOC CGSpace subjects for FAO +
        +
      • The last time I did this was in 2022-06
      • +
      • I used the following SQL query to dump values from all subject fields, lower case them, and group by counts:
      • +
      +
    • +
    +
    localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2023-04-26-cgspace-subjects.csv WITH CSV HEADER;
    +COPY 26315
    +Time: 2761.981 ms (00:02.762)
    +
      +
    • Then I extracted the subjects and looked them up against AGROVOC:
    • +
    +
    $ csvcut -c subject /tmp/2023-04-26-cgspace-subjects.csv | sed '1d' > /tmp/2023-04-26-cgspace-subjects.txt
    +$ ./ilri/agrovoc_lookup.py -i /tmp/2023-04-26-cgspace-subjects.txt -o /tmp/2023-04-26-cgspace-subjects-results.csv
    +

    2023-04-27

    +
      +
    • The AGROVOC lookup from yesterday finished, so I extracted all terms that did not match and joined them with the original CSV so I can see the counts: +
        +
      • (I also note that the agrovoc_lookup.py script didn’t seem to be caching properly, as it had to look up everything again the next time I ran it despite the requests cache being 174MB!)
      • +
      +
    • +
    +
    csvgrep -c 'number of matches' -r '^0$' /tmp/2023-04-26-cgspace-subjects-results.csv \
    +  | csvcut -c subject \
    +  | csvjoin -c subject /tmp/2023-04-26-cgspace-subjects.csv - \
    +  > /tmp/2023-04-26-cgspace-non-agrovoc.csv
    +
      +
    • I filtered for only those terms that had counts larger than fifty +
        +
      • I also removed terms like “forages”, “policy”, “pests and diseases” because those exist as singular or separate terms in AGROVOC
      • +
      • I also removed ambiguous terms like “cocoa”, “diversity”, “resistance” etc because there are various other preferred terms for those in AGROVOC
      • +
      • I also removed spelling mistakes like “modeling” and “savanas” because those exist in their correct form in AGROVOC
      • +
      • I also removed internal CGIAR terms like “tac”, “crp”, “internal review” etc (note: these are mostly from CGIAR System Office’s subjects… perhaps I exclude those next time?)
      • +
      +
    • +
    • I note that many of our terms would match if they were singular, plural, or split up into separate terms, so perhaps we should pair this with an excercise to review our own terms
    • +
    • I couldn’t finish the work locally yet so I uploaded my list to Google Docs to continue later
    • +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index 93a5b8caa..b460ac457 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 32091bb5f..18d6a1c31 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index fea80197a..72c0040ce 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 91f4c7af2..3df5d4d40 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 9345865a5..f120e5486 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 181304a78..226316b9c 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index bd6903dd8..22a35cbc6 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index 3ed7f5987..68d7a7338 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index ba5d800fa..035f80ee7 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/10/index.html b/docs/page/10/index.html index ebd33e53c..883e30ebe 100644 --- a/docs/page/10/index.html +++ b/docs/page/10/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index d0dd6d645..27ff52469 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 399ac02c3..03b66c6c0 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index f6946505f..23c97e9f2 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index cdf107ee3..cb80335b3 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index b4cabbba3..8ac2c86bd 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 52948314c..400567ced 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 21719ecd1..751ce38ab 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index 914bd8fea..3514c96f5 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index d26e0d3b5..88e7b473e 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/10/index.html b/docs/posts/page/10/index.html index d595ceb5d..2e9f511dc 100644 --- a/docs/posts/page/10/index.html +++ b/docs/posts/page/10/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 1177827e0..a9ea4cb27 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index b236c1509..32f34f976 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 5b2ac8555..bd7c5523d 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 7ac40a009..85bf5959b 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index c76bd2f7a..0c9026c58 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 9a24fcb87..0ef61fd9e 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index be641f3d2..2c98e6752 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index b8ae87223..ab17d320d 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 9a8b6315c..dd2df7ef7 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/2023-04/ - 2023-04-20T22:44:18-07:00 + 2023-04-22T16:37:19-07:00 https://alanorth.github.io/cgspace-notes/categories/ - 2023-04-20T22:44:18-07:00 + 2023-04-22T16:37:19-07:00 https://alanorth.github.io/cgspace-notes/ - 2023-04-20T22:44:18-07:00 + 2023-04-22T16:37:19-07:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2023-04-20T22:44:18-07:00 + 2023-04-22T16:37:19-07:00 https://alanorth.github.io/cgspace-notes/posts/ - 2023-04-20T22:44:18-07:00 + 2023-04-22T16:37:19-07:00 https://alanorth.github.io/cgspace-notes/2023-03/ 2023-04-02T09:16:25+03:00