Add notes for 2023-04-27

This commit is contained in:
2023-04-27 13:10:13 -07:00
parent 0ca3cadbef
commit ad8516bbb3
33 changed files with 129 additions and 41 deletions

View File

@ -202,7 +202,7 @@ $ xsv join --full alpha2 /tmp/clarisa-un-cgspace-xsv-full.csv alpha2 /tmp/mel-co
## 2022-06-28
- Start working on the CGSpace subject export for FAO
- Start working on the CGSpace subject export for FAO / AGROVOC
- First I exported a list of all metadata in our `dcterms.subject` and other center-specific subject fields with their counts:
```console
@ -220,7 +220,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2022-06-28-cgspace-subjects.txt -o /tmp/2022-
- I keep getting timeouts after every five or ten requests, so this will not be feasible for 27,000 subjects!
- I think I will have to write some custom script to use the AGROVOC RDF file
- Using rdflib to open the 1.2GB `agrovoc_lod.rdf` file takes several minutes and doesn't seem very efficient
- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limiting and I'm not sure how to search yet
- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limited and I'm not sure how to search yet
- I had to try in different Python versions because 3.10.x is apparently too new
- For future reference I was able to search with lightrdf:

View File

@ -481,6 +481,10 @@ $ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_b
- As the quality settings are not comparable between formats, we need to compare the formats at matching perceptual scores (ssimulacra2 in this case)
- I used a ssimulacra2 score of 80 because that's the about the highest score I see with WebP using my samples, though JPEG and AVIF do go higher
- Also, according to current ssimulacra2 (v2.1), a score of 70 is "high quality" and a score of 90 is "very high quality", so 80 should be reasonably high enough...
- Here is a plot of the qualities and ssimulacra2 scores:
![Quality vs Score](/cgspace-notes/2023/04/quality-vs-score-ssimulacra-v2.1.png)
- Export CGSpace to check for missing Initiatives mappings
## 2023-04-22
@ -491,4 +495,43 @@ $ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_b
- Also, I found a few items submitted by MEL that had dates in DD/MM/YYYY format, so I sent them to Salem for him to investigate
- Start a harvest on AReS
## 2023-04-26
- Begin working on the list of non-AGROVOC CGSpace subjects for FAO
- The last time I did this was in 2022-06
- I used the following SQL query to dump values from all subject fields, lower case them, and group by counts:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2023-04-26-cgspace-subjects.csv WITH CSV HEADER;
COPY 26315
Time: 2761.981 ms (00:02.762)
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2023-04-26-cgspace-subjects.csv | sed '1d' > /tmp/2023-04-26-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-04-26-cgspace-subjects.txt -o /tmp/2023-04-26-cgspace-subjects-results.csv
```
## 2023-04-27
- The AGROVOC lookup from yesterday finished, so I extracted all terms that did not match and joined them with the original CSV so I can see the counts:
- (I also note that the `agrovoc_lookup.py` script didn't seem to be caching properly, as it had to look up everything again the next time I ran it despite the requests cache being 174MB!)
```console
csvgrep -c 'number of matches' -r '^0$' /tmp/2023-04-26-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2023-04-26-cgspace-subjects.csv - \
> /tmp/2023-04-26-cgspace-non-agrovoc.csv
```
- I filtered for only those terms that had counts larger than fifty
- I also removed terms like "forages", "policy", "pests and diseases" because those exist as singular or separate terms in AGROVOC
- I also removed ambiguous terms like "cocoa", "diversity", "resistance" etc because there are various other preferred terms for those in AGROVOC
- I also removed spelling mistakes like "modeling" and "savanas" because those exist in their correct form in AGROVOC
- I also removed internal CGIAR terms like "tac", "crp", "internal review" etc (note: these are mostly from CGIAR System Office's subjects... perhaps I exclude those next time?)
- I note that many of *our* terms would match if they were singular, plural, or split up into separate terms, so perhaps we should pair this with an excercise to review our own terms
- I couldn't finish the work locally yet so I uploaded my list to Google Docs to continue later
<!-- vim: set sw=2 ts=2: -->