mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2023-04-27
This commit is contained in:
@ -202,7 +202,7 @@ $ xsv join --full alpha2 /tmp/clarisa-un-cgspace-xsv-full.csv alpha2 /tmp/mel-co
|
||||
|
||||
## 2022-06-28
|
||||
|
||||
- Start working on the CGSpace subject export for FAO
|
||||
- Start working on the CGSpace subject export for FAO / AGROVOC
|
||||
- First I exported a list of all metadata in our `dcterms.subject` and other center-specific subject fields with their counts:
|
||||
|
||||
```console
|
||||
@ -220,7 +220,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2022-06-28-cgspace-subjects.txt -o /tmp/2022-
|
||||
- I keep getting timeouts after every five or ten requests, so this will not be feasible for 27,000 subjects!
|
||||
- I think I will have to write some custom script to use the AGROVOC RDF file
|
||||
- Using rdflib to open the 1.2GB `agrovoc_lod.rdf` file takes several minutes and doesn't seem very efficient
|
||||
- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limiting and I'm not sure how to search yet
|
||||
- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limited and I'm not sure how to search yet
|
||||
- I had to try in different Python versions because 3.10.x is apparently too new
|
||||
- For future reference I was able to search with lightrdf:
|
||||
|
||||
|
@ -481,6 +481,10 @@ $ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_b
|
||||
- As the quality settings are not comparable between formats, we need to compare the formats at matching perceptual scores (ssimulacra2 in this case)
|
||||
- I used a ssimulacra2 score of 80 because that's the about the highest score I see with WebP using my samples, though JPEG and AVIF do go higher
|
||||
- Also, according to current ssimulacra2 (v2.1), a score of 70 is "high quality" and a score of 90 is "very high quality", so 80 should be reasonably high enough...
|
||||
- Here is a plot of the qualities and ssimulacra2 scores:
|
||||
|
||||

|
||||
|
||||
- Export CGSpace to check for missing Initiatives mappings
|
||||
|
||||
## 2023-04-22
|
||||
@ -491,4 +495,43 @@ $ psql -d dspace -c "update bundle set primary_bitstream_id=NULL where primary_b
|
||||
- Also, I found a few items submitted by MEL that had dates in DD/MM/YYYY format, so I sent them to Salem for him to investigate
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2023-04-26
|
||||
|
||||
- Begin working on the list of non-AGROVOC CGSpace subjects for FAO
|
||||
- The last time I did this was in 2022-06
|
||||
- I used the following SQL query to dump values from all subject fields, lower case them, and group by counts:
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2023-04-26-cgspace-subjects.csv WITH CSV HEADER;
|
||||
COPY 26315
|
||||
Time: 2761.981 ms (00:02.762)
|
||||
```
|
||||
|
||||
- Then I extracted the subjects and looked them up against AGROVOC:
|
||||
|
||||
```console
|
||||
$ csvcut -c subject /tmp/2023-04-26-cgspace-subjects.csv | sed '1d' > /tmp/2023-04-26-cgspace-subjects.txt
|
||||
$ ./ilri/agrovoc_lookup.py -i /tmp/2023-04-26-cgspace-subjects.txt -o /tmp/2023-04-26-cgspace-subjects-results.csv
|
||||
```
|
||||
|
||||
## 2023-04-27
|
||||
|
||||
- The AGROVOC lookup from yesterday finished, so I extracted all terms that did not match and joined them with the original CSV so I can see the counts:
|
||||
- (I also note that the `agrovoc_lookup.py` script didn't seem to be caching properly, as it had to look up everything again the next time I ran it despite the requests cache being 174MB!)
|
||||
|
||||
```console
|
||||
csvgrep -c 'number of matches' -r '^0$' /tmp/2023-04-26-cgspace-subjects-results.csv \
|
||||
| csvcut -c subject \
|
||||
| csvjoin -c subject /tmp/2023-04-26-cgspace-subjects.csv - \
|
||||
> /tmp/2023-04-26-cgspace-non-agrovoc.csv
|
||||
```
|
||||
|
||||
- I filtered for only those terms that had counts larger than fifty
|
||||
- I also removed terms like "forages", "policy", "pests and diseases" because those exist as singular or separate terms in AGROVOC
|
||||
- I also removed ambiguous terms like "cocoa", "diversity", "resistance" etc because there are various other preferred terms for those in AGROVOC
|
||||
- I also removed spelling mistakes like "modeling" and "savanas" because those exist in their correct form in AGROVOC
|
||||
- I also removed internal CGIAR terms like "tac", "crp", "internal review" etc (note: these are mostly from CGIAR System Office's subjects... perhaps I exclude those next time?)
|
||||
- I note that many of *our* terms would match if they were singular, plural, or split up into separate terms, so perhaps we should pair this with an excercise to review our own terms
|
||||
- I couldn't finish the work locally yet so I uploaded my list to Google Docs to continue later
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user