mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2019-02-21
This commit is contained in:
@ -1002,4 +1002,38 @@ $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X P
|
||||
- See this [issue on the VIVO tracker](https://jira.duraspace.org/browse/VIVO-1655) for more information about this endpoint
|
||||
- The old-school AGROVOC SOAP WSDL works with the [Zeep Python library](https://python-zeep.readthedocs.io/en/master/), but in my tests the results are way too broad despite trying to use a "exact match" searching
|
||||
|
||||
## 2019-02-21
|
||||
|
||||
- I wrote a script [agrovoc-lookup.py](https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py) to resolve subject terms against the public AGROVOC REST API
|
||||
- It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to
|
||||
- I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files:
|
||||
|
||||
```
|
||||
$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
|
||||
$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
|
||||
$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
|
||||
```
|
||||
|
||||
- Then I generated a list of all the unique matched terms:
|
||||
|
||||
```
|
||||
$ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
|
||||
```
|
||||
|
||||
- And then a list of all the unique *unmatched* terms using some utility I've never heard of before called `comm`:
|
||||
|
||||
```
|
||||
$ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
|
||||
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
|
||||
```
|
||||
|
||||
- Generate a list of countries and regions from CGSpace for Sisay to look through:
|
||||
|
||||
```
|
||||
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
|
||||
COPY 202
|
||||
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
|
||||
COPY 33
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user