diff --git a/content/posts/2019-02.md b/content/posts/2019-02.md index 5e327e729..485057536 100644 --- a/content/posts/2019-02.md +++ b/content/posts/2019-02.md @@ -1002,4 +1002,38 @@ $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X P - See this [issue on the VIVO tracker](https://jira.duraspace.org/browse/VIVO-1655) for more information about this endpoint - The old-school AGROVOC SOAP WSDL works with the [Zeep Python library](https://python-zeep.readthedocs.io/en/master/), but in my tests the results are way too broad despite trying to use a "exact match" searching +## 2019-02-21 + +- I wrote a script [agrovoc-lookup.py](https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py) to resolve subject terms against the public AGROVOC REST API +- It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to +- I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files: + +``` +$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt +$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt +$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt +``` + +- Then I generated a list of all the unique matched terms: + +``` +$ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt +``` + +- And then a list of all the unique *unmatched* terms using some utility I've never heard of before called `comm`: + +``` +$ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt +$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt +``` + +- Generate a list of countries and regions from CGSpace for Sisay to look through: + +``` +dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER; +COPY 202 +dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER; +COPY 33 +``` + diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html index 02167748c..fa7b56b33 100644 --- a/docs/2019-02/index.html +++ b/docs/2019-02/index.html @@ -42,7 +42,7 @@ sys 0m1.979s - + @@ -89,9 +89,9 @@ sys 0m1.979s "@type": "BlogPosting", "headline": "February, 2019", "url": "https://alanorth.github.io/cgspace-notes/2019-02/", - "wordCount": "5730", + "wordCount": "5947", "datePublished": "2019-02-01T21:37:30+02:00", - "dateModified": "2019-02-20T17:36:22-08:00", + "dateModified": "2019-02-20T18:20:09-08:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -1294,6 +1294,44 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
  • The old-school AGROVOC SOAP WSDL works with the Zeep Python library, but in my tests the results are way too broad despite trying to use a “exact match” searching
  • +

    2019-02-21

    + + + +
    $ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
    +$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
    +$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
    +
    + + + +
    $ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
    +
    + + + +
    $ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
    +$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
    +
    + + + +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
    +COPY 202
    +dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
    +COPY 33
    +
    + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index fca467499..f10a2b06a 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2019-02/ - 2019-02-20T17:36:22-08:00 + 2019-02-20T18:20:09-08:00 @@ -209,7 +209,7 @@ https://alanorth.github.io/cgspace-notes/ - 2019-02-20T17:36:22-08:00 + 2019-02-20T18:20:09-08:00 0 @@ -220,7 +220,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2019-02-20T17:36:22-08:00 + 2019-02-20T18:20:09-08:00 0 @@ -232,13 +232,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2019-02-20T17:36:22-08:00 + 2019-02-20T18:20:09-08:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2019-02-20T17:36:22-08:00 + 2019-02-20T18:20:09-08:00 0