Add notes for 2019-02-21

2025-01-27 05:49:12 +01:00 · 2019-02-21 10:08:18 -08:00
parent 57e52e01ef
commit bcdf2a1e26
3 changed files with 80 additions and 8 deletions
--- a/content/posts/2019-02.md
+++ b/content/posts/2019-02.md
@ -1002,4 +1002,38 @@ $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X P
 - See this [issue on the VIVO tracker](https://jira.duraspace.org/browse/VIVO-1655) for more information about this endpoint
 - The old-school AGROVOC SOAP WSDL works with the [Zeep Python library](https://python-zeep.readthedocs.io/en/master/), but in my tests the results are way too broad despite trying to use a "exact match" searching

+## 2019-02-21
+
+- I wrote a script [agrovoc-lookup.py](https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py) to resolve subject terms against the public AGROVOC REST API
+- It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to
+- I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files:
+
+```
+$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
+$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
+$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
+```
+
+- Then I generated a list of all the unique matched terms:
+
+```
+$ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
+```
+
+- And then a list of all the unique *unmatched* terms using some utility I've never heard of before called `comm`:
+
+```
+$ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
+$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
+```
+
+- Generate a list of countries and regions from CGSpace for Sisay to look through:
+
+```
+dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
+COPY 202
+dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
+COPY 33
+```
+
 <!-- vim: set sw=2 ts=2: -->
--- a/docs/2019-02/index.html
+++ b/docs/2019-02/index.html
@ -42,7 +42,7 @@ sys     0m1.979s
 <meta property="og:type" content="article" />
 <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-02/" />
 <meta property="article:published_time" content="2019-02-01T21:37:30&#43;02:00"/>
-<meta property="article:modified_time" content="2019-02-20T17:36:22-08:00"/>
+<meta property="article:modified_time" content="2019-02-20T18:20:09-08:00"/>

 <meta name="twitter:card" content="summary"/>
 <meta name="twitter:title" content="February, 2019"/>
@ -89,9 +89,9 @@ sys     0m1.979s
  "@type": "BlogPosting",
  "headline": "February, 2019",
  "url": "https://alanorth.github.io/cgspace-notes/2019-02/",
-  "wordCount": "5730",
+  "wordCount": "5947",
  "datePublished": "2019-02-01T21:37:30&#43;02:00",
-  "dateModified": "2019-02-20T17:36:22-08:00",
+  "dateModified": "2019-02-20T18:20:09-08:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
@ -1294,6 +1294,44 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: applica
 <li>The old-school AGROVOC SOAP WSDL works with the <a href="https://python-zeep.readthedocs.io/en/master/">Zeep Python library</a>, but in my tests the results are way too broad despite trying to use a &ldquo;exact match&rdquo; searching</li>
 </ul>

+<h2 id="2019-02-21">2019-02-21</h2>
+
+<ul>
+<li>I wrote a script <a href="https://github.com/ilri/DSpace/blob/5_x-prod/agrovoc-lookup.py">agrovoc-lookup.py</a> to resolve subject terms against the public AGROVOC REST API</li>
+<li>It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to</li>
+<li>I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files:</li>
+</ul>
+
+<pre><code>$ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
+$ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
+$ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
+</code></pre>
+
+<ul>
+<li>Then I generated a list of all the unique matched terms:</li>
+</ul>
+
+<pre><code>$ cat /tmp/matched-subjects-* | sort | uniq &gt; /tmp/2019-02-21-matched-subjects.txt
+</code></pre>
+
+<ul>
+<li>And then a list of all the unique <em>unmatched</em> terms using some utility I&rsquo;ve never heard of before called <code>comm</code>:</li>
+</ul>
+
+<pre><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
+$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
+</code></pre>
+
+<ul>
+<li>Generate a list of countries and regions from CGSpace for Sisay to look through:</li>
+</ul>
+
+<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
+COPY 202
+dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
+COPY 33
+</code></pre>
+
 <!-- vim: set sw=2 ts=2: -->

  
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@ -4,7 +4,7 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/2019-02/</loc>
-    <lastmod>2019-02-20T17:36:22-08:00</lastmod>
+    <lastmod>2019-02-20T18:20:09-08:00</lastmod>
  </url>
  
  <url>
@ -209,7 +209,7 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/</loc>
-    <lastmod>2019-02-20T17:36:22-08:00</lastmod>
+    <lastmod>2019-02-20T18:20:09-08:00</lastmod>
    <priority>0</priority>
  </url>
  
@ -220,7 +220,7 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
-    <lastmod>2019-02-20T17:36:22-08:00</lastmod>
+    <lastmod>2019-02-20T18:20:09-08:00</lastmod>
    <priority>0</priority>
  </url>
  
@ -232,13 +232,13 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
-    <lastmod>2019-02-20T17:36:22-08:00</lastmod>
+    <lastmod>2019-02-20T18:20:09-08:00</lastmod>
    <priority>0</priority>
  </url>
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
-    <lastmod>2019-02-20T17:36:22-08:00</lastmod>
+    <lastmod>2019-02-20T18:20:09-08:00</lastmod>
    <priority>0</priority>
  </url>