Add notes for 2022-08-20

2025-01-27 05:49:12 +01:00 · 2022-08-20 22:37:35 -07:00
parent daf4a646ed
commit 8e6c83a5e1
29 changed files with 147 additions and 34 deletions
--- a/docs/2022-08/index.html
+++ b/docs/2022-08/index.html
@ -14,7 +14,7 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
 <meta property="og:type" content="article" />
 <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-08/" />
 <meta property="article:published_time" content="2022-08-01T10:22:36+03:00" />
-<meta property="article:modified_time" content="2022-08-18T22:43:37-07:00" />
+<meta property="article:modified_time" content="2022-08-19T21:55:36-07:00" />



@ -34,9 +34,9 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
  "@type": "BlogPosting",
  "headline": "August, 2022",
  "url": "https://alanorth.github.io/cgspace-notes/2022-08/",
-  "wordCount": "1446",
+  "wordCount": "1862",
  "datePublished": "2022-08-01T10:22:36+03:00",
-  "dateModified": "2022-08-18T22:43:37-07:00",
+  "dateModified": "2022-08-19T21:55:36-07:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
@ -294,6 +294,66 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
 <li>I created a SAF bundle and imported the 749 MELIAs to DSpace Test</li>
 <li>I found thirteen items on CGSpace with dates in format &ldquo;DD/MM/YYYY&rdquo; so I fixed those</li>
 </ul>
+<h2 id="2022-08-20">2022-08-20</h2>
+<ul>
+<li>Peter sent me back the results of the duplicate checking on the Gender presentations
+<ul>
+<li>There were only a handful of duplicates, so I used the IDs in the spreadsheet to flag and delete them in OpenRefine</li>
+</ul>
+</li>
+<li>I had a new idea about matching AGROVOC subjects and countries in OpenRefine
+<ul>
+<li>I was previously splitting up the text value field (title/abstract/etc) by spaces and searching for each word in the list of terms/countries like this:</li>
+</ul>
+</li>
+</ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>with open(r&#34;/tmp/cgspace-countries.txt&#34;,&#39;r&#39;) as f : 
+</span></span><span style="display:flex;"><span>    countries = [name.rstrip().lower() for name in f]
+</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
+</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>return &#34;||&#34;.join([x for x in value.split(&#39; &#39;) if x.lower() in countries])
+</span></span></code></pre></div><ul>
+<li>But that misses multi-word terms/countries with spaces, so we can search the other way around by using a regex for each term/country and checking if it appears in the text value field:</li>
+</ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>import re
+</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
+</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>with open(r&#34;/tmp/agrovoc-subjects.txt&#34;,&#39;r&#39;) as f : 
+</span></span><span style="display:flex;"><span>    terms = [name.rstrip().lower() for name in f]
+</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
+</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>return &#34;||&#34;.join([term for term in terms if re.match(r&#34;.*\b&#34; + term + r&#34;\b.*&#34;, value.lower())])
+</span></span></code></pre></div><ul>
+<li>Now we are only limited by our small (~1,400) list of AGROVOC subjects, so I did an export from PostgreSQL of all <code>dcterms.subjects</code> values and am looking them up against AGROVOC&rsquo;s API right now:</li>
+</ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS &#34;dcterms.subject&#34;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &#34;dcterms.subject&#34; ORDER BY count DESC) to /tmp/2022-08-20-agrovoc.csv WITH CSV HEADER;
+</span></span><span style="display:flex;"><span>COPY 21685
+</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2022-08-20-agrovoc.csv | sed 1d &gt; /tmp/all-subjects.txt
+</span></span><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/all-subjects.txt -o 2022-08-20-all-subjects-results.csv
+</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <span style="color:#ae81ff">0</span> -i /tmp/2022-08-20-all-subjects-results.csv.bak | csvcut -c <span style="color:#ae81ff">1</span> | sed 1d &gt; /tmp/agrovoc-subjects.txt
+</span></span><span style="display:flex;"><span>$ wc -l /tmp/agrovoc-subjects.txt
+</span></span><span style="display:flex;"><span>11834 /tmp/agrovoc-subjects.txt
+</span></span></code></pre></div><ul>
+<li>Then I created a new column joining the title and abstract, and ran the Jython expression above against this new file with 11,000 AGROVOC terms
+<ul>
+<li>Then I joined that column with Peter&rsquo;s <code>dcterms.subject</code> column and then deduplicated it with this Jython:</li>
+</ul>
+</li>
+</ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>res = []
+</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
+</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>[res.append(x) for x in value.split(&#34;||&#34;) if x not in res]
+</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
+</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>return &#34;||&#34;.join(res)
+</span></span></code></pre></div><ul>
+<li>This is way better, but you end up getting a bunch of countries, regions, and short words like &ldquo;gates&rdquo; matching in AGROVOC that are inappropriate (we typically don&rsquo;t tag these in AGROVOC) or incorrect (gates, as in windows or doors, not the funding agency)
+<ul>
+<li>I did a text facet in OpenRefine and removed a bunch of these by eye</li>
+</ul>
+</li>
+<li>Then I finished adding the <code>dcterms.relation</code> and CRP metadata flagged by Peter on the Gender presentations
+<ul>
+<li>I&rsquo;m waiting for him to send me the PDFs and then I will upload them to DSpace Test</li>
+</ul>
+</li>
+</ul>
 <!-- raw HTML omitted -->