Add notes for 2022-08-20

This commit is contained in:
Alan Orth 2022-08-20 22:37:35 -07:00
parent daf4a646ed
commit 8e6c83a5e1
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
29 changed files with 147 additions and 34 deletions

View File

@ -156,4 +156,57 @@ $ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Dow
- I created a SAF bundle and imported the 749 MELIAs to DSpace Test
- I found thirteen items on CGSpace with dates in format "DD/MM/YYYY" so I fixed those
## 2022-08-20
- Peter sent me back the results of the duplicate checking on the Gender presentations
- There were only a handful of duplicates, so I used the IDs in the spreadsheet to flag and delete them in OpenRefine
- I had a new idea about matching AGROVOC subjects and countries in OpenRefine
- I was previously splitting up the text value field (title/abstract/etc) by spaces and searching for each word in the list of terms/countries like this:
```console
with open(r"/tmp/cgspace-countries.txt",'r') as f :
countries = [name.rstrip().lower() for name in f]
return "||".join([x for x in value.split(' ') if x.lower() in countries])
```
- But that misses multi-word terms/countries with spaces, so we can search the other way around by using a regex for each term/country and checking if it appears in the text value field:
```console
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- Now we are only limited by our small (~1,400) list of AGROVOC subjects, so I did an export from PostgreSQL of all `dcterms.subjects` values and am looking them up against AGROVOC's API right now:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2022-08-20-agrovoc.csv WITH CSV HEADER;
COPY 21685
$ csvcut -c 1 /tmp/2022-08-20-agrovoc.csv | sed 1d > /tmp/all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/all-subjects.txt -o 2022-08-20-all-subjects-results.csv
$ csvgrep -c 'number of matches' -m 0 -i /tmp/2022-08-20-all-subjects-results.csv.bak | csvcut -c 1 | sed 1d > /tmp/agrovoc-subjects.txt
$ wc -l /tmp/agrovoc-subjects.txt
11834 /tmp/agrovoc-subjects.txt
```
- Then I created a new column joining the title and abstract, and ran the Jython expression above against this new file with 11,000 AGROVOC terms
- Then I joined that column with Peter's `dcterms.subject` column and then deduplicated it with this Jython:
```console
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
```
- This is way better, but you end up getting a bunch of countries, regions, and short words like "gates" matching in AGROVOC that are inappropriate (we typically don't tag these in AGROVOC) or incorrect (gates, as in windows or doors, not the funding agency)
- I did a text facet in OpenRefine and removed a bunch of these by eye
- Then I finished adding the `dcterms.relation` and CRP metadata flagged by Peter on the Gender presentations
- I'm waiting for him to send me the PDFs and then I will upload them to DSpace Test
<!-- vim: set sw=2 ts=2: -->

View File

@ -14,7 +14,7 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-08/" />
<meta property="article:published_time" content="2022-08-01T10:22:36+03:00" />
<meta property="article:modified_time" content="2022-08-18T22:43:37-07:00" />
<meta property="article:modified_time" content="2022-08-19T21:55:36-07:00" />
@ -34,9 +34,9 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
"@type": "BlogPosting",
"headline": "August, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-08/",
"wordCount": "1446",
"wordCount": "1862",
"datePublished": "2022-08-01T10:22:36+03:00",
"dateModified": "2022-08-18T22:43:37-07:00",
"dateModified": "2022-08-19T21:55:36-07:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -294,6 +294,66 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
<li>I created a SAF bundle and imported the 749 MELIAs to DSpace Test</li>
<li>I found thirteen items on CGSpace with dates in format &ldquo;DD/MM/YYYY&rdquo; so I fixed those</li>
</ul>
<h2 id="2022-08-20">2022-08-20</h2>
<ul>
<li>Peter sent me back the results of the duplicate checking on the Gender presentations
<ul>
<li>There were only a handful of duplicates, so I used the IDs in the spreadsheet to flag and delete them in OpenRefine</li>
</ul>
</li>
<li>I had a new idea about matching AGROVOC subjects and countries in OpenRefine
<ul>
<li>I was previously splitting up the text value field (title/abstract/etc) by spaces and searching for each word in the list of terms/countries like this:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>with open(r&#34;/tmp/cgspace-countries.txt&#34;,&#39;r&#39;) as f :
</span></span><span style="display:flex;"><span> countries = [name.rstrip().lower() for name in f]
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>return &#34;||&#34;.join([x for x in value.split(&#39; &#39;) if x.lower() in countries])
</span></span></code></pre></div><ul>
<li>But that misses multi-word terms/countries with spaces, so we can search the other way around by using a regex for each term/country and checking if it appears in the text value field:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>import re
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>with open(r&#34;/tmp/agrovoc-subjects.txt&#34;,&#39;r&#39;) as f :
</span></span><span style="display:flex;"><span> terms = [name.rstrip().lower() for name in f]
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>return &#34;||&#34;.join([term for term in terms if re.match(r&#34;.*\b&#34; + term + r&#34;\b.*&#34;, value.lower())])
</span></span></code></pre></div><ul>
<li>Now we are only limited by our small (~1,400) list of AGROVOC subjects, so I did an export from PostgreSQL of all <code>dcterms.subjects</code> values and am looking them up against AGROVOC&rsquo;s API right now:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS &#34;dcterms.subject&#34;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &#34;dcterms.subject&#34; ORDER BY count DESC) to /tmp/2022-08-20-agrovoc.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 21685
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2022-08-20-agrovoc.csv | sed 1d &gt; /tmp/all-subjects.txt
</span></span><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/all-subjects.txt -o 2022-08-20-all-subjects-results.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <span style="color:#ae81ff">0</span> -i /tmp/2022-08-20-all-subjects-results.csv.bak | csvcut -c <span style="color:#ae81ff">1</span> | sed 1d &gt; /tmp/agrovoc-subjects.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/agrovoc-subjects.txt
</span></span><span style="display:flex;"><span>11834 /tmp/agrovoc-subjects.txt
</span></span></code></pre></div><ul>
<li>Then I created a new column joining the title and abstract, and ran the Jython expression above against this new file with 11,000 AGROVOC terms
<ul>
<li>Then I joined that column with Peter&rsquo;s <code>dcterms.subject</code> column and then deduplicated it with this Jython:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>res = []
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>[res.append(x) for x in value.split(&#34;||&#34;) if x not in res]
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>return &#34;||&#34;.join(res)
</span></span></code></pre></div><ul>
<li>This is way better, but you end up getting a bunch of countries, regions, and short words like &ldquo;gates&rdquo; matching in AGROVOC that are inappropriate (we typically don&rsquo;t tag these in AGROVOC) or incorrect (gates, as in windows or doors, not the funding agency)
<ul>
<li>I did a text facet in OpenRefine and removed a bunch of these by eye</li>
</ul>
</li>
<li>Then I finished adding the <code>dcterms.relation</code> and CRP metadata flagged by Peter on the Gender presentations
<ul>
<li>I&rsquo;m waiting for him to send me the PDFs and then I will upload them to DSpace Test</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-18T22:43:37-07:00" />
<meta property="og:updated_time" content="2022-08-19T21:55:36-07:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/2022-08/</loc>
<lastmod>2022-08-18T22:43:37-07:00</lastmod>
<lastmod>2022-08-19T21:55:36-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2022-08-18T22:43:37-07:00</lastmod>
<lastmod>2022-08-19T21:55:36-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2022-08-18T22:43:37-07:00</lastmod>
<lastmod>2022-08-19T21:55:36-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2022-08-18T22:43:37-07:00</lastmod>
<lastmod>2022-08-19T21:55:36-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2022-08-18T22:43:37-07:00</lastmod>
<lastmod>2022-08-19T21:55:36-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-07/</loc>
<lastmod>2022-07-31T15:49:35+03:00</lastmod>