Add notes for 2022-08-24

This commit is contained in:
Alan Orth 2022-08-24 21:24:07 -07:00
parent 64d5b998f9
commit 5084b5ca5e
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
29 changed files with 131 additions and 34 deletions

View File

@ -248,4 +248,53 @@ $ dspace import --add --eperson=fuu@fuu.com --source /tmp/SimpleArchiveFormat --
- I created a [GitHub issue for OpenRXV compatibility issues with DSpace 7](https://github.com/ilri/OpenRXV/issues/133)
## 2022-08-24
- Start working on the MARLO OICRs
- First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:
```console
$ xsv select 'cg.number (series/report No.),File' OICRS\ Metadata\ v2.csv > /tmp/OICR-files.csv
$ xsv join --left 'cg.number (series/report No.)' OICRS\ metadata\ utf8\ 20220816_JM.csv 'cg.number (series/report No.)' /tmp/OICR-files.csv > OICRs-UTF-8-with-files.csv
```
- After that I imported it into OpenRefine for data cleaning
- To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it
- First, create a new column with this GREL:
```console
cells["dc.title"].value + " " + cells["dcterms.abstract"].value
```
- Then use this Jython:
```python
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- After that I de-duplicated the terms using this Jython:
```python
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
```
- Then I split the multi-values on "||" and used a text facet to remove some countries and other nonsense terms that matched, like "gates" and "al" and "s"
- Then I did the same for countries
- Then I exported the CSV and started searching for duplicates so that I can add them as relations:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p 'omg' -o /tmp/oicrs-matches.csv
```
- Oh wow, I actually found one OICR already uploaded to CGSpace... I have to ask Jose about that
<!-- vim: set sw=2 ts=2: -->

View File

@ -14,7 +14,7 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-08/" />
<meta property="article:published_time" content="2022-08-01T10:22:36+03:00" />
<meta property="article:modified_time" content="2022-08-20T22:37:35-07:00" />
<meta property="article:modified_time" content="2022-08-23T12:14:14-07:00" />
@ -34,9 +34,9 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
"@type": "BlogPosting",
"headline": "August, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-08/",
"wordCount": "2068",
"wordCount": "2309",
"datePublished": "2022-08-01T10:22:36+03:00",
"dateModified": "2022-08-20T22:37:35-07:00",
"dateModified": "2022-08-23T12:14:14-07:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -395,6 +395,54 @@ Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago
</span></span></code></pre></div><ul>
<li>I created a <a href="https://github.com/ilri/OpenRXV/issues/133">GitHub issue for OpenRXV compatibility issues with DSpace 7</a></li>
</ul>
<h2 id="2022-08-24">2022-08-24</h2>
<ul>
<li>Start working on the MARLO OICRs
<ul>
<li>First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ xsv <span style="color:#66d9ef">select</span> <span style="color:#e6db74">&#39;cg.number (series/report No.),File&#39;</span> OICRS<span style="color:#ae81ff">\ </span>Metadata<span style="color:#ae81ff">\ </span>v2.csv &gt; /tmp/OICR-files.csv
</span></span><span style="display:flex;"><span>$ xsv join --left <span style="color:#e6db74">&#39;cg.number (series/report No.)&#39;</span> OICRS<span style="color:#ae81ff">\ </span>metadata<span style="color:#ae81ff">\ </span>utf8<span style="color:#ae81ff">\ </span>20220816_JM.csv <span style="color:#e6db74">&#39;cg.number (series/report No.)&#39;</span> /tmp/OICR-files.csv &gt; OICRs-UTF-8-with-files.csv
</span></span></code></pre></div><ul>
<li>After that I imported it into OpenRefine for data cleaning
<ul>
<li>To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it</li>
<li>First, create a new column with this GREL:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>cells[&#34;dc.title&#34;].value + &#34; &#34; + cells[&#34;dcterms.abstract&#34;].value
</span></span></code></pre></div><ul>
<li>Then use this Jython:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> re
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;/tmp/agrovoc-subjects.txt&#34;</span>,<span style="color:#e6db74">&#39;r&#39;</span>) <span style="color:#66d9ef">as</span> f :
</span></span><span style="display:flex;"><span> terms <span style="color:#f92672">=</span> [name<span style="color:#f92672">.</span>rstrip()<span style="color:#f92672">.</span>lower() <span style="color:#66d9ef">for</span> name <span style="color:#f92672">in</span> f]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">return</span> <span style="color:#e6db74">&#34;||&#34;</span><span style="color:#f92672">.</span>join([term <span style="color:#66d9ef">for</span> term <span style="color:#f92672">in</span> terms <span style="color:#66d9ef">if</span> re<span style="color:#f92672">.</span>match(<span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;.*\b&#34;</span> <span style="color:#f92672">+</span> term <span style="color:#f92672">+</span> <span style="color:#e6db74">r</span><span style="color:#e6db74">&#34;\b.*&#34;</span>, value<span style="color:#f92672">.</span>lower())])
</span></span></code></pre></div><ul>
<li>After that I de-duplicated the terms using this Jython:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>res <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>[res<span style="color:#f92672">.</span>append(x) <span style="color:#66d9ef">for</span> x <span style="color:#f92672">in</span> value<span style="color:#f92672">.</span>split(<span style="color:#e6db74">&#34;||&#34;</span>) <span style="color:#66d9ef">if</span> x <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> res]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">return</span> <span style="color:#e6db74">&#34;||&#34;</span><span style="color:#f92672">.</span>join(res)
</span></span></code></pre></div><ul>
<li>Then I split the multi-values on &ldquo;||&rdquo; and used a text facet to remove some countries and other nonsense terms that matched, like &ldquo;gates&rdquo; and &ldquo;al&rdquo; and &ldquo;s&rdquo;
<ul>
<li>Then I did the same for countries</li>
</ul>
</li>
<li>Then I exported the CSV and started searching for duplicates so that I can add them as relations:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p <span style="color:#e6db74">&#39;omg&#39;</span> -o /tmp/oicrs-matches.csv
</span></span></code></pre></div><ul>
<li>Oh wow, I actually found one OICR already uploaded to CGSpace&hellip; I have to ask Jose about that</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-08-20T22:37:35-07:00" />
<meta property="og:updated_time" content="2022-08-23T12:14:14-07:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/2022-08/</loc>
<lastmod>2022-08-20T22:37:35-07:00</lastmod>
<lastmod>2022-08-23T12:14:14-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2022-08-20T22:37:35-07:00</lastmod>
<lastmod>2022-08-23T12:14:14-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2022-08-20T22:37:35-07:00</lastmod>
<lastmod>2022-08-23T12:14:14-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2022-08-20T22:37:35-07:00</lastmod>
<lastmod>2022-08-23T12:14:14-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2022-08-20T22:37:35-07:00</lastmod>
<lastmod>2022-08-23T12:14:14-07:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-07/</loc>
<lastmod>2022-07-31T15:49:35+03:00</lastmod>