Add notes for 2022-06-29

This commit is contained in:
Alan Orth 2022-06-30 09:41:54 +03:00
parent 368d60df56
commit 53f60284e2
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
28 changed files with 144 additions and 34 deletions

View File

@ -200,4 +200,59 @@ $ xsv join --full alpha2 /tmp/clarisa-un-cgspace-xsv-full.csv alpha2 /tmp/mel-co
- Start a harvest on AReS
## 2022-06-28
- Start working on the CGSpace subject export for FAO
- First I exported a list of all metadata in our `dcterms.subject` and other center-specific subject fields with their counts:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY "subject" ORDER BY count DESC) to /tmp/2022-06-28-cgspace-subjects.csv WITH CSV HEADER;
COPY 27010
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2022-06-28-cgspace-subjects.csv | sed '1d' > /tmp/2022-06-28-cgspace-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2022-06-28-cgspace-subjects.txt -o /tmp/2022-06-28-cgspace-subjects-results.csv
```
- I keep getting timeouts after every five or ten requests, so this will not be feasible for 27,000 subjects!
- I think I will have to write some custom script to use the AGROVOC RDF file
- Using rdflib to open the 1.2GB `agrovoc_lod.rdf` file takes several minutes and doesn't seem very efficient
- I tried using [lightrdf](https://github.com/ozekik/lightrdf) and it's much quicker, but the documentation is limiting and I'm not sure how to search yet
- I had to try in different Python versions because 3.10.x is apparently too new
- For future reference I was able to search with lightrdf:
```console
import lightrdf
parser = lightrdf.Parser()
# prints millions of lines
for triple in parser.parse("./agrovoc_lod.rdf", base_iri=None):
print(triple)
agrovoc = lightrdf.RDFDocument('agrovoc_lod.rdf');
# all results for prefix http://aims.fao.org/aos/agrovoc/c_5
for triple in agrovoc.search_triples('http://aims.fao.org/aos/agrovoc/c_5', None, None):
print(triple)
('http://aims.fao.org/aos/agrovoc/c_5', 'http://www.w3.org/2004/02/skos/core#altLabel', '"Abalone"@de')
('http://aims.fao.org/aos/agrovoc/c_5', 'http://www.w3.org/2004/02/skos/core#prefLabel', '"abalones"@en')
# all stuff for abalones in English
for triple in agrovoc.search_triples(None, None, '"abalones"@en'):
print(triple)
```
- I ran the `agrovoc-lookup.py` from a Linode server and it completed without issues... hmmm
## 2022-06-29
- Continue working on the list of non-AGROVOC subject to report to FAO
- I got a one liner to get the list of non-AGROVOC subjects and join them with their counts:
```console
$ csvgrep -c 'number of matches' -m 0 /tmp/2022-06-28-cgspace-subjects-results.csv \
| csvcut -c subject \
| csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
> /tmp/2022-06-28-cgspace-non-agrovoc.csv
```
<!-- vim: set sw=2 ts=2: -->

View File

@ -26,7 +26,7 @@ There seem to be many more of these:
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-06/" />
<meta property="article:published_time" content="2022-06-06T09:01:36+03:00" />
<meta property="article:modified_time" content="2022-06-24T14:49:37+03:00" />
<meta property="article:modified_time" content="2022-06-26T18:11:33+03:00" />
@ -58,9 +58,9 @@ There seem to be many more of these:
"@type": "BlogPosting",
"headline": "June, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-06/",
"wordCount": "1196",
"wordCount": "1520",
"datePublished": "2022-06-06T09:01:36+03:00",
"dateModified": "2022-06-24T14:49:37+03:00",
"dateModified": "2022-06-26T18:11:33+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -347,7 +347,62 @@ There seem to be many more of these:
<ul>
<li>Start a harvest on AReS</li>
</ul>
<!-- raw HTML omitted -->
<h2 id="2022-06-28">2022-06-28</h2>
<ul>
<li>Start working on the CGSpace subject export for FAO</li>
<li>First I exported a list of all metadata in our <code>dcterms.subject</code> and other center-specific subject fields with their counts:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS &#34;subject&#34;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119) GROUP BY &#34;subject&#34; ORDER BY count DESC) to /tmp/2022-06-28-cgspace-subjects.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 27010
</span></span></code></pre></div><ul>
<li>Then I extracted the subjects and looked them up against AGROVOC:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c subject /tmp/2022-06-28-cgspace-subjects.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> &gt; /tmp/2022-06-28-cgspace-subjects.txt
</span></span><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/2022-06-28-cgspace-subjects.txt -o /tmp/2022-06-28-cgspace-subjects-results.csv
</span></span></code></pre></div><ul>
<li>I keep getting timeouts after every five or ten requests, so this will not be feasible for 27,000 subjects!</li>
<li>I think I will have to write some custom script to use the AGROVOC RDF file
<ul>
<li>Using rdflib to open the 1.2GB <code>agrovoc_lod.rdf</code> file takes several minutes and doesn&rsquo;t seem very efficient</li>
</ul>
</li>
<li>I tried using <a href="https://github.com/ozekik/lightrdf">lightrdf</a> and it&rsquo;s much quicker, but the documentation is limiting and I&rsquo;m not sure how to search yet
<ul>
<li>I had to try in different Python versions because 3.10.x is apparently too new</li>
</ul>
</li>
<li>For future reference I was able to search with lightrdf:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>import lightrdf
</span></span><span style="display:flex;"><span>parser = lightrdf.Parser()
</span></span><span style="display:flex;"><span># prints millions of lines
</span></span><span style="display:flex;"><span>for triple in parser.parse(&#34;./agrovoc_lod.rdf&#34;, base_iri=None):
</span></span><span style="display:flex;"><span> print(triple)
</span></span><span style="display:flex;"><span>agrovoc = lightrdf.RDFDocument(&#39;agrovoc_lod.rdf&#39;);
</span></span><span style="display:flex;"><span># all results <span style="color:#66d9ef">for</span> prefix http://aims.fao.org/aos/agrovoc/c_5
</span></span><span style="display:flex;"><span>for triple in agrovoc.search_triples(&#39;http://aims.fao.org/aos/agrovoc/c_5&#39;, None, None):
</span></span><span style="display:flex;"><span> print(triple)
</span></span><span style="display:flex;"><span>(&#39;http://aims.fao.org/aos/agrovoc/c_5&#39;, &#39;http://www.w3.org/2004/02/skos/core#altLabel&#39;, &#39;&#34;Abalone&#34;@de&#39;)
</span></span><span style="display:flex;"><span>(&#39;http://aims.fao.org/aos/agrovoc/c_5&#39;, &#39;http://www.w3.org/2004/02/skos/core#prefLabel&#39;, &#39;&#34;abalones&#34;@en&#39;)
</span></span><span style="display:flex;"><span># all stuff <span style="color:#66d9ef">for</span> abalones in English
</span></span><span style="display:flex;"><span>for triple in agrovoc.search_triples(None, None, &#39;&#34;abalones&#34;@en&#39;):
</span></span><span style="display:flex;"><span> print(triple)
</span></span></code></pre></div><ul>
<li>I ran the <code>agrovoc-lookup.py</code> from a Linode server and it completed without issues&hellip; hmmm</li>
</ul>
<h2 id="2022-06-29">2022-06-29</h2>
<ul>
<li>Continue working on the list of non-AGROVOC subject to report to FAO
<ul>
<li>I got a one liner to get the list of non-AGROVOC subjects and join them with their counts:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <span style="color:#ae81ff">0</span> /tmp/2022-06-28-cgspace-subjects-results.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvcut -c subject \
</span></span><span style="display:flex;"><span> | csvjoin -c subject /tmp/2022-06-28-cgspace-subjects.csv - \
</span></span><span style="display:flex;"><span> &gt; /tmp/2022-06-28-cgspace-non-agrovoc.csv
</span></span></code></pre></div><!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-06-24T14:49:37+03:00" />
<meta property="og:updated_time" content="2022-06-26T18:11:33+03:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2022-06-24T14:49:37+03:00</lastmod>
<lastmod>2022-06-26T18:11:33+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2022-06-24T14:49:37+03:00</lastmod>
<lastmod>2022-06-26T18:11:33+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-06/</loc>
<lastmod>2022-06-24T14:49:37+03:00</lastmod>
<lastmod>2022-06-26T18:11:33+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2022-06-24T14:49:37+03:00</lastmod>
<lastmod>2022-06-26T18:11:33+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2022-06-24T14:49:37+03:00</lastmod>
<lastmod>2022-06-26T18:11:33+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-05/</loc>
<lastmod>2022-05-30T16:00:02+03:00</lastmod>