mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 22:55:04 +01:00
Update notes
This commit is contained in:
parent
cf48f20cbe
commit
2c0b9ce100
@ -1046,4 +1046,68 @@ COPY 33
|
|||||||
- PLANT PRODUCTION & HEALTH research theme to items with PLANT HEALTH subject
|
- PLANT PRODUCTION & HEALTH research theme to items with PLANT HEALTH subject
|
||||||
- NUTRITION & HUMAN HEALTH research theme to items with NUTRITION subject
|
- NUTRITION & HUMAN HEALTH research theme to items with NUTRITION subject
|
||||||
|
|
||||||
|
## 2019-02-22
|
||||||
|
|
||||||
|
- Help Udana from WLE with some issues related to CGSpace items on their [Publications website](https://www.wle.cgiar.org/publications)
|
||||||
|
- He wanted some IWMI items to show up in their publications website
|
||||||
|
- The items were mapped into WLE collections, but still weren't showing up on the publications website
|
||||||
|
- I told him that he needs to add the `cg.identifier.wletheme` to the items so that the website indexer finds them
|
||||||
|
- A few days ago he added the metadata to [10568/93011](https://cgspace.cgiar.org/handle/10568/93011) and now I see that the item is present on the [WLE publications website](https://www.wle.cgiar.org/resource-recovery-waste-business-models-energy-nutrient-and-water-reuse-low-and-middle-income)
|
||||||
|
- Start looking at IITA's latest round of batch uploads called ["IITA_Feb_14" on DSpace Test](https://dspacetest.cgiar.org/handle/10568/108684)
|
||||||
|
- One mispelled authorship type
|
||||||
|
- A few dozen incorrect inconsistent affiliations (I dumped a list of the top 1500 affiliations and reconciled against it, but it was still a lot of work)
|
||||||
|
- One issue with smart quotes in countries
|
||||||
|
- A few IITA subjects with syntax errors
|
||||||
|
- Some whitespace and consistency issues in sponsorships
|
||||||
|
- Eight items with invalid ISBN: 0-471-98560-3
|
||||||
|
- Two incorrectly formatted ISSNs
|
||||||
|
- Lots of incorrect values in subjects, but that's a difficult problem to do in an automated way
|
||||||
|
|
||||||
|
- I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:
|
||||||
|
|
||||||
|
```
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import urllib
|
||||||
|
import urllib2
|
||||||
|
|
||||||
|
pattern = re.compile('^S[A-Z ]+$')
|
||||||
|
if pattern.match(value):
|
||||||
|
url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&lang=en'
|
||||||
|
get = urllib2.urlopen(url)
|
||||||
|
data = json.load(get)
|
||||||
|
if len(data['results']) == 1:
|
||||||
|
return "matched"
|
||||||
|
|
||||||
|
return "unmatched"
|
||||||
|
```
|
||||||
|
|
||||||
|
- You have to make sure to URL encode the value with `quote_plus()` and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable
|
||||||
|
- There is a [good resource discussing OpenRefine, Jython, and web scraping](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-2-url-queries-and-parsing-json)
|
||||||
|
|
||||||
|
## 2019-02-24
|
||||||
|
|
||||||
|
- I decided to try to validate the AGROVOC subjects in IITA's recent batch upload by dumping all their terms, checking them in en/es/fr with `agrovoc-lookup.py`, then reconciling against the final list using reconcile-csv with OpenRefine
|
||||||
|
- I'm not sure how to deal with terms like "CORN" that are alternative labels (`altLabel`) in AGROVOC where the preferred label (`prefLabel`) would be "MAIZE"
|
||||||
|
- For example, [a query](http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&lang=en) for `CORN*` returns:
|
||||||
|
|
||||||
|
```
|
||||||
|
"results": [
|
||||||
|
{
|
||||||
|
"altLabel": "corn (maize)",
|
||||||
|
"lang": "en",
|
||||||
|
"prefLabel": "maize",
|
||||||
|
"type": [
|
||||||
|
"skos:Concept"
|
||||||
|
],
|
||||||
|
"uri": "http://aims.fao.org/aos/agrovoc/c_12332",
|
||||||
|
"vocab": "agrovoc"
|
||||||
|
},
|
||||||
|
```
|
||||||
|
|
||||||
|
- There are dozens of other entries like "corn (soft wheat)", "corn (zea)", "corn bran", "Cornales", etc that could potentially match and to determine if they are related programatically is difficult
|
||||||
|
- Shit, and then there are terms like "GENETIC DIVERSITY" that should [technically be](http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/page/c_33952) "genetic diversity (as resource)"
|
||||||
|
- I applied all changes to the IITA Feb 14 batch data except the affiliations and sponsorships because I think I made some mistakes with the copying of reconciled values so I will try to look at those again separately
|
||||||
|
- I went back and re-did the affiliations and sponsorships and then applied them on the IITA Feb 14 collection on DSpace Test
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
@ -42,7 +42,7 @@ sys 0m1.979s
|
|||||||
<meta property="og:type" content="article" />
|
<meta property="og:type" content="article" />
|
||||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-02/" />
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-02/" />
|
||||||
<meta property="article:published_time" content="2019-02-01T21:37:30+02:00"/>
|
<meta property="article:published_time" content="2019-02-01T21:37:30+02:00"/>
|
||||||
<meta property="article:modified_time" content="2019-02-21T17:21:37-08:00"/>
|
<meta property="article:modified_time" content="2019-02-21T18:16:33-08:00"/>
|
||||||
|
|
||||||
<meta name="twitter:card" content="summary"/>
|
<meta name="twitter:card" content="summary"/>
|
||||||
<meta name="twitter:title" content="February, 2019"/>
|
<meta name="twitter:title" content="February, 2019"/>
|
||||||
@ -89,9 +89,9 @@ sys 0m1.979s
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "February, 2019",
|
"headline": "February, 2019",
|
||||||
"url": "https://alanorth.github.io/cgspace-notes/2019-02/",
|
"url": "https://alanorth.github.io/cgspace-notes/2019-02/",
|
||||||
"wordCount": "6074",
|
"wordCount": "6551",
|
||||||
"datePublished": "2019-02-01T21:37:30+02:00",
|
"datePublished": "2019-02-01T21:37:30+02:00",
|
||||||
"dateModified": "2019-02-21T17:21:37-08:00",
|
"dateModified": "2019-02-21T18:16:33-08:00",
|
||||||
"author": {
|
"author": {
|
||||||
"@type": "Person",
|
"@type": "Person",
|
||||||
"name": "Alan Orth"
|
"name": "Alan Orth"
|
||||||
@ -1347,6 +1347,83 @@ COPY 33
|
|||||||
</ul></li>
|
</ul></li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
|
<h2 id="2019-02-22">2019-02-22</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Help Udana from WLE with some issues related to CGSpace items on their <a href="https://www.wle.cgiar.org/publications">Publications website</a>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>He wanted some IWMI items to show up in their publications website</li>
|
||||||
|
<li>The items were mapped into WLE collections, but still weren’t showing up on the publications website</li>
|
||||||
|
<li>I told him that he needs to add the <code>cg.identifier.wletheme</code> to the items so that the website indexer finds them</li>
|
||||||
|
<li>A few days ago he added the metadata to <a href="https://cgspace.cgiar.org/handle/10568/93011"><sup>10568</sup>⁄<sub>93011</sub></a> and now I see that the item is present on the <a href="https://www.wle.cgiar.org/resource-recovery-waste-business-models-energy-nutrient-and-water-reuse-low-and-middle-income">WLE publications website</a></li>
|
||||||
|
</ul></li>
|
||||||
|
|
||||||
|
<li><p>Start looking at IITA’s latest round of batch uploads called <a href="https://dspacetest.cgiar.org/handle/10568/108684">“IITA_Feb_14” on DSpace Test</a></p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>One mispelled authorship type</li>
|
||||||
|
<li>A few dozen incorrect inconsistent affiliations (I dumped a list of the top 1500 affiliations and reconciled against it, but it was still a lot of work)</li>
|
||||||
|
<li>One issue with smart quotes in countries</li>
|
||||||
|
<li>A few IITA subjects with syntax errors</li>
|
||||||
|
<li>Some whitespace and consistency issues in sponsorships</li>
|
||||||
|
<li>Eight items with invalid ISBN: 0-471-98560-3</li>
|
||||||
|
<li>Two incorrectly formatted ISSNs</li>
|
||||||
|
<li>Lots of incorrect values in subjects, but that’s a difficult problem to do in an automated way</li>
|
||||||
|
</ul></li>
|
||||||
|
|
||||||
|
<li><p>I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:</p></li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>import json
|
||||||
|
import re
|
||||||
|
import urllib
|
||||||
|
import urllib2
|
||||||
|
|
||||||
|
pattern = re.compile('^S[A-Z ]+$')
|
||||||
|
if pattern.match(value):
|
||||||
|
url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&lang=en'
|
||||||
|
get = urllib2.urlopen(url)
|
||||||
|
data = json.load(get)
|
||||||
|
if len(data['results']) == 1:
|
||||||
|
return "matched"
|
||||||
|
|
||||||
|
return "unmatched"
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>You have to make sure to URL encode the value with <code>quote_plus()</code> and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable</li>
|
||||||
|
<li>There is a <a href="https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-2-url-queries-and-parsing-json">good resource discussing OpenRefine, Jython, and web scraping</a></li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<h2 id="2019-02-24">2019-02-24</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>I decided to try to validate the AGROVOC subjects in IITA’s recent batch upload by dumping all their terms, checking them in en/es/fr with <code>agrovoc-lookup.py</code>, then reconciling against the final list using reconcile-csv with OpenRefine</li>
|
||||||
|
<li>I’m not sure how to deal with terms like “CORN” that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be “MAIZE”</li>
|
||||||
|
<li>For example, <a href="http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&lang=en">a query</a> for <code>CORN*</code> returns:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code> "results": [
|
||||||
|
{
|
||||||
|
"altLabel": "corn (maize)",
|
||||||
|
"lang": "en",
|
||||||
|
"prefLabel": "maize",
|
||||||
|
"type": [
|
||||||
|
"skos:Concept"
|
||||||
|
],
|
||||||
|
"uri": "http://aims.fao.org/aos/agrovoc/c_12332",
|
||||||
|
"vocab": "agrovoc"
|
||||||
|
},
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>There are dozens of other entries like “corn (soft wheat)”, “corn (zea)”, “corn bran”, “Cornales”, etc that could potentially match and to determine if they are related programatically is difficult</li>
|
||||||
|
<li>Shit, and then there are terms like “GENETIC DIVERSITY” that should <a href="http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/page/c_33952">technically be</a> “genetic diversity (as resource)”</li>
|
||||||
|
<li>I applied all changes to the IITA Feb 14 batch data except the affiliations and sponsorships because I think I made some mistakes with the copying of reconciled values so I will try to look at those again separately</li>
|
||||||
|
<li>I went back and re-did the affiliations and sponsorships and then applied them on the IITA Feb 14 collection on DSpace Test</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
<!-- vim: set sw=2 ts=2: -->
|
<!-- vim: set sw=2 ts=2: -->
|
||||||
|
|
||||||
|
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2019-02/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2019-02/</loc>
|
||||||
<lastmod>2019-02-21T17:21:37-08:00</lastmod>
|
<lastmod>2019-02-21T18:16:33-08:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
@ -209,7 +209,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2019-02-21T17:21:37-08:00</lastmod>
|
<lastmod>2019-02-21T18:16:33-08:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -220,7 +220,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||||
<lastmod>2019-02-21T17:21:37-08:00</lastmod>
|
<lastmod>2019-02-21T18:16:33-08:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -232,13 +232,13 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||||
<lastmod>2019-02-21T17:21:37-08:00</lastmod>
|
<lastmod>2019-02-21T18:16:33-08:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||||
<lastmod>2019-02-21T17:21:37-08:00</lastmod>
|
<lastmod>2019-02-21T18:16:33-08:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user