mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes
This commit is contained in:
@ -1046,4 +1046,68 @@ COPY 33
|
||||
- PLANT PRODUCTION & HEALTH research theme to items with PLANT HEALTH subject
|
||||
- NUTRITION & HUMAN HEALTH research theme to items with NUTRITION subject
|
||||
|
||||
## 2019-02-22
|
||||
|
||||
- Help Udana from WLE with some issues related to CGSpace items on their [Publications website](https://www.wle.cgiar.org/publications)
|
||||
- He wanted some IWMI items to show up in their publications website
|
||||
- The items were mapped into WLE collections, but still weren't showing up on the publications website
|
||||
- I told him that he needs to add the `cg.identifier.wletheme` to the items so that the website indexer finds them
|
||||
- A few days ago he added the metadata to [10568/93011](https://cgspace.cgiar.org/handle/10568/93011) and now I see that the item is present on the [WLE publications website](https://www.wle.cgiar.org/resource-recovery-waste-business-models-energy-nutrient-and-water-reuse-low-and-middle-income)
|
||||
- Start looking at IITA's latest round of batch uploads called ["IITA_Feb_14" on DSpace Test](https://dspacetest.cgiar.org/handle/10568/108684)
|
||||
- One mispelled authorship type
|
||||
- A few dozen incorrect inconsistent affiliations (I dumped a list of the top 1500 affiliations and reconciled against it, but it was still a lot of work)
|
||||
- One issue with smart quotes in countries
|
||||
- A few IITA subjects with syntax errors
|
||||
- Some whitespace and consistency issues in sponsorships
|
||||
- Eight items with invalid ISBN: 0-471-98560-3
|
||||
- Two incorrectly formatted ISSNs
|
||||
- Lots of incorrect values in subjects, but that's a difficult problem to do in an automated way
|
||||
|
||||
- I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:
|
||||
|
||||
```
|
||||
import json
|
||||
import re
|
||||
import urllib
|
||||
import urllib2
|
||||
|
||||
pattern = re.compile('^S[A-Z ]+$')
|
||||
if pattern.match(value):
|
||||
url = 'http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=' + urllib.quote_plus(value) + '&lang=en'
|
||||
get = urllib2.urlopen(url)
|
||||
data = json.load(get)
|
||||
if len(data['results']) == 1:
|
||||
return "matched"
|
||||
|
||||
return "unmatched"
|
||||
```
|
||||
|
||||
- You have to make sure to URL encode the value with `quote_plus()` and it totally works, but it seems to refresh the facets (and therefore re-query everything) when you select a facet so that makes it basically unusable
|
||||
- There is a [good resource discussing OpenRefine, Jython, and web scraping](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine#example-2-url-queries-and-parsing-json)
|
||||
|
||||
## 2019-02-24
|
||||
|
||||
- I decided to try to validate the AGROVOC subjects in IITA's recent batch upload by dumping all their terms, checking them in en/es/fr with `agrovoc-lookup.py`, then reconciling against the final list using reconcile-csv with OpenRefine
|
||||
- I'm not sure how to deal with terms like "CORN" that are alternative labels (`altLabel`) in AGROVOC where the preferred label (`prefLabel`) would be "MAIZE"
|
||||
- For example, [a query](http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&lang=en) for `CORN*` returns:
|
||||
|
||||
```
|
||||
"results": [
|
||||
{
|
||||
"altLabel": "corn (maize)",
|
||||
"lang": "en",
|
||||
"prefLabel": "maize",
|
||||
"type": [
|
||||
"skos:Concept"
|
||||
],
|
||||
"uri": "http://aims.fao.org/aos/agrovoc/c_12332",
|
||||
"vocab": "agrovoc"
|
||||
},
|
||||
```
|
||||
|
||||
- There are dozens of other entries like "corn (soft wheat)", "corn (zea)", "corn bran", "Cornales", etc that could potentially match and to determine if they are related programatically is difficult
|
||||
- Shit, and then there are terms like "GENETIC DIVERSITY" that should [technically be](http://agrovoc.uniroma2.it/agrovoc/agrovoc/en/page/c_33952) "genetic diversity (as resource)"
|
||||
- I applied all changes to the IITA Feb 14 batch data except the affiliations and sponsorships because I think I made some mistakes with the copying of reconciled values so I will try to look at those again separately
|
||||
- I went back and re-did the affiliations and sponsorships and then applied them on the IITA Feb 14 collection on DSpace Test
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user