mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-06-30
This commit is contained in:
@ -437,5 +437,167 @@ COPY 3917
|
||||
|
||||
- Email GRID.ac to ask them about where old names for institutes are stores, as I see them in the "Disambiguate" search function online, but not in the standalone data
|
||||
- For example, both "International Laboratory for Research on Animal Diseases" (ILRAD) and "International Livestock Centre for Africa" (ILCA) correctly return a hit for "International Livestock Research Institute", but it's nowhere in the data
|
||||
- I discovered two interesting OpenRefine reconciliation services:
|
||||
- [OpenRefine reconciler for the Research Organization Registry](https://github.com/ror-community/ror-reconciler)
|
||||
- [Getty Vocabularies OpenRefine Reconciliation](https://www.getty.edu/research/tools/vocabularies/obtain/openrefine.html) (see the Getty Thesaurus of Geographic Names ® (TGN))
|
||||
|
||||
## 2020-06-29
|
||||
|
||||
- I stumbled upon a sort of [standard for rights statements](https://rightsstatements.org/page/1.0/) that we might want to use for `dc.rights` eventually
|
||||
- I'm trying to understand the difference between `dcterms.coverage`, `dcterms.spatial`, and `dcterms.temporal`
|
||||
- According to the [Dublin Core specification for coverage](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/terms/coverage/) the more specific spatial and temporal subproperties:
|
||||
|
||||
> Because coverage is so broadly defined, it is preferable to use the more specific subproperties Temporal Coverage and Spatial Coverage.
|
||||
|
||||
- So I guess we should be using this for countries... but then all regions, countries, etc get merged together into this when you use DCTERMS
|
||||
- Perhaps better to use `cg.coverage.country` and crosswalk to `dcterms.spatial`
|
||||
- Another thing is that these values are not literals—you are supposed to embed classes...
|
||||
- I also notice that there is a [CrossRef funders registry](https://www.crossref.org/services/funder-registry/) with 23,000+ funders that you can [download as RDF](https://gitlab.com/crossref/open_funder_registry) or [access via an API](https://www.crossref.org/education/funder-registry/accessing-the-funder-registry/)
|
||||
|
||||
```
|
||||
$ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&mailto=a.orth@cgiar.org'
|
||||
```
|
||||
|
||||
- Searching for "Bill and Melinda Gates" we can see the `name` literal and a list of `alt-names` literals
|
||||
- This could be good for checking our funders
|
||||
- The API currently returns pages for each funder in the vocabulary, but they are giving HTTP 404 right now: https://data.crossref.org/fundingdata/vocabulary/Label-599174
|
||||
- I sent an email to the CrossRef Funders Registry team
|
||||
- See the [CrossRef API docs](https://github.com/CrossRef/rest-api-doc) (specifically the parameters and filters)
|
||||
- I made a pull request on CG Core v2 to recommend using persistent identifiers for DOIs and ORCID iDs ([#26](https://github.com/AgriculturalSemantics/cg-core/pull/26))
|
||||
- I exported sponsors/funders from CGSpace and wrote a script to query the CrossRef API for matches:
|
||||
|
||||
```
|
||||
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29) TO /tmp/2020-06-29-sponsors.csv;
|
||||
COPY 682
|
||||
```
|
||||
|
||||
- The script is `crossref-funders-lookup.py` and it is based on `agrovoc-lookup.py`
|
||||
- On that note, I realized I need to URL encode the funder before making the search request with requests because, while the requests library *does* do URL encoding, it seems that it interprets characters like `&` as indicative query parameters and this causes searches for funders like `Bill & Melinda Gates Foundation` to get misinterpreted
|
||||
- So then I noticed that I had worked around this in `agrovoc-lookup.py` a few years ago by just ignoring subjects with special characters like apostrophes and accents!
|
||||
- I tested the script on our funders:
|
||||
|
||||
```
|
||||
$ ./crossref-funders-lookup.py -i /tmp/2020-06-29-sponsors.csv -om /tmp/sponsors-matched.txt -or /tmp/sponsors-rejected.txt -d -e blah@blah.com
|
||||
$ wc -l /tmp/2020-06-29-sponsors.csv
|
||||
682 /tmp/2020-06-29-sponsors.csv
|
||||
$ wc -l /tmp/sponsors-*
|
||||
180 /tmp/sponsors-matched.txt
|
||||
502 /tmp/sponsors-rejected.txt
|
||||
682 total
|
||||
```
|
||||
|
||||
- It seems that 35% of our funders already match... I bet a few more will match if I check for simple errors
|
||||
- Interesting, I found a few funders that we have correct, but can't figure out how to match them in the API:
|
||||
- `Claussen-Simon-Stiftung`
|
||||
- `H2020 Marie Skłodowska-Curie Actions`
|
||||
|
||||
## 2020-06-30
|
||||
|
||||
- GRID responded to my question about historical names
|
||||
- They said the information is not part of the public GRID or ROR lists, but you can access it with a license to the Dimensions API
|
||||
- Gabriela from CIP sent me a list of erroneously added CIP subjects to remove from CGSpace:
|
||||
|
||||
```
|
||||
$ cat /tmp/2020-06-30-remove-cip-subjects.csv
|
||||
cg.subject.cip
|
||||
INTEGRATED PEST MANAGEMENT
|
||||
ORANGE FLESH SWEET POTATOES
|
||||
AEROPONICS
|
||||
FOOD SUPPLY
|
||||
SASHA
|
||||
SPHI
|
||||
INSECT LIFE CYCLE MODELLING
|
||||
SUSTAIN
|
||||
AGRICULTURAL INNOVATIONS
|
||||
NATIVE VARIETIES
|
||||
PHYTOPHTHORA INFESTANS
|
||||
$ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -m 127 -d
|
||||
```
|
||||
|
||||
- She also wants to change their `SWEET POTATOES` term to `SWEETPOTATOES`, both in the CIP subject list and existing items so I updated those too:
|
||||
|
||||
```
|
||||
$ cat /tmp/2020-06-30-fix-cip-subjects.csv
|
||||
cg.subject.cip,correct
|
||||
SWEET POTATOES,SWEETPOTATOES
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -t correct -m 127 -d
|
||||
```
|
||||
|
||||
- She also finished doing all the corrections to authors that I had sent her last week, but many of the changes are removing Spanish accents from authors names so I asked if she's really should she wants to do that
|
||||
- I ran the fixes and deletes on CGSpace, but not on DSpace Test yet because those scripts need updating for DSpace 6 UUIDs
|
||||
- I spent about two hours manually checking our sponsors that were rejected from CrossRef and found about fifty-five corrections that I ran on CGSpace:
|
||||
|
||||
```
|
||||
$ cat 2020-06-29-fix-sponsors.csv
|
||||
dc.description.sponsorship,correct
|
||||
"Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil","Conselho Nacional de Desenvolvimento Científico e Tecnológico"
|
||||
"Claussen Simon Stiftung","Claussen-Simon-Stiftung"
|
||||
"Fonds pour la formation á la Recherche dans l'Industrie et dans l'Agriculture, Belgium","Fonds pour la Formation à la Recherche dans l’Industrie et dans l’Agriculture"
|
||||
"Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil","Fundação de Amparo à Pesquisa do Estado de São Paulo"
|
||||
"Schlumberger Foundation Faculty for the Future","Schlumberger Foundation"
|
||||
"Wildlife Conservation Society, United States","Wildlife Conservation Society"
|
||||
"Portuguese Foundation for Science and Technology","Portuguese Science and Technology Foundation"
|
||||
"Wageningen University and Research","Wageningen University and Research Centre"
|
||||
"Leverhulme Centre for Integrative Research in Agriculture and Health","Leverhulme Centre for Integrative Research on Agriculture and Health"
|
||||
"Natural Science and Engineering Research Council of Canada","Natural Sciences and Engineering Research Council of Canada"
|
||||
"Biotechnology and Biological Sciences Research Council, United Kingdom","Biotechnology and Biological Sciences Research Council"
|
||||
"Home Grown Ceraels Authority United Kingdom","Home-Grown Cereals Authority"
|
||||
"Fiat Panis Foundation","Foundation fiat panis"
|
||||
"Defence Science and Technology Laboratory, United Kingdom","Defence Science and Technology Laboratory"
|
||||
"African Development Bank","African Development Bank Group"
|
||||
"Ministry of Health, Labour, and Welfare, Japan","Ministry of Health, Labour and Welfare"
|
||||
"World Academy of Sciences","The World Academy of Sciences"
|
||||
"Agricultural Research Council, South Africa","Agricultural Research Council"
|
||||
"Department of Homeland Security, USA","U.S. Department of Homeland Security"
|
||||
"Quadram Institute","Quadram Institute Bioscience"
|
||||
"Google.org","Google"
|
||||
"Department for Environment, Food and Rural Affairs, United Kingdom","Department for Environment, Food and Rural Affairs, UK Government"
|
||||
"National Commission for Science, Technology and Innovation, Kenya","National Commission for Science, Technology and Innovation"
|
||||
"Hainan Province Natural Science Foundation of China","Natural Science Foundation of Hainan Province"
|
||||
"German Society for International Cooperation (GIZ)","GIZ"
|
||||
"German Federal Ministry of Food and Agriculture","Federal Ministry of Food and Agriculture"
|
||||
"State Key Laboratory of Environmental Geochemistry, China","State Key Laboratory of Environmental Geochemistry"
|
||||
"QUT student scholarship","Queensland University of Technology"
|
||||
"Australia Centre for International Agricultural Research","Australian Centre for International Agricultural Research"
|
||||
"Belgian Science Policy","Belgian Federal Science Policy Office"
|
||||
"U.S. Department of Agriculture USDA","U.S. Department of Agriculture"
|
||||
"U.S.. Department of Agriculture (USDA)","U.S. Department of Agriculture"
|
||||
"Fundação de Amparo à Pesquisa do Estado de São Paulo ( FAPESP)","Fundação de Amparo à Pesquisa do Estado de São Paulo"
|
||||
"Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul, Brazil","Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul"
|
||||
"Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro, Brazil","Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro"
|
||||
"Swedish University of Agricultural Sciences (SLU)","Swedish University of Agricultural Sciences"
|
||||
"U.S. Department of Agriculture (USDA)","U.S. Department of Agriculture"
|
||||
"Swedish International Development Cooperation Agency (Sida)","Sida"
|
||||
"Swedish International Development Agency","Sida"
|
||||
"Federal Ministry for Economic Cooperation and Development, Germany","Federal Ministry for Economic Cooperation and Development"
|
||||
"Natural Environment Research Council, United Kingdom","Natural Environment Research Council"
|
||||
"Economic and Social Research Council, United Kingdom","Economic and Social Research Council"
|
||||
"Medical Research Council, United Kingdom","Medical Research Council"
|
||||
"Federal Ministry for Education and Research, Germany","Federal Ministry for Education, Science, Research and Technology"
|
||||
"UK Government’s Department for International Development","Department for International Development, UK Government"
|
||||
"Department for International Development, United Kingdom","Department for International Development, UK Government"
|
||||
"United Nations Children's Fund","United Nations Children's Emergency Fund"
|
||||
"Swedish Research Council for Environment, Agricultural Science and Spatial Planning","Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning"
|
||||
"Agence Nationale de la Recherche, France","French National Research Agency"
|
||||
"Fondation pour la recherche sur la biodiversité","Foundation for Research on Biodiversity"
|
||||
"Programa Nacional de Innovacion Agraria, Peru","Programa Nacional de Innovación Agraria, Peru"
|
||||
"United States Agency for International Development (USAID)","United States Agency for International Development"
|
||||
"West Africa Agricultural Productivity Programme","West Africa Agricultural Productivity Program"
|
||||
"West African Agricultural Productivity Project","West Africa Agricultural Productivity Program"
|
||||
"Rural Development Administration, Republic of Korea","Rural Development Administration"
|
||||
"UK’s Biotechnology and Biological Sciences Research Council (BBSRC)","Biotechnology and Biological Sciences Research Council"
|
||||
$ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
|
||||
```
|
||||
|
||||
- Peter wants me to add "CORONAVIRUS DISEASE" to all ILRI items that have ILRI subject "COVID19"
|
||||
- I exported the ILRI community and cut the columns I needed, then opened the file in OpenRefine:
|
||||
|
||||
```
|
||||
$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
|
||||
$ dspace metadata-export -i 10568/1 -f /tmp/ilri.cs
|
||||
$ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]' /tmp/ilri.csv > /tmp/ilri-covid19.csv
|
||||
```
|
||||
|
||||
- I see that all items with "COVID19" already have "CORONAVIRUS DISEASE" so I don't need to do anything
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user