Add notes for 2020-06-30

This commit is contained in:
2020-06-30 15:47:18 +03:00
parent 48159ebf8e
commit a23b442c77
83 changed files with 484 additions and 725 deletions

View File

@ -437,5 +437,167 @@ COPY 3917
- Email GRID.ac to ask them about where old names for institutes are stores, as I see them in the "Disambiguate" search function online, but not in the standalone data
- For example, both "International Laboratory for Research on Animal Diseases" (ILRAD) and "International Livestock Centre for Africa" (ILCA) correctly return a hit for "International Livestock Research Institute", but it's nowhere in the data
- I discovered two interesting OpenRefine reconciliation services:
- [OpenRefine reconciler for the Research Organization Registry](https://github.com/ror-community/ror-reconciler)
- [Getty Vocabularies OpenRefine Reconciliation](https://www.getty.edu/research/tools/vocabularies/obtain/openrefine.html) (see the Getty Thesaurus of Geographic Names ® (TGN))
## 2020-06-29
- I stumbled upon a sort of [standard for rights statements](https://rightsstatements.org/page/1.0/) that we might want to use for `dc.rights` eventually
- I'm trying to understand the difference between `dcterms.coverage`, `dcterms.spatial`, and `dcterms.temporal`
- According to the [Dublin Core specification for coverage](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/terms/coverage/) the more specific spatial and temporal subproperties:
> Because coverage is so broadly defined, it is preferable to use the more specific subproperties Temporal Coverage and Spatial Coverage.
- So I guess we should be using this for countries... but then all regions, countries, etc get merged together into this when you use DCTERMS
- Perhaps better to use `cg.coverage.country` and crosswalk to `dcterms.spatial`
- Another thing is that these values are not literals—you are supposed to embed classes...
- I also notice that there is a [CrossRef funders registry](https://www.crossref.org/services/funder-registry/) with 23,000+ funders that you can [download as RDF](https://gitlab.com/crossref/open_funder_registry) or [access via an API](https://www.crossref.org/education/funder-registry/accessing-the-funder-registry/)
```
$ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&mailto=a.orth@cgiar.org'
```
- Searching for "Bill and Melinda Gates" we can see the `name` literal and a list of `alt-names` literals
- This could be good for checking our funders
- The API currently returns pages for each funder in the vocabulary, but they are giving HTTP 404 right now: https://data.crossref.org/fundingdata/vocabulary/Label-599174
- I sent an email to the CrossRef Funders Registry team
- See the [CrossRef API docs](https://github.com/CrossRef/rest-api-doc) (specifically the parameters and filters)
- I made a pull request on CG Core v2 to recommend using persistent identifiers for DOIs and ORCID iDs ([#26](https://github.com/AgriculturalSemantics/cg-core/pull/26))
- I exported sponsors/funders from CGSpace and wrote a script to query the CrossRef API for matches:
```
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29) TO /tmp/2020-06-29-sponsors.csv;
COPY 682
```
- The script is `crossref-funders-lookup.py` and it is based on `agrovoc-lookup.py`
- On that note, I realized I need to URL encode the funder before making the search request with requests because, while the requests library *does* do URL encoding, it seems that it interprets characters like `&` as indicative query parameters and this causes searches for funders like `Bill & Melinda Gates Foundation` to get misinterpreted
- So then I noticed that I had worked around this in `agrovoc-lookup.py` a few years ago by just ignoring subjects with special characters like apostrophes and accents!
- I tested the script on our funders:
```
$ ./crossref-funders-lookup.py -i /tmp/2020-06-29-sponsors.csv -om /tmp/sponsors-matched.txt -or /tmp/sponsors-rejected.txt -d -e blah@blah.com
$ wc -l /tmp/2020-06-29-sponsors.csv
682 /tmp/2020-06-29-sponsors.csv
$ wc -l /tmp/sponsors-*
180 /tmp/sponsors-matched.txt
502 /tmp/sponsors-rejected.txt
682 total
```
- It seems that 35% of our funders already match... I bet a few more will match if I check for simple errors
- Interesting, I found a few funders that we have correct, but can't figure out how to match them in the API:
- `Claussen-Simon-Stiftung`
- `H2020 Marie Skłodowska-Curie Actions`
## 2020-06-30
- GRID responded to my question about historical names
- They said the information is not part of the public GRID or ROR lists, but you can access it with a license to the Dimensions API
- Gabriela from CIP sent me a list of erroneously added CIP subjects to remove from CGSpace:
```
$ cat /tmp/2020-06-30-remove-cip-subjects.csv
cg.subject.cip
INTEGRATED PEST MANAGEMENT
ORANGE FLESH SWEET POTATOES
AEROPONICS
FOOD SUPPLY
SASHA
SPHI
INSECT LIFE CYCLE MODELLING
SUSTAIN
AGRICULTURAL INNOVATIONS
NATIVE VARIETIES
PHYTOPHTHORA INFESTANS
$ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -m 127 -d
```
- She also wants to change their `SWEET POTATOES` term to `SWEETPOTATOES`, both in the CIP subject list and existing items so I updated those too:
```
$ cat /tmp/2020-06-30-fix-cip-subjects.csv
cg.subject.cip,correct
SWEET POTATOES,SWEETPOTATOES
$ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -t correct -m 127 -d
```
- She also finished doing all the corrections to authors that I had sent her last week, but many of the changes are removing Spanish accents from authors names so I asked if she's really should she wants to do that
- I ran the fixes and deletes on CGSpace, but not on DSpace Test yet because those scripts need updating for DSpace 6 UUIDs
- I spent about two hours manually checking our sponsors that were rejected from CrossRef and found about fifty-five corrections that I ran on CGSpace:
```
$ cat 2020-06-29-fix-sponsors.csv
dc.description.sponsorship,correct
"Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil","Conselho Nacional de Desenvolvimento Científico e Tecnológico"
"Claussen Simon Stiftung","Claussen-Simon-Stiftung"
"Fonds pour la formation á la Recherche dans l'Industrie et dans l'Agriculture, Belgium","Fonds pour la Formation à la Recherche dans lIndustrie et dans lAgriculture"
"Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil","Fundação de Amparo à Pesquisa do Estado de São Paulo"
"Schlumberger Foundation Faculty for the Future","Schlumberger Foundation"
"Wildlife Conservation Society, United States","Wildlife Conservation Society"
"Portuguese Foundation for Science and Technology","Portuguese Science and Technology Foundation"
"Wageningen University and Research","Wageningen University and Research Centre"
"Leverhulme Centre for Integrative Research in Agriculture and Health","Leverhulme Centre for Integrative Research on Agriculture and Health"
"Natural Science and Engineering Research Council of Canada","Natural Sciences and Engineering Research Council of Canada"
"Biotechnology and Biological Sciences Research Council, United Kingdom","Biotechnology and Biological Sciences Research Council"
"Home Grown Ceraels Authority United Kingdom","Home-Grown Cereals Authority"
"Fiat Panis Foundation","Foundation fiat panis"
"Defence Science and Technology Laboratory, United Kingdom","Defence Science and Technology Laboratory"
"African Development Bank","African Development Bank Group"
"Ministry of Health, Labour, and Welfare, Japan","Ministry of Health, Labour and Welfare"
"World Academy of Sciences","The World Academy of Sciences"
"Agricultural Research Council, South Africa","Agricultural Research Council"
"Department of Homeland Security, USA","U.S. Department of Homeland Security"
"Quadram Institute","Quadram Institute Bioscience"
"Google.org","Google"
"Department for Environment, Food and Rural Affairs, United Kingdom","Department for Environment, Food and Rural Affairs, UK Government"
"National Commission for Science, Technology and Innovation, Kenya","National Commission for Science, Technology and Innovation"
"Hainan Province Natural Science Foundation of China","Natural Science Foundation of Hainan Province"
"German Society for International Cooperation (GIZ)","GIZ"
"German Federal Ministry of Food and Agriculture","Federal Ministry of Food and Agriculture"
"State Key Laboratory of Environmental Geochemistry, China","State Key Laboratory of Environmental Geochemistry"
"QUT student scholarship","Queensland University of Technology"
"Australia Centre for International Agricultural Research","Australian Centre for International Agricultural Research"
"Belgian Science Policy","Belgian Federal Science Policy Office"
"U.S. Department of Agriculture USDA","U.S. Department of Agriculture"
"U.S.. Department of Agriculture (USDA)","U.S. Department of Agriculture"
"Fundação de Amparo à Pesquisa do Estado de São Paulo ( FAPESP)","Fundação de Amparo à Pesquisa do Estado de São Paulo"
"Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul, Brazil","Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul"
"Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro, Brazil","Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro"
"Swedish University of Agricultural Sciences (SLU)","Swedish University of Agricultural Sciences"
"U.S. Department of Agriculture (USDA)","U.S. Department of Agriculture"
"Swedish International Development Cooperation Agency (Sida)","Sida"
"Swedish International Development Agency","Sida"
"Federal Ministry for Economic Cooperation and Development, Germany","Federal Ministry for Economic Cooperation and Development"
"Natural Environment Research Council, United Kingdom","Natural Environment Research Council"
"Economic and Social Research Council, United Kingdom","Economic and Social Research Council"
"Medical Research Council, United Kingdom","Medical Research Council"
"Federal Ministry for Education and Research, Germany","Federal Ministry for Education, Science, Research and Technology"
"UK Governments Department for International Development","Department for International Development, UK Government"
"Department for International Development, United Kingdom","Department for International Development, UK Government"
"United Nations Children's Fund","United Nations Children's Emergency Fund"
"Swedish Research Council for Environment, Agricultural Science and Spatial Planning","Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning"
"Agence Nationale de la Recherche, France","French National Research Agency"
"Fondation pour la recherche sur la biodiversité","Foundation for Research on Biodiversity"
"Programa Nacional de Innovacion Agraria, Peru","Programa Nacional de Innovación Agraria, Peru"
"United States Agency for International Development (USAID)","United States Agency for International Development"
"West Africa Agricultural Productivity Programme","West Africa Agricultural Productivity Program"
"West African Agricultural Productivity Project","West Africa Agricultural Productivity Program"
"Rural Development Administration, Republic of Korea","Rural Development Administration"
"UKs Biotechnology and Biological Sciences Research Council (BBSRC)","Biotechnology and Biological Sciences Research Council"
$ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
```
- Peter wants me to add "CORONAVIRUS DISEASE" to all ILRI items that have ILRI subject "COVID19"
- I exported the ILRI community and cut the columns I needed, then opened the file in OpenRefine:
```
$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
$ dspace metadata-export -i 10568/1 -f /tmp/ilri.cs
$ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]' /tmp/ilri.csv > /tmp/ilri-covid19.csv
```
- I see that all items with "COVID19" already have "CORONAVIRUS DISEASE" so I don't need to do anything
<!-- vim: set sw=2 ts=2: -->