mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-12-03
This commit is contained in:
@ -25,4 +25,42 @@ categories: ["Notes"]
|
||||
- [Set the description when submitting bitstreams to CGSpace](https://github.com/CodeObia/MEL/issues/11067)
|
||||
- [Some items have a Creative Commons license, but are Limited Access and bitstreams are locked](https://github.com/CodeObia/MEL/issues/11068)
|
||||
|
||||
## 2022-12-03
|
||||
|
||||
- I downloaded a fresh copy of CLARISA's institutions list as well as ROR's latest dump from 2022-12-01 to check how many are matching:
|
||||
|
||||
```console
|
||||
$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
|
||||
$ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
|
||||
$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
|
||||
$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
|
||||
1864
|
||||
$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
|
||||
7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
|
||||
```
|
||||
|
||||
- Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher
|
||||
- If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:
|
||||
|
||||
```console
|
||||
$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
|
||||
```
|
||||
|
||||
- I checked CGSpace's top 1,000 institutions too, first exporting from PostgreSQL:
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
|
||||
```
|
||||
|
||||
- Then cutting (tab is the default delimeter):
|
||||
|
||||
```console
|
||||
$ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
|
||||
$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
|
||||
$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
|
||||
542
|
||||
```
|
||||
|
||||
- So that's a 54% match for our top institutions
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user