cgspace-notes/2022-12.md at 12b4f1660d2d34c1c8bb33979f426009778f4375

3.2 KiB

Raw Blame History

title

date

author

2022-12-01

Fix some incorrect regions on CGSpace
- I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
Replace "East Asia" with "Eastern Asia" region on CGSpace (UN M.49 region)

CGSpace and PRMS information session with Enrico and a bunch of researchers
I noticed some minor issues with SPDX licenses and AGROVOC terms in items submitted by TIP so I sent a message to Daniel from Alliance
I startd a harvest on AReS since we've updated so much metadata recently

2022-12-02

File some issues related to metadata on the MEL issue tracker

2022-12-03

I downloaded a fresh copy of CLARISA's institutions list as well as ROR's latest dump from 2022-12-01 to check how many are matching:

$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
$ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
1864
$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt

Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher
If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:

$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt

I checked CGSpace's top 1,000 institutions too, first exporting from PostgreSQL:

localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;

Then cutting (tab is the default delimeter):

$ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
542

So that's a 54% match for our top institutions

3.2 KiB Raw Blame History

2022-12-01

2022-12-02

2022-12-03

3.2 KiB

Raw Blame History