--- title: "December, 2022" date: 2022-12-01T08:52:36+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2022-12-01 - Fix some incorrect regions on CGSpace - I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions - Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items! - Replace "East Asia" with "Eastern Asia" region on CGSpace (UN M.49 region) - CGSpace and PRMS information session with Enrico and a bunch of researchers - I noticed some minor issues with SPDX licenses and AGROVOC terms in items submitted by TIP so I sent a message to Daniel from Alliance - I startd a harvest on AReS since we've updated so much metadata recently ## 2022-12-02 - File some issues related to metadata on the MEL issue tracker - [Only use "Open Access" or "Limited Access" access rights when publishing items on CGSpace](https://github.com/CodeObia/MEL/issues/11066) - [Set the description when submitting bitstreams to CGSpace](https://github.com/CodeObia/MEL/issues/11067) - [Some items have a Creative Commons license, but are Limited Access and bitstreams are locked](https://github.com/CodeObia/MEL/issues/11068) ## 2022-12-03 - I downloaded a fresh copy of CLARISA's institutions list as well as ROR's latest dump from 2022-12-01 to check how many are matching: ```console $ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json $ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt $ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json $ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l 1864 $ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt 7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt ``` - Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher - If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%: ```console $ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt ``` - I checked CGSpace's top 1,000 institutions too, first exporting from PostgreSQL: ```console localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv; ``` - Then cutting (tab is the default delimeter): ```console $ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt $ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json $ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l 542 ``` - So that's a 54% match for our top institutions