mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-06-26 16:13:48 +02:00
128 lines
6.5 KiB
Markdown
128 lines
6.5 KiB
Markdown
---
|
|
title: "December, 2022"
|
|
date: 2022-12-01T08:52:36+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2022-12-01
|
|
|
|
- Fix some incorrect regions on CGSpace
|
|
- I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
|
|
- Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
|
|
- Replace "East Asia" with "Eastern Asia" region on CGSpace (UN M.49 region)
|
|
|
|
<!--more-->
|
|
|
|
- CGSpace and PRMS information session with Enrico and a bunch of researchers
|
|
- I noticed some minor issues with SPDX licenses and AGROVOC terms in items submitted by TIP so I sent a message to Daniel from Alliance
|
|
- I startd a harvest on AReS since we've updated so much metadata recently
|
|
|
|
## 2022-12-02
|
|
|
|
- File some issues related to metadata on the MEL issue tracker
|
|
- [Only use "Open Access" or "Limited Access" access rights when publishing items on CGSpace](https://github.com/CodeObia/MEL/issues/11066)
|
|
- [Set the description when submitting bitstreams to CGSpace](https://github.com/CodeObia/MEL/issues/11067)
|
|
- [Some items have a Creative Commons license, but are Limited Access and bitstreams are locked](https://github.com/CodeObia/MEL/issues/11068)
|
|
|
|
## 2022-12-03
|
|
|
|
- I downloaded a fresh copy of CLARISA's institutions list as well as ROR's latest dump from 2022-12-01 to check how many are matching:
|
|
|
|
```console
|
|
$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp > ~/Downloads/2022-12-03-CLARISA-institutions.json
|
|
$ jq -r '.[] | .name' ~/Downloads/2022-12-03-CLARISA-institutions.json > ~/Downloads/2022-12-03-CLARISA-institutions.txt
|
|
$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
|
|
$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
|
|
1864
|
|
$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
|
|
7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
|
|
```
|
|
|
|
- Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher
|
|
- If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:
|
|
|
|
```console
|
|
$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
|
|
```
|
|
|
|
- I checked CGSpace's top 1,000 affiliations too, first exporting from PostgreSQL:
|
|
|
|
```console
|
|
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
|
|
```
|
|
|
|
- Then cutting (tab is the default delimeter):
|
|
|
|
```console
|
|
$ cut -f 1 /tmp/2022-11-22-affiliations.csv > 2022-11-22-affiliations.txt
|
|
$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
|
|
$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
|
|
542
|
|
```
|
|
|
|
- So that's a 54% match for our top affiliations
|
|
- I realized we should actually check affiliations and sponsors, since those are stored in separate fields
|
|
- When I add those the matches go down a bit to 45%
|
|
- Oh man, I realized institutions like `Université d'Abomey Calavi` don't match in ROR because they are like this in the JSON:
|
|
|
|
```console
|
|
"name": "Universit\u00e9 d'Abomey-Calavi"
|
|
```
|
|
|
|
- So we likely match a bunch more than 50%...
|
|
- I exported a list of affiliations and donors from CGSpace for Peter to look over and send corrections
|
|
|
|
## 2022-12-05
|
|
|
|
- First day of PRMS technical workshop in Rome
|
|
- Last night I submitted a CSV import with changes to 1,500 Alliance items (adding regions) and it hadn't completed after twenty-four hours so I canceled it
|
|
- Not sure if there is some rollback that will happen or what state the database will be in, so I will wait a few hours to see what happens before trying to modify those items again
|
|
- I started it again a few hours later with a subset of the items and 4GB of RAM instead of 2
|
|
- It completed successfully...
|
|
|
|
## 2022-12-07
|
|
|
|
- I found a bug in my csv-metadata-quality script regarding the regions
|
|
- I was accidentally checking `cg.coverage.subregion` due to a sloppy regex
|
|
- This means I've added a few thousand UN M.49 regions to the `cg.coverage.subregion` field in the last few days
|
|
- I had to extract them from CGSpace and delete them using `delete-metadata-values.py`
|
|
- My [DSpace 7.x pull request to tell ImageMagick about the PDF CropBox](https://github.com/DSpace/DSpace/pull/8550) was merged
|
|
- Start a harvest on AReS
|
|
|
|
## 2022-12-08
|
|
|
|
- While on the plane I decided to fix some ORCID identifiers, as I had seen some poorly formatted ones
|
|
- I couldn't remember the XPath syntax so this was kinda ghetto:
|
|
|
|
```console
|
|
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE 'label=".*"' | sed -e 's/label="//' -e 's/"$//' > /tmp/orcid-names.txt
|
|
$ ./ilri/update-orcids.py -i /tmp/orcid-names.txt -db dspace -u dspace -p 'fuuu' -m 247
|
|
```
|
|
|
|
- After that there were still some poorly formatted ones that my script didn't fix, so perhaps these are new ones not in our list
|
|
- I dumped them and combined with the existing ones to resolve later:
|
|
|
|
```console
|
|
localhost/dspace= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=247 AND text_value LIKE '%http%') to /tmp/orcid-formatting.txt;
|
|
COPY 36
|
|
```
|
|
|
|
- I think there are really just some new ones...
|
|
|
|
```console
|
|
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/orcid-formatting.txt| grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2022-12-08-orcids.txt
|
|
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u | wc -l
|
|
1907
|
|
$ wc -l /tmp/2022-12-08-orcids.txt
|
|
1939 /tmp/2022-12-08-orcids.txt
|
|
```
|
|
|
|
- Then I applied these updates on CGSpace
|
|
- Maria mentioned that she was getting a lot more items in her daily subscription emails
|
|
- I had a hunch it was related to me updating the `last_modified` timestamp after updating a bunch of countries, regions, etc in items
|
|
- Then today I noticed this option in `dspace.cfg`: `eperson.subscription.onlynew`
|
|
- By default DSpace sends notifications for modified items too! I've disabled it now...
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|