mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-12-07
This commit is contained in:
@ -46,7 +46,7 @@ $ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
|
||||
$ sed -e 's_ / _\n_' -e 's_/_\n_' -e 's/ \?(.*)$//' ~/Downloads/2022-12-03-CLARISA-institutions.txt > ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
|
||||
```
|
||||
|
||||
- I checked CGSpace's top 1,000 institutions too, first exporting from PostgreSQL:
|
||||
- I checked CGSpace's top 1,000 affiliations too, first exporting from PostgreSQL:
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
|
||||
@ -61,6 +61,32 @@ $ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
|
||||
542
|
||||
```
|
||||
|
||||
- So that's a 54% match for our top institutions
|
||||
- So that's a 54% match for our top affiliations
|
||||
- I realized we should actually check affiliations and sponsors, since those are stored in separate fields
|
||||
- When I add those the matches go down a bit to 45%
|
||||
- Oh man, I realized institutions like `Université d'Abomey Calavi` don't match in ROR because they are like this in the JSON:
|
||||
|
||||
```console
|
||||
"name": "Universit\u00e9 d'Abomey-Calavi"
|
||||
```
|
||||
|
||||
- So we likely match a bunch more than 50%...
|
||||
- I exported a list of affiliations and donors from CGSpace for Peter to look over and send corrections
|
||||
|
||||
## 2022-12-05
|
||||
|
||||
- First day of PRMS technical workshop in Rome
|
||||
- Last night I submitted a CSV import with changes to 1,500 Alliance items (adding regions) and it hadn't completed after twenty-four hours so I canceled it
|
||||
- Not sure if there is some rollback that will happen or what state the database will be in, so I will wait a few hours to see what happens before trying to modify those items again
|
||||
- I started it again a few hours later with a subset of the items and 4GB of RAM instead of 2
|
||||
- It completed successfully...
|
||||
|
||||
## 2022-12-07
|
||||
|
||||
- I found a bug in my csv-metadata-quality script regarding the regions
|
||||
- I was accidentally checking `cg.coverage.subregion` due to a sloppy regex
|
||||
- This means I've added a few thousand UN M.49 regions to the `cg.coverage.subregion` field in the last few days
|
||||
- I had to extract them from CGSpace and delete them using `delete-metadata-values.py`
|
||||
- My [DSpace 7.x pull request to tell ImageMagick about the PDF CropBox](https://github.com/DSpace/DSpace/pull/8550) was merged
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user