cgspace-notes/content/posts/2024-03.md

---
title: "March, 2024"
date: 2024-03-01T09:55:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2024-03-01

- Last week Bizu reported an issue with the "browse by issue date" drop down
  - I verified it, and suspect it could be due to missing issue dates...
  - It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808

<!--more-->

- I spent some time trying to reproduce the bug affecting `onebox` fields that are configured to use external vocabularies and are not repeatable
  - I filed an issue: https://github.com/DSpace/dspace-angular/issues/2846

## 2024-03-03

- I did some cleanups on abstracts, licenses, and dates from CrossRef
- I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list

## 2024-03-05

- I tried a new technique to get some affiliations from Crossref using OpenRefine
  - First I split them and clustered, resolving a few hundred clusters out of 1500 (!)
  - Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work
  - Then I joined them with our affiliations, paying no attention to duplicates
  - Then I deduped them using the Jython technique I learned in 2023-02

## 2024-03-06

- Peter sent me some more corrections for the authors that I had sent him in 2023-12

## 2024-03-08

- IFPRI sent me their 2023 records from CONTENTdm so I started working on those
  - I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:

```python
import re

with open(r"/tmp/cg-creator-identifier.txt",'r') as f :
    orcid_ids = [orcid_id.strip() for orcid_id in f]

matched = False
for orcid_id in orcid_ids:
    if re.search(r'.+: {}'.format(value), orcid_id):
        matched = True
        break

if matched:
    return orcid_id
else:
    return value
```


- I realized that [UNICEF was renamed to its current name in 1953](https://www.unicef.org/about-unicef/frequently-asked-questions#3) so I replaced all other variations in our vocabularies and metadata:

```sql
UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF');
```

- Note the use of two single quotes to escape the one in the name

## 2024-03-11

- Experimenting with moving some of my Python scripts to the DSpace 7 REST API
  - I need a way to get UUIDs for Handles...
  - Seems that I can use a Discovery query like: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&query=handle:10568/130864
  - Then just take the first result...?
- I spent some time working on the script get abstracts from CGSpace, and found a bug in my logic
  - I also noticed that one item had two abstracts, but the first one was blank!
  - Looking deeper, I found 113 blank metadata values so I deleted those:

```sql
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';
COMMIT;
```

- I also found a few dozen items with "N/A" for their citation, so I deleted those too:

```sql
BEGIN;
DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='N/A' AND metadata_field_id=146;
COMMIT;
```

- I deployed the change to disable Angular SSR's `inlineCriticalCss` on production because we had heavy load on the frontend and I've been meaning to do this permanently for some time
- Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her:

```console
$ csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],cg.link.permalink[en_US]' ~/Downloads/2024-03-05-cgspace.csv \
  | csvgrep -c 'cg.link.permalink[en_US]' -r '^.+$' > /tmp/2024-03-11-Bioversity-Permalinks.csv
```

## 2024-03-12

- Run the duplicate checker for IFPRI 2023 batch upload

## 2024-03-13

- I found about 428 duplicates in the IFPRI 2023 batch records
  - Alarmingly, I found about 18 that are duplicated on CGSpace as well!
  - I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones
- Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable
  - I found it easiest to use the `ds6_item2itemhandle` [DSpace SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) with a nested query on the provenance:

```sql
SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%';
```

## 2024-03-14

- Looking in to reports of rate limiting of Altmetric's bot on CGSpace
  - I don't see any HTTP 429 responses for their user agents in any of our logs...
  - I tried myself on an item page and never hit a limit...

```console
$ for num in {1..60}; do echo -n "Request ${num}: "; curl -s -o /dev/null -w "%{http_code}" https://dspace7test.ilri.org/items/c9b8999d-3001-42ba-a267-14f4bfa90b53 && echo; done
Request 1: 200
Request 2: 200
Request 3: 200
Request 4: 200
...
Request 60: 200
```

- All responses were HTTP 200...
- In any case, I whitelisted their production IPs and told them to try again
- I imported 468 of IFPRI's 2023 records that were confirmed to not be duplicates to CGSpace
  - I also spent some time merging metadata from 415 of the remaining 432 duplicates with the metadata for the existing items on CGSpace
  - This was a bit of dirty work using csvkit, xsv, and OpenRefine

## 2024-03-17

- There are 17 records from IFPRI's 2023 batch that are remaining from the 432 that I identified as already being on CGSpace
  - These are different in that they are duplicates on CGSpace as well, so the csvjoin failed and the metadata got messed up in my migration
  - I looked closer and whittled this down to 14 actual records, and spent some time working on them
  - I isolated 12 of these items that existed on CGSpace and added publication ranks, project identifiers, and provenance links
  - Now there only remain two confusing records about the Inkomati catchment

## 2024-03-18

- Checking to see how many IFPRI records we have migrated so far:

```console
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'Original URL from IFPRI CONTENTdm' cgspace.csv \
  | csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],dc.description.provenance[en_US],dcterms.type[en_US]' \
  | tee /tmp/ifpri-records.csv \
  | csvstat --count
898
```

- I finalized the remaining two on Inkomati catchment and now we are at 900!

# 2024-03-19

- IWMI sent me some new author ORCID identifiers so I updated our list
- Started working on updating my data for the Ontology CoP webinar on CGIAR and AGROVOC
  - First extracting all unique subjects on CGSpace:

```
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2024-03-19-cgspace-subjects.csv WITH CSV HEADER;
COPY 28024
```

- Then I extracted the subjects and looked them up against AGROVOC:

```console
$ csvcut -c subject /tmp/2024-03-19-cgspace-subjects.csv | sed '1d' > /tmp/2024-03-19-cgspace-subjects.txt
$ ./ilri/agrovoc_lookup.py -i /tmp/2024-03-19-cgspace-subjects.txt -o /tmp/2024-03-19-cgspace-subjects-results.csv
```

## 2024-03-20

- Identify seven duplicates on CGSpace from the PRMS results and withdraw them from CGSpace

## 2024-03-21

- Look more closely at duplicates on CGSpace based on a fresh export
  - Using DOIs I found ~842 that occur more than once for journal articles alone, so probably around 400 duplicates
  - I did a handful of them, merging the metadata and withdrawing the duplicate, and decided to add `dcterms.replaces` with the handle in the original

## 2024-03-22

- Look at duplicate DOIs on CGSpace and address a dozen or so

## 2024-03-23

- Look at duplicate DOIs on CGSpace and address a dozen or so
- Update Tomcat and Solr to latest versions
  - I had done some tests with these last week, and did a last minute test on DSpace 7 Test to make sure submission and searching worked

## 2024-03-24

- Slowly process several dozen more duplicate DOIs on CGSpace, sigh...

## 2024-03-26

- File an issue on dspace-angular about improving withdrawn item tombstones: https://github.com/DSpace/dspace-angular/issues/2880
- Merge metadata and withdraw more duplicates on CGSpace

<!-- vim: set sw=2 ts=2: -->
Add notes for 2024-03 2024-03-04 08:02:14 +01:00			`---`
			`title: "March, 2024"`
			`date: 2024-03-01T09:55:00+03:00`
			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

			`## 2024-03-01`

			`- Last week Bizu reported an issue with the "browse by issue date" drop down`
			`- I verified it, and suspect it could be due to missing issue dates...`
			`- It might be this issue: https://github.com/DSpace/dspace-angular/issues/2808`

			`<!--more-->`

			- I spent some time trying to reproduce the bug affecting `onebox` fields that are configured to use external vocabularies and are not repeatable
			`- I filed an issue: https://github.com/DSpace/dspace-angular/issues/2846`

			`## 2024-03-03`

			`- I did some cleanups on abstracts, licenses, and dates from CrossRef`
			`- I also did some minor cleanups to affiliations because I saw some incorrect and duplicate ones in our list`

Add notes for 2024-03-08 2024-03-08 15:31:19 +01:00			`## 2024-03-05`

			`- I tried a new technique to get some affiliations from Crossref using OpenRefine`
			`- First I split them and clustered, resolving a few hundred clusters out of 1500 (!)`
			`- Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work`
			`- Then I joined them with our affiliations, paying no attention to duplicates`
			`- Then I deduped them using the Jython technique I learned in 2023-02`

			`## 2024-03-06`

			`- Peter sent me some more corrections for the authors that I had sent him in 2023-12`

			`## 2024-03-08`

			`- IFPRI sent me their 2023 records from CONTENTdm so I started working on those`
			`- I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:`

			```python
			`import re`

			`with open(r"/tmp/cg-creator-identifier.txt",'r') as f :`
			`orcid_ids = [orcid_id.strip() for orcid_id in f]`

			`matched = False`
			`for orcid_id in orcid_ids:`
			`if re.search(r'.+: {}'.format(value), orcid_id):`
			`matched = True`
			`break`

			`if matched:`
			`return orcid_id`
			`else:`
			`return value`
			```


			`- I realized that [UNICEF was renamed to its current name in 1953](https://www.unicef.org/about-unicef/frequently-asked-questions#3) so I replaced all other variations in our vocabularies and metadata:`

			```sql
			`UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF');`
			```

			`- Note the use of two single quotes to escape the one in the name`

Add notes for 2024-03-11 2024-03-11 16:04:40 +01:00			`## 2024-03-11`

			`- Experimenting with moving some of my Python scripts to the DSpace 7 REST API`
			`- I need a way to get UUIDs for Handles...`
			`- Seems that I can use a Discovery query like: https://dspace7test.ilri.org/server/api/discover/search/objects?dsoType=item&query=handle:10568/130864`
			`- Then just take the first result...?`
			`- I spent some time working on the script get abstracts from CGSpace, and found a bug in my logic`
			`- I also noticed that one item had two abstracts, but the first one was blank!`
			`- Looking deeper, I found 113 blank metadata values so I deleted those:`

			```sql
			`BEGIN;`
			`DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='';`
			`COMMIT;`
			```

			`- I also found a few dozen items with "N/A" for their citation, so I deleted those too:`

			```sql
			`BEGIN;`
			`DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value='N/A' AND metadata_field_id=146;`
			`COMMIT;`
			```

Update notes for 2024-03-11 2024-03-11 19:58:15 +01:00			- I deployed the change to disable Angular SSR's `inlineCriticalCss` on production because we had heavy load on the frontend and I've been meaning to do this permanently for some time
			`- Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her:`

			```console
			`$ csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],cg.link.permalink[en_US]' ~/Downloads/2024-03-05-cgspace.csv \`
			`\| csvgrep -c 'cg.link.permalink[en_US]' -r '^.+$' > /tmp/2024-03-11-Bioversity-Permalinks.csv`
			```

Add notes for 2024-03-13 2024-03-14 07:29:05 +01:00			`## 2024-03-12`

			`- Run the duplicate checker for IFPRI 2023 batch upload`

			`## 2024-03-13`

			`- I found about 428 duplicates in the IFPRI 2023 batch records`
			`- Alarmingly, I found about 18 that are duplicated on CGSpace as well!`
			`- I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones`
			`- Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable`
			- I found it easiest to use the `ds6_item2itemhandle` [DSpace SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) with a nested query on the provenance:

			```sql
			`SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%';`
			```

Add notes 2024-03-19 07:01:13 +01:00			`## 2024-03-14`

			`- Looking in to reports of rate limiting of Altmetric's bot on CGSpace`
			`- I don't see any HTTP 429 responses for their user agents in any of our logs...`
			`- I tried myself on an item page and never hit a limit...`

			```console
			`$ for num in {1..60}; do echo -n "Request ${num}: "; curl -s -o /dev/null -w "%{http_code}" https://dspace7test.ilri.org/items/c9b8999d-3001-42ba-a267-14f4bfa90b53 && echo; done`
			`Request 1: 200`
			`Request 2: 200`
			`Request 3: 200`
			`Request 4: 200`
			`...`
			`Request 60: 200`
			```

			`- All responses were HTTP 200...`
			`- In any case, I whitelisted their production IPs and told them to try again`
			`- I imported 468 of IFPRI's 2023 records that were confirmed to not be duplicates to CGSpace`
			`- I also spent some time merging metadata from 415 of the remaining 432 duplicates with the metadata for the existing items on CGSpace`
			`- This was a bit of dirty work using csvkit, xsv, and OpenRefine`

			`## 2024-03-17`

			`- There are 17 records from IFPRI's 2023 batch that are remaining from the 432 that I identified as already being on CGSpace`
			`- These are different in that they are duplicates on CGSpace as well, so the csvjoin failed and the metadata got messed up in my migration`
			`- I looked closer and whittled this down to 14 actual records, and spent some time working on them`
			`- I isolated 12 of these items that existed on CGSpace and added publication ranks, project identifiers, and provenance links`
			`- Now there only remain two confusing records about the Inkomati catchment`

			`## 2024-03-18`

			`- Checking to see how many IFPRI records we have migrated so far:`

			```console
			`$ csvgrep -c 'dc.description.provenance[en_US]' -m 'Original URL from IFPRI CONTENTdm' cgspace.csv \`
			`\| csvcut -c 'id,dc.title[en_US],dc.identifier.uri[en_US],dc.description.provenance[en_US],dcterms.type[en_US]' \`
			`\| tee /tmp/ifpri-records.csv \`
			`\| csvstat --count`
			`898`
			```

			`- I finalized the remaining two on Inkomati catchment and now we are at 900!`

Add notes for 2024-03-19 2024-03-19 14:24:20 +01:00			`# 2024-03-19`

			`- IWMI sent me some new author ORCID identifiers so I updated our list`
			`- Started working on updating my data for the Ontology CoP webinar on CGIAR and AGROVOC`
			`- First extracting all unique subjects on CGSpace:`

			```
			`localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2024-03-19-cgspace-subjects.csv WITH CSV HEADER;`
			`COPY 28024`
			```

			`- Then I extracted the subjects and looked them up against AGROVOC:`

			```console
			`$ csvcut -c subject /tmp/2024-03-19-cgspace-subjects.csv \| sed '1d' > /tmp/2024-03-19-cgspace-subjects.txt`
			`$ ./ilri/agrovoc_lookup.py -i /tmp/2024-03-19-cgspace-subjects.txt -o /tmp/2024-03-19-cgspace-subjects-results.csv`
			```

Add notes 2024-03-25 16:53:18 +01:00			`## 2024-03-20`

			`- Identify seven duplicates on CGSpace from the PRMS results and withdraw them from CGSpace`

			`## 2024-03-21`

			`- Look more closely at duplicates on CGSpace based on a fresh export`
			`- Using DOIs I found ~842 that occur more than once for journal articles alone, so probably around 400 duplicates`
			- I did a handful of them, merging the metadata and withdrawing the duplicate, and decided to add `dcterms.replaces` with the handle in the original

			`## 2024-03-22`

			`- Look at duplicate DOIs on CGSpace and address a dozen or so`

			`## 2024-03-23`

			`- Look at duplicate DOIs on CGSpace and address a dozen or so`
			`- Update Tomcat and Solr to latest versions`
			`- I had done some tests with these last week, and did a last minute test on DSpace 7 Test to make sure submission and searching worked`

			`## 2024-03-24`

			`- Slowly process several dozen more duplicate DOIs on CGSpace, sigh...`

Add notes 2024-04-04 09:23:49 +02:00			`## 2024-03-26`

			`- File an issue on dspace-angular about improving withdrawn item tombstones: https://github.com/DSpace/dspace-angular/issues/2880`
			`- Merge metadata and withdraw more duplicates on CGSpace`

Add notes for 2024-03 2024-03-04 08:02:14 +01:00			`<!-- vim: set sw=2 ts=2: -->`