- I tried a new technique to get some affiliations from Crossref using OpenRefine
- First I split them and clustered, resolving a few hundred clusters out of 1500 (!)
- Then I used a custom text facet with a few dozen CGIAR and other large affiliations to reduce the work
- Then I joined them with our affiliations, paying no attention to duplicates
- Then I deduped them using the Jython technique I learned in 2023-02
## 2024-03-06
- Peter sent me some more corrections for the authors that I had sent him in 2023-12
## 2024-03-08
- IFPRI sent me their 2023 records from CONTENTdm so I started working on those
- I found a way to match their ORCID identifiers in our list using Jython in OpenRefine:
```python
import re
with open(r"/tmp/cg-creator-identifier.txt",'r') as f :
orcid_ids = [orcid_id.strip() for orcid_id in f]
matched = False
for orcid_id in orcid_ids:
if re.search(r'.+: {}'.format(value), orcid_id):
matched = True
break
if matched:
return orcid_id
else:
return value
```
- I realized that [UNICEF was renamed to its current name in 1953](https://www.unicef.org/about-unicef/frequently-asked-questions#3) so I replaced all other variations in our vocabularies and metadata:
```sql
UPDATE metadatavalue SET text_value='United Nations Children''s Fund' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value IN ('United Nations International Children''s Emergency Fund', 'United Nations International Children''s Emergency Fund', 'UNICEF');
```
- Note the use of two single quotes to escape the one in the name
- I deployed the change to disable Angular SSR's `inlineCriticalCss` on production because we had heavy load on the frontend and I've been meaning to do this permanently for some time
- Maria asked me for a CSV with all the broken Bioversity permalinks so I exported them for her:
- Run the duplicate checker for IFPRI 2023 batch upload
## 2024-03-13
- I found about 428 duplicates in the IFPRI 2023 batch records
- Alarmingly, I found about 18 that are duplicated on CGSpace as well!
- I looked closer and decided that 11 were duplicates, so I merged the metadata and withdrew the later ones
- Alliance asked me to get him the Handles for items submitted by TIP that are not discoverable
- I found it easiest to use the `ds6_item2itemhandle` [DSpace SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) with a nested query on the provenance:
```sql
SELECT ds6_item2itemhandle(dspace_object_id) AS handle FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item WHERE NOT discoverable) AND metadata_field_id=28 AND text_value LIKE 'Submitted by Alliance TIP Submit%';
- IWMI sent me some new author ORCID identifiers so I updated our list
- Started working on updating my data for the Ontology CoP webinar on CGIAR and AGROVOC
- First extracting all unique subjects on CGSpace:
```
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2024-03-19-cgspace-subjects.csv WITH CSV HEADER;
COPY 28024
```
- Then I extracted the subjects and looked them up against AGROVOC:
```console
$ csvcut -c subject /tmp/2024-03-19-cgspace-subjects.csv | sed '1d' > /tmp/2024-03-19-cgspace-subjects.txt