cgspace-notes/content/posts/2024-04.md

170 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "April, 2024"
date: 2024-04-04T10:23:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-04-04
- Work on CGSpace duplicate DOIs more
<!--more-->
## 2024-04-08
- Start working on IFPRI's 2022 batch import
- I ran the duplicate checker against CGSpace and started downloading all linked PDFs
## 2024-04-09
- Continue working on IFPRI's 2022 batch import
- I started validating the potential duplicates in OpenRefine
## 2024-04-12
- Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
- I need to merge the metadata for the remaining 212 that are already on CGSpace
- Spend some time looking at duplicate DOIs again...
## 2024-04-13
- Spend some time looking at duplicate DOIs again...
## 2024-04-14
- Spend some time looking at duplicate DOIs again...
## 2024-04-15
- Spend some time looking at duplicate DOIs again...
- Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
- Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:
```
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
```
- Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
- I merged metadata with the existing items
- There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself
## 2024-04-16
- Spend some time looking at duplicate DOIs again...
- Assist Deborah with an advanced query on CGSpace for biodiversity and health:
```
dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
```
- Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
- I used this Jython expression in OpenRefine with [Crossref's content negotiation](https://citation.crosscite.org/docs.html) to get citations for all DOIs:
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi + "/transform/text/x-bibliography"
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
- It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!
## 2024-04-18
- Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:
```sql
SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE 'Original URL%' AND h.resource_type_id=2;
```
- Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for `dcterms.replaces`:
```sql
SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2;
```
- Then I can work that list into an nginx map with redirect, for example:
```console
server {
...
if ($new_uri) {
return 301 $new_uri;
}
}
map $request_uri $new_uri {
/handle/10568/112821 /handle/10568/97605;
}
```
## 2024-04-19
- Spend some time looking at duplicate DOIs again...
- Refresh ORCID identifiers from ORCID API and update CGSpace metadata and controlled vocabulary
## 2024-04-20
- I read an [interesting thread about DOI casing](https://github.com/greenelab/scihub/issues/9)
- Apparently the DOI specification says ASCII characters in DOIs are case insensitive
- Indeed, [Crossref recommends lower case](https://www.crossref.org/documentation/member-setup/constructing-your-dois/) for all DOIs
- I was curious about the DOIs in our database so I checked before and after lower casing:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-before.txt;
COPY 25675
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-after.txt;
COPY 25666
```
- I need to investigate options for lower casing these in the repository, for example in a curation task, and in all workflows around DSpace metadata...
## 2024-04-23
- Spent some time writing a Java curation task to normalize DOIs in items when they enter the workflow edit step
- The workflow curation tasks are not documented very well but I got a basic configuration working
- I found a bug in DSpace curation tasks and discussed on Slack
- I finalized the `NormalizeDOIs` curation task and released v7.6.1.1 of the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) project
## 2024-04-24
- A bit more testing of the curation tasks
- I tested a patch by Mark Wood
- I added support for normalizing DOIs to this same format to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) project
## 2024-04-25
- I lowercased the remaining 3,900 DOIs on CGSpace that had uppercase ASCII characters
- Spend some time looking at duplicate DOIs again...
## 2024-04-26
- Spend some time looking at duplicate DOIs again...
## 2024-04-29
- Start working on the IFPRI 20202021 batch migration
- I modified my `check_duplicates.py` script to check for DOIs instead of titles, and use a similarity of 1.0 to make sure the match is exact
- I noticed something in the Tomcat log:
```console
tomcat9[690]: WARNING: The HTTP response header [Content-Disposition] with value [attachment; filename="Literature review on Womens Empowerment and their Resilience2.pdf"] has been removed from the response because it is invalid
tomcat9[690]: java.lang.IllegalArgumentException: The Unicode character [] at code point [8,217] cannot be encoded as it is outside the permitted range of 0 to 255
```
- I found the bitstream's ID and then used the `ds6_bitstream2itemhandle` [SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the item's handle
- Then I replaced the curly quote with a regular quote in all bistreams
<!-- vim: set sw=2 ts=2: -->