cgspace-notes/content/posts/2024-04.md

170 lines
6.9 KiB
Markdown
Raw Normal View History

2024-04-04 09:23:49 +02:00
---
title: "April, 2024"
date: 2024-04-04T10:23:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-04-04
2024-04-09 15:50:56 +02:00
- Work on CGSpace duplicate DOIs more
2024-04-04 09:23:49 +02:00
<!--more-->
2024-04-09 15:50:56 +02:00
## 2024-04-08
- Start working on IFPRI's 2022 batch import
- I ran the duplicate checker against CGSpace and started downloading all linked PDFs
## 2024-04-09
- Continue working on IFPRI's 2022 batch import
- I started validating the potential duplicates in OpenRefine
2024-04-12 19:40:52 +02:00
## 2024-04-12
- Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
- I need to merge the metadata for the remaining 212 that are already on CGSpace
2024-04-16 08:35:30 +02:00
- Spend some time looking at duplicate DOIs again...
## 2024-04-13
- Spend some time looking at duplicate DOIs again...
## 2024-04-14
- Spend some time looking at duplicate DOIs again...
## 2024-04-15
- Spend some time looking at duplicate DOIs again...
- Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
- Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:
```
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
2024-04-18 08:38:02 +02:00
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&sort=dcterms.issued,desc'
2024-04-16 08:35:30 +02:00
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
```
- Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
- I merged metadata with the existing items
- There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself
## 2024-04-16
- Spend some time looking at duplicate DOIs again...
2024-04-18 08:38:02 +02:00
- Assist Deborah with an advanced query on CGSpace for biodiversity and health:
```
dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
```
- Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
- I used this Jython expression in OpenRefine with [Crossref's content negotiation](https://citation.crosscite.org/docs.html) to get citations for all DOIs:
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi + "/transform/text/x-bibliography"
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
- It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!
2024-04-12 19:40:52 +02:00
2024-04-18 16:00:25 +02:00
## 2024-04-18
- Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:
```sql
SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE 'Original URL%' AND h.resource_type_id=2;
```
- Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for `dcterms.replaces`:
```sql
SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2;
```
- Then I can work that list into an nginx map with redirect, for example:
```console
server {
...
if ($new_uri) {
return 301 $new_uri;
}
}
map $request_uri $new_uri {
/handle/10568/112821 /handle/10568/97605;
}
```
2024-04-25 14:28:20 +02:00
## 2024-04-19
- Spend some time looking at duplicate DOIs again...
- Refresh ORCID identifiers from ORCID API and update CGSpace metadata and controlled vocabulary
## 2024-04-20
- I read an [interesting thread about DOI casing](https://github.com/greenelab/scihub/issues/9)
- Apparently the DOI specification says ASCII characters in DOIs are case insensitive
- Indeed, [Crossref recommends lower case](https://www.crossref.org/documentation/member-setup/constructing-your-dois/) for all DOIs
- I was curious about the DOIs in our database so I checked before and after lower casing:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-before.txt;
COPY 25675
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-after.txt;
COPY 25666
```
- I need to investigate options for lower casing these in the repository, for example in a curation task, and in all workflows around DSpace metadata...
## 2024-04-23
- Spent some time writing a Java curation task to normalize DOIs in items when they enter the workflow edit step
- The workflow curation tasks are not documented very well but I got a basic configuration working
- I found a bug in DSpace curation tasks and discussed on Slack
- I finalized the `NormalizeDOIs` curation task and released v7.6.1.1 of the [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) project
## 2024-04-24
- A bit more testing of the curation tasks
- I tested a patch by Mark Wood
- I added support for normalizing DOIs to this same format to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) project
## 2024-04-25
- I lowercased the remaining 3,900 DOIs on CGSpace that had uppercase ASCII characters
2024-04-27 10:22:58 +02:00
- Spend some time looking at duplicate DOIs again...
## 2024-04-26
- Spend some time looking at duplicate DOIs again...
2024-04-25 14:28:20 +02:00
2024-04-29 16:21:28 +02:00
## 2024-04-29
- Start working on the IFPRI 20202021 batch migration
- I modified my `check_duplicates.py` script to check for DOIs instead of titles, and use a similarity of 1.0 to make sure the match is exact
- I noticed something in the Tomcat log:
```console
tomcat9[690]: WARNING: The HTTP response header [Content-Disposition] with value [attachment; filename="Literature review on Womens Empowerment and their Resilience2.pdf"] has been removed from the response because it is invalid
tomcat9[690]: java.lang.IllegalArgumentException: The Unicode character [] at code point [8,217] cannot be encoded as it is outside the permitted range of 0 to 255
```
- I found the bitstream's ID and then used the `ds6_bitstream2itemhandle` [SQL helper function](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the item's handle
- Then I replaced the curly quote with a regular quote in all bistreams
2024-04-04 09:23:49 +02:00
<!-- vim: set sw=2 ts=2: -->