mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-10-06 09:04:17 +02:00
113 lines
3.9 KiB
Markdown
113 lines
3.9 KiB
Markdown
---
|
|
title: "April, 2024"
|
|
date: 2024-04-04T10:23:00+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2024-04-04
|
|
|
|
- Work on CGSpace duplicate DOIs more
|
|
|
|
<!--more-->
|
|
|
|
## 2024-04-08
|
|
|
|
- Start working on IFPRI's 2022 batch import
|
|
- I ran the duplicate checker against CGSpace and started downloading all linked PDFs
|
|
|
|
## 2024-04-09
|
|
|
|
- Continue working on IFPRI's 2022 batch import
|
|
- I started validating the potential duplicates in OpenRefine
|
|
|
|
## 2024-04-12
|
|
|
|
- Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
|
|
- I need to merge the metadata for the remaining 212 that are already on CGSpace
|
|
- Spend some time looking at duplicate DOIs again...
|
|
|
|
## 2024-04-13
|
|
|
|
- Spend some time looking at duplicate DOIs again...
|
|
|
|
## 2024-04-14
|
|
|
|
- Spend some time looking at duplicate DOIs again...
|
|
|
|
## 2024-04-15
|
|
|
|
- Spend some time looking at duplicate DOIs again...
|
|
- Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418
|
|
- Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:
|
|
|
|
```
|
|
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
|
|
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
|
|
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&sort=dcterms.issued,desc'
|
|
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
|
|
```
|
|
|
|
- Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
|
|
- I merged metadata with the existing items
|
|
- There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself
|
|
|
|
## 2024-04-16
|
|
|
|
- Spend some time looking at duplicate DOIs again...
|
|
- Assist Deborah with an advanced query on CGSpace for biodiversity and health:
|
|
|
|
```
|
|
dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
|
|
```
|
|
|
|
- Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
|
|
- I used this Jython expression in OpenRefine with [Crossref's content negotiation](https://citation.crosscite.org/docs.html) to get citations for all DOIs:
|
|
|
|
```python
|
|
import urllib2
|
|
|
|
doi = cells['cg.identifier.doi[en_US]'].value
|
|
url = "https://api.crossref.org/works/" + doi + "/transform/text/x-bibliography"
|
|
useragent = "Python (mailto:a.o@cgiar.org)"
|
|
|
|
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
|
|
get = urllib2.urlopen(request)
|
|
|
|
return get.read().decode('utf-8')
|
|
```
|
|
|
|
- It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!
|
|
|
|
## 2024-04-18
|
|
|
|
- Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:
|
|
|
|
```sql
|
|
SELECT m.text_value, h.handle FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=28 AND m.text_value LIKE 'Original URL%' AND h.resource_type_id=2;
|
|
```
|
|
|
|
- Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for `dcterms.replaces`:
|
|
|
|
```sql
|
|
SELECT m.text_value AS handle_from, h.handle AS handle_to FROM metadatavalue m JOIN handle h on m.dspace_object_id = h.resource_id WHERE m.metadata_field_id=181 AND h.resource_type_id=2;
|
|
```
|
|
|
|
- Then I can work that list into an nginx map with redirect, for example:
|
|
|
|
```console
|
|
server {
|
|
...
|
|
|
|
if ($new_uri) {
|
|
return 301 $new_uri;
|
|
}
|
|
}
|
|
|
|
map $request_uri $new_uri {
|
|
/handle/10568/112821 /handle/10568/97605;
|
|
}
|
|
```
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|