cgspace-notes/content/posts/2024-05.md

87 lines
3.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "May, 2024"
date: 2024-05-01T10:39:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-05-01
- I dumped all the CGSpace DOIs and resolved them with my `crossref_doi_lookup.py` script
- Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
<!--more-->
## 2024-05-05
- Spend some time looking at duplicate DOIs again...
## 2024-05-06
- Spend some time looking at duplicate DOIs again...
## 2024-05-07
- Discuss RSS feeds and OpenSearch with IWMI
- It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch
- I saw a patch for an interesting issue on DSpace GitHub: [Error submitting or deleting items - URI too long when user is in a large number of groups](https://github.com/DSpace/DSpace/issues/9544)
- I hadn't realized it, but we have lots of those errors:
```console
$ zstdgrep -a 'URI Too Long' log/dspace.log-2024-04-* | wc -l
1423
```
- Spend some time looking at duplicate DOIs again...
## 2024-05-08
- Spend some time looking at duplicate DOIs again...
- I finally finished looking at the duplicate DOIs for journal articles
- I updated the list of handle redirects and there are 386 of them!
## 2024-05-09
- Spend some time working on the IFPRI 20202021 batch
- I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date
## 2024-05-12
- I couldn't figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:
```psql
dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE 'Submitted by%') TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
```
- Then joined them:
```console
$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv > /tmp/withdrawn.csv
```
- This gives me an insight into who submitted at 334 of the duplicates over the past few years...
- I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl
## 2024-05-13
- Export a list of IFPRI information products with handle links and CONTENTdm links:
```
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
2645
```
- I discovered the `/server/api/pid/find` endpoint today, which is much more direct and manageable than the `/server/api/discover/search/objects?query=` endpoint when trying to get metadata for a Handle (item, collection, or community)
- The "pid" stands for permanent identifiers apparently, and we can use it like this:
```
https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
```
<!-- vim: set sw=2 ts=2: -->