cgspace-notes/content/posts/2024-05.md

198 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "May, 2024"
date: 2024-05-01T10:39:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-05-01
- I dumped all the CGSpace DOIs and resolved them with my `crossref_doi_lookup.py` script
- Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc
<!--more-->
## 2024-05-05
- Spend some time looking at duplicate DOIs again...
## 2024-05-06
- Spend some time looking at duplicate DOIs again...
## 2024-05-07
- Discuss RSS feeds and OpenSearch with IWMI
- It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch
- I saw a patch for an interesting issue on DSpace GitHub: [Error submitting or deleting items - URI too long when user is in a large number of groups](https://github.com/DSpace/DSpace/issues/9544)
- I hadn't realized it, but we have lots of those errors:
```console
$ zstdgrep -a 'URI Too Long' log/dspace.log-2024-04-* | wc -l
1423
```
- Spend some time looking at duplicate DOIs again...
## 2024-05-08
- Spend some time looking at duplicate DOIs again...
- I finally finished looking at the duplicate DOIs for journal articles
- I updated the list of handle redirects and there are 386 of them!
## 2024-05-09
- Spend some time working on the IFPRI 20202021 batch
- I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date
## 2024-05-12
- I couldn't figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:
```psql
dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE 'Submitted by%') TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
```
- Then joined them:
```console
$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv > /tmp/withdrawn.csv
```
- This gives me an insight into who submitted at 334 of the duplicates over the past few years...
- I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl
## 2024-05-13
- Export a list of IFPRI information products with handle links and CONTENTdm links:
```
$ csvgrep -c 'dc.description.provenance[en_US]' -m 'CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
2645
```
- I discovered the `/server/api/pid/find` endpoint today, which is much more direct and manageable than the `/server/api/discover/search/objects?query=` endpoint when trying to get metadata for a Handle (item, collection, or community)
- The "pid" stands for permanent identifiers apparently, and we can use it like this:
```
https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
```
## 2024-05-15
- I got journal titles for 2,900 journal articles that were missing them from Crossref
## 2024-05-16
Helping IFPRI with some DSpace 7 API support, these are two queries for items issued in 2024:
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D — note the Lucene search syntax is URL encoded version of `:[2024-01-01T00:00:00Z TO *]`
Both of them return the same number of results and seem identitical as far as I can see, but the second one uses Solr date indexes and requires the full Lucene datetime and range syntax
I wrote a new version of the `check_duplicates.py` script to help identify duplicates with different types
- Initially I called it `check_duplicates_fast.py` but it's actually not faster
- I need to find a way to deal with duplicates from IFPRI's repository because there are some mismatched types...
## 2024-05-20
Continue working through alternative duplicate matching for IFPRI
- Their item types are sometimes different than ours...
- One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check
- Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)
## 2024-05-22
- Finalize and upload the IFPRI 20202021 batch set
- I used a new technique to get missing licenses via Crossref (it's Python 2 because of OpenRefine's Jython):
```python
import urllib2
doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi
useragent = "Python (mailto:a.o@cgiar.org)"
request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)
return get.read().decode('utf-8')
```
## 2024-05-23
- Finalize last of the duplicates I found for the IFPRI 20202021 batch set (those that we missed initially due to mismatched types)
- Export a new list of IFPRI redirects from CONTENTdm:
```console
$ csvgrep -c 'dc.description.provenance[en_US]' -r 'Original URLs? from IFPRI CONTENTdm' cgspace.csv \
| csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
| tee /tmp/ifpri-redirects.csv \
| csvstat --count
4004
```
I found a way to get abstracts from PLOS
- They offer an API that returns XML including the JATS-formatted abstracts
- I created a new column in OpenRefine by fetching specially crafted URLs based on the DOIs using this GREL:
```console
"https://journals.plos.org/plosone/article/file?id=" + cells['doi'].value + '&type=manuscript'
```
Then used `value.parseXml()` on the resulting text to extract the abstract's text:
```console
value.parseXml().select("abstract")[0].xmlText()
```
This doesn't preserve `<p>` tags though...
- Oh, nice, this does!
```console
forEach(value.parseHtml().select("abstract p"), i, i.htmlText()).join("\r\n\r\n")
```
For each paragraph inside an abstract, get the inner text and join them as one string separated by two newlines...
- Ah, some articles have multiple abstracts, for example: https://journals.plos.org/plosone/article/file?id=https://doi.org/10.1371/journal.pntd.0001859&type=manuscript
- I need to select the abstract that does **not** have any attributes (using [Jsoup selector syntax](https://jsoup.org/apidocs/org/jsoup/select/Selector.html))
```console
forEach(value.parseXml().select("abstract:not([*]) p"), i, i.xmlText()).join("\r\n\r\n")
```
Testing `xsv` (Rust) versus `csvkit` (Python) to filter all items with DOIs from a DSpace dump with 118,000 items:
```console
$ time xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv | xsv select doi | xsv count
27339
xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv 0.06s user 0.03s system 98% cpu 0.091 total
xsv select doi 0.02s user 0.02s system 40% cpu 0.091 total
xsv count 0.01s user 0.00s system 9% cpu 0.090 total
$ time csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv | csvcut -c doi | csvstat --count
27339
csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv 1.15s user 0.06s system 95% cpu 1.273 total
csvcut -c doi 0.42s user 0.05s system 36% cpu 1.283 total
csvstat --count 0.20s user 0.03s system 18% cpu 1.298 total
```
## 2024-05-27
- Working on IFPRI datasets batch migration
- 732 items total
- 6 duplicates on CGSpace
- 6 duplicates within set that need investigation
## 2024-05-28
- I'm thinking of increasing the frequency of thumbnail generation on CGSpace
- Currently the `dspace filter-media` script runs once at 3AM for all media types and seems to take ~10 minutes to run for all 118,000 items...
- I think I will make the thumbnailer run explicitly more often using `-p "ImageMagick PDF Thumbnail"`
<!-- vim: set sw=2 ts=2: -->