May, 2024

Wed May 01, 2024 in Notes

2024-05-01

I dumped all the CGSpace DOIs and resolved them with my crossref_doi_lookup.py script
- Then I did some work to add missing abstracts (about 900!), volumes, issues, licenses, publishers, and types, etc

2024-05-05

Spend some time looking at duplicate DOIs again…

2024-05-06

Spend some time looking at duplicate DOIs again…

2024-05-07

Discuss RSS feeds and OpenSearch with IWMI
- It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch
I saw a patch for an interesting issue on DSpace GitHub: Error submitting or deleting items - URI too long when user is in a large number of groups
- I hadn’t realized it, but we have lots of those errors:

$ zstdgrep -a 'URI Too Long' log/dspace.log-2024-04-* | wc -l
1423

Spend some time looking at duplicate DOIs again…

2024-05-08

Spend some time looking at duplicate DOIs again…
- I finally finished looking at the duplicate DOIs for journal articles
- I updated the list of handle redirects and there are 386 of them!

2024-05-09

Spend some time working on the IFPRI 2020–2021 batch
- I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date

2024-05-12

I couldn’t figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:

dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE 'Submitted by%') TO /tmp/withdrawn-submitted-by.csv CSV HEADER;

Then joined them:

$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv > /tmp/withdrawn.csv

This gives me an insight into who submitted at 334 of the duplicates over the past few years…
I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ﬀ, ﬁ, ﬂ, ﬃ, and ﬄ

2024-05-13

Export a list of IFPRI information products with handle links and CONTENTdm links:

$ csvgrep -c 'dc.description.provenance[en_US]' -m 'CONTENTdm' cgspace.csv \
  | csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
  | tee /tmp/ifpri-redirects.csv \
  | csvstat --count
2645

I discovered the /server/api/pid/find endpoint today, which is much more direct and manageable than the /server/api/discover/search/objects?query= endpoint when trying to get metadata for a Handle (item, collection, or community)
- The “pid” stands for permanent identifiers apparently, and we can use it like this:

https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424

2024-05-15

I got journal titles for 2,900 journal articles that were missing them from Crossref

2024-05-16

Helping IFPRI with some DSpace 7 API support, these are two queries for items issued in 2024:

https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024
https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D — note the Lucene search syntax is URL encoded version of :[2024-01-01T00:00:00Z TO *]

Both of them return the same number of results and seem identitical as far as I can see, but the second one uses Solr date indexes and requires the full Lucene datetime and range syntax

I wrote a new version of the check_duplicates.py script to help identify duplicates with different types

Initially I called it check_duplicates_fast.py but it’s actually not faster
I need to find a way to deal with duplicates from IFPRI’s repository because there are some mismatched types…

2024-05-20

Continue working through alternative duplicate matching for IFPRI

Their item types are sometimes different than ours…
One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check
Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)

2024-05-22

Finalize and upload the IFPRI 2020–2021 batch set
- I used a new technique to get missing licenses via Crossref (it’s Python 2 because of OpenRefine’s Jython):

import urllib2

doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi
useragent = "Python (mailto:a.o@cgiar.org)"

request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)

return get.read().decode('utf-8')

2024-05-23

Finalize last of the duplicates I found for the IFPRI 2020–2021 batch set (those that we missed initially due to mismatched types)
Export a new list of IFPRI redirects from CONTENTdm:

$ csvgrep -c 'dc.description.provenance[en_US]' -r 'Original URLs? from IFPRI CONTENTdm' cgspace.csv \
  | csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
  | tee /tmp/ifpri-redirects.csv \
  | csvstat --count
4004

I found a way to get abstracts from PLOS

They offer an API that returns XML including the JATS-formatted abstracts
I created a new column in OpenRefine by fetching specially crafted URLs based on the DOIs using this GREL:

"https://journals.plos.org/plosone/article/file?id=" + cells['doi'].value + '&type=manuscript'

Then used value.parseXml() on the resulting text to extract the abstract’s text:

value.parseXml().select("abstract")[0].xmlText()

This doesn’t preserve <p> tags though…

Oh, nice, this does!

forEach(value.parseHtml().select("abstract p"), i, i.htmlText()).join("\r\n\r\n")

For each paragraph inside an abstract, get the inner text and join them as one string separated by two newlines…

Ah, some articles have multiple abstracts, for example: https://journals.plos.org/plosone/article/file?id=https://doi.org/10.1371/journal.pntd.0001859&type=manuscript
I need to select the abstract that does not have any attributes (using Jsoup selector syntax)

forEach(value.parseXml().select("abstract:not([*]) p"), i, i.xmlText()).join("\r\n\r\n")

Testing xsv (Rust) versus csvkit (Python) to filter all items with DOIs from a DSpace dump with 118,000 items:

$ time xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv | xsv select doi | xsv count
27339
xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv  0.06s user 0.03s system 98% cpu 0.091 total
xsv select doi  0.02s user 0.02s system 40% cpu 0.091 total
xsv count  0.01s user 0.00s system 9% cpu 0.090 total
$ time csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv | csvcut -c doi | csvstat --count
27339
csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv  1.15s user 0.06s system 95% cpu 1.273 total
csvcut -c doi  0.42s user 0.05s system 36% cpu 1.283 total
csvstat --count  0.20s user 0.03s system 18% cpu 1.298 total

2024-05-27

Working on IFPRI datasets batch migration
- 732 items total
- 6 duplicates on CGSpace
- 6 duplicates within set that need investigation

2024-05-28

I’m thinking of increasing the frequency of thumbnail generation on CGSpace
- Currently the dspace filter-media script runs once at 3AM for all media types and seems to take ~10 minutes to run for all 118,000 items…
- I think I will make the thumbnailer run explicitly more often using -p "ImageMagick PDF Thumbnail"