CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

April, 2024

2024-04-04

  • Work on CGSpace duplicate DOIs more

2024-04-08

  • Start working on IFPRI’s 2022 batch import
    • I ran the duplicate checker against CGSpace and started downloading all linked PDFs

2024-04-09

  • Continue working on IFPRI’s 2022 batch import
    • I started validating the potential duplicates in OpenRefine

2024-04-12

  • Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
    • I need to merge the metadata for the remaining 212 that are already on CGSpace
  • Spend some time looking at duplicate DOIs again…

2024-04-13

  • Spend some time looking at duplicate DOIs again…

2024-04-14

  • Spend some time looking at duplicate DOIs again…

2024-04-15

$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
curl -s -o /dev/null   0.01s user 0.01s system 0% cpu 47.515 total
$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&sort=dcterms.issued,desc'
curl -s -o /dev/null   0.01s user 0.01s system 0% cpu 4.764 total
  • Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
    • I merged metadata with the existing items
    • There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself

2024-04-16

  • Spend some time looking at duplicate DOIs again…
  • Assist Deborah with an advanced query on CGSpace for biodiversity and health:
dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
  • Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
import urllib2

doi = cells['cg.identifier.doi[en_US]'].value
url = "https://api.crossref.org/works/" + doi + "/transform/text/x-bibliography"
useragent = "Python (mailto:a.o@cgiar.org)"

request = urllib2.Request(url.encode("utf-8"), headers={"User-Agent" : useragent})
get = urllib2.urlopen(request)

return get.read().decode('utf-8')
  • It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!