2024-04-04 09:23:49 +02:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "April, 2024" / >
< meta property = "og:description" content = "2024-04-04
2024-04-09 15:50:56 +02:00
Work on CGSpace duplicate DOIs more
2024-04-04 09:23:49 +02:00
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2024-04/" / >
< meta property = "article:published_time" content = "2024-04-04T10:23:00+03:00" / >
2024-04-18 16:00:25 +02:00
< meta property = "article:modified_time" content = "2024-04-18T09:38:02+03:00" / >
2024-04-04 09:23:49 +02:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "April, 2024" / >
< meta name = "twitter:description" content = "2024-04-04
2024-04-09 15:50:56 +02:00
Work on CGSpace duplicate DOIs more
2024-04-04 09:23:49 +02:00
"/>
2024-04-18 08:38:02 +02:00
< meta name = "generator" content = "Hugo 0.125.0" >
2024-04-04 09:23:49 +02:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "April, 2024",
"url": "https://alanorth.github.io/cgspace-notes/2024-04/",
2024-04-18 16:00:25 +02:00
"wordCount": "456",
2024-04-04 09:23:49 +02:00
"datePublished": "2024-04-04T10:23:00+03:00",
2024-04-18 16:00:25 +02:00
"dateModified": "2024-04-18T09:38:02+03:00",
2024-04-04 09:23:49 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2024-04/" >
< title > April, 2024 | CGSpace Notes< / title >
<!-- combined, minified CSS -->
< link href = "https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel = "stylesheet" integrity = "sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin = "anonymous" >
<!-- minified Font Awesome for SVG icons -->
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2024-04/" > April, 2024< / a > < / h2 >
< p class = "blog-post-meta" >
< time datetime = "2024-04-04T10:23:00+03:00" > Thu Apr 04, 2024< / time >
in
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/categories/notes/" rel = "category tag" > Notes< / a >
< / p >
< / header >
< h2 id = "2024-04-04" > 2024-04-04< / h2 >
< ul >
2024-04-09 15:50:56 +02:00
< li > Work on CGSpace duplicate DOIs more< / li >
< / ul >
< h2 id = "2024-04-08" > 2024-04-08< / h2 >
< ul >
< li > Start working on IFPRI’ s 2022 batch import
< ul >
< li > I ran the duplicate checker against CGSpace and started downloading all linked PDFs< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2024-04-09" > 2024-04-09< / h2 >
< ul >
< li > Continue working on IFPRI’ s 2022 batch import
< ul >
< li > I started validating the potential duplicates in OpenRefine< / li >
< / ul >
< / li >
2024-04-04 09:23:49 +02:00
< / ul >
2024-04-12 19:40:52 +02:00
< h2 id = "2024-04-12" > 2024-04-12< / h2 >
< ul >
< li > Finish working on the 650 IFPRI 2022 records that were not already on CGSpace, then uploaded them
< ul >
< li > I need to merge the metadata for the remaining 212 that are already on CGSpace< / li >
< / ul >
< / li >
2024-04-16 08:35:30 +02:00
< li > Spend some time looking at duplicate DOIs again… < / li >
< / ul >
< h2 id = "2024-04-13" > 2024-04-13< / h2 >
< ul >
< li > Spend some time looking at duplicate DOIs again… < / li >
< / ul >
< h2 id = "2024-04-14" > 2024-04-14< / h2 >
< ul >
< li > Spend some time looking at duplicate DOIs again… < / li >
< / ul >
< h2 id = "2024-04-15" > 2024-04-15< / h2 >
< ul >
< li > Spend some time looking at duplicate DOIs again… < / li >
< li > Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: < a href = "https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418" > https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418< / a > < / li >
< li > Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:< / li >
< / ul >
< pre tabindex = "0" > < code > $ time curl -s -o /dev/null ' https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*& scope=8f1e9650-fe87-4e6e-889a-1cacfb747408& page=0& size=100& embed=thumbnail,bundles/bitstreams& sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
2024-04-18 08:38:02 +02:00
$ time curl -s -o /dev/null ' https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*& scope=8f1e9650-fe87-4e6e-889a-1cacfb747408& page=0& size=100& sort=dcterms.issued,desc'
2024-04-16 08:35:30 +02:00
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 4.764 total
< / code > < / pre > < ul >
< li > Finalize processing the remaining 206 items from the IFPRI 2022 batch set that already existed on CGSpace
< ul >
< li > I merged metadata with the existing items< / li >
< li > There are still six remaining items that I identified as being duplicates (3x2) in the IFPRI set itself< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2024-04-16" > 2024-04-16< / h2 >
< ul >
< li > Spend some time looking at duplicate DOIs again… < / li >
2024-04-18 08:38:02 +02:00
< li > Assist Deborah with an advanced query on CGSpace for biodiversity and health:< / li >
< / ul >
< pre tabindex = "0" > < code > dcterms.issued:[2010 TO 2024] AND dcterms.type:" Journal Article" AND (dc.title:" biodiversity" OR dcterms.subject:" biodiversity" OR dc.title:" health" OR dcterms.subject:" health" )
< / code > < / pre > < ul >
< li > Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
< ul >
< li > I used this Jython expression in OpenRefine with < a href = "https://citation.crosscite.org/docs.html" > Crossref’ s content negotiation< / a > to get citations for all DOIs:< / li >
< / ul >
< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-python" data-lang = "python" > < span style = "display:flex;" > < span > < span style = "color:#f92672" > import< / span > urllib2
< / span > < / span > < span style = "display:flex;" > < span >
< / span > < / span > < span style = "display:flex;" > < span > doi < span style = "color:#f92672" > =< / span > cells[< span style = "color:#e6db74" > ' cg.identifier.doi[en_US]' < / span > ]< span style = "color:#f92672" > .< / span > value
< / span > < / span > < span style = "display:flex;" > < span > url < span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > " https://api.crossref.org/works/" < / span > < span style = "color:#f92672" > +< / span > doi < span style = "color:#f92672" > +< / span > < span style = "color:#e6db74" > " /transform/text/x-bibliography" < / span >
< / span > < / span > < span style = "display:flex;" > < span > useragent < span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > " Python (mailto:a.o@cgiar.org)" < / span >
< / span > < / span > < span style = "display:flex;" > < span >
< / span > < / span > < span style = "display:flex;" > < span > request < span style = "color:#f92672" > =< / span > urllib2< span style = "color:#f92672" > .< / span > Request(url< span style = "color:#f92672" > .< / span > encode(< span style = "color:#e6db74" > " utf-8" < / span > ), headers< span style = "color:#f92672" > =< / span > {< span style = "color:#e6db74" > " User-Agent" < / span > : useragent})
< / span > < / span > < span style = "display:flex;" > < span > get < span style = "color:#f92672" > =< / span > urllib2< span style = "color:#f92672" > .< / span > urlopen(request)
< / span > < / span > < span style = "display:flex;" > < span >
< / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#66d9ef" > return< / span > get< span style = "color:#f92672" > .< / span > read()< span style = "color:#f92672" > .< / span > decode(< span style = "color:#e6db74" > ' utf-8' < / span > )
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!< / li >
2024-04-12 19:40:52 +02:00
< / ul >
2024-04-18 16:00:25 +02:00
< h2 id = "2024-04-18" > 2024-04-18< / h2 >
< ul >
< li > Write a SQL query to build the IFPRI CONTENTdm redirects to Handles:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-sql" data-lang = "sql" > < span style = "display:flex;" > < span > < span style = "color:#66d9ef" > SELECT< / span > m.text_value, h.handle < span style = "color:#66d9ef" > FROM< / span > metadatavalue m < span style = "color:#66d9ef" > JOIN< / span > handle h < span style = "color:#66d9ef" > on< / span > m.dspace_object_id < span style = "color:#f92672" > =< / span > h.resource_id < span style = "color:#66d9ef" > WHERE< / span > m.metadata_field_id< span style = "color:#f92672" > =< / span > < span style = "color:#ae81ff" > 28< / span > < span style = "color:#66d9ef" > AND< / span > m.text_value < span style = "color:#66d9ef" > LIKE< / span > < span style = "color:#e6db74" > ' Original URL%' < / span > < span style = "color:#66d9ef" > AND< / span > h.resource_type_id< span style = "color:#f92672" > =< / span > < span style = "color:#ae81ff" > 2< / span > ;
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > Similarly, I need a SQL query to get the redirects for duplicate Handles, querying for < code > dcterms.replaces< / code > :< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-sql" data-lang = "sql" > < span style = "display:flex;" > < span > < span style = "color:#66d9ef" > SELECT< / span > m.text_value < span style = "color:#66d9ef" > AS< / span > handle_from, h.handle < span style = "color:#66d9ef" > AS< / span > handle_to < span style = "color:#66d9ef" > FROM< / span > metadatavalue m < span style = "color:#66d9ef" > JOIN< / span > handle h < span style = "color:#66d9ef" > on< / span > m.dspace_object_id < span style = "color:#f92672" > =< / span > h.resource_id < span style = "color:#66d9ef" > WHERE< / span > m.metadata_field_id< span style = "color:#f92672" > =< / span > < span style = "color:#ae81ff" > 181< / span > < span style = "color:#66d9ef" > AND< / span > h.resource_type_id< span style = "color:#f92672" > =< / span > < span style = "color:#ae81ff" > 2< / span > ;
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > Then I can work that list into an nginx map with redirect, for example:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > server {
< / span > < / span > < span style = "display:flex;" > < span > ...
< / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" >
< / span > < / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" > < / span > if ($new_uri) {
< / span > < / span > < span style = "display:flex;" > < span > return 301 $new_uri;
< / span > < / span > < span style = "display:flex;" > < span > }
< / span > < / span > < span style = "display:flex;" > < span > }
< / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" >
< / span > < / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" > < / span > map $request_uri $new_uri {
< / span > < / span > < span style = "display:flex;" > < span > /handle/10568/112821 /handle/10568/97605;
< / span > < / span > < span style = "display:flex;" > < span > }
< / span > < / span > < / code > < / pre > < / div > <!-- raw HTML omitted -->
2024-04-04 09:23:49 +02:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "/cgspace-notes/2024-04/" > April, 2024< / a > < / li >
< li > < a href = "/cgspace-notes/2024-03/" > March, 2024< / a > < / li >
< li > < a href = "/cgspace-notes/2024-02/" > February, 2024< / a > < / li >
< li > < a href = "/cgspace-notes/2024-01/" > January, 2024< / a > < / li >
< li > < a href = "/cgspace-notes/2023-12/" > December, 2023< / a > < / li >
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p dir = "auto" >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >