<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="CGIAR Library Migration" /> <meta property="og:description" content="Notes on the migration of the CGIAR Library to CGSpace" /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/cgiar-library-migration/" /> <meta property="article:published_time" content="2017-09-18T16:38:35+03:00" /> <meta property="article:modified_time" content="2019-10-28T13:40:20+02:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="CGIAR Library Migration"/> <meta name="twitter:description" content="Notes on the migration of the CGIAR Library to CGSpace"/> <meta name="generator" content="Hugo 0.119.0"> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "CGIAR Library Migration", "url": "https://alanorth.github.io/cgspace-notes/cgiar-library-migration/", "wordCount": "1278", "datePublished": "2017-09-18T16:38:35+03:00", "dateModified": "2019-10-28T13:40:20+02:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes, Migration", "description": "Notes on the migration of the CGIAR Library to CGSpace" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/cgiar-library-migration/"> <title>CGIAR Library Migration | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/cgiar-library-migration/">CGIAR Library Migration</a></h2> <p class="blog-post-meta"> <time datetime="2017-09-18T16:38:35+03:00">Mon Sep 18, 2017</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a> <span class="fas fa-tag" aria-hidden="true"></span> <a href="/tags/migration/" rel="tag">Migration</a> </p> </header> <p>Rough notes for importing the CGIAR Library content. It was decided that this content would go to a new top-level community called <em>CGIAR System Organization</em>.</p> <h2 id="pre-migration-technical-todos">Pre-migration Technical TODOs</h2> <p>Things that need to happen before the migration:</p> <ul> <li><input checked="" disabled="" type="checkbox"> Create top-level community on CGSpace to hold the CGIAR Library content: <code>10568/83389</code> <ul> <li><input checked="" disabled="" type="checkbox"> Update nginx redirects in ansible templates</li> <li><input checked="" disabled="" type="checkbox"> Update handle in DSpace XMLUI config</li> </ul> </li> <li>Set up nginx redirects for URLs like: <ul> <li><input checked="" disabled="" type="checkbox"> <a href="https://library.cgiar.org/bitstream/handle/10947/2699/CGIAR_Branding_Guidelines_and_Toolkit.pdf">https://library.cgiar.org/bitstream/handle/10947/2699/CGIAR_Branding_Guidelines_and_Toolkit.pdf</a></li> <li><input checked="" disabled="" type="checkbox"> <a href="https://library.cgiar.org/handle/10947/4258">https://library.cgiar.org/handle/10947/4258</a></li> </ul> </li> <li><input checked="" disabled="" type="checkbox"> Merge <a href="https://github.com/ilri/DSpace/pull/339">#339</a> to <code>5_x-prod</code> branch and rebuild DSpace</li> <li><input checked="" disabled="" type="checkbox"> Increase <code>max_connections</code> in <code>/etc/postgresql/9.5/main/postgresql.conf</code> by ~10 <ul> <li><code>SELECT * FROM pg_stat_activity;</code> seems to show ~6 extra connections used by the command line tools during import</li> </ul> </li> <li><input checked="" disabled="" type="checkbox"> Temporarily disable nightly <code>index-discovery</code> cron job because the import process will be taking place during some of this time and I don’t want them to be competing to update the Solr index</li> <li><input checked="" disabled="" type="checkbox"> Copy HTTPS certificate key pair from CGIAR Library server’s Tomcat keystore:</li> </ul> <pre tabindex="0"><code>$ keytool -list -keystore tomcat.keystore $ keytool -importkeystore -srckeystore tomcat.keystore -destkeystore library.cgiar.org.p12 -deststoretype PKCS12 -srcalias tomcat $ openssl pkcs12 -in library.cgiar.org.p12 -nokeys -out library.cgiar.org.crt.pem $ openssl pkcs12 -in library.cgiar.org.p12 -nodes -nocerts -out library.cgiar.org.key.pem $ wget https://certs.godaddy.com/repository/gdroot-g2.crt https://certs.godaddy.com/repository/gdig2.crt.pem $ cat library.cgiar.org.crt.pem gdig2.crt.pem > library.cgiar.org-chained.pem </code></pre><h2 id="migration-process">Migration Process</h2> <p><strong>Export all top-level communities and collections from DSpace Test:</strong></p> <pre tabindex="0"><code>$ export PATH=$PATH:/home/dspacetest.cgiar.org/bin $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2515 10947-2515/10947-2515.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2516 10947-2516/10947-2516.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2517 10947-2517/10947-2517.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2518 10947-2518/10947-2518.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2519 10947-2519/10947-2519.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2708 10947-2708/10947-2708.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2526 10947-2526/10947-2526.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2871 10947-2871/10947-2871.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2527 10947-2527/10947-2527.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10568/93759 10568-93759/10568-93759.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10568/93760 10568-93760/10568-93760.zip $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/1 10947-1/10947-1.zip </code></pre><p><strong>Import to CGSpace (also see <a href="http://alanorth.github.io/cgspace-notes/2017-05/#2017-05-10">notes from 2017-05-10</a>):</strong></p> <ul> <li><input checked="" disabled="" type="checkbox"> Copy all exports from DSpace Test</li> <li><input checked="" disabled="" type="checkbox"> Add ingestion overrides to <code>dspace.cfg</code> before import:</li> </ul> <pre tabindex="0"><code>mets.dspaceAIP.ingest.crosswalk.METSRIGHTS = NIL mets.dspaceAIP.ingest.crosswalk.DSPACE-ROLES = NIL </code></pre><ul> <li><input checked="" disabled="" type="checkbox"> Import communities and collections, paying attention to options to skip missing parents and ignore handles:</li> </ul> <pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1" $ export PATH=$PATH:/home/cgspace.cgiar.org/bin $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2515/10947-2515.zip $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2516/10947-2516.zip $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2517/10947-2517.zip $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2518/10947-2518.zip $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2519/10947-2519.zip $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2708/10947-2708.zip $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2526/10947-2526.zip $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2871/10947-2871.zip $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-4467/10947-4467.zip $ dspace packager -s -u -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83389 10947-2527/10947-2527.zip $ for item in 10947-2527/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done $ dspace packager -s -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83389 10947-1/10947-1.zip $ for collection in 10947-1/COLLECTION@10947-*; do dspace packager -s -o ignoreHandle=false -t AIP -e aorth@mjanja.ch -p 10947/1 $collection; done $ for item in 10947-1/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done </code></pre><p>This submits AIP hierarchies recursively (-r) and suppresses errors when an item’s parent collection hasn’t been created yet—for example, if the item is mapped. The large historic archive (10947/1) is created in several steps because it requires a lot of memory and often crashes.</p> <p><strong>Create new subcommunities and collections for content we reorganized into new hierarchies from the original:</strong></p> <ul> <li><input checked="" disabled="" type="checkbox"> Create <em>CGIAR System Management Board</em> sub-community: <code>10568/83536</code> <ul> <li><input checked="" disabled="" type="checkbox"> Content from <em>CGIAR System Management Board documents</em> collection (<code>10947/4561</code>) goes here</li> <li>Import collection hierarchy first and then the items:</li> </ul> </li> </ul> <pre tabindex="0"><code>$ dspace packager -r -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83536 10568-93760/COLLECTION@10947-4651.zip $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done </code></pre><ul> <li><input checked="" disabled="" type="checkbox"> Create <em>CGIAR System Management Office</em> sub-community: <code>10568/83537</code> <ul> <li><input checked="" disabled="" type="checkbox"> Create <em>CGIAR System Management Office documents</em> collection: <code>10568/83538</code></li> <li>Import items to collection individually in replace mode (-r) while explicitly preserving handles and ignoring parents:</li> </ul> </li> </ul> <pre tabindex="0"><code>$ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done </code></pre><p><strong>Get the handles for the last few items from CGIAR Library that were created since we did the migration to DSpace Test in May:</strong></p> <pre tabindex="0"><code>dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z'); </code></pre><ul> <li>Export them from the CGIAR Library:</li> </ul> <pre tabindex="0"><code># for handle in 10947/4658 10947/4659 10947/4660 10947/4661 10947/4665 10947/4664 10947/4666 10947/4669; do /usr/local/dspace/bin/dspace packager -d -a -t AIP -e m.marus@cgiar.org -i $handle ${handle}.zip; done </code></pre><ul> <li>Import on CGSpace:</li> </ul> <pre tabindex="0"><code>$ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done </code></pre><h2 id="post-migration">Post Migration</h2> <ul> <li><input checked="" disabled="" type="checkbox"> Shut down Tomcat and run <code>update-sequences.sql</code> as the system’s <code>postgres</code> user</li> <li><input checked="" disabled="" type="checkbox"> Remove ingestion overrides from <code>dspace.cfg</code></li> <li><input checked="" disabled="" type="checkbox"> Reset PostgreSQL <code>max_connections</code> to 183</li> <li><input checked="" disabled="" type="checkbox"> Enable nightly <code>index-discovery</code> cron job</li> <li><input checked="" disabled="" type="checkbox"> Adjust CGSpace’s <code>handle-server/config.dct</code> to add the new prefix alongside our existing 10568, ie:</li> </ul> <pre tabindex="0"><code>"server_admins" = ( "300:0.NA/10568" "300:0.NA/10947" ) "replication_admins" = ( "300:0.NA/10568" "300:0.NA/10947" ) "backup_admins" = ( "300:0.NA/10568" "300:0.NA/10947" ) </code></pre><p>I had been regenerated the <code>sitebndl.zip</code> file on the CGIAR Library server and sent it to the Handle.net admins but they said that there were mismatches between the public and private keys, which I suspect is due to <code>make-handle-config</code> not being very flexible. After discussing our scenario with the Handle.net admins they said we actually don’t need to send an updated <code>sitebndl.zip</code> for this type of change, and the above <code>config.dct</code> edits are all that is required. I guess they just did something on their end by setting the authoritative IP address for the 10947 prefix to be the same as ours…</p> <ul> <li><input checked="" disabled="" type="checkbox"> Update DNS records: <ul> <li>CNAME: cgspace.cgiar.org</li> </ul> </li> <li><input checked="" disabled="" type="checkbox"> Re-deploy DSpace from freshly built <code>5_x-prod</code> branch</li> <li><input checked="" disabled="" type="checkbox"> Merge <code>cgiar-library</code> branch to <code>master</code> and re-run ansible nginx templates</li> <li><input checked="" disabled="" type="checkbox"> Run system updates and reboot server</li> <li><input checked="" disabled="" type="checkbox"> Switch to Let’s Encrypt HTTPS certificates (after DNS is updated and server isn’t busy):</li> </ul> <pre tabindex="0"><code>$ sudo systemctl stop nginx $ /opt/certbot-auto certonly --standalone -d library.cgiar.org $ sudo systemctl start nginx </code></pre><h2 id="troubleshooting">Troubleshooting</h2> <h3 id="foreign-key-error-in-dspace-cleanup">Foreign Key Error in <code>dspace cleanup</code></h3> <p>The cleanup script is sometimes used during import processes to clean the database and assetstore after failed AIP imports. If you see the following error with <code>dspace cleanup -v</code>:</p> <pre tabindex="0"><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle" Detail: Key (bitstream_id)=(119841) is still referenced from table "bundle". </code></pre><p>The solution is to set the <code>primary_bitstream_id</code> to NULL in PostgreSQL:</p> <pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (119841); </code></pre><h3 id="psqlexception-during-aip-ingest">PSQLException During AIP Ingest</h3> <p>After a few rounds of ingesting—possibly with failures—you might end up with inconsistent IDs in the database. In this case, during AIP ingest of a single collection in submit mode (-s):</p> <pre tabindex="0"><code>org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(86227) already exists. </code></pre><p>The normal solution is to run the <code>update-sequences.sql</code> script (with Tomcat shut down) but it doesn’t seem to work in this case. Finding the maximum <code>handle_id</code> and manually updating the sequence seems to work:</p> <pre tabindex="0"><code>dspace=# select * from handle where handle_id=(select max(handle_id) from handle); dspace=# select setval('handle_seq',86873); </code></pre> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2023-10/">October, 2023</a></li> <li><a href="/cgspace-notes/2023-09/">September, 2023</a></li> <li><a href="/cgspace-notes/2023-08/">August, 2023</a></li> <li><a href="/cgspace-notes/2023-07/">July, 2023</a></li> <li><a href="/cgspace-notes/2023-06/">June, 2023</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>