<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="February, 2021" /> <meta property="og:description" content="2021-02-01 Abenet said that CIP found more duplicate records in their export from AReS I re-opened the issue on OpenRXV where we had previously noticed this The shared link where the duplicates are is here: https://cgspace.cgiar.org/explorer/shared/heEOz3YBnXdK69bR2ra6 I had a call with CodeObia to discuss the work on OpenRXV Check the results of the AReS harvesting from last night: $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty' { "count" : 100875, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-02/" /> <meta property="article:published_time" content="2021-02-01T10:13:54+02:00" /> <meta property="article:modified_time" content="2021-03-04T22:46:05+02:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="February, 2021"/> <meta name="twitter:description" content="2021-02-01 Abenet said that CIP found more duplicate records in their export from AReS I re-opened the issue on OpenRXV where we had previously noticed this The shared link where the duplicates are is here: https://cgspace.cgiar.org/explorer/shared/heEOz3YBnXdK69bR2ra6 I had a call with CodeObia to discuss the work on OpenRXV Check the results of the AReS harvesting from last night: $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty' { "count" : 100875, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } "/> <meta name="generator" content="Hugo 0.82.0" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "February, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-02/", "wordCount": "4170", "datePublished": "2021-02-01T10:13:54+02:00", "dateModified": "2021-03-04T22:46:05+02:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-02/"> <title>February, 2021 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk+S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-02/">February, 2021</a></h2> <p class="blog-post-meta"> <time datetime="2021-02-01T10:13:54+02:00">Mon Feb 01, 2021</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2021-02-01">2021-02-01</h2> <ul> <li>Abenet said that CIP found more duplicate records in their export from AReS <ul> <li>I re-opened <a href="https://github.com/ilri/OpenRXV/issues/67">the issue</a> on OpenRXV where we had previously noticed this</li> <li>The shared link where the duplicates are is here: <a href="https://cgspace.cgiar.org/explorer/shared/heEOz3YBnXdK69bR2ra6">https://cgspace.cgiar.org/explorer/shared/heEOz3YBnXdK69bR2ra6</a></li> </ul> </li> <li>I had a call with CodeObia to discuss the work on OpenRXV</li> <li>Check the results of the AReS harvesting from last night:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty' { "count" : 100875, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } </code></pre><ul> <li>Set the current items index to read only and make a backup:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}' $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01 </code></pre><ul> <li>Delete the current items index and clone the temp one to it:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items' $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items </code></pre><ul> <li>Then delete the temp and backup:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' {"acknowledged":true}% $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01' </code></pre><ul> <li>Meeting with Peter and Abenet about CGSpace goals and progress</li> <li>Test submission to DSpace via REST API to see if Abenet can fix / reject it (submit workflow?)</li> <li>Get Peter a list of users who have submitted or approved on DSpace everrrrrrr, so he can remove some</li> <li>Ask MEL for a dump of their types to reconcile with ours and CG Core</li> <li>Need to tag ILRI collection with license!! For pre-2010 use “Other” unless a license is already there; 2010-2020 do the ilri content in batches (2010-2015: CC-BY-NC-SA; 2016-onwards: CC-BY); <ul> <li>ONLY if ILRI / International Livestock Research Institute is the publisher, no journal articles, no book chapters…</li> </ul> </li> <li>I tried to export the ILRI community from CGSpace but I got an error:</li> </ul> <pre><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv Loading @mire database changes for module MQM Changes have been processed Exporting community 'International Livestock Research Institute (ILRI)' (10568/1) Exception: null java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212) at com.google.common.collect.Iterators.concat(Iterators.java:464) at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136) at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125) at org.dspace.app.bulkedit.MetadataExport.<init>(MetadataExport.java:77) at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81) </code></pre><ul> <li>I imported the production database to my local development environment and I get the same error… WTF is this? <ul> <li>I was able to export another smaller community</li> <li>I filed <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=919">an issue</a> with Atmire to see if it is likely something of theirs, or if I need to ask on the dspace-tech mailing list</li> </ul> </li> <li>CodeObia sent a <a href="https://github.com/ilri/OpenRXV/pull/71">pull request</a> with fixes for several issues we highlighted in OpenRXV <ul> <li>I deployed the fixes on production, as they only affect minor parts of the frontend, and two of the four are working</li> <li>I sent feedback to CodeObia</li> </ul> </li> </ul> <h2 id="2021-02-02">2021-02-02</h2> <ul> <li>Communicate more with CodeObia about some fixes for OpenRXV</li> <li>Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart’s iD</li> <li>I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li> <li>Then for the rest, I saved them to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li> </ul> <pre><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt $ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt </code></pre><ul> <li>I sorted the names and added the XML formatting in vim, then ran it through tidy:</li> </ul> <pre><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml </code></pre><ul> <li>Then I added all the changed names plus Stefan’s incorrect ones to a CSV and processed them with <code>fix-metadata-values.py</code>:</li> </ul> <pre><code class="language-console" data-lang="console">$ cat 2021-02-02-fix-orcid-ids.csv cg.creator.id,correct Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184 Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184 Stefan Burkart: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184 Stefan Burkart: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184 Adina Chain Guadarrama: 0000-0002-6944-2064,Adina Chain-Guadarrama: 0000-0002-6944-2064 Bedru: 0000-0002-7344-5743,Bedru B. Balana: 0000-0002-7344-5743 Leigh Winowiecki: 0000-0001-5572-1284,Leigh Ann Winowiecki: 0000-0001-5572-1284 Sander J. Zwart: 0000-0002-5091-1801,Sander Zwart: 0000-0002-5091-1801 saul lozano-fuentes: 0000-0003-1517-6853,Saul Lozano: 0000-0003-1517-6853 $ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p 'fuuu' -f cg.creator.id -t 'correct' -m 240 </code></pre><ul> <li>I also looked up which of these new authors might have existing items that are missing ORCID iDs</li> <li>I had to port my <code>add-orcid-identifiers-csv.py</code> to DSpace 6 UUIDs and I think it’s working but I want to do a few more tests because it uses a sequence for the metadata_value_id</li> </ul> <h2 id="2021-02-03">2021-02-03</h2> <ul> <li>Tag forty-three items from Bioversity’s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code>:</li> </ul> <pre><code class="language-console" data-lang="console">$ cat /tmp/2021-02-02-add-orcid-ids.csv dc.contributor.author,cg.creator.id "Nchanji, E.",Eileen Bogweh Nchanji: 0000-0002-6859-0962 "Nchanji, Eileen",Eileen Bogweh Nchanji: 0000-0002-6859-0962 "Nchanji, Eileen Bogweh",Eileen Bogweh Nchanji: 0000-0002-6859-0962 "Machida, Lewis",Lewis Machida: 0000-0002-0012-3997 "Mockshell, Jonathan",Jonathan Mockshell: 0000-0003-1990-6657" "Aubert, C.",Celine Aubert: 0000-0001-6284-4821 "Aubert, Céline",Celine Aubert: 0000-0001-6284-4821 "Devare, M.",Medha Devare: 0000-0003-0041-4812 "Devare, Medha",Medha Devare: 0000-0003-0041-4812 "Benites-Alfaro, O.E.",Omar E. Benites-Alfaro: 0000-0002-6852-9598 "Benites-Alfaro, Omar Eduardo",Omar E. Benites-Alfaro: 0000-0002-6852-9598 "Johnson, Vincent",VINCENT JOHNSON: 0000-0001-7874-178X "Lesueur, Didier",didier lesueur: 0000-0002-6694-0869 $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -d </code></pre><ul> <li>I’m working on the CGSpace accession for Karl Rich’s <a href="https://github.com/ilri/vietnam-pig-model-2018">Viet Nam Pig Model 2018</a> and I noticed his ORCID iD is missing from CGSpace <ul> <li>I added it and tagged 141 items of his with the iD</li> </ul> </li> <li>I <a href="https://hdl.handle.net/10568/111126">uploaded a metadata-only accession</a> for the impact of ILRI book by John McIntire and Delia Grace to CGSpace <ul> <li>The source code itself is here: <a href="https://github.com/ilri/impact-book">https://github.com/ilri/impact-book</a></li> </ul> </li> <li>A little bit more work on CG Core v2</li> </ul> <h2 id="2021-02-04">2021-02-04</h2> <ul> <li>Re-sync CGSpace database and Solr to DSpace Test to start a public test of CG Core v2 <ul> <li>Afterwards I updated Discovery and OAI:</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">$ time chrt -b 0 dspace index-discovery -b $ dspace oai import -c </code></pre><ul> <li>Attend Accenture meeting for repository managers <ul> <li>Not clear what the SMO wants to get out of us</li> </ul> </li> <li>Enrico asked for some notes about our work on AReS in 2020 for CRP Livestock reporting <ul> <li>Abenet and I came up with the following:</li> </ul> </li> </ul> <blockquote> <p>In 2020 we funded the third phase of development on the OpenRXV platform that powers AReS. This phase focused mainly on improving the search filtering, graphical visualizations, and reporting capabilities. It is now possible to create custom reports in Excel, Word, and PDF formats using a templating system. We also concentrated on making the vanilla OpenRXV platform easier to deploy and administer in hopes that other organizations would begin using it. Lastly, we identified and fixed a handful of bugs in the system. All development takes place publicly on GitHub: <a href="https://github.com/ilri/OpenRXV">https://github.com/ilri/OpenRXV</a>.</p> </blockquote> <blockquote> <p>In the last quarter of 2020, ILRI conducted a briefing for nearly 100 scientists and communications staff on how to use ARes as a visualization tool for repository outputs and as a reporting tool (<a href="https://hdl.handle.net/10568/110527)">https://hdl.handle.net/10568/110527)</a>. Staff will begin using AReS to generate lists of their outputs to upload in the performance evaluation system to assist in their performance evaluation. The list of publications they will upload from AReS to Performax will indicate the open access status of each publication to help start discussion why some outputs are not open access given the open access policies of the CGIAR.</p> </blockquote> <ul> <li>Call Moayad to discuss OpenRXV development <ul> <li>We talked about the “reporting period” (date-based statistics) and some of the issues Abdullah is working on on GitHub</li> <li>I suggested that we offer the date-range statistics in a modal dialog with other sorting and grouping options during report generation</li> </ul> </li> <li>Peter sent me the cleaned up series that I had originally sent him in 2020-10 <ul> <li>I quickly applied all the deletions on CGSpace:</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43 </code></pre><ul> <li>The corrected versions have a lot of encoding issues so I asked Peter to give me the correct ones so I can search/replace them: <ul> <li>CIAT Publicaçao</li> <li>CIAT Publicación</li> <li>CIAT Série</li> <li>CIAT Séries</li> <li>Colección investigación y desarrollo</li> <li>CTA Guias práticos</li> <li>CTA Guias técnicas</li> <li>Curso de adiestramiento en producción y utilización de pastos tropicales</li> <li>Folheto Técnico</li> <li>ILRI Nota Informativa de Investigação</li> <li>Influencia de los actores sociales en América Central</li> <li>Institutionalization of quality assurance mechanism and dissemination of top quality commercial products to increase crop yields and improve food security of smallholder farmers in sub-Saharan Africa – COMPRO-II</li> <li>Manuel pour les Banques de Gènes;1</li> <li>Sistematización de experiencias Proyecto ACORDAR</li> <li>Strüngmann Forum</li> <li>Unité de Recherche</li> </ul> </li> <li>I ended up using <a href="https://github.com/LuminosoInsight/python-ftfy">python-ftfy</a> to fix those very easily, then replaced them in the CSV</li> <li>Then I trimmed whitespace at the beginning, end, and around the “;”, and applied the 1,600 fixes using <code>fix-metadata-values.py</code>:</li> </ul> <pre><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43 </code></pre><ul> <li>Help Peter debug an issue with one of Alan Duncan’s new FEAST Data reports on CGSpace <ul> <li>For some reason the default policy for the item was “COLLECTION_492_DEFAULT_READ” group, which had zero members</li> <li>I changed them all to Anonymous and the item was accessible</li> </ul> </li> </ul> <h2 id="2021-02-07">2021-02-07</h2> <ul> <li>Run system updates on CGSpace (linode18), deploy latest 6_x-prod branch, and reboot the server</li> <li>After the server came back up I started a full Discovery re-indexing:</li> </ul> <pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b real 247m30.850s user 160m36.657s sys 2m26.050s </code></pre><ul> <li>Regarding the CG Core v2 migration, Fabio wrote to tell me that he is not using CGSpace directly, instead harvesting via GARDIAN <ul> <li>He gave me the contact of Sotiris Konstantinidis, who is the CTO at SCIO Systems and works on the GARDIAN platform</li> </ul> </li> <li>Delete the old Elasticsearch temp index to prepare for starting an AReS re-harvest:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' # start indexing in AReS </code></pre><h2 id="2021-02-08">2021-02-08</h2> <ul> <li>Finish rotating the AReS indexes after the harvesting last night:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty' { "count" : 100983, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write":true}}' $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-08 $ curl -XDELETE 'http://localhost:9200/openrxv-items' $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08' </code></pre><h2 id="2021-02-10">2021-02-10</h2> <ul> <li>Talk to Abdullah from CodeObia about a few of the issues we filed on OpenRXV <ul> <li>Verify a fix he made for the issue with spaces in template file names</li> <li>He says that the <a href="https://github.com/ilri/OpenRXV/issues/49">Angular expressions support should be enabled</a>, but I tried it and couldn’t get a few simple examples working</li> </ul> </li> <li>Atmire responded to a few issues today: <ul> <li>First, the one about a crash while exporting a community CSV, which appears to be a <a href="https://jira.lyrasis.org/browse/DS-4211">vanilla DSpace issue with a patch in DSpace 6.4</a></li> <li>Second, the MQM batch consumer issue, which appears to be harmless log spam in <em>most</em> cases and they have sent a patch that adjusts the logging as such</li> <li>Third, a version bump for CUA to fix the <code>java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp</code> error</li> </ul> </li> <li>I cherry-picked the patches for DS-4111 and was able to export the ILRI community finally, but the results are almost twice as many items as in the community! <ul> <li>Investigating with csvcut I see there are some ids that appear up to five, six, or seven times!</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l 30354 $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l 18555 $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail 5 c21a79e5-e24e-4861-aa07-e06703d1deb7 5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38 5 d73fb3ae-9fac-4f7e-990f-e394f344246c 5 dc0e24fa-b7f5-437e-ac09-e15c0704be00 5 dc50bcca-0abf-473f-8770-69d5ab95cc33 5 e714bdf9-cc0f-4d9a-a808-d572e25c9238 6 7dfd1c61-9e8c-4677-8d41-e1c4b11d867d 6 fb76888c-03ae-4d53-b27d-87d7ca91371a 6 ff42d1e6-c489-492c-a40a-803cabd901ed 7 094e9e1d-09ff-40ca-a6b9-eca580936147 </code></pre><ul> <li>I added a comment to that bug to ask if this is a side effect of the patch</li> <li>I started working on tagging pre-2010 ILRI items with license information, like we talked about with Peter and Abenet last week <ul> <li>Due to the export bug I had to sort and remove duplicates first, then use csvgrep to filter out books and journal articles:</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">$ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]' </code></pre><ul> <li>I imported the CSV into OpenRefine and converted the date text values to date types so I could facet by dates before 2010:</li> </ul> <pre><code class="language-console" data-lang="console">if(diff(value,"01/01/2010".toDate(),"days")<0, true, false) </code></pre><ul> <li>Then I filtered by publisher to make sure they were only ours:</li> </ul> <pre><code class="language-console" data-lang="console">or( value.contains("International Livestock Research Institute"), value.contains("ILRI"), value.contains("International Livestock Centre for Africa"), value.contains("ILCA"), value.contains("ILRAD"), value.contains("International Laboratory for Research on Animal Diseases") ) </code></pre><ul> <li>I tagged these pre-2010 items with “Other” if they didn’t already have a license</li> <li>I checked 2010 to 2015, and 2016 to date, but they were all tagged already!</li> <li>In the end I added the “Other” license to 1,523 items from before 2010</li> </ul> <h2 id="2021-02-11">2021-02-11</h2> <ul> <li>CodeObia keeps working on a few more small issues on OpenRXV <ul> <li>Abdullah sent fixes for two issues but I couldn’t verify them myself so I asked him to check again</li> <li>Call with Abdullah and Yousef to discuss some issues</li> <li>We got the Angular expressions parser working…</li> </ul> </li> </ul> <h2 id="2021-02-13">2021-02-13</h2> <ul> <li>Run system updates, deploy latest <code>6_x-prod</code> branch, and reboot CGSpace (linode18)</li> <li>Normalize <code>text_lang</code> of DSpace item metadata on CGSpace:</li> </ul> <pre><code>dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC; text_lang | count -----------+--------- en_US | 2567413 | 8050 en | 7601 | 0 (4 rows) dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item); </code></pre><ul> <li>Start a full Discovery re-indexing on CGSpace</li> </ul> <h2 id="2021-02-14">2021-02-14</h2> <ul> <li>Clear the OpenRXV temp items index:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' </code></pre><ul> <li>Then start a full harvesting of CGSpace in the AReS Explorer admin dashboard</li> <li>Peter asked me about a few other recently submitted FEAST items that are restricted <ul> <li>I checked the collection and there was an empty group there for the “default read” authorization</li> <li>I deleted the group and fixed the authorization policies for two new items manually</li> </ul> </li> <li>Upload fifteen items to CGSpace for Peter Ballantyne</li> <li>Move 313 journals from series, which Peter had indicated when we were cleaning up the series last week <ul> <li>I re-purposed one of my Python metadata scripts to create <code>move-metadata-values.py</code></li> <li>The script reads a text file with one metadata value per line and moves them from one metadata field id to another</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55 </code></pre><h2 id="2021-02-15">2021-02-15</h2> <ul> <li>Check the results of the AReS Harvesting from last night:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty' { "count" : 101126, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } </code></pre><ul> <li>Set the current items index to read only and make a backup:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}' $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15 </code></pre><ul> <li>Delete the current items index and clone the temp one:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items' $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15' </code></pre><ul> <li>Call with Abdullah from CodeObia to discuss community and collection statistics reporting</li> </ul> <h2 id="2021-02-16">2021-02-16</h2> <ul> <li>Linode emailed me to say that CGSpace (linode18) had a high CPU usage this afternoon</li> <li>I looked in the nginx logs and found a few heavy users: <ul> <li>45.146.165.203 in Russia with user agent <code>Opera/9.80 (Windows NT 6.1; U; cs) Presto/2.2.15 Version/10.00</code></li> <li>130.255.161.231 in Sweden with user agent <code>Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0</code></li> </ul> </li> <li>They are definitely bots posing as users, as I see they have created six thousand DSpace sessions today:</li> </ul> <pre><code class="language-console" data-lang="console">$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l 4007 $ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231' | sort | uniq | wc -l 2128 </code></pre><ul> <li>Ah, actually 45.146.165.203 is making requests like this:</li> </ul> <pre><code class="language-console" data-lang="console">"http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO" </code></pre><ul> <li>I purged the hits from these two using my <code>check-spider-ip-hits.sh</code>:</li> </ul> <pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p Purging 4005 hits from 45.146.165.203 in statistics Purging 3493 hits from 130.255.161.231 in statistics Total number of bot hits purged: 7498 </code></pre><ul> <li>Ugh, I looked in Solr for the top IPs in 2021-01 and found a few more of these Russian IPs so I purged them too:</li> </ul> <pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p Purging 27163 hits from 45.146.164.176 in statistics Purging 19556 hits from 45.146.165.105 in statistics Purging 15927 hits from 45.146.165.83 in statistics Purging 8085 hits from 45.146.165.104 in statistics Total number of bot hits purged: 70731 </code></pre><ul> <li>My god, and 64.39.99.15 is from Qualys, the domain scanning security people, who are making queries trying to see if we are vulnerable or something (wtf?) <ul> <li>Looking in Solr I see a few different IPs with DNS like <code>sn003.s02.iad01.qualys.com.</code> so I will purge their requests too:</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p Purging 3 hits from 130.255.161.231 in statistics Purging 16773 hits from 64.39.99.15 in statistics Purging 6976 hits from 64.39.99.13 in statistics Purging 13 hits from 64.39.99.63 in statistics Purging 12 hits from 64.39.99.65 in statistics Purging 12 hits from 64.39.99.94 in statistics Total number of bot hits purged: 23789 </code></pre><h2 id="2021-02-17">2021-02-17</h2> <ul> <li>I tested Node.js 10 vs 12 on CGSpace (linode18) and DSpace Test (linode26) and the build times were surprising <ul> <li>Node.js 10 <ul> <li>linode26: [INFO] Total time: 17:07 min</li> <li>linode18: [INFO] Total time: 19:26 min</li> </ul> </li> <li>Node.js 12 <ul> <li>linode26: [INFO] Total time: 17:14 min</li> <li>linode18: [INFO] Total time: 19:43 min</li> </ul> </li> </ul> </li> <li>So I guess there is no need to use Node.js 12 any time soon, unless 10 becomes end of life</li> <li>Abenet asked me to add Tom Randolph’s ORCID identifier to CGSpace</li> <li>I also tagged all his 247 existing items on CGSpace:</li> </ul> <pre><code class="language-console" data-lang="console">$ cat 2021-02-17-add-tom-orcid.csv dc.contributor.author,cg.creator.id "Randolph, Thomas F.","Thomas Fitz Randolph: 0000-0003-1849-9877" $ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p 'fuuu' </code></pre><h2 id="2021-02-20">2021-02-20</h2> <ul> <li>Test the CG Core v2 migration on DSpace Test (linode26) one last time</li> </ul> <h2 id="2021-02-21">2021-02-21</h2> <ul> <li>Start the CG Core v2 migration on CGSpace (linode18)</li> <li>After deploying the latest <code>6_x-prod</code> branch and running <code>migrate-fields.sh</code> I started a full Discovery reindex:</li> </ul> <pre><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b real 311m12.617s user 217m3.102s sys 2m37.363s </code></pre><ul> <li>Then update OAI:</li> </ul> <pre><code class="language-console" data-lang="console">$ dspace oai import -c $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m" </code></pre><ul> <li>Ben Hack was asking if there is a REST API query that will give him all ILRI outputs for their new Sharepoint intranet <ul> <li>I told him he can try to use something like this if it’s just something like the ILRI articles in journals collection:</li> </ul> </li> </ul> <p><a href="https://cgspace.cgiar.org/rest/collections/8ea4b611-1f59-4d4e-b78d-a9921a72cfe7/items?limit=100&offset=0">https://cgspace.cgiar.org/rest/collections/8ea4b611-1f59-4d4e-b78d-a9921a72cfe7/items?limit=100&offset=0</a></p> <ul> <li>But I don’t know if he wants the entire ILRI community, in which case he needs to get the collections recursively and iterate over them, or if his software can manage the iteration over the pages of item results using limit and offset</li> <li>Help proof and upload 1095 CIFOR items to DSpace Test for Abenet <ul> <li>There were a few dozen issues with author affiliations, but the metadata was otherwise very good quality</li> <li>I ran the data through the csv-metadata-quality tool nevertheless to fix some minor formatting issues</li> <li>I uploaded it to DSpace Test to check for duplicates</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m' $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv </code></pre><ul> <li>The process took an hour or so!</li> <li>I added colorized output to the csv-metadata-quality tool and tagged <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.4">version 0.4.4 on GitHub</a></li> <li>I updated the fields in AReS Explorer and then removed the old temp index so I can start a fresh re-harvest of CGSpace:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' # start indexing in AReS </code></pre><h2 id="2021-02-22">2021-02-22</h2> <ul> <li>Start looking at splitting the series name and number in <code>dcterms.isPartOf</code> now that we have migrated to CG Core v2 <ul> <li>The numbers will go to <code>cg.number</code></li> <li>I notice there are about 100 series without a number, but they still have a semicolon, for example <code>Esporo 72;</code></li> <li>I think I will replace those like this:</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$'; UPDATE 104 </code></pre><ul> <li>As for splitting the other values, I think I can export the <code>dspace_object_id</code> and <code>text_value</code> and then upload it as a CSV rather than writing a Python script to create the new metadata values</li> </ul> <h2 id="2021-02-22-1">2021-02-22</h2> <ul> <li>Check the results of the AReS harvesting from last night:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty' { "count" : 101380, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } </code></pre><ul> <li>Set the current items index to read only and make a backup:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}' $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22 </code></pre><ul> <li>Delete the current items index and clone the temp one to it:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items' $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items </code></pre><ul> <li>Then delete the temp and backup:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' {"acknowledged":true}% $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22' </code></pre><h2 id="2021-02-23">2021-02-23</h2> <ul> <li>CodeObia sent a <a href="https://github.com/ilri/OpenRXV/pull/75">pull request for clickable countries on AReS</a> <ul> <li>I deployed it and it seems to work, so I asked Abenet and Peter to test it so we can get feedback</li> </ul> </li> <li>Remove semicolons from series names without numbers:</li> </ul> <pre><code class="language-console" data-lang="console">dspace=# BEGIN; dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$'; UPDATE 104 dspace=# COMMIT; </code></pre><ul> <li>Set all <code>text_lang</code> values on CGSpace to <code>en_US</code> to make the series replacements easier (this didn’t work, read below):</li> </ul> <pre><code class="language-console" data-lang="console">dspace=# BEGIN; dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item); UPDATE 911 cgspace=# COMMIT; </code></pre><ul> <li>Then export all series with their IDs to CSV:</li> </ul> <pre><code class="language-console" data-lang="console">dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER; </code></pre><ul> <li>In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check <ul> <li>For example many Spore items are like “Spore, Spore 23”</li> <li>Also, “Agritrade, August 2002”</li> </ul> </li> <li>Then I copied the column to a new one called <code>cg.number[en_US]</code> and split the values for each on the semicolon using <code>value.split(';')[0]</code> and <code>value.split(';')[1]</code></li> <li>I tried to upload some of the series data to DSpace Test but I’m having an issue where some fields change that shouldn’t <ul> <li>It seems not all fields get updated when I set the text_lang globally, but if I updated it manually like this it works:</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845; UPDATE 1 </code></pre><ul> <li>This also seems to work, using the id for just that one item:</li> </ul> <pre><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322'; UPDATE 37 </code></pre><ul> <li>This seems to work better for some reason:</li> </ul> <pre><code class="language-console" data-lang="console">dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item); UPDATE 18659 </code></pre><ul> <li>I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace:</li> </ul> <pre><code class="language-console" data-lang="console">$ dspace metadata-import -f /tmp/0.csv </code></pre><ul> <li>It took FOREVER to import each file… like several hours <em>each</em>. MY GOD DSpace 6 is slow.</li> <li>Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros <ul> <li>She is not seeing the community list for CGSpace, and I see weird requests like this in the logs:</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT" 104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT" </code></pre><ul> <li>The first request is OK, but the second one is malformed for sure</li> </ul> <h2 id="2021-02-24">2021-02-24</h2> <ul> <li>Export a list of journals for Peter to look through:</li> </ul> <pre><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.journal", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER; COPY 3345 </code></pre><ul> <li>Start a fresh harvesting on AReS because Udana mapped some items today and wants to include them in his report:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' # start indexing in AReS </code></pre><ul> <li>Also, I want to include the new series name/number cleanups so it’s not a total waste of time</li> </ul> <h2 id="2021-02-25">2021-02-25</h2> <ul> <li>Hmm the AReS harvest last night seems to have finished successfully, but the number of items is less than I was expecting:</li> </ul> <pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty' { "count" : 99546, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } </code></pre><ul> <li>The current items index has 101380 items… I wonder what happened <ul> <li>I started a new indexing</li> </ul> </li> </ul> <h2 id="2021-02-26">2021-02-26</h2> <ul> <li>Last night’s indexing was more successful, there are now 101479 items in the index</li> <li>Yesterday Yousef sent a <a href="https://github.com/ilri/OpenRXV/pull/77/">pull request</a> for the next/previous buttons on OpenRXV <ul> <li>I tested it this morning and it seems to be working</li> </ul> </li> </ul> <h2 id="2021-02-28">2021-02-28</h2> <ul> <li>Abenet asked me to import seventy-three records for CRP Forests, Trees and Agroforestry <ul> <li>I checked them briefly and found that there were thirty+ journal articles, and none of them had <code>cg.journal</code>, <code>cg.volume</code>, <code>cg.issue</code>, or <code>dcterms.license</code> so I spent a little time adding them</li> <li>I used a GREL expression to extract the journal volume and issue from the citation into new columns:</li> </ul> </li> </ul> <pre><code class="language-console" data-lang="console">value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,"") value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") </code></pre><ul> <li>This <code>value.partition</code> was new to me… and it took me a bit of time to figure out whether I needed to escape the parentheses in the issue number or not (no) and how to reference a capture group with <code>value.replace</code></li> <li>I tried to check the 1095 CIFOR records from last week for duplicates on DSpace Test, but the page says “Processing” and never loads <ul> <li>I don’t see any errors in the logs, but there are two jQuery errors in the browser console</li> <li>I filed <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934">an issue</a> with Atmire</li> </ul> </li> <li>Upload twelve items to CGSpace for Peter</li> <li>Niroshini from IWMI is still having issues adding WLE subjects to items during the metadata review step in the workflow</li> <li>It seems the BatchEditConsumer log spam is gone since I applied <a href="https://github.com/ilri/DSpace/pull/462">Atmire’s patch</a></li> </ul> <pre><code class="language-console" data-lang="console">$ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]* dspace.log.2021-02-10:5067 dspace.log.2021-02-11:2647 dspace.log.2021-02-12:4231 dspace.log.2021-02-13:221 dspace.log.2021-02-14:0 dspace.log.2021-02-15:0 dspace.log.2021-02-16:0 dspace.log.2021-02-17:0 dspace.log.2021-02-18:0 dspace.log.2021-02-19:0 dspace.log.2021-02-20:0 dspace.log.2021-02-21:0 dspace.log.2021-02-22:0 dspace.log.2021-02-23:0 dspace.log.2021-02-24:0 dspace.log.2021-02-25:0 dspace.log.2021-02-26:0 dspace.log.2021-02-27:0 dspace.log.2021-02-28:0 </code></pre><!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2021-03/">March, 2021</a></li> <li><a href="/cgspace-notes/2021-02/">February, 2021</a></li> <li><a href="/cgspace-notes/2021-01/">January, 2021</a></li> <li><a href="/cgspace-notes/2020-12/">December, 2020</a></li> <li><a href="/cgspace-notes/cgspace-dspace6-upgrade/">CGSpace DSpace 6 Upgrade</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>