<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="September, 2020" /> <meta property="og:description" content="2020-09-02 Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS The AReS Explorer hasn’t updated its index since 2020-08-22 when I last forced it I restarted it again now and told Moayad that the automatic indexing isn’t working Add Alliance of Bioversity International and CIAT to affiliations on CGSpace Abenet told me that the general search text on AReS doesn’t get reset when you use the “Reset Filters” button I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39 I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40 " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-09/" /> <meta property="article:published_time" content="2020-09-02T15:35:54+03:00" /> <meta property="article:modified_time" content="2020-09-10T12:18:03+03:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="September, 2020"/> <meta name="twitter:description" content="2020-09-02 Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS The AReS Explorer hasn’t updated its index since 2020-08-22 when I last forced it I restarted it again now and told Moayad that the automatic indexing isn’t working Add Alliance of Bioversity International and CIAT to affiliations on CGSpace Abenet told me that the general search text on AReS doesn’t get reset when you use the “Reset Filters” button I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39 I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40 "/> <meta name="generator" content="Hugo 0.74.3" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "September, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-09/", "wordCount": "1398", "datePublished": "2020-09-02T15:35:54+03:00", "dateModified": "2020-09-10T12:18:03+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-09/"> <title>September, 2020 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-09/">September, 2020</a></h2> <p class="blog-post-meta"><time datetime="2020-09-02T15:35:54+03:00">Wed Sep 02, 2020</time> by Alan Orth in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2020-09-02">2020-09-02</h2> <ul> <li>Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS</li> <li>The AReS Explorer hasn’t updated its index since 2020-08-22 when I last forced it <ul> <li>I restarted it again now and told Moayad that the automatic indexing isn’t working</li> </ul> </li> <li>Add <code>Alliance of Bioversity International and CIAT</code> to affiliations on CGSpace</li> <li>Abenet told me that the general search text on AReS doesn’t get reset when you use the “Reset Filters” button <ul> <li>I filed a bug on OpenRXV: <a href="https://github.com/ilri/OpenRXV/issues/39">https://github.com/ilri/OpenRXV/issues/39</a></li> </ul> </li> <li>I filed an issue on OpenRXV to make some minor edits to the admin UI: <a href="https://github.com/ilri/OpenRXV/issues/40">https://github.com/ilri/OpenRXV/issues/40</a></li> </ul> <ul> <li>I ran the country code tagger on CGSpace:</li> </ul> <pre><code>$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-09-02-countrycodetagger.log ... real 2m10.516s user 1m43.953s sys 0m15.192s $ grep -c added /tmp/2020-09-02-countrycodetagger.log 39 </code></pre><ul> <li>I still need to create a cron job for this…</li> <li>Sisay and Abenet said they can’t log in with LDAP on DSpace Test (DSpace 6) <ul> <li>I tried and I can’t either… but it is working on CGSpace</li> <li>The error on DSpace 6 is:</li> </ul> </li> </ul> <pre><code>2020-09-02 12:03:10,666 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth </code></pre><ul> <li>I tried to query LDAP directly using the application credentials with ldapsearch and it works:</li> </ul> <pre><code>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "applicationaccount@cgiarad.org" -W "(sAMAccountName=me)" </code></pre><ul> <li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC6x/Authentication+Plugins#AuthenticationPlugins-LDAPAuthentication">DSpace 6 docs</a> we need to escape commas in our LDAP parameters due to the new configuration system <ul> <li>I added the commas and restarted DSpace (though technically we shouldn’t need to restart due to the new config system hot reloading configs)</li> <li>Run all system updates on DSpace Test (linode26) and reboot it</li> <li>After the restart LDAP login works…</li> </ul> </li> </ul> <h2 id="2020-09-03">2020-09-03</h2> <ul> <li>Fix some erroneous “review status” fields that Abenet noticed on AReS <ul> <li>I used my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts with the following input files:</li> </ul> </li> </ul> <pre><code>$ cat 2020-09-03-fix-review-status.csv dc.description.version,correct Externally Peer Reviewed,Peer Review Peer Reviewed,Peer Review Peer review,Peer Review Peer reviewed,Peer Review Peer-Reviewed,Peer Review Peer-reviewed,Peer Review peer Review,Peer Review $ cat 2020-09-03-delete-review-status.csv dc.description.version Report Formally Published Poster Unrefereed reprint $ ./delete-metadata-values.py -i 2020-09-03-delete-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -m 68 $ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -t 'correct' -m 68 </code></pre><ul> <li>Start reviewing 95 items for IITA (20201stbatch) <ul> <li>I used my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool to check and fix some low-hanging fruit first</li> <li>This fixed a few unnecessary Unicode, excessive whitespace, invalid multi-value separator, and duplicate metadata values</li> <li>Then I looked at the data in OpenRefine and noticed some things: <ul> <li>All issue dates use year only, but some have months in the citation so they could be more specific</li> <li>I normalized all the DOIs to use “<a href="https://doi.org">https://doi.org</a>” format</li> <li>I fixed a few AGROVOC subjects with a simple GREL: <code>value.replace("GRAINS","GRAIN").replace("SOILS","SOIL").replace("CORN","MAIZE")</code></li> <li>But there are a few more that are invalid that she will have to look at</li> <li>I uploaded the items to <a href="https://dspacetest.cgiar.org/handle/10568/108357">DSpace Test</a> and it was apparently successful but I get these errors to the console:</li> </ul> </li> </ul> </li> </ul> <pre><code>Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58 Error while updating java.lang.NullPointerException at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131) at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:212) at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1104) at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1093) at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:104) at org.dspace.event.BasicDispatcher.consume(BasicDispatcher.java:177) at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:123) at org.dspace.core.Context.dispatchEvents(Context.java:455) at org.dspace.core.Context.commit(Context.java:424) at org.dspace.core.Context.complete(Context.java:380) at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81) </code></pre><ul> <li>There are more in the DSpace log so I will raise it with Atmire immediately</li> </ul> <h2 id="2020-09-04">2020-09-04</h2> <ul> <li>I was checking the recent IITA data for duplicates when I noticed that one in CIFOR’s Archive and saw that CIFOR has updated a bunch of their website URLs, for example: <ul> <li><a href="http://www.cifor.org/nc/online-library/browse/view-publication/publication/151.html">http://www.cifor.org/nc/online-library/browse/view-publication/publication/151.html</a> → <a href="https://www.cifor.org/knowledge/publication/151">https://www.cifor.org/knowledge/publication/151</a></li> <li><a href="https://www.cifor.org/library/4033">https://www.cifor.org/library/4033</a> → <a href="https://www.cifor.org/knowledge/publication/4033">https://www.cifor.org/knowledge/publication/4033</a></li> <li><a href="https://www.cifor.org/pid/5087">https://www.cifor.org/pid/5087</a> → <a href="https://www.cifor.org/knowledge/publication/5087">https://www.cifor.org/knowledge/publication/5087</a></li> </ul> </li> <li>I will update our nearly 6,000 metadata values for CIFOR in the database accordingly:</li> </ul> <pre><code>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+'; dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?'; dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+'; </code></pre><ul> <li>I did some cleanup on the author affiliations of the IITA data our 2019-04 list using reconcile-csv and OpenRefine: <ul> <li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li> <li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li> </ul> </li> <li>I mapped one duplicated from the CIFOR Archives and re-uploaded the 94 IITA items to a new collection on <a href="https://dspacetest.cgiar.org/handle/10568/108453">DSpace Test</a></li> </ul> <h2 id="2020-09-08">2020-09-08</h2> <ul> <li>I noticed that the “share” link in AReS wasn’t working properly because it excludes the “explorer” part of the URI</li> </ul> <p><img src="/cgspace-notes/2020/09/ares-share-link.png" alt="AReS share link broken"></p> <ul> <li>I filed an issue on GitHub: <a href="https://github.com/ilri/OpenRXV/issues/41">https://github.com/ilri/OpenRXV/issues/41</a></li> <li>I uploaded the 94 IITA items that I had been working on last week to CGSpace</li> <li>RTB emailed to ask why they are getting HTTP 503 errors during harvesting to the RTB WordPress website <ul> <li>From the screenshot I can see they are requesting URLs like this:</li> </ul> </li> </ul> <pre><code>https://cgspace.cgiar.org/bitstream/handle/10568/82745/Characteristics-Silage.JPG </code></pre><ul> <li>So they end up getting rate limited due to the XMLUI rate limits <ul> <li>I told them to use the REST API bitstream retrieve links, because we don’t have any rate limits there</li> </ul> </li> </ul> <h2 id="2020-09-09">2020-09-09</h2> <ul> <li>Wire up the systemd service/timer for the CGSpace Country Code Tagger curation task in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> <ul> <li><del>For now it won’t work on DSpace 6 because the curation task invocation needs to be slightly different (minus the <code>-l</code> parameter) and for some reason the task isn’t working on DSpace Test (version 6) right now</del></li> <li>I added DSpace 6 support to the playbook templates…</li> </ul> </li> <li>Run system updates on DSpace Test (linode26), re-deploy the DSpace 6 test branch, and reboot the server <ul> <li>After rebooting I deleted old copies of the cgspace-java-helpers JAR in the DSpace lib directory and then the curation worked</li> <li>To my great surprise the curation worked (and completed, albeit a few times slower) on my local DSpace 6 environment as well:</li> </ul> </li> </ul> <pre><code>$ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object </code></pre><h2 id="2020-09-10">2020-09-10</h2> <ul> <li>I checked the country code tagger on CGSpace and DSpace Test and it ran fine from the systemd timer last night… w00t</li> <li>I started looking at Peter’s changes to the CGSpace regions that were proposed in 2020-07 <ul> <li>The changes will be:</li> </ul> </li> </ul> <pre><code>$ cat 2020-09-10-fix-cgspace-regions.csv cg.coverage.region,correct EAST AFRICA,EASTERN AFRICA WEST AFRICA,WESTERN AFRICA SOUTHEAST ASIA,SOUTHEASTERN ASIA SOUTH ASIA,SOUTHERN ASIA AFRICA SOUTH OF SAHARA,SUB-SAHARAN AFRICA NORTH AFRICA,NORTHERN AFRICA WEST ASIA,WESTERN ASIA SOUTHWEST ASIA,SOUTHWESTERN ASIA $ ./fix-metadata-values.py -i 2020-09-10-fix-cgspace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d -n Connected to database. Would fix 12227 occurences of: EAST AFRICA Would fix 7996 occurences of: WEST AFRICA Would fix 3515 occurences of: SOUTHEAST ASIA Would fix 3443 occurences of: SOUTH ASIA Would fix 1134 occurences of: AFRICA SOUTH OF SAHARA Would fix 357 occurences of: NORTH AFRICA Would fix 81 occurences of: WEST ASIA Would fix 3 occurences of: SOUTHWEST ASIA </code></pre><ul> <li>I think we need to wait for the web team, though, as they need to update their mappings <ul> <li>Not to mention that we’ll need to give WLE and CCAFS time to update their harvesters as well… hmmm</li> </ul> </li> <li>Looking at the top user agents active on CGSpace in 2020-08 and I see: <ul> <li><code>Delphi 2009</code>: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)</li> <li><code>Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)</code>: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA’s content)</li> <li><code>RTB website BOT</code>: 12282</li> <li><code>ILRI Livestock Website Publications importer BOT</code>: 9393</li> </ul> </li> <li>Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn’t commit the change</li> <li>HTTrack is in the agents list so I’m not sure why DSpace registers a hit from that request</li> <li>Also, I am surprised to see the RTB and ILRI bots here because they have “BOT” in the name and that should also be dropped</li> <li>I also see hits from <code>curl</code> and <code>Java/1.8.0_66</code> and <code>Apache-HttpClient</code> so WTF… those are supposed to be dropped by the default agents list</li> <li>Some IP <code>2607:f298:5:101d:f816:3eff:fed9:a484</code> made 9,000 requests with the <code>RI/1.0</code> user agent this year… <ul> <li>That’s on DreamHost…?</li> </ul> </li> <li>I purged 448658 hits from these agents and added <code>Delphi</code> to our local agents overload for Solr as well as Tomcat’s Crawler Session Manager Valve so that it forces them to re-use a single session</li> <li>I made a pull request on the COUNTER-Robots project for the Daum robot: <a href="https://github.com/atmire/COUNTER-Robots/pull/38">https://github.com/atmire/COUNTER-Robots/pull/38</a> <ul> <li>This bot made 8,000 requests to CGSpace this year</li> <li>I purged about 20,000 total requests from this bot from our Solr stats for the last few years</li> </ul> </li> </ul> <!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2020-09/">September, 2020</a></li> <li><a href="/cgspace-notes/2020-08/">August, 2020</a></li> <li><a href="/cgspace-notes/2020-07/">July, 2020</a></li> <li><a href="/cgspace-notes/2020-06/">June, 2020</a></li> <li><a href="/cgspace-notes/2020-05/">May, 2020</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>