<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="October, 2018" /> <meta property="og:description" content="2018-10-01 Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now 2018-10-03 I see Moayad was busy collecting item views and downloads from CGSpace yesterday: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1} ' | sort | uniq -c | sort -n | tail -n 10 933 40." /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-10/" /><meta property="article:published_time" content="2018-10-01T22:31:54+03:00"/> <meta property="article:modified_time" content="2018-10-15T08:14:22+03:00"/> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="October, 2018"/> <meta name="twitter:description" content="2018-10-01 Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now 2018-10-03 I see Moayad was busy collecting item views and downloads from CGSpace yesterday: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1} ' | sort | uniq -c | sort -n | tail -n 10 933 40."/> <meta name="generator" content="Hugo 0.49.2" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "October, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-10/", "wordCount": "1938", "datePublished": "2018-10-01T22:31:54+03:00", "dateModified": "2018-10-15T08:14:22+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2018-10/"> <title>October, 2018 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-Upm5uY/SXdvbjuIGH6fBjF5vOYUr9DguqBskM+EQpLBzO9U+9fMVmWEt+TTlGrWQ" crossorigin="anonymous"> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2018-10/">October, 2018</a></h2> <p class="blog-post-meta"><time datetime="2018-10-01T22:31:54+03:00">Mon Oct 01, 2018</time> by Alan Orth in <i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a> </p> </header> <h2 id="2018-10-01">2018-10-01</h2> <ul> <li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li> <li>I created a GitHub issue to track this <a href="https://github.com/ilri/DSpace/issues/389">#389</a>, because I’m super busy in Nairobi right now</li> </ul> <h2 id="2018-10-03">2018-10-03</h2> <ul> <li>I see Moayad was busy collecting item views and downloads from CGSpace yesterday:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1} ' | sort | uniq -c | sort -n | tail -n 10 933 40.77.167.90 971 95.108.181.88 1043 41.204.190.40 1454 157.55.39.54 1538 207.46.13.69 1719 66.249.64.61 2048 50.116.102.77 4639 66.249.64.59 4736 35.237.175.180 150362 34.218.226.147 </code></pre> <ul> <li>Of those, about 20% were HTTP 500 responses (!):</li> </ul> <pre><code>$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c 118927 200 31435 500 </code></pre> <ul> <li>I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for <code>cg.creator.orcid</code> and then re-generated the names using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li> </ul> <pre><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt $ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d </code></pre> <ul> <li>I found a new corner case error that I need to check, given <em>and</em> family names deactivated:</li> </ul> <pre><code>Looking up the names associated with ORCID iD: 0000-0001-7930-5752 Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752 </code></pre> <ul> <li>It appears to be Jim Lorenzen… I need to check that later!</li> <li>I merged the changes to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/390">#390</a>)</li> <li>Linode sent another alert about CPU usage on CGSpace (linode18) this evening</li> <li>It seems that Moayad is making quite a lot of requests today:</li> </ul> <pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 1594 157.55.39.160 1627 157.55.39.173 1774 136.243.6.84 4228 35.237.175.180 4497 70.32.83.92 4856 66.249.64.59 7120 50.116.102.77 12518 138.201.49.199 87646 34.218.226.147 111729 213.139.53.62 </code></pre> <ul> <li>But in super positive news, he says they are using my new <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> and it’s MUCH faster than using Atmire CUA’s internal “restlet” API</li> <li>I don’t recognize the <code>138.201.49.199</code> IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:</li> </ul> <pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c 8324 GET /bitstream 4193 GET /handle </code></pre> <ul> <li>Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):</li> </ul> <pre><code># grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c 7 GET /handle/10568 4186 GET /handle/10947 </code></pre> <ul> <li>The user agent is suspicious too:</li> </ul> <pre><code>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36 </code></pre> <ul> <li>It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list</li> <li>I looked in Solr’s statistics core and these hits were actually all counted as <code>isBot:false</code> (of course)… hmmm</li> <li>I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers.py</a> script:</li> </ul> <pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu' </code></pre> <ul> <li>Where <code>2018-10-03-add-orcids.csv</code> contained:</li> </ul> <pre><code>dc.contributor.author,cg.creator.id "Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462 "Henson, S.",Sonal Henson: 0000-0002-2002-5462 "Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182 "Thornton, Philip K",Philip Thornton: 0000-0002-1854-0182 "Thornton, Phil",Philip Thornton: 0000-0002-1854-0182 "Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182 "Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182 "Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182 </code></pre> <h2 id="2018-10-04">2018-10-04</h2> <ul> <li>Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)</li> <li>Every item has at least a <code>LICENSE</code> bundle, and some have a <code>THUMBNAIL</code> bundle, but the indexing code is specifically checking for downloads from the <code>ORIGINAL</code> bundle <ul> <li><a href="https://cgspace.cgiar.org/handle/10568/97460"><sup>10568</sup>⁄<sub>97460</sub></a> (100550): has a thumbnail bitstream</li> <li><a href="https://cgspace.cgiar.org/handle/10568/96112"><sup>10568</sup>⁄<sub>96112</sub></a> (96736): has only a LICENSE bitstream</li> </ul></li> <li>I see there are other bundles we might need to pay attention to: <code>TEXT</code>, <code>@_LOGO-COLLECTION_@</code>, <code>@_LOGO-COMMUNITY_@</code>, etc…</li> <li>On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads</li> <li>So it’s fixed, but I’m not sure why!</li> <li>Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):</li> </ul> <pre><code># zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets' 251226 </code></pre> <ul> <li>I found a logic error in the dspace-statistics-api <code>indexer.py</code> script that was causing item views to be inserted into downloads</li> <li>I tagged version 0.4.2 of the tool and redeployed it on CGSpace</li> </ul> <h2 id="2018-10-05">2018-10-05</h2> <ul> <li>Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay’s work plan</li> <li>We agreed that he would do monthly updates of the controlled vocabularies and generate a new one for the top 1,000 AGROVOC terms</li> <li>Add a link to <a href="https://cgspace.cgiar.org/explorer/">AReS explorer</a> to the CGSpace homepage introduction text</li> </ul> <h2 id="2018-10-06">2018-10-06</h2> <ul> <li>Follow up with AgriKnowledge about including Handle links (<code>dc.identifier.uri</code>) on their item pages</li> <li>In July, 2018 they had said their programmers would include the field in the next update of their website software</li> <li><a href="https://repository.cimmyt.org/">CIMMYT’s DSpace repository</a> is now running DSpace 5.x!</li> <li>It’s running OAI, but not REST, so I need to talk to Richard about that!</li> </ul> <h2 id="2018-10-08">2018-10-08</h2> <ul> <li>AgriKnowledge says they’re going to add the <code>dc.identifier.uri</code> to their item view in November when they update their website software</li> </ul> <h2 id="2018-10-10">2018-10-10</h2> <ul> <li>Peter noticed that some recently added PDFs don’t have thumbnails</li> <li>When I tried to force them to be generated I got an error that I’ve never seen before:</li> </ul> <pre><code>$ dspace filter-media -v -f -i 10568/97613 org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412. </code></pre> <ul> <li>I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?</li> <li>I get the same error when forcing <code>filter-media</code> to run on DSpace Test too, so it’s gotta be an ImageMagic bug</li> <li>The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an <a href="https://usn.ubuntu.com/3785-1/">Ubuntu Security Notice from 2018-10-04</a></li> <li>Wow, someone on <a href="https://twitter.com/rosscampbell/status/1048268966819319808">Twitter posted about this breaking his web application</a> (and it was retweeted by the ImageMagick acount!)</li> <li>I commented out the line that disables PDF thumbnails in <code>/etc/ImageMagick-6/policy.xml</code>:</li> </ul> <pre><code> <!--<policy domain="coder" rights="none" pattern="PDF" />--> </code></pre> <ul> <li>This works, but I’m not sure what ImageMagick’s long-term plan is if they are going to disable ALL image formats…</li> <li>I suppose I need to enable a workaround for this in Ansible?</li> </ul> <h2 id="2018-10-11">2018-10-11</h2> <ul> <li>I emailed DuraSpace to update <a href="https://duraspace.org/registry/entry/4188/?gvid=178">our entry in their DSpace registry</a> (the data was still on DSpace 3, JSPUI, etc)</li> <li>Generate a list of the top 1500 values for <code>dc.subject</code> so Sisay can start making a controlled vocabulary for it:</li> </ul> <pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER; COPY 1500 </code></pre> <ul> <li>Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!</li> <li>Last week I emailed Altmetric to ask if their software would notice mentions of our Handle in the format “handle:<sup>10568</sup>⁄<sub>80775</sub>” because I noticed that the <a href="https://landportal.org/library/resources/handle1056880775/unlocking-farming-potential-bangladesh%E2%80%99-polders">Land Portal does this</a></li> <li>Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <code><meta></code> tags in their page header, and using “dct:identifier” property instead of “dc:identifier”</li> <li>I re-created my local DSpace databse container using <a href="https://github.com/containers/libpod">podman</a> instead of Docker:</li> </ul> <pre><code>$ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data $ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine $ sudo podman start dspacedb $ createuser -h localhost -U postgres --pwprompt dspacetest $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest </code></pre> <ul> <li>I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository</li> <li>I can pull the <code>docker.bintray.io/jfrog/artifactory-oss:latest</code> image, but not start it</li> <li>I decided to use a Sonatype Nexus repository instead:</li> </ul> <pre><code>$ mkdir -p ~/.local/lib/containers/volumes/nexus_data $ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3 </code></pre> <ul> <li>With a few changes to my local Maven <code>settings.xml</code> it is working well</li> <li>Generate a list of the top 10,000 authors for Peter Ballantyne to look through:</li> </ul> <pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER; COPY 10000 </code></pre> <ul> <li>CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections</li> <li>I decided to constrain the max height of these to 200px using CSS (<a href="https://github.com/ilri/DSpace/pull/392">#392</a>)</li> </ul> <h2 id="2018-10-13">2018-10-13</h2> <ul> <li>Run all system updates on DSpace Test (linode19) and reboot it</li> <li>Look through Peter’s list of 746 author corrections in OpenRefine</li> <li>I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:</li> </ul> <pre><code>or( isNotNull(value.match(/.*\uFFFD.*/)), isNotNull(value.match(/.*\u00A0.*/)), isNotNull(value.match(/.*\u200A.*/)), isNotNull(value.match(/.*\u2019.*/)), isNotNull(value.match(/.*\u00b4.*/)) ) </code></pre> <ul> <li>Then I exported and applied them on my local test server:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3 </code></pre> <ul> <li>I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay’s author controlled vocabulary</li> </ul> <h2 id="2018-10-14">2018-10-14</h2> <ul> <li>Merge the authors controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/393">#393</a>), usage rights (<a href="https://github.com/ilri/DSpace/pull/394">#394</a>), and the upstream DSpace 5.x cherry-picks (<a href="https://github.com/ilri/DSpace/pull/395">#394</a>) into our <code>5_x-prod</code> branch</li> <li>Switch to new CGIAR LDAP server on CGSpace, as it’s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)</li> <li>Apply Peter’s 746 author corrections on CGSpace and DSpace Test using my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li> </ul> <pre><code>$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu' </code></pre> <ul> <li>Run all system updates on CGSpace (linode19) and reboot the server</li> <li>After rebooting the server I noticed that Handles are not resolving, and the <code>dspace-handle-server</code> systemd service is not running (or rather, it exited with success)</li> <li>Restarting the service with systemd works for a few seconds, then the java process quits</li> <li>I suspect that the systemd service type needs to be <code>forking</code> rather than <code>simple</code>, because the service calls the default DSpace <code>start-handle-server</code> shell script, which uses <code>nohup</code> and <code>&</code> to background the java process</li> <li>It would be nice if there was a cleaner way to start the service and then just log to the systemd journal rather than all this hiding and log redirecting</li> <li>Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page’s header, rather than the HTML properties they are using in their body</li> <li>Peter pointed out that some thumbnails were still not getting generated <ul> <li>When I tried to generate them manually I noticed that the path to the CMYK profile had changed because Ubuntu upgraded Ghostscript from 9.18 to 9.25 last week… WTF?</li> <li>Looks like I can use <code>/usr/share/ghostscript/current</code> instead of <code>/usr/share/ghostscript/9.25</code>…</li> </ul></li> <li>I limited the tall thumbnails even further to 170px because Peter said CTA’s were still too tall at 200px (<a href="https://github.com/ilri/DSpace/pull/396">#396</a>)</li> </ul> <h2 id="2018-10-15">2018-10-15</h2> <ul> <li>Tomcat on DSpace Test (linode19) has somehow stopped running all the DSpace applications</li> <li>I don’t see anything in the Catalina logs or <code>dmesg</code>, and the Tomcat manager shows XMLUI, REST, OAI, etc all “Running: false”</li> <li>Actually, now I remember that yesterday when I deployed the latest changes from git on DSpace Test I noticed a syntax error in one XML file when I was doing the discovery reindexing</li> <li>I fixed it so that I could reindex, but I guess the rest of DSpace actually didn’t start up…</li> <li>Create an account on DSpace Test for Felix from Earlham so he can test COPO submission <ul> <li>I created a new collection and added him as the administrator so he can test submission</li> </ul></li> </ul> <!-- vim: set sw=2 ts=2: --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2018-10/">October, 2018</a></li> <li><a href="/cgspace-notes/2018-09/">September, 2018</a></li> <li><a href="/cgspace-notes/2018-08/">August, 2018</a></li> <li><a href="/cgspace-notes/2018-07/">July, 2018</a></li> <li><a href="/cgspace-notes/2018-06/">June, 2018</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>