<metaproperty="og:description"content="2018-10-01 Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now 2018-10-03 I see Moayad was busy collecting item views and downloads from CGSpace yesterday: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1} ' | sort | uniq -c | sort -n | tail -n 10 933 40."/>
<metaname="twitter:description"content="2018-10-01 Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now 2018-10-03 I see Moayad was busy collecting item views and downloads from CGSpace yesterday: # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1} ' | sort | uniq -c | sort -n | tail -n 10 933 40."/>
<li>Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items</li>
<li>I created a GitHub issue to track this <ahref="https://github.com/ilri/DSpace/issues/389">#389</a>, because I’m super busy in Nairobi right now</li>
<li>I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for <code>cg.creator.orcid</code> and then re-generated the names using my <ahref="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
<li>But in super positive news, he says they are using my new <ahref="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> and it’s MUCH faster than using Atmire CUA’s internal “restlet” API</li>
<li>I don’t recognize the <code>138.201.49.199</code> IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:</li>
<li>I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my <ahref="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers.py</a> script:</li>
<li>Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)</li>
<li>Every item has at least a <code>LICENSE</code> bundle, and some have a <code>THUMBNAIL</code> bundle, but the indexing code is specifically checking for downloads from the <code>ORIGINAL</code> bundle
<ul>
<li><ahref="https://cgspace.cgiar.org/handle/10568/97460"><sup>10568</sup>⁄<sub>97460</sub></a> (100550): has a thumbnail bitstream</li>
<li><ahref="https://cgspace.cgiar.org/handle/10568/96112"><sup>10568</sup>⁄<sub>96112</sub></a> (96736): has only a LICENSE bitstream</li>
</ul></li>
<li>I see there are other bundles we might need to pay attention to: <code>TEXT</code>, <code>@_LOGO-COLLECTION_@</code>, <code>@_LOGO-COMMUNITY_@</code>, etc…</li>
<li>On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads</li>
<li>So it’s fixed, but I’m not sure why!</li>
<li>Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):</li>
<li>AgriKnowledge says they’re going to add the <code>dc.identifier.uri</code> to their item view in November when they update their website software</li>
</ul>
<h2id="2018-10-10">2018-10-10</h2>
<ul>
<li>Peter noticed that some recently added PDFs don’t have thumbnails</li>
<li>When I tried to force them to be generated I got an error that I’ve never seen before:</li>
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
</code></pre>
<ul>
<li>I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?</li>
<li>I get the same error when forcing <code>filter-media</code> to run on DSpace Test too, so it’s gotta be an ImageMagic bug</li>
<li>The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an <ahref="https://usn.ubuntu.com/3785-1/">Ubuntu Security Notice from 2018-10-04</a></li>
<li>Wow, someone on <ahref="https://twitter.com/rosscampbell/status/1048268966819319808">Twitter posted about this breaking his web application</a> (and it was retweeted by the ImageMagick acount!)</li>
<li>I commented out the line that disables PDF thumbnails in <code>/etc/ImageMagick-6/policy.xml</code>:</li>
<li>I emailed DuraSpace to update <ahref="https://duraspace.org/registry/entry/4188/?gvid=178">our entry in their DSpace registry</a> (the data was still on DSpace 3, JSPUI, etc)</li>
<li>Generate a list of the top 1500 values for <code>dc.subject</code> so Sisay can start making a controlled vocabulary for it:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
<li>Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!</li>
<li>Last week I emailed Altmetric to ask if their software would notice mentions of our Handle in the format “handle:<sup>10568</sup>⁄<sub>80775</sub>” because I noticed that the <ahref="https://landportal.org/library/resources/handle1056880775/unlocking-farming-potential-bangladesh%E2%80%99-polders">Land Portal does this</a></li>
<li>Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <code><meta></code> tags in their page header, and using “dct:identifier” property instead of “dc:identifier”</li>
<li>I re-created my local DSpace databse container using <ahref="https://github.com/containers/libpod">podman</a> instead of Docker:</li>
<li>With a few changes to my local Maven <code>settings.xml</code> it is working well</li>
<li>Generate a list of the top 10,000 authors for Peter Ballantyne to look through:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
<li>CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections</li>
<li>I decided to constrain the max height of these to 200px using CSS (<ahref="https://github.com/ilri/DSpace/pull/392">#392</a>)</li>
<li>I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay’s author controlled vocabulary</li>
<li>Merge the authors controlled vocabulary (<ahref="https://github.com/ilri/DSpace/pull/393">#393</a>), usage rights (<ahref="https://github.com/ilri/DSpace/pull/394">#394</a>), and the upstream DSpace 5.x cherry-picks (<ahref="https://github.com/ilri/DSpace/pull/395">#394</a>) into our <code>5_x-prod</code> branch</li>
<li>Switch to new CGIAR LDAP server on CGSpace, as it’s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)</li>
<li>Apply Peter’s 746 author corrections on CGSpace and DSpace Test using my <ahref="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
<li>After rebooting the server I noticed that Handles are not resolving, and the <code>dspace-handle-server</code> systemd service is not running (or rather, it exited with success)</li>
<li>Restarting the service with systemd works for a few seconds, then the java process quits</li>
<li>I suspect that the systemd service type needs to be <code>forking</code> rather than <code>simple</code>, because the service calls the default DSpace <code>start-handle-server</code> shell script, which uses <code>nohup</code> and <code>&</code> to background the java process</li>
<li>It would be nice if there was a cleaner way to start the service and then just log to the systemd journal rather than all this hiding and log redirecting</li>
<li>Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page’s header, rather than the HTML properties they are using in their body</li>
<li>Peter pointed out that some thumbnails were still not getting generated
<ul>
<li>When I tried to generate them manually I noticed that the path to the CMYK profile had changed because Ubuntu upgraded Ghostscript from 9.18 to 9.25 last week… WTF?</li>
<li>Looks like I can use <code>/usr/share/ghostscript/current</code> instead of <code>/usr/share/ghostscript/9.25</code>…</li>
</ul></li>
<li>I limited the tall thumbnails even further to 170px because Peter said CTA’s were still too tall at 200px (<ahref="https://github.com/ilri/DSpace/pull/396">#396</a>)</li>
<li>Tomcat on DSpace Test (linode19) has somehow stopped running all the DSpace applications</li>
<li>I don’t see anything in the Catalina logs or <code>dmesg</code>, and the Tomcat manager shows XMLUI, REST, OAI, etc all “Running: false”</li>
<li>Actually, now I remember that yesterday when I deployed the latest changes from git on DSpace Test I noticed a syntax error in one XML file when I was doing the discovery reindexing</li>
<li>I fixed it so that I could reindex, but I guess the rest of DSpace actually didn’t start up…</li>
<li>Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:</li>
</ul>
<pre><code>dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
<li>Talking to the CodeObia guys about the REST API I started to wonder why it’s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it</li>
<li>Interestingly, the speed doesn’t get better after you request the same thing multiple times–it’s consistently bad on both CGSpace and DSpace Test!</li>
</ul>
<pre><code>$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.35s user 0.06s system 1% cpu 25.133 total
0.31s user 0.04s system 1% cpu 25.223 total
0.27s user 0.06s system 1% cpu 27.858 total
0.20s user 0.05s system 1% cpu 23.838 total
0.30s user 0.05s system 1% cpu 24.301 total
$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.22s user 0.03s system 1% cpu 17.248 total
0.23s user 0.02s system 1% cpu 16.856 total
0.23s user 0.04s system 1% cpu 16.460 total
0.24s user 0.04s system 1% cpu 21.043 total
0.22s user 0.04s system 1% cpu 17.132 total
</code></pre>
<ul>
<li>I should note that at this time CGSpace is using Oracle Java and DSpace Test is using OpenJDK (both version 8)</li>
<li>I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?</li>
<li>I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!</li>
</ul>
<pre><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.20s user 0.03s system 0% cpu 25.017 total
0.23s user 0.02s system 1% cpu 23.299 total
0.24s user 0.02s system 1% cpu 22.496 total
0.22s user 0.03s system 1% cpu 22.720 total
0.23s user 0.03s system 1% cpu 22.632 total
</code></pre>
<ul>
<li>If I make a request without the expands it is ten time faster:</li>
</ul>
<pre><code>$ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
...
0.20s user 0.03s system 7% cpu 3.098 total
0.22s user 0.03s system 8% cpu 2.896 total
0.21s user 0.05s system 9% cpu 2.787 total
0.23s user 0.02s system 8% cpu 2.896 total
</code></pre>
<ul>
<li>I sent a mail to dspace-tech to ask how to profile this…</li>