Add notes for 2020-01-27

This commit is contained in:
2020-01-27 16:20:44 +02:00
parent 207ace0883
commit 8feb93be39
112 changed files with 11466 additions and 5158 deletions

View File

@ -69,7 +69,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.62.2" />
<meta name="generator" content="Hugo 0.63.1" />
@ -99,7 +99,7 @@ sys 0m1.979s
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css" rel="stylesheet" integrity="sha256-ogwaQ2djljLNs0HSPCfKRP7cx1sPizy&#43;piAwENoVPTw=" crossorigin="anonymous">
<link href="https://alanorth.github.io/cgspace-notes/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css" rel="stylesheet" integrity="sha256-I&#43;LDKYvMjBE2wZq6MwwhHslMNvfERU6hXPTTVINwBCo=" crossorigin="anonymous">
<!-- RSS 2.0 feed -->
@ -146,7 +146,7 @@ sys 0m1.979s
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2019-02/">February, 2019</a></h2>
<p class="blog-post-meta"><time datetime="2019-02-01T21:37:30&#43;02:00">Fri Feb 01, 2019</time> by Alan Orth in
<i class="fa fa-folder" aria-hidden="true"></i>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
</p>
@ -179,7 +179,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
</code></pre><ul>
<li>Normally I'd say this was very high, but <a href="/cgspace-notes/2018-02/">about this time last year</a> I remember thinking the same thing when we had 3.1 million&hellip;</li>
<li>Normally I&rsquo;d say this was very high, but <a href="/cgspace-notes/2018-02/">about this time last year</a> I remember thinking the same thing when we had 3.1 million&hellip;</li>
<li>I will have to keep an eye on this to see if there is some error in Solr&hellip;</li>
<li>Atmire sent their <a href="https://github.com/ilri/DSpace/pull/407">pull request to re-enable the Metadata Quality Module (MQM) on our <code>5_x-dev</code> branch</a> today
<ul>
@ -292,7 +292,7 @@ COPY 321
4658 205.186.128.185
4658 70.32.83.92
</code></pre><ul>
<li>At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there's nothing we can do to improve REST API performance!</li>
<li>At this rate I think I just need to stop paying attention to these alerts—DSpace gets thrashed when people use the APIs properly and there&rsquo;s nothing we can do to improve REST API performance!</li>
<li>Perhaps I just need to keep increasing the Linode alert threshold (currently 300%) for this host?</li>
</ul>
<h2 id="2019-02-05">2019-02-05</h2>
@ -461,7 +461,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
848 66.249.66.219
</code></pre><ul>
<li>So it seems that the load issue comes from the REST API, not the XMLUI</li>
<li>I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don't get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)</li>
<li>I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don&rsquo;t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)</li>
<li>Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:</li>
</ul>
<pre><code>Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
@ -470,7 +470,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
</ul>
<p><img src="/cgspace-notes/2019/02/iita-workflow-step1-empty.png" alt="IITA Posters and Presentations workflow step 1 empty"></p>
<ul>
<li>IITA editors or approvers should be added to that step (though I'm curious why nobody is in that group currently)</li>
<li>IITA editors or approvers should be added to that step (though I&rsquo;m curious why nobody is in that group currently)</li>
<li>Abenet says we are not using the &ldquo;Accept/Reject&rdquo; step so this group should be deleted</li>
<li>Bizuwork asked about the &ldquo;DSpace Submission Approved and Archived&rdquo; emails that stopped working last month</li>
<li>I tried the <code>test-email</code> command on DSpace and it indeed is not working:</li>
@ -489,7 +489,7 @@ Error sending email:
Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>I can't connect to TCP port 25 on that server so I sent a mail to CGNET support to ask what's up</li>
<li>I can&rsquo;t connect to TCP port 25 on that server so I sent a mail to CGNET support to ask what&rsquo;s up</li>
<li>CGNET said these servers were discontinued in 2018-01 and that I should use <a href="https://docs.microsoft.com/en-us/exchange/mail-flow-best-practices/how-to-set-up-a-multifunction-device-or-application-to-send-email-using-office-3">Office 365</a></li>
</ul>
<h2 id="2019-02-08">2019-02-08</h2>
@ -577,18 +577,18 @@ Please see the DSpace documentation for assistance.
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;10/Feb/2019:0(5|6|7|8|9)&quot; | awk '{print $1}' | sort | uniq | wc -l
95
</code></pre><ul>
<li>It's very clear to me now that the API requests are the heaviest!</li>
<li>I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it's becoming a bit of <em>the boy who cried wolf</em> because it alerts like clockwork twice per day!</li>
<li>It&rsquo;s very clear to me now that the API requests are the heaviest!</li>
<li>I think I need to increase the Linode alert threshold from 300 to 350% now so I stop getting some of these alerts—it&rsquo;s becoming a bit of <em>the boy who cried wolf</em> because it alerts like clockwork twice per day!</li>
<li>Add my Python- and shell-based metadata workflow helper scripts as well as the environment settings for pipenv to our DSpace repository (<a href="https://github.com/ilri/DSpace/pull/408">#408</a>) so I can track changes and distribute them more formally instead of just keeping them <a href="https://github.com/ilri/DSpace/wiki/Scripts">collected on the wiki</a></li>
<li>Started adding IITA research theme (<code>cg.identifier.iitatheme</code>) to CGSpace
<ul>
<li>I'm still waiting for feedback from IITA whether they actually want to use &ldquo;SOCIAL SCIENCE &amp; AGRIC BUSINESS&rdquo; because it is listed as <a href="http://www.iita.org/project-discipline/social-science-and-agribusiness/">&ldquo;Social Science and Agribusiness&rdquo;</a> on their website</li>
<li>I&rsquo;m still waiting for feedback from IITA whether they actually want to use &ldquo;SOCIAL SCIENCE &amp; AGRIC BUSINESS&rdquo; because it is listed as <a href="http://www.iita.org/project-discipline/social-science-and-agribusiness/">&ldquo;Social Science and Agribusiness&rdquo;</a> on their website</li>
<li>Also, I think they want to do some mappings of items with existing subjects to these new themes</li>
</ul>
</li>
<li>Update ILRI author name style in the controlled vocabulary (Domelevo Entfellner, Jean-Baka) (<a href="https://github.com/ilri/DSpace/pull/409">#409</a>)
<ul>
<li>I'm still waiting to hear from Bizuwork whether we'll batch update all existing items with the old name style</li>
<li>I&rsquo;m still waiting to hear from Bizuwork whether we&rsquo;ll batch update all existing items with the old name style</li>
<li>No, there is only one entry and Bizu already fixed it</li>
</ul>
</li>
@ -606,7 +606,7 @@ Please see the DSpace documentation for assistance.
<pre><code>Error sending email:
- Error: cannot test email because mail.server.disabled is set to true
</code></pre><ul>
<li>I'm not sure why I didn't know about this configuration option before, and always maintained multiple configurations for development and production
<li>I&rsquo;m not sure why I didn&rsquo;t know about this configuration option before, and always maintained multiple configurations for development and production
<ul>
<li>I will modify the <a href="https://github.com/ilri/rmg-ansible-public">Ansible DSpace role</a> to use this in its <code>build.properties</code> template</li>
</ul>
@ -645,11 +645,11 @@ Please see the DSpace documentation for assistance.
<pre><code>dspacestatistics=# SELECT * FROM items WHERE views &gt; 0 ORDER BY views DESC LIMIT 10;
dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads DESC LIMIT 10;
</code></pre><ul>
<li>I'd have to think about what to make the REST API endpoints, perhaps: <code>/statistics/top/items?limit=10</code></li>
<li>I&rsquo;d have to think about what to make the REST API endpoints, perhaps: <code>/statistics/top/items?limit=10</code></li>
<li>But how do I do top items by views / downloads separately?</li>
<li>I re-deployed DSpace 6.3 locally to test the PDFBox thumbnails, especially to see if they handle CMYK files properly
<ul>
<li>The quality is JPEG 75 and I don't see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:</li>
<li>The quality is JPEG 75 and I don&rsquo;t see a way to set the thumbnail dimensions, but the resulting image is indeed sRGB:</li>
</ul>
</li>
</ul>
@ -661,7 +661,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
</ul>
<h2 id="2019-02-13">2019-02-13</h2>
<ul>
<li>ILRI ICT reset the password for the CGSpace mail account, but I still can't get it to send mail from DSpace's <code>test-email</code> utility</li>
<li>ILRI ICT reset the password for the CGSpace mail account, but I still can&rsquo;t get it to send mail from DSpace&rsquo;s <code>test-email</code> utility</li>
<li>I even added extra mail properties to <code>dspace.cfg</code> as suggested by someone on the dspace-tech mailing list:</li>
</ul>
<pre><code>mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
@ -671,8 +671,8 @@ dspacestatistics=# SELECT * FROM items WHERE downloads &gt; 0 ORDER BY downloads
<pre><code>Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
</code></pre><ul>
<li>I tried to log into the Outlook 365 web mail and it doesn't work so I've emailed ILRI ICT again</li>
<li>After reading the <a href="https://javaee.github.io/javamail/FAQ#commonmistakes">common mistakes in the JavaMail FAQ</a> I reconfigured the extra properties in DSpace's mail configuration to be simply:</li>
<li>I tried to log into the Outlook 365 web mail and it doesn&rsquo;t work so I&rsquo;ve emailed ILRI ICT again</li>
<li>After reading the <a href="https://javaee.github.io/javamail/FAQ#commonmistakes">common mistakes in the JavaMail FAQ</a> I reconfigured the extra properties in DSpace&rsquo;s mail configuration to be simply:</li>
</ul>
<pre><code>mail.extraproperties = mail.smtp.starttls.enable=true
</code></pre><ul>
@ -707,7 +707,7 @@ $ sudo sysctl kernel.unprivileged_userns_clone=1
$ podman pull postgres:9.6-alpine
$ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
</code></pre><ul>
<li>Which totally works, but Podman's rootless support doesn't work with port mappings yet&hellip;</li>
<li>Which totally works, but Podman&rsquo;s rootless support doesn&rsquo;t work with port mappings yet&hellip;</li>
<li>Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:</li>
</ul>
<pre><code># systemctl stop tomcat7
@ -731,14 +731,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
<pre><code># find /home/cgspace.cgiar.org/solr/ -iname &quot;write.lock&quot; -delete
</code></pre><ul>
<li>After restarting Tomcat the usage statistics are back</li>
<li>Interestingly, many of the locks were from last month, last year, and even 2015! I'm pretty sure that's not supposed to be how locks work&hellip;</li>
<li>Interestingly, many of the locks were from last month, last year, and even 2015! I&rsquo;m pretty sure that&rsquo;s not supposed to be how locks work&hellip;</li>
<li>Help Sarah Kasyoka finish an item submission that she was having issues with due to the file size</li>
<li>I increased the nginx upload limit, but she said she was having problems and couldn't really tell me why</li>
<li>I increased the nginx upload limit, but she said she was having problems and couldn&rsquo;t really tell me why</li>
<li>I logged in as her and completed the submission with no problems&hellip;</li>
</ul>
<h2 id="2019-02-15">2019-02-15</h2>
<ul>
<li>Tomcat was killed around 3AM by the kernel's OOM killer according to <code>dmesg</code>:</li>
<li>Tomcat was killed around 3AM by the kernel&rsquo;s OOM killer according to <code>dmesg</code>:</li>
</ul>
<pre><code>[Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
[Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
@ -748,7 +748,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
</ul>
<pre><code>Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
</code></pre><ul>
<li>I suspect it was related to the media-filter cron job that runs at 3AM but I don't see anything particular in the log files</li>
<li>I suspect it was related to the media-filter cron job that runs at 3AM but I don&rsquo;t see anything particular in the log files</li>
<li>I want to try to normalize the <code>text_lang</code> values to make working with metadata easier</li>
<li>We currently have a bunch of weird values that DSpace uses like <code>NULL</code>, <code>en_US</code>, and <code>en</code> and others that have been entered manually by editors:</li>
</ul>
@ -769,19 +769,19 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
</code></pre><ul>
<li>The majority are <code>NULL</code>, <code>en_US</code>, the blank string, and <code>en</code>—the rest are not enough to be significant</li>
<li>Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!</li>
<li>I'm going to normalized these to <code>NULL</code> at least on DSpace Test for now:</li>
<li>I&rsquo;m going to normalized these to <code>NULL</code> at least on DSpace Test for now:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
UPDATE 1045410
</code></pre><ul>
<li>I started proofing IITA's 2019-01 records that Sisay uploaded this week
<li>I started proofing IITA&rsquo;s 2019-01 records that Sisay uploaded this week
<ul>
<li>There were 259 records in IITA's original spreadsheet, but there are 276 in Sisay's collection</li>
<li>There were 259 records in IITA&rsquo;s original spreadsheet, but there are 276 in Sisay&rsquo;s collection</li>
<li>Also, I found that there are at least twenty duplicates in these records that we will need to address</li>
</ul>
</li>
<li>ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works</li>
<li>Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman's volumes:</li>
<li>Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman&rsquo;s volumes:</li>
</ul>
<pre><code>$ podman pull postgres:9.6-alpine
$ podman volume create dspacedb_data
@ -793,7 +793,7 @@ $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h loca
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
</code></pre><ul>
<li>And it's all running without root!</li>
<li>And it&rsquo;s all running without root!</li>
<li>Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:</li>
</ul>
<pre><code>$ podman volume create artifactory_data
@ -808,7 +808,7 @@ $ podman start artifactory
</ul>
<h2 id="2019-02-17">2019-02-17</h2>
<ul>
<li>I ran DSpace's cleanup task on CGSpace (linode18) and there were errors:</li>
<li>I ran DSpace&rsquo;s cleanup task on CGSpace (linode18) and there were errors:</li>
</ul>
<pre><code>$ dspace cleanup -v
Error: ERROR: update or delete on table &quot;bitstream&quot; violates foreign key constraint &quot;bundle_primary_bitstream_id_fkey&quot; on table &quot;bundle&quot;
@ -946,7 +946,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
<h2 id="2019-02-19">2019-02-19</h2>
<ul>
<li>Linode sent another alert about CPU usage on CGSpace (linode18) averaging 417% this morning</li>
<li>Unfortunately, I don't see any strange activity in the web server API or XMLUI logs at that time in particular</li>
<li>Unfortunately, I don&rsquo;t see any strange activity in the web server API or XMLUI logs at that time in particular</li>
<li>So far today the top ten IPs in the XMLUI logs are:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;19/Feb/2019:&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
@ -962,9 +962,9 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
14686 143.233.242.130
</code></pre><ul>
<li>143.233.242.130 is in Greece and using the user agent &ldquo;Indy Library&rdquo;, like the top IP yesterday (94.71.244.172)</li>
<li>That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don't know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this</li>
<li>The user is requesting only things like <code>/handle/10568/56199?show=full</code> so it's nothing malicious, only annoying</li>
<li>Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday's nginx rate limiting updates
<li>That user agent is in our Tomcat list of crawlers so at least its resource usage is controlled by forcing it to use a single Tomcat session, but I don&rsquo;t know if DSpace recognizes if this is a bot or not, so the logs are probably skewed because of this</li>
<li>The user is requesting only things like <code>/handle/10568/56199?show=full</code> so it&rsquo;s nothing malicious, only annoying</li>
<li>Otherwise there are still shit loads of IPs from Amazon still hammering the server, though I see HTTP 503 errors now after yesterday&rsquo;s nginx rate limiting updates
<ul>
<li>I should really try to script something around <a href="https://ipapi.co/api/">ipapi.co</a> to get these quickly and easily</li>
</ul>
@ -984,7 +984,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
12360 2a01:7e00::f03c:91ff:fe0a:d645
</code></pre><ul>
<li><code>2a01:7e00::f03c:91ff:fe0a:d645</code> is on Linode, and I can see from the XMLUI access logs that it is Drupal, so I assume it is part of the new ILRI website harvester&hellip;</li>
<li>Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I'm so fucking sick of this</li>
<li>Jesus, Linode just sent another alert as we speak that the load on CGSpace (linode18) has been at 450% the last two hours! I&rsquo;m so fucking sick of this</li>
<li>Our usage stats have exploded the last few months:</li>
</ul>
<p><img src="/cgspace-notes/2019/02/usage-stats.png" alt="Usage stats"></p>
@ -1027,12 +1027,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
</ul>
<pre><code>Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
</code></pre><ul>
<li>I wrote a quick and dirty Python script called <code>resolve-addresses.py</code> to resolve IP addresses to their owning organization's name, ASN, and country using the <a href="https://ipapi.co">IPAPI.co API</a></li>
<li>I wrote a quick and dirty Python script called <code>resolve-addresses.py</code> to resolve IP addresses to their owning organization&rsquo;s name, ASN, and country using the <a href="https://ipapi.co">IPAPI.co API</a></li>
</ul>
<h2 id="2019-02-20">2019-02-20</h2>
<ul>
<li>Ben Hack was asking about getting authors publications programmatically from CGSpace for the new ILRI website</li>
<li>I told him that they should probably try to use the REST API's <code>find-by-metadata-field</code> endpoint</li>
<li>I told him that they should probably try to use the REST API&rsquo;s <code>find-by-metadata-field</code> endpoint</li>
<li>The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:</li>
</ul>
<pre><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;https://cgspace.cgiar.org/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.creator.id&quot;,&quot;value&quot;: &quot;Alan S. Orth: 0000-0002-1735-7458&quot;, &quot;language&quot;: &quot;&quot;}'
@ -1041,7 +1041,7 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: applica
</code></pre><ul>
<li>This returns six items for me, which is the <a href="https://cgspace.cgiar.org/discover?filtertype_1=orcid&amp;filter_relational_operator_1=contains&amp;filter_1=Alan+S.+Orth%3A+0000-0002-1735-7458&amp;submit_apply_filter=&amp;query=">same I see in a Discovery search</a></li>
<li>Hector Tobon from CIAT asked if it was possible to get item statistics from CGSpace so I told him to use my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a></li>
<li>I was playing with <a href="http://yasgui.org/">YasGUI</a> to query AGROVOC's SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually</li>
<li>I was playing with <a href="http://yasgui.org/">YasGUI</a> to query AGROVOC&rsquo;s SPARQL endpoint, but they must have a cached version or something because I get an HTTP 404 if I try to go to the endpoint manually</li>
<li>I think I want to stick to the regular <a href="http://aims.fao.org/agrovoc/webservices">web services</a> to validate AGROVOC terms</li>
</ul>
<p><img src="/cgspace-notes/2019/02/yasgui-agrovoc.png" alt="YasGUI querying AGROVOC"></p>
@ -1064,7 +1064,7 @@ $ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subje
</ul>
<pre><code>$ cat /tmp/matched-subjects-* | sort | uniq &gt; /tmp/2019-02-21-matched-subjects.txt
</code></pre><ul>
<li>And then a list of all the unique <em>unmatched</em> terms using some utility I've never heard of before called <code>comm</code> or with <code>diff</code>:</li>
<li>And then a list of all the unique <em>unmatched</em> terms using some utility I&rsquo;ve never heard of before called <code>comm</code> or with <code>diff</code>:</li>
</ul>
<pre><code>$ sort /tmp/top-1500-subjects.txt &gt; /tmp/subjects-sorted.txt
$ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt &gt; /tmp/2019-02-21-unmatched-subjects.txt
@ -1077,7 +1077,7 @@ COPY 202
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
COPY 33
</code></pre><ul>
<li>I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it's almost ready so I created a pull request (<a href="https://github.com/ilri/DSpace/pull/413">#413</a>)</li>
<li>I did a bit more work on the IITA research theme (adding it to Discovery search filters) and it&rsquo;s almost ready so I created a pull request (<a href="https://github.com/ilri/DSpace/pull/413">#413</a>)</li>
<li>I still need to test the batch tagging of IITA items with themes based on their IITA subjects:
<ul>
<li>NATURAL RESOURCE MANAGEMENT research theme to items with NATURAL RESOURCE MANAGEMENT subject</li>
@ -1095,13 +1095,13 @@ COPY 33
<p>Help Udana from WLE with some issues related to CGSpace items on their <a href="https://www.wle.cgiar.org/publications">Publications website</a></p>
<ul>
<li>He wanted some IWMI items to show up in their publications website</li>
<li>The items were mapped into WLE collections, but still weren't showing up on the publications website</li>
<li>The items were mapped into WLE collections, but still weren&rsquo;t showing up on the publications website</li>
<li>I told him that he needs to add the <code>cg.identifier.wletheme</code> to the items so that the website indexer finds them</li>
<li>A few days ago he added the metadata to <a href="https://cgspace.cgiar.org/handle/10568/93011">10568/93011</a> and now I see that the item is present on the <a href="https://www.wle.cgiar.org/resource-recovery-waste-business-models-energy-nutrient-and-water-reuse-low-and-middle-income">WLE publications website</a></li>
</ul>
</li>
<li>
<p>Start looking at IITA's latest round of batch uploads called <a href="https://dspacetest.cgiar.org/handle/10568/108684">&ldquo;IITA_Feb_14&rdquo; on DSpace Test</a></p>
<p>Start looking at IITA&rsquo;s latest round of batch uploads called <a href="https://dspacetest.cgiar.org/handle/10568/108684">&ldquo;IITA_Feb_14&rdquo; on DSpace Test</a></p>
<ul>
<li>One mispelled authorship type</li>
<li>A few dozen incorrect inconsistent affiliations (I dumped a list of the top 1500 affiliations and reconciled against it, but it was still a lot of work)</li>
@ -1110,7 +1110,7 @@ COPY 33
<li>Some whitespace and consistency issues in sponsorships</li>
<li>Eight items with invalid ISBN: 0-471-98560-3</li>
<li>Two incorrectly formatted ISSNs</li>
<li>Lots of incorrect values in subjects, but that's a difficult problem to do in an automated way</li>
<li>Lots of incorrect values in subjects, but that&rsquo;s a difficult problem to do in an automated way</li>
</ul>
</li>
<li>
@ -1137,8 +1137,8 @@ return &quot;unmatched&quot;
</ul>
<h2 id="2019-02-24">2019-02-24</h2>
<ul>
<li>I decided to try to validate the AGROVOC subjects in IITA's recent batch upload by dumping all their terms, checking them in en/es/fr with <code>agrovoc-lookup.py</code>, then reconciling against the final list using reconcile-csv with OpenRefine</li>
<li>I'm not sure how to deal with terms like &ldquo;CORN&rdquo; that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be &ldquo;MAIZE&rdquo;</li>
<li>I decided to try to validate the AGROVOC subjects in IITA&rsquo;s recent batch upload by dumping all their terms, checking them in en/es/fr with <code>agrovoc-lookup.py</code>, then reconciling against the final list using reconcile-csv with OpenRefine</li>
<li>I&rsquo;m not sure how to deal with terms like &ldquo;CORN&rdquo; that are alternative labels (<code>altLabel</code>) in AGROVOC where the preferred label (<code>prefLabel</code>) would be &ldquo;MAIZE&rdquo;</li>
<li>For example, <a href="http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=CORN*&amp;lang=en">a query</a> for <code>CORN*</code> returns:</li>
</ul>
<pre><code> &quot;results&quot;: [
@ -1160,7 +1160,7 @@ return &quot;unmatched&quot;
<li>I did a duplicate check of the IITA Feb 14 records on DSpace Test and there were about fifteen or twenty items reported
<ul>
<li>A few of them are actually in previous IITA batch updates, which means they have been uploaded to CGSpace yet, so I worry that there would be many more</li>
<li>I want to re-synchronize CGSpace to DSpace Test to make sure that the duplicate checking is accurate, but I'm not sure I can because the Earlham guys are still testing COPO actively on DSpace Test</li>
<li>I want to re-synchronize CGSpace to DSpace Test to make sure that the duplicate checking is accurate, but I&rsquo;m not sure I can because the Earlham guys are still testing COPO actively on DSpace Test</li>
</ul>
</li>
</ul>
@ -1185,7 +1185,7 @@ return &quot;unmatched&quot;
/home/cgspace.cgiar.org/log/solr.log.2019-02-23.xz:0
/home/cgspace.cgiar.org/log/solr.log.2019-02-24:34
</code></pre><ul>
<li>But I don't see anything interesting in yesterday's Solr log&hellip;</li>
<li>But I don&rsquo;t see anything interesting in yesterday&rsquo;s Solr log&hellip;</li>
<li>I see this in the Tomcat 7 logs yesterday:</li>
</ul>
<pre><code>Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
@ -1209,7 +1209,7 @@ Feb 25 21:37:49 linode18 tomcat7[28363]: at java.lang.Throwable.readObje
Feb 25 21:37:49 linode18 tomcat7[28363]: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Feb 25 21:37:49 linode18 tomcat7[28363]: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</code></pre><ul>
<li>I don't think that's related&hellip;</li>
<li>I don&rsquo;t think that&rsquo;s related&hellip;</li>
<li>Also, now the Solr admin UI says &ldquo;statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher&rdquo;</li>
<li>In the Solr log I see:</li>
</ul>
@ -1245,12 +1245,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<li>On a hunch I tried adding <code>ulimit -v unlimited</code> to the Tomcat <code>catalina.sh</code> and now Solr starts up with no core errors and I actually have statistics for January and February on <a href="https://cgspace.cgiar.org/handle/10568/16814">some communities</a>, but not <a href="https://cgspace.cgiar.org/handle/10568/1">others</a></li>
<li>I wonder if the address space limits that I added via <code>LimitAS=infinity</code> in the systemd service are somehow not working?</li>
<li>I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the <code>LimitAS</code> setting does work, and the <code>infinity</code> setting in systemd does get translated to &ldquo;unlimited&rdquo; on the service</li>
<li>I thought it might be open file limit, but it seems we're nowhere near the current limit of 16384:</li>
<li>I thought it might be open file limit, but it seems we&rsquo;re nowhere near the current limit of 16384:</li>
</ul>
<pre><code># lsof -u dspace | wc -l
3016
</code></pre><ul>
<li>For what it's worth I see the same errors about <code>solr_update_time_stamp</code> on DSpace Test (linode19)</li>
<li>For what it&rsquo;s worth I see the same errors about <code>solr_update_time_stamp</code> on DSpace Test (linode19)</li>
<li>Update DSpace Test to <a href="https://tomcat.apache.org/tomcat-7.0-doc/changelog.html#Tomcat_7.0.93_(violetagg)">Tomcat 7.0.93</a></li>
<li>Something seems to have happened (some Atmire scheduled task, perhaps the CUA one at 7AM?) on CGSpace because I checked a few communities and collections on CGSpace and there are now statistics for January and February</li>
</ul>
@ -1267,27 +1267,27 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
</code></pre><ul>
<li>According to the <a href="https://cgspace.cgiar.org/rest/collections/1021">REST API</a> collection 1021 appears to be <a href="https://cgspace.cgiar.org/handle/10568/66581">CCAFS Tools, Maps, Datasets and Models</a></li>
<li>I looked at the <code>WORKFLOW_STEP_1</code> (Accept/Reject) and the group is of course empty</li>
<li>As we've seen several times recently, we are not using this step so it should simply be deleted</li>
<li>As we&rsquo;ve seen several times recently, we are not using this step so it should simply be deleted</li>
</ul>
<h2 id="2019-02-27">2019-02-27</h2>
<ul>
<li>Discuss batch uploads with Sisay</li>
<li>He's trying to upload some CTA records, but it's not possible to do collection mapping when using the web UI
<li>He&rsquo;s trying to upload some CTA records, but it&rsquo;s not possible to do collection mapping when using the web UI
<ul>
<li>I sent a mail to the dspace-tech mailing list to ask about the inability to perform mappings when uploading via the XMLUI batch upload</li>
</ul>
</li>
<li>He asked me to upload the files for him via the command line, but the file he referenced (<code>Thumbnails_feb_2019.zip</code>) doesn't exist</li>
<li>I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file's name:</li>
<li>He asked me to upload the files for him via the command line, but the file he referenced (<code>Thumbnails_feb_2019.zip</code>) doesn&rsquo;t exist</li>
<li>I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file&rsquo;s name:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
</code></pre><ul>
<li>Why don't they just derive the directory from the path to the zip file?</li>
<li>Working on Udana's Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
<li>Why don&rsquo;t they just derive the directory from the path to the zip file?</li>
<li>Working on Udana&rsquo;s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then
<ul>
<li>I also added a few regions because they are obvious for the countries</li>
<li>Also I added some rights fields that I noticed were easily available from the publications pages</li>
<li>I imported the records into my local environment with a fresh snapshot of the CGSpace database and ran the Atmire duplicate checker against them and it didn't find any</li>
<li>I imported the records into my local environment with a fresh snapshot of the CGSpace database and ran the Atmire duplicate checker against them and it didn&rsquo;t find any</li>
<li>I uploaded fifty-two records to the <a href="https://cgspace.cgiar.org/handle/10568/81592">Restoring Degraded Landscapes collection</a> on CGSpace</li>
</ul>
</li>
@ -1299,7 +1299,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
<pre><code>$ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
</code></pre><ul>
<li>Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out <em>sigh</em></li>
<li>Now I'm getting this message when trying to use DSpace's <code>test-email</code> script:</li>
<li>Now I&rsquo;m getting this message when trying to use DSpace&rsquo;s <code>test-email</code> script:</li>
</ul>
<pre><code>$ dspace test-email
@ -1313,8 +1313,8 @@ Error sending email:
Please see the DSpace documentation for assistance.
</code></pre><ul>
<li>I've tried to log in with the last two passwords that ICT reset it to earlier this month, but they are not working</li>
<li>I sent a mail to ILRI ICT to check if we're locked out or reset the password again</li>
<li>I&rsquo;ve tried to log in with the last two passwords that ICT reset it to earlier this month, but they are not working</li>
<li>I sent a mail to ILRI ICT to check if we&rsquo;re locked out or reset the password again</li>
</ul>
<!-- raw HTML omitted -->