Add notes for 2019-12-17

This commit is contained in:
2019-12-17 14:49:24 +02:00
parent d83c951532
commit d54e5b69f1
90 changed files with 1420 additions and 1377 deletions

View File

@ -27,7 +27,7 @@ I'll update the DSpace role in our Ansible infrastructure playbooks and run
Also, I'll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system's RAM, and we never re-ran them after migrating to larger Linodes last month
I'm testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I'm getting those autowire errors in Tomcat 8.5.30 again:
"/>
<meta name="generator" content="Hugo 0.60.1" />
<meta name="generator" content="Hugo 0.61.0" />
@ -108,7 +108,7 @@ I&#39;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&#
</p>
</header>
<h2 id="20180902">2018-09-02</h2>
<h2 id="2018-09-02">2018-09-02</h2>
<ul>
<li>New <a href="https://jdbc.postgresql.org/documentation/changelog.html#version_42.2.5">PostgreSQL JDBC driver version 42.2.5</a></li>
<li>I'll update the DSpace role in our <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a> and run the updated playbooks on CGSpace and DSpace Test</li>
@ -139,7 +139,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>And the <code>5_x-prod</code> DSpace 5.8 branch does work in Tomcat 8.5.x on my Arch Linux laptop&hellip;</li>
<li>I'm not sure where the issue is then!</li>
</ul>
<h2 id="20180903">2018-09-03</h2>
<h2 id="2018-09-03">2018-09-03</h2>
<ul>
<li>Abenet says she's getting three emails about periodic statistics reports every day since the DSpace 5.8 upgrade last week</li>
<li>They are from the CUA module</li>
@ -148,7 +148,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
<li>She will try to click the &ldquo;Unsubscribe&rdquo; link in the first two to see if it works, otherwise we should contact Atmire</li>
<li>The only one she remembers subscribing to is the top downloads one</li>
</ul>
<h2 id="20180904">2018-09-04</h2>
<h2 id="2018-09-04">2018-09-04</h2>
<ul>
<li>I'm looking over the latest round of IITA records from Sisay: <a href="https://dspacetest.cgiar.org/handle/10568/104230">Mercy1806_August_29</a>
<ul>
@ -171,7 +171,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
</li>
<li>Abenet says she hasn't received any more subscription emails from the CUA module since she unsubscribed yesterday, so I think we don't need create an issue on Atmire's bug tracker anymore</li>
</ul>
<h2 id="20180910">2018-09-10</h2>
<h2 id="2018-09-10">2018-09-10</h2>
<ul>
<li>Playing with <a href="https://github.com/eykhagen/strest">strest</a> to test the DSpace REST API programatically</li>
<li>For example, given this <code>test.yaml</code>:</li>
@ -287,7 +287,7 @@ X-XSS-Protection: 1; mode=block
</code></pre><ul>
<li>I will have to keep an eye on it and perhaps add it to the list of &ldquo;bad bots&rdquo; that get rate limited</li>
</ul>
<h2 id="20180912">2018-09-12</h2>
<h2 id="2018-09-12">2018-09-12</h2>
<ul>
<li>Merge AReS explorer changes to nginx config and deploy on CGSpace so CodeObia can start testing more</li>
<li>Re-create my local Docker container for PostgreSQL data, but using a volume for the database data:</li>
@ -301,7 +301,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
<li>I told Sisay to run the XML file through tidy</li>
<li>More testing of the access and usage rights changes</li>
</ul>
<h2 id="20180913">2018-09-13</h2>
<h2 id="2018-09-13">2018-09-13</h2>
<ul>
<li>Peter was communicating with Altmetric about the OAI mapping issue for item <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&amp;metadataPrefix=oai_dc&amp;identifier=oai:cgspace.cgiar.org:10568/82810">10568/82810</a> again</li>
<li>Altmetric said it was somehow related to the OAI <code>dateStamp</code> not getting updated when the mappings changed, but I said that back in <a href="/cgspace-notes/2018-07/">2018-07</a> when this happened it was because the OAI was actually just not reflecting all the item's mappings</li>
@ -348,12 +348,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>Must have been something like an old DSpace 5.5 file in the spring folder&hellip; weird</li>
<li>But yay, this means we can update DSpace Test to Ubuntu 18.04, Tomcat 8, PostgreSQL 9.6, etc&hellip;</li>
</ul>
<h2 id="20180914">2018-09-14</h2>
<h2 id="2018-09-14">2018-09-14</h2>
<ul>
<li>Sisay uploaded the IITA records to CGSpace, but forgot to remove the old Handles</li>
<li>I explicitly told him not to forget to remove them yesterday!</li>
</ul>
<h2 id="20180916">2018-09-16</h2>
<h2 id="2018-09-16">2018-09-16</h2>
<ul>
<li>Add the DSpace build.properties as a template into my <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> for configuring DSpace machines</li>
<li>One stupid thing there is that I add all the variables in a private vars file, which is apparently higher precedence than host vars, meaning that I can't override them (like SMTP server) on a per-host basis</li>
@ -361,7 +361,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>I suggested that we leave access rights (<code>cg.identifier.access</code>) as it is now, with &ldquo;Open Access&rdquo; or &ldquo;Limited Access&rdquo;, and then simply re-brand that as &ldquo;Access rights&rdquo; in the UIs and relevant drop downs</li>
<li>Then we continue as planned to add <code>dc.rights</code> as &ldquo;Usage rights&rdquo;</li>
</ul>
<h2 id="20180917">2018-09-17</h2>
<h2 id="2018-09-17">2018-09-17</h2>
<ul>
<li>Skype meeting with CGSpace team in Addis</li>
<li>Change <code>cg.identifier.status</code> &ldquo;Access rights&rdquo; options to:
@ -418,7 +418,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>That one returns 766, which is exactly 1655 minus 889&hellip;</li>
<li>Also, Solr's <code>fq</code> is similar to the regular <code>q</code> query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries</li>
</ul>
<h2 id="20180918">2018-09-18</h2>
<h2 id="2018-09-18">2018-09-18</h2>
<ul>
<li>I managed to create a simple proof of concept REST API to expose item view and download statistics: <a href="https://github.com/alanorth/cgspace-statistics-api">cgspace-statistics-api</a></li>
<li>It uses the Python-based <a href="https://falcon.readthedocs.io">Falcon</a> web framework and talks to Solr directly using the <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> library (which seems to have issues in Python 3.7 currently)</li>
@ -439,12 +439,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
</code></pre><ul>
<li>The rest of the Falcon tooling will be more difficult&hellip;</li>
</ul>
<h2 id="20180919">2018-09-19</h2>
<h2 id="2018-09-19">2018-09-19</h2>
<ul>
<li>I emailed Jane Poole to ask if there is some money we can use from the Big Data Platform (BDP) to fund the purchase of some Atmire credits for CGSpace</li>
<li>I learned that there is an efficient way to do <a href="http://yonik.com/solr/paging-and-deep-paging/">&ldquo;deep paging&rdquo; in large Solr results sets by using <code>cursorMark</code></a>, but it doesn't work with faceting</li>
</ul>
<h2 id="20180920">2018-09-20</h2>
<h2 id="2018-09-20">2018-09-20</h2>
<ul>
<li>Contact Atmire to ask how we can buy more credits for future development</li>
<li>I researched the Solr <code>filterCache</code> size and I found out that the formula for calculating the potential memory use of <strong>each entry</strong> in the cache is:</li>
@ -460,7 +460,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li><a href="https://docs.google.com/document/d/1vl-nmlprSULvNZKQNrqp65eLnLhG9s_ydXQtg9iML10/edit">Article discussing testing methodology for different <code>filterCache</code> sizes</a></li>
<li>Discuss Handle links on Twitter with IWMI</li>
</ul>
<h2 id="20180921">2018-09-21</h2>
<h2 id="2018-09-21">2018-09-21</h2>
<ul>
<li>I see that there was a nice optimization to the ImageMagick PDF CMYK detection in the upstream <code>dspace-5_x</code> branch: <a href="https://github.com/DSpace/DSpace/pull/2204">DS-3664</a></li>
<li>The fix will go into DSpace 5.10, and we are currently on DSpace 5.8 but I think I'll cherry-pick that fix into our <code>5_x-prod</code> branch:
@ -475,14 +475,14 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
</ul>
</li>
</ul>
<h2 id="20190923">2019-09-23</h2>
<h2 id="2019-09-23">2019-09-23</h2>
<ul>
<li>I did more work on my <a href="https://github.com/alanorth/cgspace-statistics-api">cgspace-statistics-api</a>, fixing some item view counts and adding indexing via SQLite (I'm trying to avoid having to set up <em>yet another</em> database, user, password, etc) during deployment</li>
<li>I created a new branch called <code>5_x-upstream-cherry-picks</code> to test and track those cherry-picks from the upstream 5.x branch</li>
<li>Also, I need to test the new LDAP server, so I will deploy that on DSpace Test today</li>
<li>Rename my cgspace-statistics-api to <a href="https://github.com/alanorth/dspace-statistics-api">dspace-statistics-api</a> on GitHub</li>
</ul>
<h2 id="20180924">2018-09-24</h2>
<h2 id="2018-09-24">2018-09-24</h2>
<ul>
<li>Trying to figure out how to get item views and downloads from SQLite in a join</li>
<li>It appears SQLite doesn't support <code>FULL OUTER JOIN</code> so some people on StackOverflow have emulated it with <code>LEFT JOIN</code> and <code>UNION</code>:</li>
@ -539,7 +539,7 @@ $ createuser -h localhost -U postgres --pwprompt dspacestatistics
$ psql -h localhost -U postgres dspacestatistics
dspacestatistics=&gt; CREATE TABLE IF NOT EXISTS items
dspacestatistics-&gt; (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)
</code></pre><h2 id="20180925">2018-09-25</h2>
</code></pre><h2 id="2018-09-25">2018-09-25</h2>
<ul>
<li>I deployed the DSpace statistics API on CGSpace, but when I ran the indexer it wanted to index 180,000 pages of item views</li>
<li>I'm not even sure how that's possible, as we only have 74,000 items!</li>
@ -586,7 +586,7 @@ Indexing item downloads (page 260 of 260)
</code></pre><ul>
<li>And now it's fast as hell due to the muuuuch smaller Solr statistics core</li>
</ul>
<h2 id="20180926">2018-09-26</h2>
<h2 id="2018-09-26">2018-09-26</h2>
<ul>
<li>Linode emailed to say that CGSpace (linode18) was using 30Mb/sec of outward bandwidth for two hours around midnight</li>
<li>I don't see anything unusual in the nginx logs, so perhaps it was the cron job that syncs the Solr database to Amazon S3?</li>
@ -616,7 +616,7 @@ sys 2m18.485s
<li>I updated the dspace-statistiscs-api to use psycopg2's <code>execute_values()</code> to insert batches of 100 values into PostgreSQL instead of doing every insert individually</li>
<li>On CGSpace this reduces the total run time of <code>indexer.py</code> from 432 seconds to 400 seconds (most of the time is actually spent in getting the data from Solr though)</li>
</ul>
<h2 id="20180927">2018-09-27</h2>
<h2 id="2018-09-27">2018-09-27</h2>
<ul>
<li>Linode emailed to say that CGSpace's (linode19) CPU load was high for a few hours last night</li>
<li>Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:</li>
@ -645,7 +645,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26
<li>I will add their IPs to the list of bad bots in nginx so we can add a &ldquo;bot&rdquo; user agent to them and let Tomcat's Crawler Session Manager Valve handle them</li>
<li>I asked Atmire to prepare an invoice for 125 credits</li>
</ul>
<h2 id="20180929">2018-09-29</h2>
<h2 id="2018-09-29">2018-09-29</h2>
<ul>
<li>I merged some changes to author affiliations from Sisay as well as some corrections to organizational names using smart quotes like <code>Université dAbomey Calavi</code> (<a href="https://github.com/ilri/DSpace/pull/388">#388</a>)</li>
<li>Peter sent me a list of 43 author names to fix, but it had some encoding errors like <code>Belalcázar, John</code> like usual (I will tell him to stop trying to export as UTF-8 because it never seems to work)</li>
@ -662,7 +662,7 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
<li>It seems to be Moayad trying to do the AReS explorer indexing</li>
<li>He was sending too many (5 or 10) concurrent requests to the server, but still&hellip; why is this shit so slow?!</li>
</ul>
<h2 id="20180930">2018-09-30</h2>
<h2 id="2018-09-30">2018-09-30</h2>
<ul>
<li>Valerio keeps sending items on CGSpace that have weird or incorrect languages, authors, etc</li>
<li>I think I should just batch export and update all languages&hellip;</li>