Add notes for 2019-12-17

This commit is contained in:
2019-12-17 14:49:24 +02:00
parent d83c951532
commit d54e5b69f1
90 changed files with 1420 additions and 1377 deletions

View File

@ -33,7 +33,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
Today these are the top 10 IPs:
"/>
<meta name="generator" content="Hugo 0.60.1" />
<meta name="generator" content="Hugo 0.61.0" />
@ -114,12 +114,12 @@ Today these are the top 10 IPs:
</p>
</header>
<h2 id="20181101">2018-11-01</h2>
<h2 id="2018-11-01">2018-11-01</h2>
<ul>
<li>Finalize AReS Phase I and Phase II ToRs</li>
<li>Send a note about my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to the dspace-tech mailing list</li>
</ul>
<h2 id="20181103">2018-11-03</h2>
<h2 id="2018-11-03">2018-11-03</h2>
<ul>
<li>Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage</li>
<li>Today these are the top 10 IPs:</li>
@ -218,7 +218,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
<li>I will add them to the list of bot IPs in nginx for now and think about enforcing rate limits in XMLUI later</li>
<li>Also, this is the third (?) time a mysterious IP on Hetzner has done this&hellip; who is this?</li>
</ul>
<h2 id="20181104">2018-11-04</h2>
<h2 id="2018-11-04">2018-11-04</h2>
<ul>
<li>Forward Peter's information about CGSpace financials to Modi from ICRISAT</li>
<li>Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again</li>
@ -313,7 +313,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<li>I added the &ldquo;most-popular&rdquo; pages to the list that return <code>X-Robots-Tag: none</code> to try to inform bots not to index or follow those pages</li>
<li>Also, I implemented an nginx rate limit of twelve requests per minute on all dynamic pages&hellip; I figure a human user might legitimately request one every five seconds</li>
</ul>
<h2 id="20181105">2018-11-05</h2>
<h2 id="2018-11-05">2018-11-05</h2>
<ul>
<li>I wrote a small Python script <a href="https://gist.github.com/alanorth/4ff81d5f65613814a66cb6f84fdf1fc5">add-dc-rights.py</a> to add usage rights (<code>dc.rights</code>) to CGSpace items based on the CSV Hector gave me from MARLO:</li>
</ul>
@ -336,7 +336,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<li>29,000 requests from Facebook and none of the requests are to the dynamic pages I rate limited yesterday!</li>
<li>At least the Tomcat Crawler Session Manager Valve is working now&hellip;</li>
</ul>
<h2 id="20181106">2018-11-06</h2>
<h2 id="2018-11-06">2018-11-06</h2>
<ul>
<li>I updated all the <a href="https://github.com/ilri/DSpace/wiki/Scripts">DSpace helper Python scripts</a> to validate against PEP 8 using Flake8</li>
<li>While I was updating the <a href="https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50">rest-find-collections.py</a> script I noticed it was using <code>expand=all</code> to get the collection and community IDs</li>
@ -346,12 +346,12 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
</code></pre><ul>
<li>Average time with all expands was 14.3 seconds, and 12.8 seconds with <code>collections,subCommunities</code>, so <strong>1.5 seconds difference</strong>!</li>
</ul>
<h2 id="20181107">2018-11-07</h2>
<h2 id="2018-11-07">2018-11-07</h2>
<ul>
<li>Update my <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to use a database management class with Python contexts so that connections and cursors are automatically opened and closed</li>
<li>Tag version 0.7.0 of the dspace-statistics-api</li>
</ul>
<h2 id="20181108">2018-11-08</h2>
<h2 id="2018-11-08">2018-11-08</h2>
<ul>
<li>I deployed verison 0.7.0 of the dspace-statistics-api on DSpace Test (linode19) so I can test it for a few days (and check the Munin stats to see the change in database connections) before deploying on CGSpace</li>
<li>I also enabled systemd's persistent journal by setting <a href="https://www.freedesktop.org/software/systemd/man/journald.conf.html"><code>Storage=persistent</code> in <em>journald.conf</em></a></li>
@ -362,12 +362,12 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
</ul>
</li>
</ul>
<h2 id="20181111">2018-11-11</h2>
<h2 id="2018-11-11">2018-11-11</h2>
<ul>
<li>I added tests to the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>!</li>
<li>It runs with Python 3.5, 3.6, and 3.7 using pytest, including automatically on Travis CI!</li>
</ul>
<h2 id="20181113">2018-11-13</h2>
<h2 id="2018-11-13">2018-11-13</h2>
<ul>
<li>Help troubleshoot an issue with Judy Kimani submitting to the <a href="https://cgspace.cgiar.org/handle/10568/78">ILRI project reports, papers and documents</a> collection on CGSpace</li>
<li>For some reason there is an existing group for the &ldquo;Accept/Reject&rdquo; workflow step, but it's empty</li>
@ -377,21 +377,21 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
<li>As for the collection mappings I think I need to export the CSV from DSpace Test, add mappings for each type (ie Books go to IITA books collection, etc), then re-import to DSpace Test, then export from DSpace command line in &ldquo;migrate&rdquo; mode&hellip;</li>
<li>From there I should be able to script the removal of the old DSpace Test collection so they just go to the correct IITA collections on import into CGSpace</li>
</ul>
<h2 id="20181114">2018-11-14</h2>
<h2 id="2018-11-14">2018-11-14</h2>
<ul>
<li>Finally import the 277 IITA (ALIZZY1802) records to CGSpace</li>
<li>I had to export them from DSpace Test and import them into a temporary collection on CGSpace first, then export the collection as CSV to map them to new owning collections (IITA books, IITA posters, etc) with OpenRefine because DSpace's <code>dspace export</code> command doesn't include the collections for the items!</li>
<li>Delete all old IITA collections on DSpace Test and run <code>dspace cleanup</code> to get rid of all the bitstreams</li>
</ul>
<h2 id="20181115">2018-11-15</h2>
<h2 id="2018-11-15">2018-11-15</h2>
<ul>
<li>Deploy version 0.8.1 of the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> to CGSpace (linode18)</li>
</ul>
<h2 id="20181118">2018-11-18</h2>
<h2 id="2018-11-18">2018-11-18</h2>
<ul>
<li>Request invoice from Wild Jordan for their meeting venue in January</li>
</ul>
<h2 id="20181119">2018-11-19</h2>
<h2 id="2018-11-19">2018-11-19</h2>
<ul>
<li>Testing corrections and deletions for AGROVOC (<code>dc.subject</code>) that Sisay and Peter were working on earlier this month:</li>
</ul>
@ -405,7 +405,7 @@ $ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m
<li>Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:</li>
</ul>
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
</code></pre><h2 id="20181120">2018-11-20</h2>
</code></pre><h2 id="2018-11-20">2018-11-20</h2>
<ul>
<li>The Discovery re-indexing on CGSpace never finished yesterday&hellip; the command died after six minutes</li>
<li>The <code>dspace.log.2018-11-19</code> shows this at the time:</li>
@ -432,7 +432,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
<ul>
<li>these items will go to the <a href="https://dspacetest.cgiar.org/handle/10568/81592">Restoring Degraded Landscapes collection</a></li>
<li>a few items missing DOIs, but they are easily available on the publication page</li>
<li>clean up DOIs to use &ldquo;<a href="https://doi.org">https://doi.org</a>&rdquo; format</li>
<li>clean up DOIs to use &ldquo;<a href="https://doi.org%22">https://doi.org&quot;</a> format</li>
<li>clean up some cg.identifier.url to remove unneccessary query strings</li>
<li>remove columns with no metadata (river basin, place, target audience, isbn, uri, publisher, ispartofseries, subject)</li>
<li>fix column with invalid spaces in metadata field name (cg. subject. wle)</li>
@ -446,16 +446,16 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
<li>these items will go to the <a href="https://dspacetest.cgiar.org/handle/10568/81589">Variability, Risks and Competing Uses collection</a></li>
<li>trim and collapse whitespace in all fields (lots in WLE subject!)</li>
<li>clean up some cg.identifier.url fields that had unneccessary anchors in their links</li>
<li>clean up DOIs to use &ldquo;<a href="https://doi.org">https://doi.org</a>&rdquo; format</li>
<li>clean up DOIs to use &ldquo;<a href="https://doi.org%22">https://doi.org&quot;</a> format</li>
<li>fix column with invalid spaces in metadata field name (cg. subject. wle)</li>
<li>remove columns with no metadata (place, target audience, isbn, uri, publisher, ispartofseries, subject)</li>
<li>remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: <code>value.replace('<27>','')</code></li>
<li>I notice a few items using DOIs pointing at ICARDA's DSpace like: <a href="https://doi.org/20.500.11766/8178">https://doi.org/20.500.11766/8178</a>, which then points at the &ldquo;real&rdquo; DOI on the publisher's site&hellip; these should be using the real DOI instead of ICARDA's &ldquo;fake&rdquo; Handle DOI</li>
<li>I notice a few items using DOIs pointing at ICARDA's DSpace like: <a href="https://doi.org/20.500.11766/8178,">https://doi.org/20.500.11766/8178,</a> which then points at the &ldquo;real&rdquo; DOI on the publisher's site&hellip; these should be using the real DOI instead of ICARDA's &ldquo;fake&rdquo; Handle DOI</li>
<li>Some items missing DOIs, but they clearly have them if you look at the publisher's site</li>
</ul>
</li>
</ul>
<h2 id="20181122">2018-11-22</h2>
<h2 id="2018-11-22">2018-11-22</h2>
<ul>
<li>Tezira is having problems submitting to the <a href="https://cgspace.cgiar.org/handle/10568/24452">ILRI brochures</a> collection for some reason
<ul>
@ -466,7 +466,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
</ul>
</li>
</ul>
<h2 id="20181126">2018-11-26</h2>
<h2 id="2018-11-26">2018-11-26</h2>
<ul>
<li><a href="https://cgspace.cgiar.org/handle/10568/97709">This WLE item</a> is issued on 2018-10 and accessioned on 2018-10-22 but does not show up in the <a href="https://cgspace.cgiar.org/handle/10568/41888">WLE R4D Learning Series</a> collection on CGSpace for some reason, and therefore does not show up on the WLE publication website</li>
<li>I tried to remove that collection from Discovery and do a simple re-index:</li>
@ -484,7 +484,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
<li>More work on the AReS terms of reference for CodeObia</li>
<li>Erica from AgriKnowledge emailed me to say that they have implemented the changes in their item page UI so that they include the permanent identifier on items harvested from CGSpace, for example: <a href="https://www.agriknowledge.org/concern/generics/wd375w33s">https://www.agriknowledge.org/concern/generics/wd375w33s</a></li>
</ul>
<h2 id="20181127">2018-11-27</h2>
<h2 id="2018-11-27">2018-11-27</h2>
<ul>
<li>Linode alerted me that the outbound traffic rate on CGSpace (linode19) was very high</li>
<li>The top users this morning are:</li>
@ -519,7 +519,7 @@ $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
</li>
<li>Help Marianne troubleshoot some issue with items in their WLE collections and the WLE publicatons website</li>
</ul>
<h2 id="20181128">2018-11-28</h2>
<h2 id="2018-11-28">2018-11-28</h2>
<ul>
<li>Change the usage rights text a bit based on Maria Garruccio's feedback on &ldquo;all rights reserved&rdquo; (<a href="https://github.com/ilri/DSpace/pull/404">#404</a>)</li>
<li>Run all system updates on DSpace Test (linode19) and reboot the server</li>