<li>Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
<ul>
<li>He was hitting <ahref="https://github.com/ilri/OpenRXV/issues/66">a bug on AReS</a> and also he only needed stats for 2020, and AReS currently only gives all-time stats</li>
</ul>
</li>
<li>I cleaned up about 230 ISSNs on CGSpace in OpenRefine
<ul>
<li>I had exported them last week, then filtered for anything not looking like an ISSN with this GREL: <code>isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))</code></li>
<li>Then I applied them on CGSpace with the <code>fix-metadata-values.py</code> script:</li>
<li>For now I only fixed obvious errors like “1234-5678.” and “e-ISSN: 1234-5678” etc, but there are still lots of invalid ones which need more manual work:
<ul>
<li>Too few characters</li>
<li>Too many characters</li>
<li>ISBNs</li>
</ul>
</li>
<li>Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps</li>
<li><code>openrxv-items</code> should be an alias of <code>openrxv-items-final</code>, not <code>openrxv-temp</code>… I will have to fix that manually</li>
<li>Enrico asked for more information on the RTB stats I gave him yesterday
<ul>
<li>I remembered (again) that we can’t filter Atmire’s CUA stats by date issued</li>
<li>To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range</li>
<li>I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020:</li>
csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
csvgrep -c issued -m 2020 | \
csvcut -c id | \
sed '1d' | \
sort | \
uniq
</code></pre><ul>
<li>So I remember in the future, this basically does the following:
<ul>
<li>Use csvcut to extract the id and all date issued columns from the CSV</li>
<li>Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters)</li>
<li>Use csvsql to concatenate the various date issued columns (coalescing where null)</li>
<li>Use csvgrep to filter items by date issued in 2020</li>
<li>Use csvcut to extract the id column</li>
<li>Use sed to delete the header row</li>
<li>Use sort and uniq to filter out any duplicate IDs (there were three)</li>
</ul>
</li>
<li>Then I have a list of 296 IDs for RTB items issued in 2020</li>
<li>I constructed a JSON file to post to the DSpace Statistics API:</li>
<li>Around that time in the dspace log I see nothing unusual, but maybe these?</li>
</ul>
<pre><codeclass="language-console"data-lang="console">2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
</code></pre><ul>
<li>(BTW what is the deal with the “200/127”? I should send a comment to Atmire)
<ul>
<li>I file a ticket with Atmire: <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets">https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets</a></li>
</ul>
</li>
<li>I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3640
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2968
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
</code></pre><ul>
<li>After ten minutes or so it went back down…</li>
<li>And now it’s back up in the thousands… I am seeing a lot of stuff in dspace log like this:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717955
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717956
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717957
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717958
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717959
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717960
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717961
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717962
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717963
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717964
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717965
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717966
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717967
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717968
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971
</code></pre><ul>
<li>I sent some notes and a log to Atmire on our existing issue about the database stuff
<ul>
<li>Also I asked them about the possibility of doing a formal review of Hibernate</li>
</ul>
</li>
<li>Falcon 3.0.0 was released so I updated the 3.0.0 branch for dspace-statistics-api and merged it to <code>v6_x</code>
<ul>
<li>I also fixed one minor (unrelated) bug in the tests</li>
<li>Then I deployed the new version on DSpace Test</li>
</ul>
</li>
<li>I had a meeting with Peter and Abenet about CGSpace TODOs</li>
<li>CGSpace went down again and the PostgreSQL locks are through the roof:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12154
</code></pre><ul>
<li>I don’t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:</li>
<li>Also annoying, I see tons of what look like penetration testing requests from Qualys:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
2021-04-04 06:35:18,145 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
2021-04-04 06:35:18,519 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
2021-04-04 06:35:18,520 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com
</code></pre><ul>
<li>I deleted the ilri/AReS repository on GitHub since we haven’t updated it in two years
<ul>
<li>All development is happening in <ahref="https://github.com/ilri/openRXV">https://github.com/ilri/openRXV</a> now</li>
<li>Then I realized I have to clone the backup index directly to <code>openrxv-items-final</code>, and re-create the <code>openrxv-items</code> alias:</li>
<li>Then I started a fresh harvesting in the AReS Explorer admin dashboard</li>
</ul>
<h2id="2021-04-12">2021-04-12</h2>
<ul>
<li>The harvesting on AReS finished last night, but the indexes got messed up again
<ul>
<li>I will have to fix them manually next time…</li>
</ul>
</li>
</ul>
<h2id="2021-04-13">2021-04-13</h2>
<ul>
<li>Looking into the logs on 2021-04-06 on CGSpace and DSpace Test to see if there is anything specific that stands out about the activty on those days that would cause the PostgreSQL issues
<ul>
<li>Digging into the Munin graphs for the last week I found a few other things happening on that morning:</li>
</ul>
</li>
</ul>
<p><imgsrc="/cgspace-notes/2021/04/sda-week.png"alt="/dev/sda disk latency week">
Purging 13159 hits from SomeRandomText in statistics
Total number of bot hits purged: 13159
</code></pre><ul>
<li>I noticed there were 78 items submitted in the hour before CGSpace crashed:</li>
</ul>
<pre><codeclass="language-console"data-lang="console"># grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item
78
</code></pre><ul>
<li>Of those 78, 77 of them were from Udana</li>
<li>Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:</li>
</ul>
<pre><codeclass="language-console"data-lang="console"># for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a