<li>Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
<ul>
<li>He was hitting <ahref="https://github.com/ilri/OpenRXV/issues/66">a bug on AReS</a> and also he only needed stats for 2020, and AReS currently only gives all-time stats</li>
</ul>
</li>
<li>I cleaned up about 230 ISSNs on CGSpace in OpenRefine
<ul>
<li>I had exported them last week, then filtered for anything not looking like an ISSN with this GREL: <code>isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))</code></li>
<li>Then I applied them on CGSpace with the <code>fix-metadata-values.py</code> script:</li>
<li>For now I only fixed obvious errors like “1234-5678.” and “e-ISSN: 1234-5678” etc, but there are still lots of invalid ones which need more manual work:
<ul>
<li>Too few characters</li>
<li>Too many characters</li>
<li>ISBNs</li>
</ul>
</li>
<li>Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps</li>
<li><code>openrxv-items</code> should be an alias of <code>openrxv-items-final</code>, not <code>openrxv-temp</code>… I will have to fix that manually</li>
<li>Enrico asked for more information on the RTB stats I gave him yesterday
<ul>
<li>I remembered (again) that we can’t filter Atmire’s CUA stats by date issued</li>
<li>To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range</li>
<li>I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020:</li>
csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
csvgrep -c issued -m 2020 | \
csvcut -c id | \
sed '1d' | \
sort | \
uniq
</code></pre><ul>
<li>So I remember in the future, this basically does the following:
<ul>
<li>Use csvcut to extract the id and all date issued columns from the CSV</li>
<li>Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters)</li>
<li>Use csvsql to concatenate the various date issued columns (coalescing where null)</li>
<li>Use csvgrep to filter items by date issued in 2020</li>
<li>Use csvcut to extract the id column</li>
<li>Use sed to delete the header row</li>
<li>Use sort and uniq to filter out any duplicate IDs (there were three)</li>
</ul>
</li>
<li>Then I have a list of 296 IDs for RTB items issued in 2020</li>
<li>I constructed a JSON file to post to the DSpace Statistics API:</li>
<li>Around that time in the dspace log I see nothing unusual, but maybe these?</li>
</ul>
<pre><codeclass="language-console"data-lang="console">2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
</code></pre><ul>
<li>(BTW what is the deal with the “200/127”? I should send a comment to Atmire)
<ul>
<li>I file a ticket with Atmire: <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets">https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets</a></li>
</ul>
</li>
<li>I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3640
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2968
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
</code></pre><ul>
<li>After ten minutes or so it went back down…</li>
<li>And now it’s back up in the thousands… I am seeing a lot of stuff in dspace log like this:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717955
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717956
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717957
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717958
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717959
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717960
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717961
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717962
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717963
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717964
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717965
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717966
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717967
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717968
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971
</code></pre><ul>
<li>I sent some notes and a log to Atmire on our existing issue about the database stuff
<ul>
<li>Also I asked them about the possibility of doing a formal review of Hibernate</li>
</ul>
</li>
<li>Falcon 3.0.0 was released so I updated the 3.0.0 branch for dspace-statistics-api and merged it to <code>v6_x</code>
<ul>
<li>I also fixed one minor (unrelated) bug in the tests</li>
<li>Then I deployed the new version on DSpace Test</li>
</ul>
</li>
<li>I had a meeting with Peter and Abenet about CGSpace TODOs</li>
<li>CGSpace went down again and the PostgreSQL locks are through the roof:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12154
</code></pre><ul>
<li>I don’t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:</li>
<li>Also annoying, I see tons of what look like penetration testing requests from Qualys:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
2021-04-04 06:35:18,145 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
2021-04-04 06:35:18,519 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
2021-04-04 06:35:18,520 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com
</code></pre><ul>
<li>I deleted the ilri/AReS repository on GitHub since we haven’t updated it in two years
<ul>
<li>All development is happening in <ahref="https://github.com/ilri/openRXV">https://github.com/ilri/openRXV</a> now</li>
<li>Then I realized I have to clone the backup index directly to <code>openrxv-items-final</code>, and re-create the <code>openrxv-items</code> alias:</li>
<li>Then I started a fresh harvesting in the AReS Explorer admin dashboard</li>
</ul>
<h2id="2021-04-12">2021-04-12</h2>
<ul>
<li>The harvesting on AReS finished last night, but the indexes got messed up again
<ul>
<li>I will have to fix them manually next time…</li>
</ul>
</li>
</ul>
<h2id="2021-04-13">2021-04-13</h2>
<ul>
<li>Looking into the logs on 2021-04-06 on CGSpace and DSpace Test to see if there is anything specific that stands out about the activty on those days that would cause the PostgreSQL issues
<ul>
<li>Digging into the Munin graphs for the last week I found a few other things happening on that morning:</li>
</ul>
</li>
</ul>
<p><imgsrc="/cgspace-notes/2021/04/sda-week.png"alt="/dev/sda disk latency week">
Purging 13159 hits from SomeRandomText in statistics
Total number of bot hits purged: 13159
</code></pre><ul>
<li>I noticed there were 78 items submitted in the hour before CGSpace crashed:</li>
</ul>
<pre><codeclass="language-console"data-lang="console"># grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item
78
</code></pre><ul>
<li>Of those 78, 77 of them were from Udana</li>
<li>Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:</li>
</ul>
<pre><codeclass="language-console"data-lang="console"># for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
<li>Release v1.4.2 of the DSpace Statistics API on GitHub: <ahref="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.2">https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.2</a>
<ul>
<li>This has been running on DSpace Test for the last week or so, and mostly contains the Falcon 3.0.0 changes</li>
</ul>
</li>
<li>Re-sync DSpace Test with data from CGSpace
<ul>
<li>Run system updates on DSpace Test (linode26) and reboot the server</li>
</ul>
</li>
<li>Update the PostgreSQL JDBC driver on DSpace Test (linode26) to 42.2.19
<ul>
<li>It has been a few months since we updated this, and there have been a few releases since 42.2.14 that we are currently using</li>
</ul>
</li>
<li>Create a test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
</code></pre><ul>
<li>I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
<ul>
<li>According to my notes from <ahref="/cgspace-notes/2020-10/">2020-10</a> the account must be in the admin group in order to submit via the REST API</li>
yellow open openrxv-values ChyhGwMDQpevJtlNWO1vcw 1 1 1579 0 537.6kb 537.6kb
yellow open openrxv-items-temp PhV5ieuxQsyftByvCxzSIw 1 1 103585 104372 482.7mb 482.7mb
yellow open openrxv-shared J_8cxIz6QL6XTRZct7UBBQ 1 1 127 0 115.7kb 115.7kb
yellow open openrxv-values-00001 jAoXTLR0R9mzivlDVbQaqA 1 1 3903 0 696.2kb 696.2kb
green open .kibana_task_manager_1 O1zgJ0YlQhKCFAwJZaNSIA 1 0 2 2 20.6kb 20.6kb
yellow open openrxv-users 1hWGXh9kS_S6YPxAaBN8ew 1 1 5 0 28.6kb 28.6kb
green open .apm-agent-configuration f3RAkSEBRGaxJZs3ePVxsA 1 0 0 0 283b 283b
yellow open openrxv-items-final sgk-s8O-RZKdcLRoWt3G8A 1 1 970 0 2.3mb 2.3mb
green open .kibana_1 HHPN7RD_T7qe0zDj4rauQw 1 0 25 7 36.8kb 36.8kb
yellow open users M0t2LaZhSm2NrF5xb64dnw 1 1 2 0 11.6kb 11.6kb
</code></pre><ul>
<li>Somehow the <code>openrxv-items-final</code> index only has a few items and the majority are in <code>openrxv-items-temp</code>, via the <code>openrxv-items</code> alias (which is in the temp index):</li>
<li>I deleted the <code>openrxv-items-final</code> index and <code>openrxv-items-temp</code> indexes and then restored the mappings to <code>openrxv-items-final</code>, added the <code>openrxv-items</code> alias, and started restoring the data to <code>openrxv-items</code> with elasticdump:</li>
<li>AReS seems to be working fine аfter that, so I created the <code>openrxv-items-temp</code> index and then started a fresh harvest on AReS Explorer:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">$ curl -X PUT "localhost:9200/openrxv-items-temp"
</code></pre><ul>
<li>Run system updates on CGSpace (linode18) and run the latest Ansible infrastructure playbook to update the DSpace Statistics API, PostgreSQL JDBC driver, etc, and then reboot the system</li>
yellow open openrxv-items-temp kNUlupUyS_i7vlBGiuVxwg 1 1 103741 105553 483.6mb 483.6mb
yellow open openrxv-items-final HFc3uytTRq2GPpn13vkbmg 1 1 970 0 2.3mb 2.3mb
</code></pre><ul>
<li>The indices endpoint doesn’t include the <code>openrxv-items</code> alias, but it is currently in the <code>openrxv-items-temp</code> index so the number of items is the same:</li>
<li>I am curious to check the JVM stats in a few days to see if there is a marked change</li>
</ul>
</li>
<li>Work on minor changes to get DSpace working on Ubuntu 20.04 for our <ahref="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
<li>Send Abdullah feedback on the <ahref="https://github.com/ilri/OpenRXV/pull/91">filter on click pull request</a> for OpenRXV
<ul>
<li>I see it adds a new “allow filter on click” checkbox in the layout settings, but it doesn’t modify the filters</li>
<li>Also, it seems to have broken the existing clicking of the countries on the map</li>
</ul>
</li>
<li>Atmire recently sent feedback about the CUA duplicates processor
<ul>
<li>Last month when I ran it it got stuck on the storage reports, apparently, so I will try again (with a fresh Solr statistics core from production) and skip the storage reports (<code>-g</code>):</li>
<li>I made a backup of the indexes, then deleted the <code>openrxv-items-final</code> and <code>openrxv-items-temp</code> indexes, re-created the <code>openrxv-items</code> alias, and restored the data into <code>openrxv-items</code>:</li>
<li>Start testing DSpace 7.0 Beta 5 so I can evaluate if it solves some of the problems we are having on DSpace 6, and if it’s missing things like multiple handle resolvers, etc
<ul>
<li>I see it needs Java JDK 11, Tomcat 9, Solr 8, and PostgreSQL 11</li>
<li>Also, according to the <ahref="https://wiki.lyrasis.org/display/DSDOC7x/Installing+DSpace">installation notes</a> I see you can install the old DSpace 6 REST API, so that’s potentially useful for us</li>
<li>I see that all web applications on the backend are now rolled into just one “server” application</li>
<li>The build process took 11 minutes the first time (due to downloading the world with Maven) and ~2 minutes the second time</li>
<li>The <code>local.cfg</code> content and syntax is very similar DSpace 6</li>
</ul>
</li>
<li>I got the basic <code>fresh_install</code> up and running
<ul>
<li>Then I tried to import a DSpace 6 database from production</li>
</ul>
</li>
<li>I tried to delete all the Atmire SQL migrations:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">localhost/dspace7b5= > DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
</code></pre><ul>
<li>But I got an error when running <code>dspace database migrate</code>:</li>
Detected applied migration not resolved locally: 5.0.2017.09.25
Detected applied migration not resolved locally: 6.0.2017.01.30
Detected applied migration not resolved locally: 6.0.2017.09.25
at org.flywaydb.core.Flyway.doValidate(Flyway.java:292)
at org.flywaydb.core.Flyway.access$100(Flyway.java:73)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:166)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:158)
at org.flywaydb.core.Flyway.execute(Flyway.java:527)
at org.flywaydb.core.Flyway.migrate(Flyway.java:158)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:729)
... 9 more
</code></pre><ul>
<li>I deleted those migrations:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">localhost/dspace7b5= > DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
</code></pre><ul>
<li>Then when I ran the migration again it failed for a new reason, related to the configurable workflow:</li>
Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflow_id='default'
...
</code></pre><ul>
<li>The <ahref="https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace">DSpace 7 upgrade docs</a> say I need to apply these previously optional migrations:</li>
~/dspace7b5/bin/dspace index-discovery -b 25156.71s user 64.22s system 97% cpu 7:11:09.94 total
</code></pre><ul>
<li>Not good, that shit took almost seven hours!</li>
</ul>
<h2id="2021-04-27">2021-04-27</h2>
<ul>
<li>Peter sent me a list of 500+ DOIs from CGSpace with no Altmetric score
<ul>
<li>I used csvgrep (with Windows encoding!) to extract those without our handle and save the DOIs to a text file, then got their handles with my <code>doi-to-handle.py</code> script:</li>