cgspace-notes/docs/2020-11/index.html
2021-11-01 10:07:11 +02:00

786 lines
42 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="November, 2020" />
<meta property="og:description" content="2020-11-01
Continue with processing the statistics-2019 Solr core with the AtomicStatisticsUpdateCLI tool on DSpace Test
So far we&rsquo;ve spent at least fifty hours to process the statistics and statistics-2019 core&hellip; wow.
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-11/" />
<meta property="article:published_time" content="2020-11-01T13:11:54+02:00" />
<meta property="article:modified_time" content="2020-11-30T20:12:55+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2020"/>
<meta name="twitter:description" content="2020-11-01
Continue with processing the statistics-2019 Solr core with the AtomicStatisticsUpdateCLI tool on DSpace Test
So far we&rsquo;ve spent at least fifty hours to process the statistics and statistics-2019 core&hellip; wow.
"/>
<meta name="generator" content="Hugo 0.88.1" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-11/",
"wordCount": "3655",
"datePublished": "2020-11-01T13:11:54+02:00",
"dateModified": "2020-11-30T20:12:55+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-11/">
<title>November, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-11/">November, 2020</a></h2>
<p class="blog-post-meta">
<time datetime="2020-11-01T13:11:54+02:00">Sun Nov 01, 2020</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-11-01">2020-11-01</h2>
<ul>
<li>Continue with processing the statistics-2019 Solr core with the AtomicStatisticsUpdateCLI tool on DSpace Test
<ul>
<li>So far we&rsquo;ve spent at least fifty hours to process the statistics and statistics-2019 core&hellip; wow.</li>
</ul>
</li>
</ul>
<h2 id="2020-11-02">2020-11-02</h2>
<ul>
<li>Talk to Moayad and fix a few issues on OpenRXV:
<ul>
<li>Incorrect views and downloads (caused by Elasticsearch&rsquo;s default result set size of 10)</li>
<li>Invalid share link</li>
<li>Missing &ldquo;https://&rdquo; for Handles in the Simple Excel report (caused by using the <code>handle</code> instead of the <code>uri</code>)</li>
<li>Sorting the list of items by views</li>
</ul>
</li>
<li>I resumed the processing of the statistics-2018 Solr core after it spent 20 hours to get to 60%</li>
</ul>
<h2 id="2020-11-04">2020-11-04</h2>
<ul>
<li>After 29 hours the statistics-2017 core finished processing so I started the statistics-2016 core on DSpace Test</li>
</ul>
<h2 id="2020-11-05">2020-11-05</h2>
<ul>
<li>Peter sent me corrections and deletions for the author affiliations
<ul>
<li>I quickly proofed them for UTF-8 issues in OpenRefine and csv-metadata-quality and then tested them locally and then applied them on CGSpace:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
$ ./delete-metadata-values.py -i 2020-11-05-delete-29-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre><ul>
<li>Then I started a Discovery re-index on CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 92m24.993s
user 8m11.858s
sys 2m26.931s
</code></pre><h2 id="2020-11-06">2020-11-06</h2>
<ul>
<li>Restart the AtomicStatisticsUpdateCLI processing of the statistics-2016 core on DSpace Test after 20 hours&hellip;
<ul>
<li>This phase finished after five hours so I started it on the statistics-2015 core</li>
</ul>
</li>
</ul>
<h2 id="2020-11-07">2020-11-07</h2>
<ul>
<li>Atmire responded about the issue with duplicate values in owningComm and containerCommunity etc
<ul>
<li>I told them to please look into it and use some of our credits if need be</li>
</ul>
</li>
<li>The statistics-2015 core finished after 20 hours so I started the statistics-2014 core</li>
</ul>
<h2 id="2020-11-08">2020-11-08</h2>
<ul>
<li>Add &ldquo;Data Paper&rdquo; to types on CGSpace</li>
<li>Add &ldquo;SCALING CLIMATE-SMART AGRICULTURE&rdquo; to CCAFS subjects on CGSpace</li>
<li>Add &ldquo;ANDEAN ROOTS AND TUBERS&rdquo; to CIP subjects on CGSpace</li>
<li>Add CGIAR System subjects to Discovery sidebar facets on CGSpace
<ul>
<li>Also add the System subject to item view on CGSpace</li>
</ul>
</li>
<li>The statistics-2014 core finished processing after five hours, so I started processing the statistics-2013 core on DSpace Test</li>
<li>Since I was going to restart CGSpace and update the Discovery indexes anyways I decided to check for any straggling upper case AGROVOC entries and lower case them:</li>
</ul>
<pre tabindex="0"><code>dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 164
dspace=# COMMIT;
</code></pre><ul>
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>I had to restart Tomcat once after the machine started up to get all Solr statistics cores to load properly</li>
</ul>
</li>
<li>After about ten more hours the rest of the Solr statistics cores finished processing on DSpace Test and I started optimizing them in Solr admin UI</li>
</ul>
<h2 id="2020-11-10">2020-11-10</h2>
<ul>
<li>I am noticing that CGSpace doesn&rsquo;t have any statistics showing for years before 2020, but all cores are loaded successfully in Solr Admin UI&hellip; strange
<ul>
<li>I restarted Tomcat and I see in Solr Admin UI that the statistics-2015 core failed to load</li>
<li>Looking in the DSpace log I see:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>2020-11-10 08:43:59,634 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:43:59,687 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
2020-11-10 08:43:59,707 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:44:00,004 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,005 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,005 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,325 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2015
</code></pre><ul>
<li>Seems that the core gets probed twice&hellip; perhaps a threading issue?
<ul>
<li>The only thing I can think of is the <code>acceptorThreadCount</code> parameter in Tomcat&rsquo;s server.xml, which has been set to 2 since 2018-01 (we started sharding the Solr statistics cores in 2019-01 and that&rsquo;s when this problem arose)</li>
<li>I will try reducing that to 1</li>
<li>Wow, now it&rsquo;s even worse:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>2020-11-10 08:51:03,007 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2018
2020-11-10 08:51:03,008 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:51:03,137 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
2020-11-10 08:51:03,153 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:51:03,289 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2015
2020-11-10 08:51:03,289 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2010
2020-11-10 08:51:03,475 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2010
2020-11-10 08:51:03,475 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2016
2020-11-10 08:51:03,730 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2020-11-10 08:51:03,731 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2017
2020-11-10 08:51:03,992 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2017
2020-11-10 08:51:03,992 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2011
2020-11-10 08:51:04,178 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2011
2020-11-10 08:51:04,178 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2012
</code></pre><ul>
<li>Could it be because we have two Tomcat connectors?
<ul>
<li>I restarted Tomcat a few more times before all cores loaded, and still there are no stats before 2020-01&hellip; hmmmmm</li>
</ul>
</li>
<li>I added a <a href="https://github.com/ilri/OpenRXV/commit/3816b9b3f3d9182d2ba1a899c1017c5895a59dee">lowercase formatter to OpenRXV</a> so that we can lowercase AGROVOC subjects during harvesting</li>
</ul>
<h2 id="2020-11-11">2020-11-11</h2>
<ul>
<li>Atmire responded with a quote for the work to fix the duplicate owningComm, etc in our Solr data
<ul>
<li>I told them to proceed, as it&rsquo;s within our budget of credits</li>
<li>They will write a processor for DSpace 6 to remove the duplicates</li>
</ul>
</li>
<li>I did some tests to add a usage statistics chart to the item views on DSpace Test
<ul>
<li>It is inspired by Salem&rsquo;s work on WorldFish&rsquo;s repository, and it hits the dspace-statistics-api for the current item and displays a graph</li>
<li>I got it working very easily for all-time statistics with Chart.js, but I think I will need to use Highcharts or something else because Chart.js is HTML5 canvas and doesn&rsquo;t allow theming via CSS (so our Bootstrap brand colors for each theme won&rsquo;t work)
<ul>
<li>Hmm, Highcharts is not licensed under and open source license so I will not use it</li>
<li>Perhaps I&rsquo;ll use Chartist with the popover plugin&hellip;</li>
</ul>
</li>
<li>I think I&rsquo;ll pursue this after the DSpace 6 upgrade&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2020-11-12">2020-11-12</h2>
<ul>
<li>I was looking at Solr again trying to find a way to get community and collection stats by faceting on <code>owningComm</code> and <code>owningColl</code> and it seems to work actually
<ul>
<li>The duplicated values in the multi-value fields don&rsquo;t seem to affect the counts, as I had thought previously (though we should still get rid of them)</li>
<li>One major difference between the raw numbers I was looking at and Atmire&rsquo;s numbers is that Atmire&rsquo;s code filters &ldquo;Internal&rdquo; IP addresses&hellip;</li>
<li>Also, instead of doing <code>isBot:false</code> I think I should do <code>-isBot:true</code> because it&rsquo;s not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as true</li>
</ul>
</li>
<li>First we get the total number of communities with stats (using calcdistinct):</li>
</ul>
<pre tabindex="0"><code>facet=true&amp;facet.field=owningComm&amp;facet.mincount=1&amp;facet.limit=1&amp;facet.offset=0&amp;stats=true&amp;stats.field=owningComm&amp;stats.calcdistinct=true&amp;shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
</code></pre><ul>
<li>Then get stats themselves, iterating 100 items at a time with limit and offset:</li>
</ul>
<pre tabindex="0"><code>facet=true&amp;facet.field=owningComm&amp;facet.mincount=1&amp;facet.limit=100&amp;facet.offset=0&amp;shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
</code></pre><ul>
<li>I was surprised to see 10,000,000 docs with <code>isBot:true</code> when I was testing on DSpace Test&hellip;
<ul>
<li>This has got to be a mistake of some kind, as I see 4 million in 2014 that are from <code>dns:localhost.</code>, perhaps that&rsquo;s when we didn&rsquo;t have useProxies set up correctly?</li>
<li>I don&rsquo;t see the same thing on CGSpace&hellip; I wonder what happened?</li>
<li>Perhaps they got re-tagged during the DSpace 6 upgrade, somehow during the Solr migration? Hmmmmm. Definitely have to be careful with <code>isBot:true</code> in the future and not automatically purge these!!!</li>
</ul>
</li>
<li>I noticed 120,000+ hits from monit, FeedBurner, and Blackboard Safeassign in 2014, 2015, 2016, 2017, etc&hellip;
<ul>
<li>I hadn&rsquo;t seen monit before, but the others are already in DSpace&rsquo;s spider agents lists for some time so probably only appear in older stats cores</li>
<li>The issue with purging these using <code>check-spider-hits.sh</code> is that it can&rsquo;t do case-insensitive regexes and some metacharacters like <code>\s</code> don&rsquo;t work so I added case-sensitive patterns to a local agents file and purged them with the script</li>
</ul>
</li>
</ul>
<h2 id="2020-11-15">2020-11-15</h2>
<ul>
<li>Upgrade CGSpace to DSpace 6.3
<ul>
<li>First build, update, and migrate the database:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ dspace cleanup -v
$ git checkout origin/6_x-dev-atmire-modules
$ npm install -g yarn
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2,\!dspace-jspui clean package
$ sudo su - postgres
$ psql dspace -c 'CREATE EXTENSION pgcrypto;'
$ psql dspace -c &quot;DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');&quot;
$ exit
$ rm -rf /home/cgspace/config/spring
$ ant update
$ dspace database info
$ dspace database migrate
$ sudo systemctl start tomcat7
</code></pre><ul>
<li>After starting Tomcat DSpace should start up OK and begin Discovery indexing, but I want to also upgrade from PostgreSQL 9.6 to 10
<ul>
<li>I installed and configured PostgreSQL 10 using the Ansible playbooks, then migrated the database manually:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code># systemctl stop tomcat7
# pg_ctlcluster 9.6 main stop
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
# tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6
# pg_ctlcluster 10 main stop
# pg_dropcluster 10 main
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# systemctl start postgresql
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
</code></pre><ul>
<li>Then I ran all system updates and rebooted the server&hellip;</li>
<li>After the server came back up I re-ran the Ansible playbook to make sure all configs and services were updated</li>
<li>I disabled the dspace-statistsics-api for now because it won&rsquo;t work until I migrate all the Solr statistics anyways</li>
<li>Start a full Discovery re-indexing:</li>
</ul>
<pre tabindex="0"><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 211m30.726s
user 134m40.124s
sys 2m17.979s
</code></pre><ul>
<li>Towards the end of the indexing there were a few dozen of these messages:</li>
</ul>
<pre tabindex="0"><code>2020-11-15 13:23:21,685 INFO com.atmire.dspace.discovery.service.AtmireSolrService @ Removed Item: null from Index
</code></pre><ul>
<li>I updated all the Ansible infrastructure and DSpace branches to be the DSpace 6 ones</li>
<li>I will wait until the Discovery indexing is finished to start doing the Solr statistics migration</li>
<li>I tested the email functionality and it seems to need more configuration:</li>
</ul>
<pre tabindex="0"><code>$ dspace test-email
About to send test email:
- To: blah@cgiar.org
- Subject: DSpace test email
- Server: smtp.office365.com
Error sending email:
- Error: com.sun.mail.smtp.SMTPSendFailedException: 451 5.7.3 STARTTLS is required to send mail [AM4PR0701CA0003.eurprd07.prod.outlook.com]
</code></pre><ul>
<li>I copied the <code>mail.extraproperties = mail.smtp.starttls.enable=true</code> setting from the old DSpace 5 <code>dspace.cfg</code> and now the emails are working</li>
<li>After the Discovery indexing finished I started processing the Solr stats one core and 2.5 million records at a time:</li>
</ul>
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
</code></pre><ul>
<li>After about 6,000,000 records I got the same error that I&rsquo;ve gotten every time I test this migration process:</li>
</ul>
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><h2 id="2020-11-16">2020-11-16</h2>
<ul>
<li>Users are having issues submitting items to CGSpace
<ul>
<li>Looking at the data I see that connections skyrocketed since DSpace 6 upgrade yesterday, and they are all in &ldquo;waiting for lock&rdquo; state:</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2020/11/postgres_connections_ALL-week.png" alt="PostgreSQL connections week">
<img src="/cgspace-notes/2020/11/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
<ul>
<li>There are almost 1,500 locks:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1494
</code></pre><ul>
<li>I sent a mail to the dspace-tech mailing list to ask for help&hellip;
<ul>
<li>For now I just restarted PostgreSQL and a few users were able to complete submissions&hellip;</li>
</ul>
</li>
<li>While processing the statistics-2018 Solr core I got the <em>same</em> memory error that I have gotten every time I processed this core in testing:</li>
</ul>
<pre tabindex="0"><code>Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuffer.append(StringBuffer.java:270)
at java.io.StringWriter.write(StringWriter.java:101)
at org.apache.solr.common.util.XML.writeXML(XML.java:133)
at org.apache.solr.client.solrj.util.ClientUtils.writeVal(SourceFile:160)
at org.apache.solr.client.solrj.util.ClientUtils.writeXML(SourceFile:128)
at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:365)
at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:281)
at org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:67)
at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:95)
at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:105)
at org.apache.solr.client.solrj.impl.HttpSolrServer.createMethod(HttpSolrServer.java:302)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
<li>I increased the Java heap memory to 4096MB and restarted the processing
<ul>
<li>After a few hours I got the following error, which I have gotten several times over the last few months:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><h2 id="2020-11-17">2020-11-17</h2>
<ul>
<li>Chat with Peter about using some remaining CRP Livestock open access money to fund more work on OpenRXV / AReS
<ul>
<li>I will create GitHub issues for each of the things we talked about and then create ToRs to send to CodeObia for a quote</li>
</ul>
</li>
<li>Continue migrating Solr statistics to DSpace 6 UUID format after the upgrade on Sunday</li>
<li>Regarding the IWMI issue about flagships and strategic priorities we can use CRP Livestock as an example because all their <a href="https://cgspace.cgiar.org/handle/10568/80102">flagships are mapped to collections</a></li>
<li>Database issues are worse today&hellip;</li>
</ul>
<p><img src="/cgspace-notes/2020/11/postgres_connections_ALL-week2.png" alt="PostgreSQL connections week"></p>
<ul>
<li>There are over 2,000 locks:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2071
</code></pre><h2 id="2020-11-18">2020-11-18</h2>
<ul>
<li>I decided to enable the <code>rollbackOnReturn=true</code> option in <a href="https://tomcat.apache.org/tomcat-7.0-doc/jdbc-pool.html">Tomcat&rsquo;s JDBC connection pool parameters</a> because I noticed that all of the &ldquo;idle in transaction&rdquo; connections waiting for locks were SELECT queries
<ul>
<li>There are many posts on the Internet about people having this issue with Hibernate</li>
<li>The locks are lower now, but Peter and Abenet are still having issues approving items and Tezira forwarded one strange case where an item was &ldquo;approved&rdquo; and was assigned a handle, but it doesn&rsquo;t exist&hellip;</li>
<li>I sent another mail to the dspace-tech mailing list to ask for help</li>
<li>I reverted the <code>rollbackOnReturn</code> change in Tomcat&hellip;</li>
<li>I sent a message to Atmire to ask for urgent help</li>
</ul>
</li>
<li>Call with IWMI and Abenet about them potentially moving from InMagic to CGSpace
<ul>
<li>They have questions about the reporting on AReS</li>
<li>We told them that we can use collections to infer Strategic Priorities and Research Groups and WLE Flagships</li>
<li>It sounds like we will create this structure under the top-level IWMI community:
<ul>
<li>IWMI Strategic Priorities (sub-community)
<ul>
<li>Water, Food and Ecosystems (sub-community)
<ul>
<li>Sustainable and Resilient Food Production Systems (collection)</li>
<li>Sustainable Water infrastructure and Ecosystems (collection)</li>
<li>Integrated Basin and Aquifer Management</li>
</ul>
</li>
<li>Water, Climate Change and Resilience (sub-community)
<ul>
<li>Climate Change Adaptation and Resilience (collection)</li>
</ul>
</li>
<li>etc&hellip;</li>
</ul>
</li>
</ul>
</li>
<li>They will submit items to their normal output type collections and map to these</li>
</ul>
</li>
<li>In other news I finally finished processing the Solr statistics for UUIDs and re-indexed the stats with the dspace-statistics-api
<ul>
<li>I started the Atmire stats processing, notes in the dedicated <a href="/cgspace-notes/cgspace-dspace6-upgrade/">CGSpace DSpace 6 Upgrade section</a></li>
</ul>
</li>
<li>Peter got a strange message this evening when trying to update metadata:</li>
</ul>
<pre tabindex="0"><code>2020-11-18 16:57:33,309 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1]
2020-11-18 16:57:33,316 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [13]; actual row count: 0; expected: 1]
2020-11-18 16:57:33,385 INFO org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl @ HHH000010: On release of batch it still contained JDBC statements
</code></pre><ul>
<li>Minor bug fixes to limit parameter in DSpace Statistics API
<ul>
<li>Release <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.3.2">version 1.3.2</a></li>
</ul>
</li>
<li>Send a list of potential ToRs for a next phase of OpenRXV development to Michael Victor for feedback:
<ul>
<li>Enable advanced reporting templates using &ldquo;Angular expressions&rdquo; in Docxtemplater (would be used immediately for IWMI and BioversityCIAT)</li>
<li>Enable embedding of charts like world map and word cloud in reports</li>
<li>Enable embedding of item thumbnails in reports, similar to the &ldquo;list of information products&rdquo;</li>
<li>Enable something like the &ldquo;Statistics&rdquo; Excel report Peter wanted in 2019 so we can get community and collection statistics reports</li>
<li>Add a new &ldquo;metrics&rdquo; block with statistics about top authors and items by number of views and downloads for the current search terms</li>
<li>Add ability to change the explorer UI to &ldquo;Usage Statistics&rdquo; mode where lists of authors, affiliations, sponsors, CRPs, communities, collections, etc are sorted according to the number of views or downloads for the current search results, rather than by number of occurrences of metadata values</li>
<li>Add ability to &ldquo;drill down&rdquo; or modify search filter terms by clicking on countries in the map</li>
<li>Enable date-based usage statistics (currently only &ldquo;all time&rdquo; statistics are available)</li>
<li>Fixing minor bugs for all issues filed on GitHub</li>
</ul>
</li>
<li>I also added GitHub issues for each of them</li>
</ul>
<h2 id="2020-11-19">2020-11-19</h2>
<ul>
<li>I started a fresh reharvest on AReS and when it was done I noticed that the metadata from CGSpace is fine, but the views and downloads don&rsquo;t seem to be working</li>
<li>Peter said he was able to approve a few items on CGSpace immediately &ldquo;like old times&rdquo; this morning</li>
<li>The PostgreSQL status looks much better now, though I haven&rsquo;t changed anything</li>
</ul>
<p><img src="/cgspace-notes/2020/11/postgres_connections_ALL-week3.png" alt="PostgreSQL connections week">
<img src="/cgspace-notes/2020/11/postgres_locks_ALL-week2.png" alt="PostgreSQL locks week">
<img src="/cgspace-notes/2020/11/postgres_xlog-week.png" alt="PostgreSQL transaction log week">
<img src="/cgspace-notes/2020/11/postgres_transactions_ALL-week.png" alt="PostgreSQL transactions week"></p>
<ul>
<li>Very curious that there was such a high number of rolled back transactions after the update</li>
</ul>
<h2 id="2020-11-22">2020-11-22</h2>
<ul>
<li>PostgreSQL situation on CGSpace (linode18) looks much better now:</li>
</ul>
<p><img src="/cgspace-notes/2020/11/postgres_locks_ALL-week3.png" alt="PostgreSQL locks week">
<img src="/cgspace-notes/2020/11/postgres_xlog-week2.png" alt="PostgreSQL transaction log week"></p>
<ul>
<li>In other news, I noticed that harvesting DSpace 6 works fine in OpenRXV, but the statistics fail on page 1
<ul>
<li>I filed an issue: <a href="https://github.com/ilri/OpenRXV/issues/59">https://github.com/ilri/OpenRXV/issues/59</a></li>
</ul>
</li>
<li>Abenet asked for help trying to add a new user to the Bioversity and CIAT groups on CGSpace
<ul>
<li>I see that the user search is split on five results, so the user in question appears on page 2</li>
<li>I asked Abenet if she was getting an error or it was simply this&hellip;</li>
</ul>
</li>
<li>Maria Garuccio sent me an example report that she wants to be able to generate from AReS
<ul>
<li>First, she would like to have the option to group by output type</li>
<li>Second, she would like to be able to control the sorting in the template, like sorting the citation alphabetically</li>
<li>I filed an issue: <a href="https://github.com/ilri/OpenRXV/issues/60">https://github.com/ilri/OpenRXV/issues/60</a></li>
</ul>
</li>
<li>Mohammad Salem had asked if there was an item ID to UUID mapping for CGSpace
<ul>
<li>I found a thread on the dspace-tech mailing list that pointed out that there is a new <code>uuid</code> column in the item table</li>
<li>Only old items have an <code>item_id</code> so we can get a mapping easily:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
COPY 87411
</code></pre><ul>
<li>Saving some notes I wrote down about faceting by community and collection in Solr, for potential use in the future in the DSpace Statistics API</li>
<li>Facet by owningComm to see total number of distinct communities (136):</li>
</ul>
<pre tabindex="0"><code> facet=true&amp;facet.mincount=1&amp;facet.field=owningComm&amp;facet.limit=1&amp;facet.offset=0&amp;stats=true&amp;stats.field=id&amp;stats.calcdistinct=true
</code></pre><ul>
<li>Facet by owningComm and get the first 5 distinct:</li>
</ul>
<pre tabindex="0"><code> facet=true&amp;facet.mincount=1&amp;facet.field=owningComm&amp;facet.limit=5&amp;facet.offset=0&amp;facet.pivot=id,countryCode
</code></pre><ul>
<li>Facet by owningComm and countryCode using facet.pivot and maybe I can just skip the normal facet params?</li>
</ul>
<pre tabindex="0"><code>facet=true&amp;f.owningComm.facet.limit=5&amp;f.owningComm.facet.offset=5&amp;facet.pivot=owningComm,countryCode
</code></pre><ul>
<li>Facet by owningComm and countryCode using facet.pivot and limiting to top five countries&hellip; fuck it&rsquo;s possible!</li>
</ul>
<pre tabindex="0"><code>facet=true&amp;f.owningComm.facet.limit=5&amp;f.owningComm.facet.offset=5&amp;f.countryCode.facet.limit=5&amp;facet.pivot=owningComm,countryCode
</code></pre><h2 id="2020-11-23">2020-11-23</h2>
<ul>
<li>I created the sub-communities and collections for IWMI&rsquo;s Strategic Priorities and Research Groups on CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/110259">https://cgspace.cgiar.org/handle/10568/110259</a></li>
</ul>
<h2 id="2020-11-24">2020-11-24</h2>
<ul>
<li>Yesterday Abenet asked me to investigate why AReS only shows 9,000 &ldquo;livestock&rdquo; terms in the ILRI community on AReS, but on CGSpace we have over 10,000
<ul>
<li>I added the lowercase formatter to all center and CRP subjects fields and re-harvested</li>
<li>Now I see there are 9,999, which seems suspicious</li>
<li>I filed a bug on GitHub: <a href="https://github.com/ilri/OpenRXV/issues/61">https://github.com/ilri/OpenRXV/issues/61</a></li>
</ul>
</li>
<li>Help Abenet map an item on CGSpace for CIAT
<ul>
<li>If I search for the entire item title I don&rsquo;t get any results, but I notice this item had a &ldquo;:&rdquo; in the title, so I tried searching for part of the title without the colon and it worked</li>
<li>It is a mystery to me that you can&rsquo;t map an item using its Handle&hellip;</li>
</ul>
</li>
<li>I started processing the statistics-2011 core with Atmire&rsquo;s AtomicStatisticsUpdateCLI tool</li>
<li>I called Moayad and we worked on the views/downloads issue on OpenRXV
<ul>
<li>It turns out to be a mapping (schema) issue in Elasticsearch due to DSpace 6 UUIDs (LOL!!)</li>
</ul>
</li>
</ul>
<h2 id="2020-11-25">2020-11-25</h2>
<ul>
<li>Zoom meeting with ILRI communicators about CGSpace, Altmetric, and AReS</li>
<li>Send an email to Richard Fulss and Paola Camargo Paz at CIMMYT about having them work closer with us on AReS</li>
<li>Send an email to Usman at CIFOR to ask how his DSpace stuff is going</li>
<li>The Atmire AtomicStatisticsUpdateCLI tool finished processing the statistics-2017 core</li>
<li>Atmire responded about the duplicate fields in Solr and said they don&rsquo;t see them
<ul>
<li>I sent a few examples that I found after thirty seconds of randomly looking in several Solr cores</li>
</ul>
</li>
</ul>
<h2 id="2020-11-27">2020-11-27</h2>
<ul>
<li>I finished processing the statistics-2016 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2015 core</li>
</ul>
<h2 id="2020-11-28">2020-11-28</h2>
<ul>
<li>I finished processing the statistics-2015 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2014 core</li>
<li>I finished processing the statistics-2014 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2013 core</li>
<li>I finished processing the statistics-2014 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2012 core</li>
<li>I finished processing the statistics-2014 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2012 core</li>
<li>I finished processing the statistics-2014 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2010 core</li>
</ul>
<h2 id="2020-11-29">2020-11-29</h2>
<ul>
<li>Peter told me that he can&rsquo;t find the <a href="https://cgspace.cgiar.org/handle/10568/80099">CGIAR Research Program on Livestock</a> community in the community filters on AReS
<ul>
<li>I looked briefly and couldn&rsquo;t find it either so I filed an issue on OpenRXV: <a href="https://github.com/ilri/OpenRXV/issues/62">https://github.com/ilri/OpenRXV/issues/62</a></li>
</ul>
</li>
</ul>
<h2 id="2020-11-30">2020-11-30</h2>
<ul>
<li>Ben Hack asked for the ILRI subject we are using on CGSpace
<ul>
<li>I linked him the input-forms.xml file and also sent him a list of 112 terms extracted with <code>xml</code> from the xmlstarlet package:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>$ xml sel -t -m '//value-pairs[@value-pairs-name=&quot;ilrisubject&quot;]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
</code></pre><ul>
<li>IWMI sent me a few new ORCID identifiers so I combined them with our existing ones as well as another ILRI one that Tezira asked me to update, filtered the unique ones, and then resolved their names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2020-11-30-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-11-30-combined-orcids.txt -o /tmp/2020-11-30-combined-orcids-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<li>I used my <code>fix-metadata-values.py</code> script to update the old occurences of Hung&rsquo;s ORCID and some others that I see have changed:</li>
</ul>
<pre tabindex="0"><code>$ cat 2020-11-30-fix-hung-orcid.csv
cg.creator.id,correct
&quot;Hung Nguyen-Viet: 0000-0001-9877-0596&quot;,&quot;Hung Nguyen-Viet: 0000-0003-1549-2733&quot;
&quot;Adriana Tofiño: 0000-0001-7115-7169&quot;,&quot;Adriana Tofiño Rivera: 0000-0001-7115-7169&quot;
&quot;Cristhian Puerta Rodriguez: 0000-0001-5992-1697&quot;,&quot;David Puerta: 0000-0001-5992-1697&quot;
&quot;Ermias Betemariam: 0000-0002-1955-6995&quot;,&quot;Ermias Aynekulu: 0000-0002-1955-6995&quot;
&quot;Hirut Betaw: 0000-0002-1205-3711&quot;,&quot;Betaw Hirut: 0000-0002-1205-3711&quot;
&quot;Megan Zandstra: 0000-0002-3326-6492&quot;,&quot;Megan McNeil Zandstra: 0000-0002-3326-6492&quot;
&quot;Tolu Eyinla: 0000-0003-1442-4392&quot;,&quot;Toluwalope Emmanuel: 0000-0003-1442-4392&quot;
&quot;VInay Nangia: 0000-0001-5148-8614&quot;,&quot;Vinay Nangia: 0000-0001-5148-8614&quot;
$ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -f cg.creator.id -t 'correct' -m 240
</code></pre><!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2021-09/">September, 2021</a></li>
<li><a href="/cgspace-notes/2021-08/">August, 2021</a></li>
<li><a href="/cgspace-notes/2021-07/">July, 2021</a></li>
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>