532 lines
25 KiB
HTML
Raw Normal View History

2020-05-02 10:08:14 +03:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
2020-12-06 16:53:29 +02:00
2020-05-02 10:08:14 +03:00
<meta property="og:title" content="May, 2020" />
<meta property="og:description" content="2020-05-02
Peter said that CTA is having problems submitting an item to CGSpace
Looking at the PostgreSQL stats it seems to be the same issue that Tezira was having last week, as I see the number of connections in &lsquo;idle in transaction&rsquo; and &lsquo;waiting for lock&rsquo; state are increasing again
I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2.11, and there were some bugs related to transactions fixed in 42.2.12 (which I had updated in the Ansible playbooks, but not deployed yet)
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-05/" />
<meta property="article:published_time" content="2020-05-02T09:52:04+03:00" />
2020-06-01 17:08:25 +03:00
<meta property="article:modified_time" content="2020-06-01T13:55:08+03:00" />
2020-05-02 10:08:14 +03:00
2020-12-06 16:53:29 +02:00
2020-05-02 10:08:14 +03:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2020"/>
<meta name="twitter:description" content="2020-05-02
Peter said that CTA is having problems submitting an item to CGSpace
Looking at the PostgreSQL stats it seems to be the same issue that Tezira was having last week, as I see the number of connections in &lsquo;idle in transaction&rsquo; and &lsquo;waiting for lock&rsquo; state are increasing again
I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2.11, and there were some bugs related to transactions fixed in 42.2.12 (which I had updated in the Ansible playbooks, but not deployed yet)
"/>
2021-07-01 08:53:21 +03:00
<meta name="generator" content="Hugo 0.84.3" />
2020-05-02 10:08:14 +03:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "May, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-05/",
2020-06-01 13:55:08 +03:00
"wordCount": "2154",
2020-05-02 10:08:14 +03:00
"datePublished": "2020-05-02T09:52:04+03:00",
2020-06-01 17:08:25 +03:00
"dateModified": "2020-06-01T13:55:08+03:00",
2020-05-02 10:08:14 +03:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-05/">
<title>May, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
2021-01-24 09:46:27 +02:00
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
2020-05-02 10:08:14 +03:00
<!-- minified Font Awesome for SVG icons -->
2021-01-24 09:46:27 +02:00
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.ffbfea088a9a1666ec65c3a8cb4906e2a0e4f92dc70dbbf400a125ad2422123a.js" integrity="sha256-/7/qCIqaFmbsZcOoy0kG4qDk&#43;S3HDbv0AKElrSQiEjo=" crossorigin="anonymous"></script>
2020-05-02 10:08:14 +03:00
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-05/">May, 2020</a></h2>
2020-11-16 10:54:00 +02:00
<p class="blog-post-meta">
<time datetime="2020-05-02T09:52:04+03:00">Sat May 02, 2020</time>
in
2020-05-02 10:08:14 +03:00
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-05-02">2020-05-02</h2>
<ul>
<li>Peter said that CTA is having problems submitting an item to CGSpace
<ul>
<li>Looking at the PostgreSQL stats it seems to be the same issue that Tezira was having last week, as I see the number of connections in &lsquo;idle in transaction&rsquo; and &lsquo;waiting for lock&rsquo; state are increasing again</li>
<li>I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2.11, and there were some bugs related to transactions fixed in 42.2.12 (which I had updated in the Ansible playbooks, but not deployed yet)</li>
</ul>
</li>
</ul>
2020-05-03 16:10:21 +03:00
<h2 id="2020-05-03">2020-05-03</h2>
<ul>
<li>Purge a few remaining bots from CGSpace Solr statistics that I had identified a few months ago
<ul>
<li><code>lua-resty-http/0.10 (Lua) ngx_lua/10000</code></li>
<li><code>omgili/0.5 +http://omgili.com</code></li>
<li><code>IZaBEE/IZaBEE-1.01 (Buzzing Abound The Web; https://izabee.com; info at izabee dot com)</code></li>
<li><code>Twurly v1.1 (https://twurly.org)</code></li>
<li><code>Pattern/2.6 +http://www.clips.ua.ac.be/pattern</code></li>
<li><code>CyotekWebCopy/1.7 CyotekHTTP/2.0</code></li>
</ul>
</li>
<li>This is only about 2,500 hits total from the last ten years, and half of these bots no longer seem to exist, so I won&rsquo;t bother submitting them to the COUNTER-Robots project</li>
<li>I noticed that our custom themes were incorrectly linking to the OpenSearch XML file
<ul>
<li>The bug <a href="https://jira.lyrasis.org/browse/DS-2592">was fixed</a> for Mirage2 in 2015</li>
<li>Note that this did not prevent OpenSearch itself from working</li>
<li>I will patch this on our DSpace 5.x and 6.x branches</li>
</ul>
</li>
</ul>
2020-05-06 16:03:29 +03:00
<h2 id="2020-05-06">2020-05-06</h2>
<ul>
<li>Atmire responded asking for more information about the Solr statistics processing bug in CUA so I sent them some full logs
<ul>
<li>Also I asked again about the Maven variable interpolation issue for <code>cua.version.number</code>, and if they would be willing to upgrade CUA to use Font Awesome 5 instead of 4.</li>
</ul>
</li>
</ul>
2020-05-07 11:45:25 +03:00
<h2 id="2020-05-07">2020-05-07</h2>
<ul>
<li>Linode sent an alert that there was high CPU usage on CGSpace (linode18) early this morning
<ul>
<li>I looked at the nginx logs using goaccess and I found a few IPs making lots of requests around then:</li>
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;07/May/2020:(01|03|04)&quot; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The two main IPs making requests around then are 188.134.31.88 and 212.34.8.188
<ul>
<li>The first is in Russia and it is hitting mostly XMLUI Discover links using <em>dozens</em> of different user agents, a total of 20,000 requests this week</li>
<li>The second IP is CodeObia testing AReS, a total of 171,000 hits this month</li>
<li>I will purge both of those IPs from the Solr stats using my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
Purging 171641 hits from 212.34.8.188 in statistics
Purging 20691 hits from 188.134.31.88 in statistics
Total number of bot hits purged: 192332
</code></pre><ul>
<li>And then I will add 188.134.31.88 to the nginx bad bot list and tell CodeObia to please use a &ldquo;bot&rdquo; user agent</li>
<li>I also changed the nginx config to block requests with blank user agents</li>
</ul>
2020-05-11 16:50:27 +03:00
<h2 id="2020-05-11">2020-05-11</h2>
<ul>
<li>Bizu said she was having issues submitting to CGSpace last week
<ul>
<li>The issue sounds like the one Tezira and CTA were having in the last few weeks</li>
<li>I looked at the PostgreSQL graphs and see there are a lot of connections in &ldquo;idle in transaction&rdquo; and &ldquo;waiting for lock&rdquo; state:</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2020/05/postgres_connections_cgspace-week.png" alt="PostgreSQL connections"></p>
<ul>
<li>I think I&rsquo;ll downgrade the PostgreSQL JDBC driver from 42.2.12 to 42.2.10, which was the version we were using before these issues started happening</li>
<li>Atmire sent some feedback about my ongoing issues with their CUA module, but none of it was conclusive yet
<ul>
<li>Regarding Font Awesome 5 they will check how much work it will take and give me a quote</li>
</ul>
</li>
<li>Abenet said some users are questioning why the statistics dropped so much lately, so I made a <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=674923030216704">post to Yammer</a> to explain about the robots</li>
<li>Last week Peter had asked me to add a new ILRI author&rsquo;s ORCID iD
<ul>
<li>I added it to the controlled vocabulary and tagged the user&rsquo;s existing ~11 items in CGSpace using this CSV file with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-11-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Lutakome, P.&quot;,&quot;Pius Lutakome: 0000-0002-0804-2649&quot;
&quot;Lutakome, Pius&quot;,&quot;Pius Lutakome: 0000-0002-0804-2649&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
2020-05-11 17:15:13 +03:00
</code></pre><ul>
2020-05-11 17:30:31 +03:00
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>I had to restart Tomcat five times before all Solr statistics cores came up OK, ugh.</li>
</ul>
</li>
2020-05-11 17:15:13 +03:00
</ul>
2020-05-17 20:03:17 +03:00
<h2 id="2020-05-12">2020-05-12</h2>
<ul>
<li>Peter noticed that CGSpace is no longer on AReS, because I blocked all requests that don&rsquo;t specify a user agent
<ul>
<li>I&rsquo;ve temporarily disabled that restriction and asked Moayad to look into how he can specify a user agent in the AReS harvester</li>
</ul>
</li>
</ul>
<h2 id="2020-05-13">2020-05-13</h2>
<ul>
<li>Atmire responded about Font Awesome and said they can switch to version 5 for 16 credits
<ul>
<li>I told them to go ahead</li>
</ul>
</li>
<li>Also, Atmire gave me a small workaround for the <code>cua.version.number</code> interpolation issue and said they would look into the crash that happens when processing our Solr stats</li>
<li>Run system updates and reboot AReS server (linode20) for the first time in almost 100 days
<ul>
<li>I notice that AReS now has some of CGSpace&rsquo;s data in it (but not all) since I dropped the user-agent restriction on the REST API yesterday</li>
</ul>
</li>
</ul>
<h2 id="2020-05-17">2020-05-17</h2>
<ul>
<li>Create an issue in the OpenRXV project for Moayad to change the default harvester user agent (<a href="https://github.com/ilri/OpenRXV/issues/36">#36</a>)</li>
</ul>
2020-05-19 11:13:48 +03:00
<h2 id="2020-05-18">2020-05-18</h2>
<ul>
<li>Atmire responded and said they still can&rsquo;t figure out the CUA statistics issue, though they seem to only be trying to understand what&rsquo;s going on using static analysis
<ul>
<li>I told them that they should try to run the code with the Solr statistics that I shared with them a few weeks ago</li>
</ul>
</li>
</ul>
<h2 id="2020-05-19">2020-05-19</h2>
<ul>
<li>Add ORCID identifier for Sirak Bahta
<ul>
<li>I added it to the controlled vocabulary and tagged the user&rsquo;s existing ~40 items in CGSpace using this CSV file with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-19-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Bahta, Sirak T.&quot;,&quot;Sirak Bahta: 0000-0002-5728-2489&quot;
2020-05-25 11:52:28 +03:00
$ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
2020-05-20 09:44:36 +03:00
</code></pre><ul>
<li>An IITA user is having issues submitting to CGSpace and I see there are a rising number of PostgreSQL connections waiting in transaction and in lock:</li>
</ul>
<p><img src="/cgspace-notes/2020/05/postgres_connections_cgspace-week2.png" alt="PostgreSQL connections"></p>
<ul>
<li>This is the same issue Tezira, Bizu, and CTA were having in the last few weeks and it I already downgraded the PostgreSQL JDBC driver version to the last version I was using before this started (42.2.10)
<ul>
<li>I will downgrade it to version 42.2.9 for now&hellip;</li>
<li>The only other thing I can think of is that I upgraded Tomcat to 7.0.103 in March</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode26) and reboot it</li>
<li>Run system updates on CGSpace (linode18) and reboot it
<ul>
<li>After the system came back up I had to restart Tomcat 7 three times before all the Solr statistics cores came up OK</li>
</ul>
</li>
<li>Send Atmire a snapshot of the CGSpace database for them to possibly troubleshoot the CUA issue with DSpace 6</li>
</ul>
<h2 id="2020-05-20">2020-05-20</h2>
<ul>
<li>Send CodeObia some logos and footer text for the next phase of OpenRXV development (<a href="https://github.com/ilri/OpenRXV/issues/18">#18</a>)</li>
</ul>
2020-05-25 11:52:28 +03:00
<h2 id="2020-05-25">2020-05-25</h2>
<ul>
<li>Add ORCID identifier for CIAT author Manuel Francisco
<ul>
<li>I added it to the controlled vocabulary and tagged the user&rsquo;s existing ~27 items in CGSpace using this CSV file with my <code>add-orcid-identifiers-csv.py</code> script:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-05-25-add-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Díaz, Manuel F.&quot;,&quot;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&quot;
&quot;Díaz, Manuel Francisco&quot;,&quot;Manuel Francisco Diaz Baca: 0000-0001-8996-5092&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
</code></pre><ul>
<li>Last week Maria asked again about searching for items by accession or issue date
<ul>
<li>A few months ago I had told her to search for the ISO8601 date in Discovery search, which appears to work because it filters the results down quite a bit</li>
<li>She pointed out that the results include hits that don&rsquo;t exactly match, for example if part of the search string appears elsewhere like in the timestamp</li>
<li>I checked in Solr and the results are the same, so perhaps it&rsquo;s a limitation in Solr&hellip;?</li>
<li>So this effectively means that we don&rsquo;t have a way to create reports for items in an arbitrary date range shorter than a year:
<ul>
<li>DSpace advanced search is buggy or simply not designed to work like that</li>
<li>AReS Explorer currently only allows filtering by year, but will allow months soon</li>
<li>Atmire Listings and Reports only allows a &ldquo;Timespan&rdquo; of a year</li>
</ul>
</li>
</ul>
</li>
</ul>
2020-05-29 10:25:41 +03:00
<h2 id="2020-05-29">2020-05-29</h2>
<ul>
<li>Linode alerted to say that the CPU load on CGSpace (linode18) was high for a few hours this morning
<ul>
<li>Looking at the nginx logs for this morning with goaccess:</li>
</ul>
</li>
</ul>
<pre><code># cat /var/log/nginx/*.log.1 | grep -E &quot;29/May/2020:(02|03|04|05)&quot; | goaccess --log-format=COMBINED -
</code></pre><ul>
<li>The top is 172.104.229.92, which is the AReS harvester (still not using a user agent, but it&rsquo;s tagged as a bot in the nginx mapping)</li>
<li>Second is 188.134.31.88, which is a Russian host that we also saw in the last few weeks, using a browser user agent and hitting the XMLUI (but it is tagged as a bot in nginx as well)</li>
<li>Another one is 51.158.106.4, which is some Scaleway IP making requests to XMLUI with different browser user agents that I am pretty sure I have seen before but never blocked
<ul>
<li>According to Solr it has made about 800 requests this year, but still&hellip; it&rsquo;s a bot.</li>
</ul>
</li>
<li>One I don&rsquo;t think I&rsquo;ve seen before is 95.217.58.146, which is making requests to XMLUI with a Drupal user agent
<ul>
<li>According to <a href="https://viewdns.info/reverseip/?host=95.217.58.146&amp;t=1">viewdns.info</a> it belongs to <a href="https://landvoc.org/">landvoc.org</a></li>
<li>I should add Drupal to the list of bots&hellip;</li>
</ul>
</li>
2020-05-30 18:38:16 +03:00
<li>Atmire got back to me about the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">Solr CUA issue in the DSpace 6 upgrade</a> and they cannot reproduce the error
<ul>
<li>The next step is for me to migrate DSpace Test (linode26) to DSpace 6 and try to reproduce the error there</li>
</ul>
</li>
2020-05-29 10:25:41 +03:00
</ul>
2020-05-31 16:04:18 +03:00
<h2 id="2020-05-31">2020-05-31</h2>
<ul>
<li>Start preparing to migrate DSpace Test (linode26) to the <code>6_x-dev-atmire-modules</code> branch
<ul>
<li>Run all system updates and reboot</li>
<li>For now I will disable all yearly Solr statistics cores except the current <code>statistics</code> one</li>
<li>Prepare PostgreSQL with a clean snapshot of CGSpace&rsquo;s DSpace 5.8 database:</li>
</ul>
</li>
</ul>
<pre><code>$ sudo su - postgres
$ dropdb dspacetest
$ createdb -O dspacetest --encoding=UNICODE dspacetest
$ psql dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -d dspacetest -O --role=dspacetest /tmp/cgspace_2020-05-31.backup
$ psql dspacetest -c 'alter user dspacetest nosuperuser;'
# run DSpace 5 version of update-sequences.sql!!!
$ psql -f /home/dspace/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql dspacetest -c &quot;DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');&quot;
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
$ exit
</code></pre><ul>
<li>Now switch to the DSpace 6.x branch and start a build:</li>
</ul>
<pre><code>$ chrt -i 0 ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false package
...
[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:6.3: Failed to collect dependencies at com.atmire:atmire-listings-and-reports-api:jar:6.x-2.10.8-0-SNAPSHOT: Failed to read artifact descriptor for com.atmire:atmire-listings-and-reports-api:jar:6.x-2.10.8-0-SNAPSHOT: Could not transfer artifact com.atmire:atmire-listings-and-reports-api:pom:6.x-2.10.8-0-SNAPSHOT from/to atmire.com-snapshots (https://atmire.com/artifactory/atmire.com-snapshots): Not authorized , ReasonPhrase:Unauthorized. -&gt; [Help 1]
</code></pre><ul>
<li>Great! I will have to send Atmire a note about this&hellip; but for now I can sync over my local <code>~/.m2</code> directory and the build completes</li>
<li>After the Maven build completed successfully I installed the updated code with Ant (make sure to delete the old spring directory):</li>
</ul>
<pre><code>$ cd dspace/target/dspace-installer
$ rm -rf /blah/dspacetest/config/spring
$ ant update
</code></pre><ul>
<li>Database migrations take 10:18.287s during the first startup&hellip;
<ul>
<li>perhaps when we do the production CGSpace migration I can do this in advance and tell users not to make any submissions?</li>
</ul>
</li>
<li>I had a mistake in my Solr internal URL parameter so DSpace couldn&rsquo;t find it, but once I fixed that DSpace starts up OK!</li>
2020-05-31 20:49:41 +03:00
<li>Once the initial Discovery reindexing was completed (after three hours or so!) I started the Solr statistics UUID migration:</li>
2020-05-31 16:04:18 +03:00
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
2020-05-31 20:49:41 +03:00
$ dspace solr-upgrade-statistics-6x -i statistics -n 250000
$ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
$ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
...
</code></pre><ul>
<li>It&rsquo;s taking about 35 minutes for 1,000,000 records&hellip;</li>
<li>Some issues towards the end of this core:</li>
</ul>
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
<li>So basically there are some documents that have IDs that have <em>not</em> been converted to UUID, and have <em>not</em> been labeled as &ldquo;unmigrated&rdquo; either&hellip;
<ul>
<li>Of these 101,257 documents, 90,000 are of type 5 (search), 9,000 are type storage, and 800 are type view, but it&rsquo;s weird because if I look at their type/statistics_type using a facet the storage ones disappear&hellip;</li>
<li>For now I will export these documents from the statistics core and then delete them:</li>
</ul>
</li>
</ul>
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f '(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)'
$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<li>Now the UUID conversion script says there is nothing left to convert, so I can try to run the Atmire CUA conversion utility:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Xmx1024m -Dfile.encoding=UTF-8&quot;
$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 1
2020-05-31 16:04:18 +03:00
</code></pre><ul>
2020-06-01 13:55:08 +03:00
<li>The processing is very slow and there are lots of errors like this:</li>
</ul>
<pre><code>Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(AtomicStatisticsUpdater.java:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(AtomicStatisticsUpdater.java:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(AtomicStatisticsUpdateCLI.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: java.lang.NullPointerException
</code></pre><ul>
2020-05-31 16:04:18 +03:00
<li>Experiment a bit with the Python <a href="https://pypi.org/project/country-converter/">country-converter</a> library as it can convert between different formats (like ISO 3166 and UN m49)
<ul>
<li>We need to eventually find a format we can use for all CGIAR DSpaces&hellip;</li>
</ul>
</li>
</ul>
2020-05-20 09:44:36 +03:00
<!-- raw HTML omitted -->
2020-05-02 10:08:14 +03:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2021-06-03 21:54:49 +03:00
<li><a href="/cgspace-notes/2021-06/">June, 2021</a></li>
2021-05-02 19:55:06 +03:00
<li><a href="/cgspace-notes/2021-05/">May, 2021</a></li>
2021-04-05 19:36:44 +03:00
<li><a href="/cgspace-notes/2021-04/">April, 2021</a></li>
2021-03-04 22:46:05 +02:00
<li><a href="/cgspace-notes/2021-03/">March, 2021</a></li>
2021-04-01 09:49:08 +03:00
<li><a href="/cgspace-notes/cgspace-cgcorev2-migration/">CGSpace CG Core v2 Migration</a></li>
2020-05-02 10:08:14 +03:00
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>