cgspace-notes/docs/2017-01/index.html

424 lines
27 KiB
HTML
Raw Normal View History

2018-02-11 17:28:23 +01:00
<!DOCTYPE html>
<html lang="en" >
2018-02-11 17:28:23 +01:00
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
2020-12-06 15:53:29 +01:00
2018-02-11 17:28:23 +01:00
<meta property="og:title" content="January, 2017" />
<meta property="og:description" content="2017-01-02
I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
2020-01-27 15:20:44 +01:00
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
2018-02-11 17:28:23 +01:00
" />
<meta property="og:type" content="article" />
2019-02-02 13:12:57 +01:00
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2017-01/" />
2019-08-08 17:10:44 +02:00
<meta property="article:published_time" content="2017-01-02T10:43:00+03:00" />
<meta property="article:modified_time" content="2018-03-09T22:10:33+02:00" />
2018-09-30 07:23:48 +02:00
2020-12-06 15:53:29 +01:00
2018-02-11 17:28:23 +01:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="January, 2017"/>
<meta name="twitter:description" content="2017-01-02
I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
2020-01-27 15:20:44 +01:00
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
2018-02-11 17:28:23 +01:00
"/>
2022-04-27 08:58:45 +02:00
<meta name="generator" content="Hugo 0.97.3" />
2018-02-11 17:28:23 +01:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2017",
2020-04-02 09:55:42 +02:00
"url": "https://alanorth.github.io/cgspace-notes/2017-01/",
2018-04-30 18:05:39 +02:00
"wordCount": "1594",
"datePublished": "2017-01-02T10:43:00+03:00",
"dateModified": "2018-03-09T22:10:33+02:00",
2018-02-11 17:28:23 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2017-01/">
<title>January, 2017 | CGSpace Notes</title>
2018-02-11 17:28:23 +01:00
<!-- combined, minified CSS -->
2020-01-23 19:19:38 +01:00
2021-01-24 08:46:27 +01:00
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
2018-02-11 17:28:23 +01:00
2020-01-28 11:01:42 +01:00
<!-- minified Font Awesome for SVG icons -->
2021-09-28 09:32:32 +02:00
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
2020-01-28 11:01:42 +01:00
2019-04-14 15:59:47 +02:00
<!-- RSS 2.0 feed -->
2018-02-11 17:28:23 +01:00
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
2018-12-19 12:20:39 +01:00
2018-02-11 17:28:23 +01:00
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
2018-02-11 17:28:23 +01:00
</div>
</header>
2018-12-19 12:20:39 +01:00
2018-02-11 17:28:23 +01:00
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-01/">January, 2017</a></h2>
2020-11-16 09:54:00 +01:00
<p class="blog-post-meta">
<time datetime="2017-01-02T10:43:00+03:00">Mon Jan 02, 2017</time>
in
2018-02-11 17:28:23 +01:00
2020-01-28 11:01:42 +01:00
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/tags/notes/" rel="tag">Notes</a>
2018-02-11 17:28:23 +01:00
</p>
</header>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-02">2017-01-02</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error</li>
2020-01-27 15:20:44 +01:00
<li>I tested on DSpace Test as well and it doesn&rsquo;t work there either</li>
<li>I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-04">2017-01-04</h2>
2018-02-11 17:28:23 +01:00
<ul>
2019-11-28 16:30:45 +01:00
<li>I tried to shard my local dev instance and it fails the same way:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xms768m -Xmx768m -Dfile.encoding=UTF-8&#34; ~/dspace/bin/dspace stats-util -s
2018-02-11 17:28:23 +01:00
Moving: 9318 into core statistics-2016
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
2019-11-28 16:30:45 +01:00
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2291)
at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2018-02-11 17:28:23 +01:00
Caused by: org.apache.http.client.ClientProtocolException
2019-11-28 16:30:45 +01:00
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
... 10 more
2018-02-11 17:28:23 +01:00
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. The cause lists the reason the original request failed.
2019-11-28 16:30:45 +01:00
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
... 14 more
2018-02-11 17:28:23 +01:00
Caused by: java.net.SocketException: Broken pipe (Write failed)
2019-11-28 16:30:45 +01:00
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
... 16 more
</code></pre><ul>
<li>And the DSpace log shows:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2018-02-11 17:28:23 +01:00
2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}-&gt;http://localhost:8081: Broken pipe (Write failed)
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}-&gt;http://localhost:8081
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Despite failing instantly, a <code>statistics-2016</code> directory was created, but it only has a data dir (no conf)</li>
<li>The Tomcat access logs show more:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/statistics/select?q=type%3A2+AND+id%3A1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/statistics/select?q=*%3A*&amp;rows=0&amp;facet=true&amp;facet.range=time&amp;facet.range.start=NOW%2FYEAR-17YEARS&amp;facet.range.end=NOW%2FYEAR%2B0YEARS&amp;facet.range.gap=%2B1YEAR&amp;facet.mincount=1&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/admin/cores?action=STATUS&amp;core=statistics-2016&amp;indexInfo=true&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] &#34;GET /solr/admin/cores?action=CREATE&amp;name=statistics-2016&amp;instanceDir=statistics&amp;dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 63
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;GET /solr/statistics/select?csv.mv.separator=%7C&amp;q=*%3A*&amp;fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&amp;rows=10000&amp;wt=csv HTTP/1.1&#34; 200 4359517
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;GET /solr/statistics/admin/luke?show=schema&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 200 16248
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] &#34;POST /solr//statistics-2016/update/csv?commit=true&amp;softCommit=false&amp;waitSearcher=true&amp;f.previousWorkflowStep.split=true&amp;f.previousWorkflowStep.separator=%7C&amp;f.previousWorkflowStep.encapsulator=%22&amp;f.actingGroupId.split=true&amp;f.actingGroupId.separator=%7C&amp;f.actingGroupId.encapsulator=%22&amp;f.containerCommunity.split=true&amp;f.containerCommunity.separator=%7C&amp;f.containerCommunity.encapsulator=%22&amp;f.range.split=true&amp;f.range.separator=%7C&amp;f.range.encapsulator=%22&amp;f.containerItem.split=true&amp;f.containerItem.separator=%7C&amp;f.containerItem.encapsulator=%22&amp;f.p_communities_map.split=true&amp;f.p_communities_map.separator=%7C&amp;f.p_communities_map.encapsulator=%22&amp;f.ngram_query_search.split=true&amp;f.ngram_query_search.separator=%7C&amp;f.ngram_query_search.encapsulator=%22&amp;f.containerBitstream.split=true&amp;f.containerBitstream.separator=%7C&amp;f.containerBitstream.encapsulator=%22&amp;f.owningItem.split=true&amp;f.owningItem.separator=%7C&amp;f.owningItem.encapsulator=%22&amp;f.actingGroupParentId.split=true&amp;f.actingGroupParentId.separator=%7C&amp;f.actingGroupParentId.encapsulator=%22&amp;f.text.split=true&amp;f.text.separator=%7C&amp;f.text.encapsulator=%22&amp;f.simple_query_search.split=true&amp;f.simple_query_search.separator=%7C&amp;f.simple_query_search.encapsulator=%22&amp;f.owningComm.split=true&amp;f.owningComm.separator=%7C&amp;f.owningComm.encapsulator=%22&amp;f.owner.split=true&amp;f.owner.separator=%7C&amp;f.owner.encapsulator=%22&amp;f.filterquery.split=true&amp;f.filterquery.separator=%7C&amp;f.filterquery.encapsulator=%22&amp;f.p_group_map.split=true&amp;f.p_group_map.separator=%7C&amp;f.p_group_map.encapsulator=%22&amp;f.actorMemberGroupId.split=true&amp;f.actorMemberGroupId.separator=%7C&amp;f.actorMemberGroupId.encapsulator=%22&amp;f.bitstreamId.split=true&amp;f.bitstreamId.separator=%7C&amp;f.bitstreamId.encapsulator=%22&amp;f.group_name.split=true&amp;f.group_name.separator=%7C&amp;f.group_name.encapsulator=%22&amp;f.p_communities_name.split=true&amp;f.p_communities_name.separator=%7C&amp;f.p_communities_name.encapsulator=%22&amp;f.query.split=true&amp;f.query.separator=%7C&amp;f.query.encapsulator=%22&amp;f.workflowStep.split=true&amp;f.workflowStep.separator=%7C&amp;f.workflowStep.encapsulator=%22&amp;f.containerCollection.split=true&amp;f.containerCollection.separator=%7C&amp;f.containerCollection.encapsulator=%22&amp;f.complete_query_search.split=true&amp;f.complete_query_search.separator=%7C&amp;f.complete_query_search.encapsulator=%22&amp;f.p_communities_id.split=true&amp;f.p_communities_id.separator=%7C&amp;f.p_communities_id.encapsulator=%22&amp;f.rangeDescription.split=true&amp;f.rangeDescription.separator=%7C&amp;f.rangeDescription.encapsulator=%22&amp;f.group_id.split=true&amp;f.group_id.separator=%7C&amp;f.group_id.encapsulator=%22&amp;f.bundleName.split=true&amp;f.bundleName.separator=%7C&amp;f.bundleName.encapsulator=%22&amp;f.ngram_simplequery_search.split=true&amp;f.ngram_simplequery_search.separator=%7C&amp;f.ngram_simplequery_search.encapsulator=%22&amp;f.group_map.split=true&amp;f.group_map.separator=%7C&amp;f.group_map.encapsulator=%22&amp;f.owningColl.split=true&amp;f.owningColl.separator=%7C&amp;f.owningColl.encapsulator=%22&amp;f.p_group_id.split=true&amp;f.p_group_id.separator=%7C&amp;f.p_group_id.encapsulator=%22&amp;f.p_group_name.split=true&amp;f.p_group_name.separator=%7C&amp;f.p_group_name.encapsulator=%22&amp;wt=javabin&amp;version=2 HTTP/1.1&#34; 409 156
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &#34;POST /solr/datatables/update?wt=javabin&amp;version=2 HTTP/1.1&#34; 200 41
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] &#34;POST /solr/datatables/update HTTP/1.1&#34; 200 40
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Very interesting&hellip; it creates the core and then fails somehow</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-08">2017-01-08</h2>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>Put Sisay&rsquo;s <code>item-view.xsl</code> code to show mapped collections on CGSpace (<a href="https://github.com/ilri/DSpace/pull/295">#295</a>)</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-09">2017-01-09</h2>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>A user wrote to tell me that the new display of an item&rsquo;s mappings had a crazy bug for at least one item: <a href="https://cgspace.cgiar.org/handle/10568/78596">https://cgspace.cgiar.org/handle/10568/78596</a></li>
2018-02-11 17:28:23 +01:00
<li>She said she only mapped it once, but it appears to be mapped 184 times</li>
</ul>
2019-11-28 16:30:45 +01:00
<p><img src="/cgspace-notes/2017/01/mapping-crazy-duplicate.png" alt="Crazy item mapping"></p>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-10">2017-01-10</h2>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>I tried to clean up the duplicate mappings by exporting the item&rsquo;s metadata to CSV, editing, and re-importing, but DSpace said &ldquo;no changes were detected&rdquo;</li>
<li>I&rsquo;ve asked on the dspace-tech mailing list to see if anyone can help</li>
2018-02-11 17:28:23 +01:00
<li>I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help</li>
2019-11-28 16:30:45 +01:00
<li>For example, this shows 186 mappings for the item, the first three of which are real:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80596&#39;;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Then I deleted the others:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>dspace=# delete from collection2item where item_id = &#39;80596&#39; and id not in (90792, 90806, 90807);
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>And in the item view it now shows the correct mappings</li>
<li>I will have to ask the DSpace people if this is a valid approach</li>
<li>Finish looking at the Journal Title corrections of the top 500 Journal Titles so we can make a controlled vocabulary from it</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-11">2017-01-11</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Maria found another item with duplicate mappings: <a href="https://cgspace.cgiar.org/handle/10568/78658">https://cgspace.cgiar.org/handle/10568/78658</a></li>
2019-11-28 16:30:45 +01:00
<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>Traceback (most recent call last):
2022-03-04 13:30:06 +01:00
File &#34;./fix-metadata-values.py&#34;, line 80, in &lt;module&gt;
print(&#34;Fixing {} occurences of: {}&#34;.format(records_to_fix, record[0]))
UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\xe4&#39; in position 15: ordinal not in range(128)
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>print(&#34;Fixing {} occurences of: {}&#34;.format(records_to_fix, record[0].encode(&#39;utf-8&#39;)))
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li>
2020-01-27 15:20:44 +01:00
<li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li>
2019-11-28 16:30:45 +01:00
<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p &#39;fuuu&#39;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Now get the top 500 journal titles:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November</li>
<li>I will have to go through these and fix some more before making the controlled vocabulary</li>
2020-01-27 15:20:44 +01:00
<li>Added 30 more corrections or so, now there are 49 total and I&rsquo;ll have to get the top 500 after applying them</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-13">2017-01-13</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Add <code>FOOD SYSTEMS</code> to CIAT subjects, waiting to merge: <a href="https://github.com/ilri/DSpace/pull/296">https://github.com/ilri/DSpace/pull/296</a></li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-16">2017-01-16</h2>
2018-02-11 17:28:23 +01:00
<ul>
2019-11-28 16:30:45 +01:00
<li>Fix the two items Maria found with duplicate mappings with this script:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
2022-03-04 13:30:06 +01:00
delete from collection2item where item_id = &#39;80596&#39; and id not in (90792, 90806, 90807);
2018-02-11 17:28:23 +01:00
/* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
2022-03-04 13:30:06 +01:00
delete from collection2item where id = &#39;91082&#39;;
2019-12-17 13:49:24 +01:00
</code></pre><h2 id="2017-01-17">2017-01-17</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Helping clean up some file names in the 232 CIAT records that Sisay worked on last week</li>
<li>There are about 30 files with <code>%20</code> (space) and Spanish accents in the file name</li>
<li>At first I thought we should fix these, but actually it is <a href="https://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1">prescribed by the W3 working group to convert these to UTF8 and URL encode them</a>!</li>
2020-01-27 15:20:44 +01:00
<li>And the file names don&rsquo;t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore</li>
2019-11-28 16:30:45 +01:00
<li>Seems like the only ones I should replace are the <code>'</code> apostrophe characters, as <code>%27</code>:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>value.replace(&#34;&#39;&#34;,&#39;%27&#39;)
2019-11-28 16:30:45 +01:00
</code></pre><ul>
2020-01-27 15:20:44 +01:00
<li>Add the item&rsquo;s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:</li>
2019-11-28 16:30:45 +01:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>value + &#34;__description:&#34; + cells[&#34;dc.type&#34;].value
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Test importing of the new CIAT records (actually there are 232, not 234):</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB</li>
<li>These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>$ convert -compress Zip -density 150x150 input.pdf output.pdf
2018-02-11 17:28:23 +01:00
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Somewhere on the Internet suggested using a DPI of 144</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-19">2017-01-19</h2>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>In testing a random sample of CIAT&rsquo;s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are</li>
2019-11-28 16:30:45 +01:00
<li>Import 232 CIAT records into CGSpace:</li>
2019-05-05 15:45:12 +02:00
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ JAVA_OPTS=&#34;-Xmx512m -Dfile.encoding=UTF-8&#34; /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &amp;&gt; /tmp/ciat.log
2019-12-17 13:49:24 +01:00
</code></pre><h2 id="2017-01-22">2017-01-22</h2>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel&rsquo;s CSV exporter)</li>
2018-02-11 17:28:23 +01:00
<li>There were also some issues with an invalid dc.date.issued field, and I trimmed leading / trailing whitespace and cleaned up some URLs with unneeded parameters like ?show=full</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-23">2017-01-23</h2>
2018-02-11 17:28:23 +01:00
<ul>
2020-01-27 15:20:44 +01:00
<li>I merged Atmire&rsquo;s pull request into the development branch so they can deploy it on DSpace Test</li>
2019-11-28 16:30:45 +01:00
<li>Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=&#34;$community&#34; &amp;&amp; /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=&#34;$community&#34;; done
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Move some collections with <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move-collections.sh</code></a> using the following config:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>10568/42161 10568/171 10568/79341
2018-02-11 17:28:23 +01:00
10568/41914 10568/171 10568/79340
2019-12-17 13:49:24 +01:00
</code></pre><h2 id="2017-01-24">2017-01-24</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Run all updates on DSpace Test and reboot the server</li>
2019-11-28 16:30:45 +01:00
<li>Run fixes for Journal titles on CGSpace:</li>
</ul>
2022-03-04 13:30:06 +01:00
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p &#39;password&#39;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Create a new list of the top 500 journal titles from the database:</li>
</ul>
2021-09-13 15:21:16 +02:00
<pre tabindex="0"><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
2019-11-28 16:30:45 +01:00
</code></pre><ul>
<li>Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request (<a href="https://github.com/ilri/DSpace/pull/298">#298</a>)</li>
<li>This would be the last issue remaining to close the meta issue about switching to controlled vocabularies (<a href="https://github.com/ilri/DSpace/pull/69">#69</a>)</li>
2018-02-11 17:28:23 +01:00
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-25">2017-01-25</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Atmire says the <code>com.atmire.statistics.util.UpdateSolrStorageReports</code> and <code>com.atmire.utils.ReportSender</code> are no longer necessary because they are using a Spring scheduler for these tasks now</li>
<li>Pull request to remove them from the Ansible templates: <a href="https://github.com/ilri/rmg-ansible-public/pull/80">https://github.com/ilri/rmg-ansible-public/pull/80</a></li>
<li>Still testing the Atmire modules on DSpace Test, and it looks like a few issues we had reported are now fixed:
<ul>
<li>XLS Export from Content statistics</li>
<li>Most popular items</li>
<li>Show statistics on collection pages</li>
2019-11-28 16:30:45 +01:00
</ul>
</li>
2018-02-11 17:28:23 +01:00
<li>But now we have a new issue with the &ldquo;Types&rdquo; in Content statistics not being respected—we only get the defaults, despite having custom settings in <code>dspace/config/modules/atmire-cua.cfg</code></li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-27">2017-01-27</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Magdalena pointed out that somehow the Anonymous group had been added to the Administrators group on CGSpace (!)</li>
<li>Discuss plans to update CCAFS metadata and communities for their new flagships and phase II project identifiers</li>
<li>The flagships are in <code>cg.subject.ccafs</code>, and we need to probably make a new field for the phase II project identifiers</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-28">2017-01-28</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Merge controlled vocabulary for journal titles (<code>dc.source</code>) into CGSpace (<a href="https://github.com/ilri/DSpace/pull/298">#298</a>)</li>
<li>Merge new CIAT subject into CGSpace (<a href="https://github.com/ilri/DSpace/pull/296">#296</a>)</li>
</ul>
2019-12-17 13:49:24 +01:00
<h2 id="2017-01-29">2017-01-29</h2>
2018-02-11 17:28:23 +01:00
<ul>
<li>Run all system updates on DSpace Test, redeploy DSpace code, and reboot the server</li>
<li>Run all system updates on CGSpace, redeploy DSpace code, and reboot the server</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2022-04-27 08:58:45 +02:00
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
2022-03-01 15:48:40 +01:00
2022-04-27 08:58:45 +02:00
<li><a href="/cgspace-notes/2022-03/">March, 2022</a></li>
2022-04-04 18:15:58 +02:00
2022-02-10 18:35:40 +01:00
<li><a href="/cgspace-notes/2022-02/">February, 2022</a></li>
2022-01-01 14:21:47 +01:00
<li><a href="/cgspace-notes/2022-01/">January, 2022</a></li>
2021-12-03 11:58:43 +01:00
<li><a href="/cgspace-notes/2021-12/">December, 2021</a></li>
2018-02-11 17:28:23 +01:00
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
2018-02-11 17:28:23 +01:00
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>