2018-02-11 18:28:23 +02:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "January, 2017" / >
< meta property = "og:description" content = "2017-01-02
I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
I tested on DSpace Test as well and it doesn’ t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’ m not sure if we’ ve ever had the sharding task run successfully over all these years
" />
< meta property = "og:type" content = "article" / >
2019-02-02 14:12:57 +02:00
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2017-01/" / >
< meta property = "article:published_time" content = "2017-01-02T10:43:00+03:00" / >
2018-03-09 22:16:20 +02:00
< meta property = "article:modified_time" content = "2018-03-09T22:10:33+02:00" / >
2018-09-30 08:23:48 +03:00
2018-02-11 18:28:23 +02:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "January, 2017" / >
< meta name = "twitter:description" content = "2017-01-02
I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
I tested on DSpace Test as well and it doesn’ t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’ m not sure if we’ ve ever had the sharding task run successfully over all these years
"/>
2019-06-25 20:10:57 +03:00
< meta name = "generator" content = "Hugo 0.55.6" / >
2018-02-11 18:28:23 +02:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "January, 2017",
2019-04-13 12:15:55 +03:00
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2017-01\/",
2018-04-30 19:05:39 +03:00
"wordCount": "1594",
2019-04-13 12:15:55 +03:00
"datePublished": "2017-01-02T10:43:00\x2b03:00",
"dateModified": "2018-03-09T22:10:33\x2b02:00",
2018-02-11 18:28:23 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2017-01/" >
< title > January, 2017 | CGSpace Notes< / title >
<!-- combined, minified CSS -->
2019-02-13 18:47:17 +02:00
< link href = "https://alanorth.github.io/cgspace-notes/css/style.css" rel = "stylesheet" integrity = "sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin = "anonymous" >
2018-02-11 18:28:23 +02:00
2019-04-14 16:59:47 +03:00
<!-- RSS 2.0 feed -->
2018-02-11 18:28:23 +02:00
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
2018-12-19 13:20:39 +02:00
2018-02-11 18:28:23 +02:00
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
2018-12-19 13:20:39 +02:00
2018-02-11 18:28:23 +02:00
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
< h2 class = "blog-post-title" > < a href = "https://alanorth.github.io/cgspace-notes/2017-01/" > January, 2017< / a > < / h2 >
< p class = "blog-post-meta" > < time datetime = "2017-01-02T10:43:00+03:00" > Mon Jan 02, 2017< / time > by Alan Orth in
< i class = "fa fa-tag" aria-hidden = "true" > < / i > < a href = "/cgspace-notes/tags/notes" rel = "tag" > Notes< / a >
< / p >
< / header >
< h2 id = "2017-01-02" > 2017-01-02< / h2 >
< ul >
< li > I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error< / li >
< li > I tested on DSpace Test as well and it doesn’ t work there either< / li >
< li > I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’ m not sure if we’ ve ever had the sharding task run successfully over all these years< / li >
< / ul >
< h2 id = "2017-01-04" > 2017-01-04< / h2 >
< ul >
2019-05-05 16:45:12 +03:00
< li > < p > I tried to shard my local dev instance and it fails the same way:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > $ JAVA_OPTS=" -Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
Moving: 9318 into core statistics-2016
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
2019-05-05 16:45:12 +03:00
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2291)
at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2018-02-11 18:28:23 +02:00
Caused by: org.apache.http.client.ClientProtocolException
2019-05-05 16:45:12 +03:00
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
... 10 more
2018-02-11 18:28:23 +02:00
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. The cause lists the reason the original request failed.
2019-05-05 16:45:12 +03:00
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
... 14 more
2018-02-11 18:28:23 +02:00
Caused by: java.net.SocketException: Broken pipe (Write failed)
2019-05-05 16:45:12 +03:00
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
... 16 more
< / code > < / pre > < / li >
< li > < p > And the DSpace log shows:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > 2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}-> http://localhost:8081: Broken pipe (Write failed)
2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}-> http://localhost:8081
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Despite failing instantly, a < code > statistics-2016< / code > directory was created, but it only has a data dir (no conf)< / p > < / li >
< li > < p > The Tomcat access logs show more:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > 127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] " GET /solr/statistics/select?q=type%3A2+AND+id%3A1& wt=javabin& version=2 HTTP/1.1" 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] " GET /solr/statistics/select?q=*%3A*& rows=0& facet=true& facet.range=time& facet.range.start=NOW%2FYEAR-17YEARS& facet.range.end=NOW%2FYEAR%2B0YEARS& facet.range.gap=%2B1YEAR& facet.mincount=1& wt=javabin& version=2 HTTP/1.1" 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] " GET /solr/admin/cores?action=STATUS& core=statistics-2016& indexInfo=true& wt=javabin& version=2 HTTP/1.1" 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] " GET /solr/admin/cores?action=CREATE& name=statistics-2016& instanceDir=statistics& dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata& wt=javabin& version=2 HTTP/1.1" 200 63
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] " GET /solr/statistics/select?csv.mv.separator=%7C& q=*%3A*& fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29& rows=10000& wt=csv HTTP/1.1" 200 4359517
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] " GET /solr/statistics/admin/luke?show=schema& wt=javabin& version=2 HTTP/1.1" 200 16248
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] " POST /solr//statistics-2016/update/csv?commit=true& softCommit=false& waitSearcher=true& f.previousWorkflowStep.split=true& f.previousWorkflowStep.separator=%7C& f.previousWorkflowStep.encapsulator=%22& f.actingGroupId.split=true& f.actingGroupId.separator=%7C& f.actingGroupId.encapsulator=%22& f.containerCommunity.split=true& f.containerCommunity.separator=%7C& f.containerCommunity.encapsulator=%22& f.range.split=true& f.range.separator=%7C& f.range.encapsulator=%22& f.containerItem.split=true& f.containerItem.separator=%7C& f.containerItem.encapsulator=%22& f.p_communities_map.split=true& f.p_communities_map.separator=%7C& f.p_communities_map.encapsulator=%22& f.ngram_query_search.split=true& f.ngram_query_search.separator=%7C& f.ngram_query_search.encapsulator=%22& f.containerBitstream.split=true& f.containerBitstream.separator=%7C& f.containerBitstream.encapsulator=%22& f.owningItem.split=true& f.owningItem.separator=%7C& f.owningItem.encapsulator=%22& f.actingGroupParentId.split=true& f.actingGroupParentId.separator=%7C& f.actingGroupParentId.encapsulator=%22& f.text.split=true& f.text.separator=%7C& f.text.encapsulator=%22& f.simple_query_search.split=true& f.simple_query_search.separator=%7C& f.simple_query_search.encapsulator=%22& f.owningComm.split=true& f.owningComm.separator=%7C& f.owningComm.encapsulator=%22& f.owner.split=true& f.owner.separator=%7C& f.owner.encapsulator=%22& f.filterquery.split=true& f.filterquery.separator=%7C& f.filterquery.encapsulator=%22& f.p_group_map.split=true& f.p_group_map.separator=%7C& f.p_group_map.encapsulator=%22& f.actorMemberGroupId.split=true& f.actorMemberGroupId.separator=%7C& f.actorMemberGroupId.encapsulator=%22& f.bitstreamId.split=true& f.bitstreamId.separator=%7C& f.bitstreamId.encapsulator=%22& f.group_name.split=true& f.group_name.separator=%7C& f.group_name.encapsulator=%22& f.p_communities_name.split=true& f.p_communities_name.separator=%7C& f.p_communities_name.encapsulator=%22& f.query.split=true& f.query.separator=%7C& f.query.encapsulator=%22& f.workflowStep.split=true& f.workflowStep.separator=%7C& f.workflowStep.encapsulator=%22& f.containerCollection.split=true& f.containerCollection.separator=%7C& f.containerCollection.encapsulator=%22& f.complete_query_search.split=true& f.complete_query_search.separator=%7C& f.complete_query_search.encapsulator=%22& f.p_communities_id.split=true& f.p_communities_id.separator=%7C& f.p_communities_id.encapsulator=%22& f.rangeDescription.split=true& f.rangeDescription.separator=%7C& f.rangeDescription.encapsulator=%22& f.group_id.split=true& f.group_id.separator=%7C& f.group_id.encapsulator=%22& f.bundleName.split=true& f.bundleName.separator=%7C& f.bundleName.encapsulator=%22& f.ngram_simplequery_search.split=true& f.ngram_simplequery_search.separator=%7C& f.ngram_simplequery_search.encapsulator=%22& f.group_map.split=true& f.group_map.separator=%7C& f.group_map.encapsulator=%22& f.owningColl.split=true& f.owningColl.separator=%7C& f.owningColl.encapsulator=%22& f.p_group_id.split=true& f.p_group_id.separator=%7C& f.p_group_id.encapsulator=%22& f.p_group_name.split=true& f.p_group_name.separator=%7C& f.p_group_name.encapsulator=%22& wt=javabin& version=2 HTTP/1.1" 409 156
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] " POST /solr/datatables/update?wt=javabin& version=2 HTTP/1.1" 200 41
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] " POST /solr/datatables/update HTTP/1.1" 200 40
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Very interesting… it creates the core and then fails somehow< / p > < / li >
2018-02-11 18:28:23 +02:00
< / ul >
< h2 id = "2017-01-08" > 2017-01-08< / h2 >
< ul >
< li > Put Sisay’ s < code > item-view.xsl< / code > code to show mapped collections on CGSpace (< a href = "https://github.com/ilri/DSpace/pull/295" > #295< / a > )< / li >
< / ul >
< h2 id = "2017-01-09" > 2017-01-09< / h2 >
< ul >
< li > A user wrote to tell me that the new display of an item’ s mappings had a crazy bug for at least one item: < a href = "https://cgspace.cgiar.org/handle/10568/78596" > https://cgspace.cgiar.org/handle/10568/78596< / a > < / li >
< li > She said she only mapped it once, but it appears to be mapped 184 times< / li >
< / ul >
< p > < img src = "/cgspace-notes/2017/01/mapping-crazy-duplicate.png" alt = "Crazy item mapping" / > < / p >
< h2 id = "2017-01-10" > 2017-01-10< / h2 >
< ul >
< li > I tried to clean up the duplicate mappings by exporting the item’ s metadata to CSV, editing, and re-importing, but DSpace said “ no changes were detected” < / li >
< li > I’ ve asked on the dspace-tech mailing list to see if anyone can help< / li >
< li > I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help< / li >
2019-05-05 16:45:12 +03:00
< li > < p > For example, this shows 186 mappings for the item, the first three of which are real:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > dspace=# select * from collection2item where item_id = '80596';
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Then I deleted the others:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > And in the item view it now shows the correct mappings< / p > < / li >
< li > < p > I will have to ask the DSpace people if this is a valid approach< / p > < / li >
< li > < p > Finish looking at the Journal Title corrections of the top 500 Journal Titles so we can make a controlled vocabulary from it< / p > < / li >
2018-02-11 18:28:23 +02:00
< / ul >
< h2 id = "2017-01-11" > 2017-01-11< / h2 >
< ul >
< li > Maria found another item with duplicate mappings: < a href = "https://cgspace.cgiar.org/handle/10568/78658" > https://cgspace.cgiar.org/handle/10568/78658< / a > < / li >
2019-05-05 16:45:12 +03:00
< li > < p > Error in < code > fix-metadata-values.py< / code > when it tries to print the value for Entwicklung & Ländlicher Raum:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > Traceback (most recent call last):
2019-05-05 16:45:12 +03:00
File " ./fix-metadata-values.py" , line 80, in < module>
print(" Fixing {} occurences of: {}" .format(records_to_fix, record[0]))
2018-02-11 18:28:23 +02:00
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Seems we need to encode as UTF-8 before printing to screen, ie:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > print(" Fixing {} occurences of: {}" .format(records_to_fix, record[0].encode('utf-8')))
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > See: < a href = "http://stackoverflow.com/a/36427358/487333" > http://stackoverflow.com/a/36427358/487333< / a > < / p > < / li >
< li > < p > I’ m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database… I’ ve never had this issue before< / p > < / li >
< li > < p > Now back to cleaning up some journal titles so we can make the controlled vocabulary:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > $ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Now get the top 500 journal titles:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November< / p > < / li >
< li > < p > I will have to go through these and fix some more before making the controlled vocabulary< / p > < / li >
< li > < p > Added 30 more corrections or so, now there are 49 total and I’ ll have to get the top 500 after applying them< / p > < / li >
2018-02-11 18:28:23 +02:00
< / ul >
< h2 id = "2017-01-13" > 2017-01-13< / h2 >
< ul >
< li > Add < code > FOOD SYSTEMS< / code > to CIAT subjects, waiting to merge: < a href = "https://github.com/ilri/DSpace/pull/296" > https://github.com/ilri/DSpace/pull/296< / a > < / li >
< / ul >
< h2 id = "2017-01-16" > 2017-01-16< / h2 >
< ul >
2019-05-05 16:45:12 +03:00
< li > < p > Fix the two items Maria found with duplicate mappings with this script:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > /* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
/* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
delete from collection2item where id = '91082';
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
< / ul >
2018-02-11 18:28:23 +02:00
< h2 id = "2017-01-17" > 2017-01-17< / h2 >
< ul >
< li > Helping clean up some file names in the 232 CIAT records that Sisay worked on last week< / li >
< li > There are about 30 files with < code > %20< / code > (space) and Spanish accents in the file name< / li >
< li > At first I thought we should fix these, but actually it is < a href = "https://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1" > prescribed by the W3 working group to convert these to UTF8 and URL encode them< / a > !< / li >
< li > And the file names don’ t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore< / li >
2019-05-05 16:45:12 +03:00
< li > < p > Seems like the only ones I should replace are the < code > '< / code > apostrophe characters, as < code > %27< / code > :< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > value.replace(" '" ,'%27')
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Add the item’ s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > value + " __description:" + cells[" dc.type" ].value
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Test importing of the new CIAT records (actually there are 232, not 234):< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > $ JAVA_OPTS=" -Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map & > /tmp/ciat.log
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB< / p > < / li >
< li > < p > These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > $ convert -compress Zip -density 150x150 input.pdf output.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Somewhere on the Internet suggested using a DPI of 144< / p > < / li >
2018-02-11 18:28:23 +02:00
< / ul >
< h2 id = "2017-01-19" > 2017-01-19< / h2 >
< ul >
< li > In testing a random sample of CIAT’ s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are< / li >
2019-05-05 16:45:12 +03:00
< li > < p > Import 232 CIAT records into CGSpace:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > $ JAVA_OPTS=" -Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map & > /tmp/ciat.log
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
< / ul >
2018-02-11 18:28:23 +02:00
< h2 id = "2017-01-22" > 2017-01-22< / h2 >
< ul >
< li > Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel’ s CSV exporter)< / li >
< li > There were also some issues with an invalid dc.date.issued field, and I trimmed leading / trailing whitespace and cleaned up some URLs with unneeded parameters like ?show=full< / li >
< / ul >
< h2 id = "2017-01-23" > 2017-01-23< / h2 >
< ul >
< li > I merged Atmire’ s pull request into the development branch so they can deploy it on DSpace Test< / li >
2019-05-05 16:45:12 +03:00
< li > < p > Move some old ILRI Program communities to a new subcommunity for former programs (< sup > 10568< / sup > ⁄ < sub > 79164< / sub > ):< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > $ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child=" $community" & & /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child=" $community" ; done
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Move some collections with < a href = "https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515" > < code > move-collections.sh< / code > < / a > using the following config:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > 10568/42161 10568/171 10568/79341
10568/41914 10568/171 10568/79340
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
< / ul >
2018-02-11 18:28:23 +02:00
< h2 id = "2017-01-24" > 2017-01-24< / h2 >
< ul >
< li > Run all updates on DSpace Test and reboot the server< / li >
2019-05-05 16:45:12 +03:00
< li > < p > Run fixes for Journal titles on CGSpace:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > $ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Create a new list of the top 500 journal titles from the database:< / p >
2018-02-11 18:28:23 +02:00
< pre > < code > dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
2019-05-05 16:45:12 +03:00
< / code > < / pre > < / li >
2018-02-11 18:28:23 +02:00
2019-05-05 16:45:12 +03:00
< li > < p > Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request (< a href = "https://github.com/ilri/DSpace/pull/298" > #298< / a > )< / p > < / li >
< li > < p > This would be the last issue remaining to close the meta issue about switching to controlled vocabularies (< a href = "https://github.com/ilri/DSpace/pull/69" > #69< / a > )< / p > < / li >
2018-02-11 18:28:23 +02:00
< / ul >
< h2 id = "2017-01-25" > 2017-01-25< / h2 >
< ul >
< li > Atmire says the < code > com.atmire.statistics.util.UpdateSolrStorageReports< / code > and < code > com.atmire.utils.ReportSender< / code > are no longer necessary because they are using a Spring scheduler for these tasks now< / li >
< li > Pull request to remove them from the Ansible templates: < a href = "https://github.com/ilri/rmg-ansible-public/pull/80" > https://github.com/ilri/rmg-ansible-public/pull/80< / a > < / li >
< li > Still testing the Atmire modules on DSpace Test, and it looks like a few issues we had reported are now fixed:
< ul >
< li > XLS Export from Content statistics< / li >
< li > Most popular items< / li >
< li > Show statistics on collection pages< / li >
< / ul > < / li >
< li > But now we have a new issue with the “ Types” in Content statistics not being respected—we only get the defaults, despite having custom settings in < code > dspace/config/modules/atmire-cua.cfg< / code > < / li >
< / ul >
< h2 id = "2017-01-27" > 2017-01-27< / h2 >
< ul >
< li > Magdalena pointed out that somehow the Anonymous group had been added to the Administrators group on CGSpace (!)< / li >
< li > Discuss plans to update CCAFS metadata and communities for their new flagships and phase II project identifiers< / li >
< li > The flagships are in < code > cg.subject.ccafs< / code > , and we need to probably make a new field for the phase II project identifiers< / li >
< / ul >
< h2 id = "2017-01-28" > 2017-01-28< / h2 >
< ul >
< li > Merge controlled vocabulary for journal titles (< code > dc.source< / code > ) into CGSpace (< a href = "https://github.com/ilri/DSpace/pull/298" > #298< / a > )< / li >
< li > Merge new CIAT subject into CGSpace (< a href = "https://github.com/ilri/DSpace/pull/296" > #296< / a > )< / li >
< / ul >
< h2 id = "2017-01-29" > 2017-01-29< / h2 >
< ul >
< li > Run all system updates on DSpace Test, redeploy DSpace code, and reboot the server< / li >
< li > Run all system updates on CGSpace, redeploy DSpace code, and reboot the server< / li >
< / ul >
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2019-05-01 11:53:26 +03:00
< li > < a href = "/cgspace-notes/2019-05/" > May, 2019< / a > < / li >
2019-04-01 09:02:18 +03:00
2019-04-13 12:15:55 +03:00
< li > < a href = "/cgspace-notes/posts/" > Posts< / a > < / li >
2019-06-02 10:57:51 +03:00
< li > < a href = "/cgspace-notes/2019-05/" > May, 2019< / a > < / li >
2019-05-01 11:53:26 +03:00
< li > < a href = "/cgspace-notes/2019-04/" > April, 2019< / a > < / li >
2019-03-01 12:17:17 +01:00
< li > < a href = "/cgspace-notes/2019-03/" > March, 2019< / a > < / li >
2018-02-11 18:28:23 +02:00
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >