January, 2017
2017-01-02
- I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
- I tested on DSpace Test as well and it doesn’t work there either
- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
2017-01-04
I tried to shard my local dev instance and it fails the same way:
$ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s Moving: 9318 into core statistics-2016 Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016 org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016 at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2291) at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) Caused by: org.apache.http.client.ClientProtocolException at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448) ... 10 more Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. The cause lists the reason the original request failed. at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) ... 14 more Caused by: java.net.SocketException: Broken pipe (Write failed) at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181) at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124) at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181) at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132) at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89) at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117) at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265) at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) ... 16 more
And the DSpace log shows:
2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016 2017-01-04 22:39:05,412 INFO org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016 2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}->http://localhost:8081: Broken pipe (Write failed) 2017-01-04 22:39:07,310 INFO org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
Despite failing instantly, a
statistics-2016
directory was created, but it only has a data dir (no conf)The Tomcat access logs show more:
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107 127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-17YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 423 127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 77 127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63 127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 4359517 127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16248 127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156 127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update?wt=javabin&version=2 HTTP/1.1" 200 41 127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update HTTP/1.1" 200 40
Very interesting… it creates the core and then fails somehow
2017-01-08
- Put Sisay’s
item-view.xsl
code to show mapped collections on CGSpace (#295)
2017-01-09
- A user wrote to tell me that the new display of an item’s mappings had a crazy bug for at least one item: https://cgspace.cgiar.org/handle/10568/78596
- She said she only mapped it once, but it appears to be mapped 184 times
2017-01-10
- I tried to clean up the duplicate mappings by exporting the item’s metadata to CSV, editing, and re-importing, but DSpace said “no changes were detected”
- I’ve asked on the dspace-tech mailing list to see if anyone can help
- I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help
For example, this shows 186 mappings for the item, the first three of which are real:
dspace=# select * from collection2item where item_id = '80596';
Then I deleted the others:
dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
And in the item view it now shows the correct mappings
I will have to ask the DSpace people if this is a valid approach
Finish looking at the Journal Title corrections of the top 500 Journal Titles so we can make a controlled vocabulary from it
2017-01-11
- Maria found another item with duplicate mappings: https://cgspace.cgiar.org/handle/10568/78658
Error in
fix-metadata-values.py
when it tries to print the value for Entwicklung & Ländlicher Raum:Traceback (most recent call last): File "./fix-metadata-values.py", line 80, in <module> print("Fixing {} occurences of: {}".format(records_to_fix, record[0])) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
Seems we need to encode as UTF-8 before printing to screen, ie:
print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
I’m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database… I’ve never had this issue before
Now back to cleaning up some journal titles so we can make the controlled vocabulary:
$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
Now get the top 500 journal titles:
dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November
I will have to go through these and fix some more before making the controlled vocabulary
Added 30 more corrections or so, now there are 49 total and I’ll have to get the top 500 after applying them
2017-01-13
- Add
FOOD SYSTEMS
to CIAT subjects, waiting to merge: https://github.com/ilri/DSpace/pull/296
2017-01-16
Fix the two items Maria found with duplicate mappings with this script:
/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */ delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807); /* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */ delete from collection2item where id = '91082';
2017-01-17
- Helping clean up some file names in the 232 CIAT records that Sisay worked on last week
- There are about 30 files with
%20
(space) and Spanish accents in the file name - At first I thought we should fix these, but actually it is prescribed by the W3 working group to convert these to UTF8 and URL encode them!
- And the file names don’t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore
Seems like the only ones I should replace are the
'
apostrophe characters, as%27
:value.replace("'",'%27')
Add the item’s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:
value + "__description:" + cells["dc.type"].value
Test importing of the new CIAT records (actually there are 232, not 234):
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB
These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:
$ convert -compress Zip -density 150x150 input.pdf output.pdf $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
Somewhere on the Internet suggested using a DPI of 144
2017-01-19
- In testing a random sample of CIAT’s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are
Import 232 CIAT records into CGSpace:
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
2017-01-22
- Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel’s CSV exporter)
- There were also some issues with an invalid dc.date.issued field, and I trimmed leading / trailing whitespace and cleaned up some URLs with unneeded parameters like ?show=full
2017-01-23
- I merged Atmire’s pull request into the development branch so they can deploy it on DSpace Test
Move some old ILRI Program communities to a new subcommunity for former programs (10568⁄79164):
$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
Move some collections with
move-collections.sh
using the following config:10568/42161 10568/171 10568/79341 10568/41914 10568/171 10568/79340
2017-01-24
- Run all updates on DSpace Test and reboot the server
Run fixes for Journal titles on CGSpace:
$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
Create a new list of the top 500 journal titles from the database:
dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request (#298)
This would be the last issue remaining to close the meta issue about switching to controlled vocabularies (#69)
2017-01-25
- Atmire says the
com.atmire.statistics.util.UpdateSolrStorageReports
andcom.atmire.utils.ReportSender
are no longer necessary because they are using a Spring scheduler for these tasks now - Pull request to remove them from the Ansible templates: https://github.com/ilri/rmg-ansible-public/pull/80
- Still testing the Atmire modules on DSpace Test, and it looks like a few issues we had reported are now fixed:
- XLS Export from Content statistics
- Most popular items
- Show statistics on collection pages
- But now we have a new issue with the “Types” in Content statistics not being respected—we only get the defaults, despite having custom settings in
dspace/config/modules/atmire-cua.cfg
2017-01-27
- Magdalena pointed out that somehow the Anonymous group had been added to the Administrators group on CGSpace (!)
- Discuss plans to update CCAFS metadata and communities for their new flagships and phase II project identifiers
- The flagships are in
cg.subject.ccafs
, and we need to probably make a new field for the phase II project identifiers
2017-01-28
- Merge controlled vocabulary for journal titles (
dc.source
) into CGSpace (#298) - Merge new CIAT subject into CGSpace (#296)
2017-01-29
- Run all system updates on DSpace Test, redeploy DSpace code, and reboot the server
- Run all system updates on CGSpace, redeploy DSpace code, and reboot the server