cgspace-notes/content/posts/2017-01.md

+++
date = "2017-01-02T10:43:00+03:00"
author = "Alan Orth"
title = "January, 2017"
tags = ["Notes"]

+++
## 2017-01-02

- I checked to see if the Solr sharding task that is supposed to run on January 1st had run and saw there was an error
- I tested on DSpace Test as well and it doesn't work there either
- I asked on the dspace-tech mailing list because it seems to be broken, and actually now I'm not sure if we've ever had the sharding task run successfully over all these years

<!--more-->

## 2017-01-04

- I tried to shard my local dev instance and it fails the same way:

```
$ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
Moving: 9318 into core statistics-2016
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.dspace.statistics.SolrLogger.shardSolrIndex(SourceFile:2291)
        at org.dspace.statistics.util.StatisticsClient.main(StatisticsClient.java:106)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Caused by: org.apache.http.client.ClientProtocolException
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:867)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)
        ... 10 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity.  The cause lists the reason the original request failed.
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:659)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        ... 14 more
Caused by: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
        at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
        at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:124)
        at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:181)
        at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:132)
        at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
        at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
        at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
        at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
        at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        ... 16 more
```

- And the DSpace log shows:

```
2017-01-04 22:39:05,412 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2017-01-04 22:39:05,412 INFO  org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016
2017-01-04 22:39:07,310 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}->http://localhost:8081: Broken pipe (Write failed)
2017-01-04 22:39:07,310 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
```

- Despite failing instantly, a `statistics-2016` directory was created, but it only has a data dir (no conf)
- The Tomcat access logs show more:

```
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-17YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 423
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 77
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 4359517
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16248
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update?wt=javabin&version=2 HTTP/1.1" 200 41
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update HTTP/1.1" 200 40
```

- Very interesting... it creates the core and then fails somehow

## 2017-01-08

- Put Sisay's `item-view.xsl` code to show mapped collections on CGSpace ([#295](https://github.com/ilri/DSpace/pull/295))

## 2017-01-09

- A user wrote to tell me that the new display of an item's mappings had a crazy bug for at least one item: https://cgspace.cgiar.org/handle/10568/78596
- She said she only mapped it once, but it appears to be mapped 184 times

![Crazy item mapping](/cgspace-notes/2017/01/mapping-crazy-duplicate.png)

## 2017-01-10

- I tried to clean up the duplicate mappings by exporting the item's metadata to CSV, editing, and re-importing, but DSpace said "no changes were detected"
- I've asked on the dspace-tech mailing list to see if anyone can help
- I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help
- For example, this shows 186 mappings for the item, the first three of which are real:

```
dspace=#  select * from collection2item where item_id = '80596';
```

- Then I deleted the others:

```
dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
```

- And in the item view it now shows the correct mappings
- I will have to ask the DSpace people if this is a valid approach
- Finish looking at the Journal Title corrections of the top 500 Journal Titles so we can make a controlled vocabulary from it

## 2017-01-11

- Maria found another item with duplicate mappings: https://cgspace.cgiar.org/handle/10568/78658
- Error in `fix-metadata-values.py` when it tries to print the value for Entwicklung & Ländlicher Raum:

```
Traceback (most recent call last):
  File "./fix-metadata-values.py", line 80, in <module>
    print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
```

- Seems we need to encode as UTF-8 before printing to screen, ie:

```
print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
```

- See: http://stackoverflow.com/a/36427358/487333
- I'm actually not sure if we need to encode() the strings to UTF-8 before writing them to the database... I've never had this issue before
- Now back to cleaning up some journal titles so we can make the controlled vocabulary:

```
$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
```

- Now get the top 500 journal titles:

```
dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
```

- The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November
- I will have to go through these and fix some more before making the controlled vocabulary
- Added 30 more corrections or so, now there are 49 total and I'll have to get the top 500 after applying them

## 2017-01-13

- Add `FOOD SYSTEMS` to CIAT subjects, waiting to merge: https://github.com/ilri/DSpace/pull/296

## 2017-01-16

- Fix the two items Maria found with duplicate mappings with this script:

```
/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
/* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
delete from collection2item where id = '91082';
```

## 2017-01-17

- Helping clean up some file names in the 232 CIAT records that Sisay worked on last week
- There are about 30 files with `%20` (space) and Spanish accents in the file name
- At first I thought we should fix these, but actually it is [prescribed by the W3 working group to convert these to UTF8 and URL encode them](https://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1)!
- And the file names don't really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore
- Seems like the only ones I should replace are the `'` apostrophe characters, as `%27`:

```
value.replace("'",'%27')
```

- Add the item's Type to the filename column as a hint to SAF Builder so it can set a more useful description field:

```
value + "__description:" + cells["dc.type"].value
```

- Test importing of the new CIAT records (actually there are 232, not 234):

```
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
```

- Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB
- These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:

```
$ convert -compress Zip -density 150x150 input.pdf output.pdf
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
```

- Somewhere on the Internet suggested using a DPI of 144

## 2017-01-19

- In testing a random sample of CIAT's PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are
- Import 232 CIAT records into CGSpace:

```
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
```

## 2017-01-22

- Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel's CSV exporter)
- There were also some issues with an invalid dc.date.issued field, and I trimmed leading / trailing whitespace and cleaned up some URLs with unneeded parameters like ?show=full

## 2017-01-23

- I merged Atmire's pull request into the development branch so they can deploy it on DSpace Test
- Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):

```
$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
```

- Move some collections with [`move-collections.sh`](https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515) using the following config:

```
10568/42161 10568/171 10568/79341
10568/41914 10568/171 10568/79340
```

## 2017-01-24

- Run all updates on DSpace Test and reboot the server
- Run fixes for Journal titles on CGSpace:

```
$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
```

- Create a new list of the top 500 journal titles from the database:

```
dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
```

- Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request ([#298](https://github.com/ilri/DSpace/pull/298))
- This would be the last issue remaining to close the meta issue about switching to controlled vocabularies ([#69](https://github.com/ilri/DSpace/pull/69))

## 2017-01-25

- Atmire says the `com.atmire.statistics.util.UpdateSolrStorageReports` and `com.atmire.utils.ReportSender` are no longer necessary because they are using a Spring scheduler for these tasks now
- Pull request to remove them from the Ansible templates: https://github.com/ilri/rmg-ansible-public/pull/80
- Still testing the Atmire modules on DSpace Test, and it looks like a few issues we had reported are now fixed:
  - XLS Export from Content statistics 
  - Most popular items 
  - Show statistics on collection pages 
- But now we have a new issue with the "Types" in Content statistics not being respected—we only get the defaults, despite having custom settings in `dspace/config/modules/atmire-cua.cfg`

## 2017-01-27

- Magdalena pointed out that somehow the Anonymous group had been added to the Administrators group on CGSpace (!)
- Discuss plans to update CCAFS metadata and communities for their new flagships and phase II project identifiers
- The flagships are in `cg.subject.ccafs`, and we need to probably make a new field for the phase II project identifiers

## 2017-01-28

- Merge controlled vocabulary for journal titles (`dc.source`) into CGSpace ([#298](https://github.com/ilri/DSpace/pull/298))
- Merge new CIAT subject into CGSpace ([#296](https://github.com/ilri/DSpace/pull/296))

## 2017-01-29

- Run all system updates on DSpace Test, redeploy DSpace code, and reboot the server
- Run all system updates on CGSpace, redeploy DSpace code, and reboot the server