- CGSpace was down for five hours in the morning while I was sleeping
- While looking in the logs for errors, I see tons of warnings about Atmire MQM:
```
2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
```
- I see thousands of them in the logs for the last few months, so it's not related to the DSpace 5.5 upgrade
- I've raised a ticket with Atmire to ask
- Another worrying error from dspace.log is:
```
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:789)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:111)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter.doFilter(DSpaceCocoonServletFilter.java:274)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.app.xmlui.cocoon.servlet.multipart.DSpaceMultipartFilter.doFilter(DSpaceMultipartFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
at com.googlecode.psiprobe.Tomcat70AgentValve.invoke(Tomcat70AgentValve.java:44)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:180)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1041)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:313)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
- Also, the disk is nearly full because of log file issues, so I'm running some compression on DSpace logs
- Normally these stay uncompressed for a month just in case we need to look at them, so now I've just compressed anything older than 2 weeks so we can get some disk space back
- I did some basic benchmarking on a local DSpace before and after the JVM settings above, but there wasn't anything amazingly obvious
- I want to make the changes on DSpace Test and monitor the JVM heap graphs for a few days to see if they change the JVM GC patterns or anything (munin graphs)
- Spin up new CGSpace server on Linode
- I did a few traceroutes from Jordan and Kenya and it seems that Linode's Frankfurt datacenter is a few less hops and perhaps less packet loss than the London one, so I put the new server in Frankfurt
- Some author authority corrections and name standardizations for Peter:
```
dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
UPDATE 11
dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
UPDATE 36
dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
UPDATE 14
dspace=# update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
UPDATE 42
dspace=# update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
UPDATE 360
dspace=# update metadatavalue set text_value='Grace, Delia', authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
UPDATE 561
```
- Pay attention to the regex to prevent false positives in tricky cases with Dutch names!
- I will run these updates on DSpace Test and then force a Discovery reindex, and then run them on CGSpace next week
- More work on the KM4Dev Journal article
- In other news, it seems the batch edit patch is working, there are no more WARN errors in the logs and the batch edit seems to work
- I need to check the CGSpace logs to see if there are still errors there, and then deploy/monitor it there
- Paola from CCAFS mentioned she also has the "take task" bug on CGSpace
- Reading about [`shared_buffers` in PostgreSQL configuration](https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html) (default is 128MB)
- Looks like we have ~5GB of memory used by caches on the test server (after OS and JVM heap!), so we might as well bump up the buffers for Postgres
- The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn't dedicated (also runs Solr, which can benefit from OS cache) so let's try 1024MB
- In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):
```
$ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
"msg":"Unexpected character '[' (code 91) in prolog; expected '<'\n at [row,col {unknown-source}]: [1,1]",
"code":400}}
```
- When I try using the XML format I get an error that the `updateLog` needs to be configured for that core
- Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?
```
dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
UPDATE 561
```
- Then I'll reindex discovery and authority and see how the authority Solr core looks
- After this, now there are authorities for some of the "Grace, D." and "Grace, Delia" text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):
- So now I could set them all to this ID and the name would be ok, but there has to be a better way!
- In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!
- Better to use:
```
dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
```
- This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!
- Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID
- Deploy MQM WARN fix on CGSpace ([#289](https://github.com/ilri/DSpace/pull/289))
- Deploy "take task" hack/fix on CGSpace ([#290](https://github.com/ilri/DSpace/pull/290))
- I ran the following author corrections and then reindexed discovery:
```
update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
update metadatavalue set authority='0d8369bb-57f7-4b2f-92aa-af820b183aca', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thornton, P%';
update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
- Something weird happened and Peter Thorne's names all ended up as "Thorne", I guess because the original authority had that as its name value:
```
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
Thorne, P. | 18349f29-61b1-44d7-ac60-89e55546e812 | 600
(6 rows)
```
- I generated a new UUID using `uuidgen | tr [A-Z] [a-z]` and set it along with correct name variation for all records:
```
dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
UPDATE 43
```
- Apparently we also need to normalize Phil Thornton's names to `Thornton, Philip K.`:
```
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
Thornton, Philip K | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, Philip K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
Thornton, P. K. | 0d8369bb-57f7-4b2f-92aa-af820b183aca | 600
(10 rows)
```
- Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:
```
dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
UPDATE 362
```
- It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)
- Everything looks ok after authority and discovery reindex
- In other news, I think we should really be using more RAM for PostgreSQL's `shared_buffers`
- The [PostgreSQL documentation](https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html) recommends using 25% of the system's RAM on dedicated systems, but we should use a bit less since we also have a massive JVM heap and also benefit from some RAM being used by the OS cache
## 2016-12-09
- More work on finishing rough draft of KM4Dev article
- Set PostgreSQL's `shared_buffers` on CGSpace to 10% of system RAM (1200MB)