cgspace-notes/content/2016-03.md

+++
date = "2016-03-02T16:50:00+03:00"
author = "Alan Orth"
title = "March, 2016"
tags = ["notes"]
image = "../images/bg.jpg"

+++
## 2016-03-02

- Looking at issues with author authorities on CGSpace
- For some reason we still have the `index-lucene-update` cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
- Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server

## 2016-03-07

- Troubleshooting the issues with the slew of commits for Atmire modules in [#182](https://github.com/ilri/DSpace/pull/182)
- Their changes on `5_x-dev` branch work, but it is messy as hell with merge commits and old branch base
- When I rebase their branch on the latest `5_x-prod` I get blank white pages
- I identified one commit that causes the issue and let them know
- Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:

```
Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
```

## 2016-03-08

- Add a few new filters to Atmire's Listings and Reports module ([#180](https://github.com/ilri/DSpace/issues/180))
- We had also wanted to add a few to the Content and Usage module but I have to ask the editors which ones they were

## 2016-03-10

- Disable the lucene cron job on CGSpace as it shouldn't be needed anymore
- Discuss ORCiD and duplicate authors on Yammer
- Request new documentation for Atmire CUA and L&R modules, as ours are from 2013
- Walk Sisay through some data cleaning workflows in OpenRefine
- Start cleaning up the configuration for Atmire's CUA module ([#184](https://github.com/ilri/DSpace/issues/185))
- It is very messed up because some labels are incorrect, fields are missing, etc

![Mixed up label in Atmire CUA](../images/2016/03/cua-label-mixup.png)

- Update documentation for Atmire modules

## 2016-03-11

- As I was looking at the CUA config I realized our Discovery config is all messed up and confusing
- I've opened an issue to track some of that work ([#186](https://github.com/ilri/DSpace/issues/186))
- I did some major cleanup work on Discovery and XMLUI stuff related to the `dc.type` indexes ([#187](https://github.com/ilri/DSpace/pull/187))
- We had been confusing `dc.type` (a Dublin Core value) with `dc.type.output` (a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc.
- There is still some more work to be done to remove references to old `outputtype` and `output`

## 2016-03-14

- Fix some items that had invalid dates (I noticed them in the log during a re-indexing)
- Reset `search.index.*` to the default, as it is only used by Lucene (deprecated by Discovery in DSpace 5.x): [#188](https://github.com/ilri/DSpace/pull/188)
- Make titles in Discovery and Browse by more consistent (singular, sentence case, etc) ([#186](https://github.com/ilri/DSpace/issues/186))
- Also four or so center-specific subject strings were missing for Discovery

![Missing XMLUI string](../images/2016/03/missing-xmlui-string.png)

## 2016-03-15

- Create simple theme for new AVCD community just for a unique Google Tracking ID ([#191](https://github.com/ilri/DSpace/pull/191))

## 2016-03-16

- Still having problems deploying Atmire's CUA updates and fixes from January!
- More discussion on the GitHub issue here: https://github.com/ilri/DSpace/pull/182
- Clean up Atmire CUA config ([#193](https://github.com/ilri/DSpace/pull/193))
- Help Sisay with some PostgreSQL queries to clean up the incorrect `dc.contributor.corporateauthor` field
- I noticed that we have some weird values in `dc.language`:

```
# select * from metadatavalue where metadata_field_id=37;
 metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
           1942571 |       35342 |                37 | hi         |           |     1 |           |         -1 |                2
           1942468 |       35345 |                37 | hi         |           |     1 |           |         -1 |                2
           1942479 |       35337 |                37 | hi         |           |     1 |           |         -1 |                2
           1942505 |       35336 |                37 | hi         |           |     1 |           |         -1 |                2
           1942519 |       35338 |                37 | hi         |           |     1 |           |         -1 |                2
           1942535 |       35340 |                37 | hi         |           |     1 |           |         -1 |                2
           1942555 |       35341 |                37 | hi         |           |     1 |           |         -1 |                2
           1942588 |       35343 |                37 | hi         |           |     1 |           |         -1 |                2
           1942610 |       35346 |                37 | hi         |           |     1 |           |         -1 |                2
           1942624 |       35347 |                37 | hi         |           |     1 |           |         -1 |                2
           1942639 |       35339 |                37 | hi         |           |     1 |           |         -1 |                2
```

- It seems this `dc.language` field isn't really used, but we should delete these values
- Also, `dc.language.iso` has some weird values, like "En" and "English"

## 2016-03-17

- It turns out `hi` is the ISO 639 language code for Hindi, but these should be in `dc.language.iso` instead of `dc.language`
- I fixed the eleven items with `hi` as well as some using the incorrect `vn` for Vietnamese
- Start discussing CG core with Abenet and Sisay
- Re-sync CGSpace database to DSpace Test for Atmire to do some tests about the problematic CUA patches
- The patches work fine with a clean database, so the error was caused by some mismatch in CUA versions and the database during my testing

## 2016-03-18

- Merge Atmire fixes into `5_x-prod`
- Discuss thumbnails with Francesca from Bioversity
- Some of their items end up with thumbnails that have a big white border around them:

![Excessive whitespace in thumbnail](../images/2016/03/bioversity-thumbnail-bad.jpg)

- Turns out we can add `-trim` to the GraphicsMagick options to trim the whitespace

![Trimmed thumbnail](../images/2016/03/bioversity-thumbnail-good.jpg)

- Command used:

```
$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
```

- Also, it looks like adding `-sharpen 0x1.0` really improves the quality of the image for only a few KB

## 2016-03-21

- Fix 66 site errors in Google's webmaster tools
- I looked at a bunch of them and they were old URLs, weird things linked from non-existent items, etc, so I just marked them all as fixed
- We also have 1,300 "soft 404" errors for URLs like: https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity
- I've marked them as fixed as well since the ones I tested were working fine
- This raises another question, as many of these pages are linked from Discovery search results and might create a duplicate content problem...
- Results pages like this give items that Google already knows from the sitemap: https://cgspace.cgiar.org/discover?filtertype=author&filter_relational_operator=equals&filter=Orth%2C+A.
- There are some access denied errors on JSPUI links (of course! we forbid them!), but I'm not sure why Google is trying to index them...
- For example:
  - This: https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf
  - Linked from: https://cgspace.cgiar.org/jspui/handle/10568/809
- I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time!
- Google says the first time it saw this particular error was September 29, 2015... so maybe it accidentally saw it somehow...
- On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content

![CGSpace pages in Google index](../images/2016/03/google-index.png)

- Turns out this is a problem with DSpace's `robots.txt`, and there's a Jira ticket since December, 2015: https://jira.duraspace.org/browse/DS-2962
- I am not sure if I want to apply it yet
- For now I've just set a bunch of these dynamic pages to not appear in search results by using the URL Parameters tool in Webmaster Tools

![URL parameters cause millions of dynamic pages](../images/2016/03/url-parameters.png)
![Setting pages with the filter_0 param not to show in search results](../images/2016/03/url-parameters2.png)

- Move AVCD collection to new community and update `move_collection.sh` script: https://gist.github.com/alanorth/392c4660e8b022d99dfa
- It seems Feedburner can do HTTPS now, so we might be able to update our feeds and simplify the nginx configs
- De-deploy CGSpace with latest `5_x-prod` branch
- Run updates on CGSpace and reboot server (new kernel, `4.5.0`)
- Deploy Let's Encrypt certificate for cgspace.cgiar.org, but still need to work it into the ansible playbooks

## 2016-03-22

- Merge robots.txt patch and disallow indexing of browse pages as our sitemap is consumed correctly ([#198](https://github.com/ilri/DSpace/issues/198))
Add notes for 2016-03 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-02 15:52:14 +01:00			`+++`
			`date = "2016-03-02T16:50:00+03:00"`
			`author = "Alan Orth"`
			`title = "March, 2016"`
			`tags = ["notes"]`
			`image = "../images/bg.jpg"`

			`+++`
			`## 2016-03-02`

			`- Looking at issues with author authorities on CGSpace`
			- For some reason we still have the `index-lucene-update` cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module
Update notes for 2016-03-02 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-02 18:31:16 +01:00			`- Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server`
Add notes for 2016-03-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-07 18:23:16 +01:00
			`## 2016-03-07`

			`- Troubleshooting the issues with the slew of commits for Atmire modules in [#182](https://github.com/ilri/DSpace/pull/182)`
			- Their changes on `5_x-dev` branch work, but it is messy as hell with merge commits and old branch base
			- When I rebase their branch on the latest `5_x-prod` I get blank white pages
			`- I identified one commit that causes the issue and let them know`
Update notes for 2016-03-07 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-07 20:22:38 +01:00			`- Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:`

			```
			`Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device`
			```
Add notes for 2016-03-08 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-08 17:57:02 +01:00
			`## 2016-03-08`

			`- Add a few new filters to Atmire's Listings and Reports module ([#180](https://github.com/ilri/DSpace/issues/180))`
			`- We had also wanted to add a few to the Content and Usage module but I have to ask the editors which ones they were`
Add notes for 2016-03-10 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-10 17:00:43 +01:00
			`## 2016-03-10`

			`- Disable the lucene cron job on CGSpace as it shouldn't be needed anymore`
			`- Discuss ORCiD and duplicate authors on Yammer`
			`- Request new documentation for Atmire CUA and L&R modules, as ours are from 2013`
			`- Walk Sisay through some data cleaning workflows in OpenRefine`
			`- Start cleaning up the configuration for Atmire's CUA module ([#184](https://github.com/ilri/DSpace/issues/185))`
			`- It is very messed up because some labels are incorrect, fields are missing, etc`
Update notes for 2016-03-10 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-10 17:04:27 +01:00
			`![Mixed up label in Atmire CUA](../images/2016/03/cua-label-mixup.png)`
Update notes for 2015-03-10 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-10 17:07:54 +01:00
			`- Update documentation for Atmire modules`
Add notes for 2016-03-11 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-11 18:16:33 +01:00
			`## 2016-03-11`

			`- As I was looking at the CUA config I realized our Discovery config is all messed up and confusing`
			`- I've opened an issue to track some of that work ([#186](https://github.com/ilri/DSpace/issues/186))`
			- I did some major cleanup work on Discovery and XMLUI stuff related to the `dc.type` indexes ([#187](https://github.com/ilri/DSpace/pull/187))
			- We had been confusing `dc.type` (a Dublin Core value) with `dc.type.output` (a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc.
			- There is still some more work to be done to remove references to old `outputtype` and `output`
Add notes for 2016-03-14 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-14 19:34:54 +01:00
			`## 2016-03-14`

			`- Fix some items that had invalid dates (I noticed them in the log during a re-indexing)`
			- Reset `search.index.*` to the default, as it is only used by Lucene (deprecated by Discovery in DSpace 5.x): [#188](https://github.com/ilri/DSpace/pull/188)
Update notes for 2016-03-14 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-16 07:53:44 +01:00			`- Make titles in Discovery and Browse by more consistent (singular, sentence case, etc) ([#186](https://github.com/ilri/DSpace/issues/186))`
			`- Also four or so center-specific subject strings were missing for Discovery`

			`![Missing XMLUI string](../images/2016/03/missing-xmlui-string.png)`

Adjust notes for 2016-03-15 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-16 07:54:23 +01:00			`## 2016-03-15`

Update notes for 2016-03-14 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-16 07:53:44 +01:00			`- Create simple theme for new AVCD community just for a unique Google Tracking ID ([#191](https://github.com/ilri/DSpace/pull/191))`
Update notes for 2016-03-16 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-17 07:45:51 +01:00
			`## 2016-03-16`

			`- Still having problems deploying Atmire's CUA updates and fixes from January!`
			`- More discussion on the GitHub issue here: https://github.com/ilri/DSpace/pull/182`
			`- Clean up Atmire CUA config ([#193](https://github.com/ilri/DSpace/pull/193))`
			- Help Sisay with some PostgreSQL queries to clean up the incorrect `dc.contributor.corporateauthor` field
			- I noticed that we have some weird values in `dc.language`:

			```
			`# select * from metadatavalue where metadata_field_id=37;`
			`metadata_value_id \| resource_id \| metadata_field_id \| text_value \| text_lang \| place \| authority \| confidence \| resource_type_id`
			`-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------`
			`1942571 \| 35342 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942468 \| 35345 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942479 \| 35337 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942505 \| 35336 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942519 \| 35338 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942535 \| 35340 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942555 \| 35341 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942588 \| 35343 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942610 \| 35346 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942624 \| 35347 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			`1942639 \| 35339 \| 37 \| hi \| \| 1 \| \| -1 \| 2`
			```

			- It seems this `dc.language` field isn't really used, but we should delete these values
			- Also, `dc.language.iso` has some weird values, like "En" and "English"
Add notes for 2016-03-17 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-17 14:56:24 +01:00
			`## 2016-03-17`

			- It turns out `hi` is the ISO 639 language code for Hindi, but these should be in `dc.language.iso` instead of `dc.language`
			- I fixed the eleven items with `hi` as well as some using the incorrect `vn` for Vietnamese
			`- Start discussing CG core with Abenet and Sisay`
			`- Re-sync CGSpace database to DSpace Test for Atmire to do some tests about the problematic CUA patches`
Add notes for 2016-03-18 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-18 15:20:24 +01:00			`- The patches work fine with a clean database, so the error was caused by some mismatch in CUA versions and the database during my testing`

			`## 2016-03-18`

			- Merge Atmire fixes into `5_x-prod`
			`- Discuss thumbnails with Francesca from Bioversity`
			`- Some of their items end up with thumbnails that have a big white border around them:`

			`![Excessive whitespace in thumbnail](../images/2016/03/bioversity-thumbnail-bad.jpg)`

			- Turns out we can add `-trim` to the GraphicsMagick options to trim the whitespace

			`![Trimmed thumbnail](../images/2016/03/bioversity-thumbnail-good.jpg)`

			`- Command used:`

			```
			`$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg`
			```

			- Also, it looks like adding `-sharpen 0x1.0` really improves the quality of the image for only a few KB
Add notes for 2016-03-21 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-21 18:44:45 +01:00
			`## 2016-03-21`

			`- Fix 66 site errors in Google's webmaster tools`
			`- I looked at a bunch of them and they were old URLs, weird things linked from non-existent items, etc, so I just marked them all as fixed`
			`- We also have 1,300 "soft 404" errors for URLs like: https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity`
			`- I've marked them as fixed as well since the ones I tested were working fine`
			`- This raises another question, as many of these pages are linked from Discovery search results and might create a duplicate content problem...`
			`- Results pages like this give items that Google already knows from the sitemap: https://cgspace.cgiar.org/discover?filtertype=author&filter_relational_operator=equals&filter=Orth%2C+A.`
			`- There are some access denied errors on JSPUI links (of course! we forbid them!), but I'm not sure why Google is trying to index them...`
			`- For example:`
			`- This: https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf`
			`- Linked from: https://cgspace.cgiar.org/jspui/handle/10568/809`
			`- I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time!`
			`- Google says the first time it saw this particular error was September 29, 2015... so maybe it accidentally saw it somehow...`
			`- On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content`
Update notes for 2016-03-21 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-22 09:42:18 +01:00
			`![CGSpace pages in Google index](../images/2016/03/google-index.png)`

Add notes for 2016-03-21 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-21 18:44:45 +01:00			- Turns out this is a problem with DSpace's `robots.txt`, and there's a Jira ticket since December, 2015: https://jira.duraspace.org/browse/DS-2962
			`- I am not sure if I want to apply it yet`
			`- For now I've just set a bunch of these dynamic pages to not appear in search results by using the URL Parameters tool in Webmaster Tools`

			`![URL parameters cause millions of dynamic pages](../images/2016/03/url-parameters.png)`
			`![Setting pages with the filter_0 param not to show in search results](../images/2016/03/url-parameters2.png)`

			- Move AVCD collection to new community and update `move_collection.sh` script: https://gist.github.com/alanorth/392c4660e8b022d99dfa
			`- It seems Feedburner can do HTTPS now, so we might be able to update our feeds and simplify the nginx configs`
			- De-deploy CGSpace with latest `5_x-prod` branch
			- Run updates on CGSpace and reboot server (new kernel, `4.5.0`)
Update notes for 2016-03-21 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-21 18:46:14 +01:00			`- Deploy Let's Encrypt certificate for cgspace.cgiar.org, but still need to work it into the ansible playbooks`
Add notes for 2016-03-22 Signed-off-by: Alan Orth <alan.orth@gmail.com> 2016-03-22 10:41:32 +01:00
			`## 2016-03-22`

			`- Merge robots.txt patch and disallow indexing of browse pages as our sitemap is consumed correctly ([#198](https://github.com/ilri/DSpace/issues/198))`