+++ date = "2016-03-02T16:50:00+03:00" author = "Alan Orth" title = "March, 2016" tags = ["notes"] image = "../images/bg.jpg" +++ ## 2016-03-02 - Looking at issues with author authorities on CGSpace - For some reason we still have the `index-lucene-update` cron job active on CGSpace, but I'm pretty sure we don't need it as of the latest few versions of Atmire's Listings and Reports module - Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server ## 2016-03-07 - Troubleshooting the issues with the slew of commits for Atmire modules in [#182](https://github.com/ilri/DSpace/pull/182) - Their changes on `5_x-dev` branch work, but it is messy as hell with merge commits and old branch base - When I rebase their branch on the latest `5_x-prod` I get blank white pages - I identified one commit that causes the issue and let them know - Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something: ``` Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device ``` ## 2016-03-08 - Add a few new filters to Atmire's Listings and Reports module ([#180](https://github.com/ilri/DSpace/issues/180)) - We had also wanted to add a few to the Content and Usage module but I have to ask the editors which ones they were ## 2016-03-10 - Disable the lucene cron job on CGSpace as it shouldn't be needed anymore - Discuss ORCiD and duplicate authors on Yammer - Request new documentation for Atmire CUA and L&R modules, as ours are from 2013 - Walk Sisay through some data cleaning workflows in OpenRefine - Start cleaning up the configuration for Atmire's CUA module ([#184](https://github.com/ilri/DSpace/issues/185)) - It is very messed up because some labels are incorrect, fields are missing, etc ![Mixed up label in Atmire CUA](../images/2016/03/cua-label-mixup.png) - Update documentation for Atmire modules ## 2016-03-11 - As I was looking at the CUA config I realized our Discovery config is all messed up and confusing - I've opened an issue to track some of that work ([#186](https://github.com/ilri/DSpace/issues/186)) - I did some major cleanup work on Discovery and XMLUI stuff related to the `dc.type` indexes ([#187](https://github.com/ilri/DSpace/pull/187)) - We had been confusing `dc.type` (a Dublin Core value) with `dc.type.output` (a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc. - There is still some more work to be done to remove references to old `outputtype` and `output` ## 2016-03-14 - Fix some items that had invalid dates (I noticed them in the log during a re-indexing) - Reset `search.index.*` to the default, as it is only used by Lucene (deprecated by Discovery in DSpace 5.x): [#188](https://github.com/ilri/DSpace/pull/188) - Make titles in Discovery and Browse by more consistent (singular, sentence case, etc) ([#186](https://github.com/ilri/DSpace/issues/186)) - Also four or so center-specific subject strings were missing for Discovery ![Missing XMLUI string](../images/2016/03/missing-xmlui-string.png) ## 2016-03-15 - Create simple theme for new AVCD community just for a unique Google Tracking ID ([#191](https://github.com/ilri/DSpace/pull/191)) ## 2016-03-16 - Still having problems deploying Atmire's CUA updates and fixes from January! - More discussion on the GitHub issue here: https://github.com/ilri/DSpace/pull/182 - Clean up Atmire CUA config ([#193](https://github.com/ilri/DSpace/pull/193)) - Help Sisay with some PostgreSQL queries to clean up the incorrect `dc.contributor.corporateauthor` field - I noticed that we have some weird values in `dc.language`: ``` # select * from metadatavalue where metadata_field_id=37; metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------ 1942571 | 35342 | 37 | hi | | 1 | | -1 | 2 1942468 | 35345 | 37 | hi | | 1 | | -1 | 2 1942479 | 35337 | 37 | hi | | 1 | | -1 | 2 1942505 | 35336 | 37 | hi | | 1 | | -1 | 2 1942519 | 35338 | 37 | hi | | 1 | | -1 | 2 1942535 | 35340 | 37 | hi | | 1 | | -1 | 2 1942555 | 35341 | 37 | hi | | 1 | | -1 | 2 1942588 | 35343 | 37 | hi | | 1 | | -1 | 2 1942610 | 35346 | 37 | hi | | 1 | | -1 | 2 1942624 | 35347 | 37 | hi | | 1 | | -1 | 2 1942639 | 35339 | 37 | hi | | 1 | | -1 | 2 ``` - It seems this `dc.language` field isn't really used, but we should delete these values - Also, `dc.language.iso` has some weird values, like "En" and "English" ## 2016-03-17 - It turns out `hi` is the ISO 639 language code for Hindi, but these should be in `dc.language.iso` instead of `dc.language` - I fixed the eleven items with `hi` as well as some using the incorrect `vn` for Vietnamese - Start discussing CG core with Abenet and Sisay - Re-sync CGSpace database to DSpace Test for Atmire to do some tests about the problematic CUA patches - The patches work fine with a clean database, so the error was caused by some mismatch in CUA versions and the database during my testing ## 2016-03-18 - Merge Atmire fixes into `5_x-prod` - Discuss thumbnails with Francesca from Bioversity - Some of their items end up with thumbnails that have a big white border around them: ![Excessive whitespace in thumbnail](../images/2016/03/bioversity-thumbnail-bad.jpg) - Turns out we can add `-trim` to the GraphicsMagick options to trim the whitespace ![Trimmed thumbnail](../images/2016/03/bioversity-thumbnail-good.jpg) - Command used: ``` $ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg ``` - Also, it looks like adding `-sharpen 0x1.0` really improves the quality of the image for only a few KB ## 2016-03-21 - Fix 66 site errors in Google's webmaster tools - I looked at a bunch of them and they were old URLs, weird things linked from non-existent items, etc, so I just marked them all as fixed - We also have 1,300 "soft 404" errors for URLs like: https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity - I've marked them as fixed as well since the ones I tested were working fine - This raises another question, as many of these pages are linked from Discovery search results and might create a duplicate content problem... - Results pages like this give items that Google already knows from the sitemap: https://cgspace.cgiar.org/discover?filtertype=author&filter_relational_operator=equals&filter=Orth%2C+A. - There are some access denied errors on JSPUI links (of course! we forbid them!), but I'm not sure why Google is trying to index them... - For example: - This: https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf - Linked from: https://cgspace.cgiar.org/jspui/handle/10568/809 - I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time! - Google says the first time it saw this particular error was September 29, 2015... so maybe it accidentally saw it somehow... - On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content - Turns out this is a problem with DSpace's `robots.txt`, and there's a Jira ticket since December, 2015: https://jira.duraspace.org/browse/DS-2962 - I am not sure if I want to apply it yet - For now I've just set a bunch of these dynamic pages to not appear in search results by using the URL Parameters tool in Webmaster Tools ![URL parameters cause millions of dynamic pages](../images/2016/03/url-parameters.png) ![Setting pages with the filter_0 param not to show in search results](../images/2016/03/url-parameters2.png) - Move AVCD collection to new community and update `move_collection.sh` script: https://gist.github.com/alanorth/392c4660e8b022d99dfa - It seems Feedburner can do HTTPS now, so we might be able to update our feeds and simplify the nginx configs - De-deploy CGSpace with latest `5_x-prod` branch - Run updates on CGSpace and reboot server (new kernel, `4.5.0`) - Deploy Let's Encrypt certificate for cgspace.cgiar.org, but still need to work it into the ansible playbooks