<li>Looking at issues with author authorities on CGSpace</li>
<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module</li>
<pre><code>Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
<li>As I was looking at the CUA config I realized our Discovery config is all messed up and confusing</li>
<li>I’ve opened an issue to track some of that work (<ahref="https://github.com/ilri/DSpace/issues/186">#186</a>)</li>
<li>I did some major cleanup work on Discovery and XMLUI stuff related to the <code>dc.type</code> indexes (<ahref="https://github.com/ilri/DSpace/pull/187">#187</a>)</li>
<li>We had been confusing <code>dc.type</code> (a Dublin Core value) with <code>dc.type.output</code> (a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc.</li>
<li>There is still some more work to be done to remove references to old <code>outputtype</code> and <code>output</code></li>
<li>Fix some items that had invalid dates (I noticed them in the log during a re-indexing)</li>
<li>Reset <code>search.index.*</code> to the default, as it is only used by Lucene (deprecated by Discovery in DSpace 5.x): <ahref="https://github.com/ilri/DSpace/pull/188">#188</a></li>
<li>Make titles in Discovery and Browse by more consistent (singular, sentence case, etc) (<ahref="https://github.com/ilri/DSpace/issues/186">#186</a>)</li>
<li>Also four or so center-specific subject strings were missing for Discovery</li>
<li>It turns out <code>hi</code> is the ISO 639 language code for Hindi, but these should be in <code>dc.language.iso</code> instead of <code>dc.language</code></li>
<li>I fixed the eleven items with <code>hi</code> as well as some using the incorrect <code>vn</code> for Vietnamese</li>
<li>Start discussing CG core with Abenet and Sisay</li>
<li>Re-sync CGSpace database to DSpace Test for Atmire to do some tests about the problematic CUA patches</li>
<li>Fix 66 site errors in Google’s webmaster tools</li>
<li>I looked at a bunch of them and they were old URLs, weird things linked from non-existent items, etc, so I just marked them all as fixed</li>
<li>We also have 1,300 “soft 404” errors for URLs like: <ahref="https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity">https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity</a></li>
<li>I’ve marked them as fixed as well since the ones I tested were working fine</li>
<li>This raises another question, as many of these pages are linked from Discovery search results and might create a duplicate content problem…</li>
<li>Results pages like this give items that Google already knows from the sitemap: <ahref="https://cgspace.cgiar.org/discover?filtertype=author&filter_relational_operator=equals&filter=Orth%2C+A">https://cgspace.cgiar.org/discover?filtertype=author&filter_relational_operator=equals&filter=Orth%2C+A</a>.</li>
<li>There are some access denied errors on JSPUI links (of course! we forbid them!), but I’m not sure why Google is trying to index them…</li>
<li>I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time!</li>
<li>Google says the first time it saw this particular error was September 29, 2015… so maybe it accidentally saw it somehow…</li>
<li>On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content</li>
<li>Turns out this is a problem with DSpace’s <code>robots.txt</code>, and there’s a Jira ticket since December, 2015: <ahref="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>I am not sure if I want to apply it yet</li>
<li>For now I’ve just set a bunch of these dynamic pages to not appear in search results by using the URL Parameters tool in Webmaster Tools</li>
</ul>
<p><imgsrc="../images/2016/03/url-parameters.png"alt="URL parameters cause millions of dynamic pages"/>
<imgsrc="../images/2016/03/url-parameters2.png"alt="Setting pages with the filter_0 param not to show in search results"/></p>
<ul>
<li>Move AVCD collection to new community and update <code>move_collection.sh</code> script: <ahref="https://gist.github.com/alanorth/392c4660e8b022d99dfa">https://gist.github.com/alanorth/392c4660e8b022d99dfa</a></li>
<li>It seems Feedburner can do HTTPS now, so we might be able to update our feeds and simplify the nginx configs</li>
<li>De-deploy CGSpace with latest <code>5_x-prod</code> branch</li>
<li>Run updates on CGSpace and reboot server (new kernel, <code>4.5.0</code>)</li>
<li>Merge robots.txt patch and disallow indexing of browse pages as our sitemap is consumed correctly (<ahref="https://github.com/ilri/DSpace/issues/198">#198</a>)</li>