mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-23 07:00:20 +01:00
316 lines
15 KiB
HTML
316 lines
15 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en-us">
|
|
<head prefix="og: http://ogp.me/ns#">
|
|
<meta charset="utf-8" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />
|
|
<meta property="og:title" content=" March, 2016 · CGSpace Notes" />
|
|
|
|
<meta property="og:site_name" content="CGSpace Notes" />
|
|
<meta property="og:url" content="/cgspace-notes/2016-03/" />
|
|
|
|
|
|
<meta property="og:type" content="article" />
|
|
|
|
<meta property="og:article:published_time" content="2016-03-02T16:50:00+03:00" />
|
|
|
|
<meta property="og:article:tag" content="notes" />
|
|
|
|
|
|
|
|
<title>
|
|
March, 2016 · CGSpace Notes
|
|
</title>
|
|
|
|
<link rel="stylesheet" href="/cgspace-notes/css/bootstrap.min.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/main.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/font-awesome.min.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/github.css" />
|
|
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">
|
|
<link rel="shortcut icon" href="/cgspace-notes/images/favicon.ico" />
|
|
<link rel="apple-touch-icon" href="/cgspace-notes/images/apple-touch-icon.png" />
|
|
|
|
</head>
|
|
<body>
|
|
<header class="global-header" style="background-image:url(../images/bg.jpg )">
|
|
<section class="header-text">
|
|
<h1><a href="/cgspace-notes/">CGSpace Notes</a></h1>
|
|
|
|
<div class="sns-links hidden-print">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<a href="/cgspace-notes/" class="btn-header btn-back hidden-xs">
|
|
<i class="fa fa-angle-left" aria-hidden="true"></i>
|
|
Home
|
|
</a>
|
|
|
|
|
|
</section>
|
|
</header>
|
|
<main class="container">
|
|
|
|
|
|
<article>
|
|
<header>
|
|
<h1 class="text-primary">March, 2016</h1>
|
|
<div class="post-meta clearfix">
|
|
<div class="post-date pull-left">
|
|
Posted on
|
|
<time datetime="2016-03-02T16:50:00+03:00">
|
|
Mar 2, 2016
|
|
</time>
|
|
</div>
|
|
<div class="pull-right">
|
|
|
|
<span class="post-tag small"><a href="/cgspace-notes//tags/notes">#notes</a></span>
|
|
|
|
</div>
|
|
</div>
|
|
</header>
|
|
<section>
|
|
|
|
|
|
<h2 id="2016-03-02:5a28ddf3ee658c043c064ccddb151717">2016-03-02</h2>
|
|
|
|
<ul>
|
|
<li>Looking at issues with author authorities on CGSpace</li>
|
|
<li>For some reason we still have the <code>index-lucene-update</code> cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module</li>
|
|
<li>Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-03-07:5a28ddf3ee658c043c064ccddb151717">2016-03-07</h2>
|
|
|
|
<ul>
|
|
<li>Troubleshooting the issues with the slew of commits for Atmire modules in <a href="https://github.com/ilri/DSpace/pull/182">#182</a></li>
|
|
<li>Their changes on <code>5_x-dev</code> branch work, but it is messy as hell with merge commits and old branch base</li>
|
|
<li>When I rebase their branch on the latest <code>5_x-prod</code> I get blank white pages</li>
|
|
<li>I identified one commit that causes the issue and let them know</li>
|
|
<li>Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:</li>
|
|
</ul>
|
|
|
|
<pre><code>Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
|
|
</code></pre>
|
|
|
|
<h2 id="2016-03-08:5a28ddf3ee658c043c064ccddb151717">2016-03-08</h2>
|
|
|
|
<ul>
|
|
<li>Add a few new filters to Atmire’s Listings and Reports module (<a href="https://github.com/ilri/DSpace/issues/180">#180</a>)</li>
|
|
<li>We had also wanted to add a few to the Content and Usage module but I have to ask the editors which ones they were</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-03-10:5a28ddf3ee658c043c064ccddb151717">2016-03-10</h2>
|
|
|
|
<ul>
|
|
<li>Disable the lucene cron job on CGSpace as it shouldn’t be needed anymore</li>
|
|
<li>Discuss ORCiD and duplicate authors on Yammer</li>
|
|
<li>Request new documentation for Atmire CUA and L&R modules, as ours are from 2013</li>
|
|
<li>Walk Sisay through some data cleaning workflows in OpenRefine</li>
|
|
<li>Start cleaning up the configuration for Atmire’s CUA module (<a href="https://github.com/ilri/DSpace/issues/185">#184</a>)</li>
|
|
<li>It is very messed up because some labels are incorrect, fields are missing, etc</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/03/cua-label-mixup.png" alt="Mixed up label in Atmire CUA" /></p>
|
|
|
|
<ul>
|
|
<li>Update documentation for Atmire modules</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-03-11:5a28ddf3ee658c043c064ccddb151717">2016-03-11</h2>
|
|
|
|
<ul>
|
|
<li>As I was looking at the CUA config I realized our Discovery config is all messed up and confusing</li>
|
|
<li>I’ve opened an issue to track some of that work (<a href="https://github.com/ilri/DSpace/issues/186">#186</a>)</li>
|
|
<li>I did some major cleanup work on Discovery and XMLUI stuff related to the <code>dc.type</code> indexes (<a href="https://github.com/ilri/DSpace/pull/187">#187</a>)</li>
|
|
<li>We had been confusing <code>dc.type</code> (a Dublin Core value) with <code>dc.type.output</code> (a value we invented) for a few years and it had permeated all aspects of our data, indexes, item displays, etc.</li>
|
|
<li>There is still some more work to be done to remove references to old <code>outputtype</code> and <code>output</code></li>
|
|
</ul>
|
|
|
|
<h2 id="2016-03-14:5a28ddf3ee658c043c064ccddb151717">2016-03-14</h2>
|
|
|
|
<ul>
|
|
<li>Fix some items that had invalid dates (I noticed them in the log during a re-indexing)</li>
|
|
<li>Reset <code>search.index.*</code> to the default, as it is only used by Lucene (deprecated by Discovery in DSpace 5.x): <a href="https://github.com/ilri/DSpace/pull/188">#188</a></li>
|
|
<li>Make titles in Discovery and Browse by more consistent (singular, sentence case, etc) (<a href="https://github.com/ilri/DSpace/issues/186">#186</a>)</li>
|
|
<li>Also four or so center-specific subject strings were missing for Discovery</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/03/missing-xmlui-string.png" alt="Missing XMLUI string" /></p>
|
|
|
|
<h2 id="2016-03-15:5a28ddf3ee658c043c064ccddb151717">2016-03-15</h2>
|
|
|
|
<ul>
|
|
<li>Create simple theme for new AVCD community just for a unique Google Tracking ID (<a href="https://github.com/ilri/DSpace/pull/191">#191</a>)</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-03-16:5a28ddf3ee658c043c064ccddb151717">2016-03-16</h2>
|
|
|
|
<ul>
|
|
<li>Still having problems deploying Atmire’s CUA updates and fixes from January!</li>
|
|
<li>More discussion on the GitHub issue here: <a href="https://github.com/ilri/DSpace/pull/182">https://github.com/ilri/DSpace/pull/182</a></li>
|
|
<li>Clean up Atmire CUA config (<a href="https://github.com/ilri/DSpace/pull/193">#193</a>)</li>
|
|
<li>Help Sisay with some PostgreSQL queries to clean up the incorrect <code>dc.contributor.corporateauthor</code> field</li>
|
|
<li>I noticed that we have some weird values in <code>dc.language</code>:</li>
|
|
</ul>
|
|
|
|
<pre><code># select * from metadatavalue where metadata_field_id=37;
|
|
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
|
|
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
|
|
1942571 | 35342 | 37 | hi | | 1 | | -1 | 2
|
|
1942468 | 35345 | 37 | hi | | 1 | | -1 | 2
|
|
1942479 | 35337 | 37 | hi | | 1 | | -1 | 2
|
|
1942505 | 35336 | 37 | hi | | 1 | | -1 | 2
|
|
1942519 | 35338 | 37 | hi | | 1 | | -1 | 2
|
|
1942535 | 35340 | 37 | hi | | 1 | | -1 | 2
|
|
1942555 | 35341 | 37 | hi | | 1 | | -1 | 2
|
|
1942588 | 35343 | 37 | hi | | 1 | | -1 | 2
|
|
1942610 | 35346 | 37 | hi | | 1 | | -1 | 2
|
|
1942624 | 35347 | 37 | hi | | 1 | | -1 | 2
|
|
1942639 | 35339 | 37 | hi | | 1 | | -1 | 2
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>It seems this <code>dc.language</code> field isn’t really used, but we should delete these values</li>
|
|
<li>Also, <code>dc.language.iso</code> has some weird values, like “En” and “English”</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-03-17:5a28ddf3ee658c043c064ccddb151717">2016-03-17</h2>
|
|
|
|
<ul>
|
|
<li>It turns out <code>hi</code> is the ISO 639 language code for Hindi, but these should be in <code>dc.language.iso</code> instead of <code>dc.language</code></li>
|
|
<li>I fixed the eleven items with <code>hi</code> as well as some using the incorrect <code>vn</code> for Vietnamese</li>
|
|
<li>Start discussing CG core with Abenet and Sisay</li>
|
|
<li>Re-sync CGSpace database to DSpace Test for Atmire to do some tests about the problematic CUA patches</li>
|
|
<li>The patches work fine with a clean database, so the error was caused by some mismatch in CUA versions and the database during my testing</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-03-18:5a28ddf3ee658c043c064ccddb151717">2016-03-18</h2>
|
|
|
|
<ul>
|
|
<li>Merge Atmire fixes into <code>5_x-prod</code></li>
|
|
<li>Discuss thumbnails with Francesca from Bioversity</li>
|
|
<li>Some of their items end up with thumbnails that have a big white border around them:</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/03/bioversity-thumbnail-bad.jpg" alt="Excessive whitespace in thumbnail" /></p>
|
|
|
|
<ul>
|
|
<li>Turns out we can add <code>-trim</code> to the GraphicsMagick options to trim the whitespace</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/03/bioversity-thumbnail-good.jpg" alt="Trimmed thumbnail" /></p>
|
|
|
|
<ul>
|
|
<li>Command used:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Also, it looks like adding <code>-sharpen 0x1.0</code> really improves the quality of the image for only a few KB</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-03-21:5a28ddf3ee658c043c064ccddb151717">2016-03-21</h2>
|
|
|
|
<ul>
|
|
<li>Fix 66 site errors in Google’s webmaster tools</li>
|
|
<li>I looked at a bunch of them and they were old URLs, weird things linked from non-existent items, etc, so I just marked them all as fixed</li>
|
|
<li>We also have 1,300 “soft 404” errors for URLs like: <a href="https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity">https://cgspace.cgiar.org/handle/10568/440/browse?type=bioversity</a></li>
|
|
<li>I’ve marked them as fixed as well since the ones I tested were working fine</li>
|
|
<li>This raises another question, as many of these pages are linked from Discovery search results and might create a duplicate content problem…</li>
|
|
<li>Results pages like this give items that Google already knows from the sitemap: <a href="https://cgspace.cgiar.org/discover?filtertype=author&filter_relational_operator=equals&filter=Orth%2C+A">https://cgspace.cgiar.org/discover?filtertype=author&filter_relational_operator=equals&filter=Orth%2C+A</a>.</li>
|
|
<li>There are some access denied errors on JSPUI links (of course! we forbid them!), but I’m not sure why Google is trying to index them…</li>
|
|
<li>For example:
|
|
|
|
<ul>
|
|
<li>This: <a href="https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf">https://cgspace.cgiar.org/jspui/bitstream/10568/809/1/main-page.pdf</a></li>
|
|
<li>Linked from: <a href="https://cgspace.cgiar.org/jspui/handle/10568/809">https://cgspace.cgiar.org/jspui/handle/10568/809</a></li>
|
|
</ul></li>
|
|
<li>I will mark these errors as resolved because they are returning HTTP 403 on purpose, for a long time!</li>
|
|
<li>Google says the first time it saw this particular error was September 29, 2015… so maybe it accidentally saw it somehow…</li>
|
|
<li>On a related note, we have 51,000 items indexed from the sitemap, but 500,000 items in the Google index, so we DEFINITELY have a problem with duplicate content</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/03/google-index.png" alt="CGSpace pages in Google index" /></p>
|
|
|
|
<ul>
|
|
<li>Turns out this is a problem with DSpace’s <code>robots.txt</code>, and there’s a Jira ticket since December, 2015: <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
|
<li>I am not sure if I want to apply it yet</li>
|
|
<li>For now I’ve just set a bunch of these dynamic pages to not appear in search results by using the URL Parameters tool in Webmaster Tools</li>
|
|
</ul>
|
|
|
|
<p><img src="../images/2016/03/url-parameters.png" alt="URL parameters cause millions of dynamic pages" />
|
|
<img src="../images/2016/03/url-parameters2.png" alt="Setting pages with the filter_0 param not to show in search results" /></p>
|
|
|
|
<ul>
|
|
<li>Move AVCD collection to new community and update <code>move_collection.sh</code> script: <a href="https://gist.github.com/alanorth/392c4660e8b022d99dfa">https://gist.github.com/alanorth/392c4660e8b022d99dfa</a></li>
|
|
<li>It seems Feedburner can do HTTPS now, so we might be able to update our feeds and simplify the nginx configs</li>
|
|
<li>De-deploy CGSpace with latest <code>5_x-prod</code> branch</li>
|
|
<li>Run updates on CGSpace and reboot server (new kernel, <code>4.5.0</code>)</li>
|
|
<li>Deploy Let’s Encrypt certificate for cgspace.cgiar.org, but still need to work it into the ansible playbooks</li>
|
|
</ul>
|
|
|
|
</section>
|
|
<footer>
|
|
|
|
<section class="author-info row">
|
|
<div class="author-avatar col-md-2">
|
|
|
|
</div>
|
|
<div class="author-meta col-md-6">
|
|
|
|
<h1 class="author-name text-primary">Alan Orth</h1>
|
|
|
|
|
|
</div>
|
|
|
|
</section>
|
|
<ul class="pager">
|
|
|
|
<li class="previous"><a href="/cgspace-notes/2016-02/"><span aria-hidden="true">←</span> Older</a></li>
|
|
|
|
|
|
<li class="next disabled"><a href="#">Newer <span aria-hidden="true">→</span></a></li>
|
|
|
|
</ul>
|
|
</footer>
|
|
</article>
|
|
|
|
</main>
|
|
<footer class="container global-footer">
|
|
<div class="copyright-note pull-left">
|
|
|
|
</div>
|
|
<div class="sns-links hidden-print">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
</footer>
|
|
|
|
<script src="/cgspace-notes/js/highlight.pack.js"></script>
|
|
<script>
|
|
hljs.initHighlightingOnLoad();
|
|
</script>
|
|
|
|
|
|
</body>
|
|
</html>
|
|
|