mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-23 13:34:32 +01:00
299 lines
12 KiB
HTML
299 lines
12 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en-us">
|
|
<head prefix="og: http://ogp.me/ns#">
|
|
<meta charset="utf-8" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1" />
|
|
<meta property="og:title" content=" April, 2016 · CGSpace Notes" />
|
|
|
|
<meta property="og:site_name" content="CGSpace Notes" />
|
|
<meta property="og:url" content="/cgspace-notes/2016-04/" />
|
|
|
|
|
|
<meta property="og:type" content="article" />
|
|
|
|
<meta property="og:article:published_time" content="2016-04-04T11:06:00+03:00" />
|
|
|
|
<meta property="og:article:tag" content="notes" />
|
|
|
|
|
|
|
|
<title>
|
|
April, 2016 · CGSpace Notes
|
|
</title>
|
|
|
|
<link rel="stylesheet" href="/cgspace-notes/css/bootstrap.min.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/main.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/font-awesome.min.css" />
|
|
<link rel="stylesheet" href="/cgspace-notes/css/github.css" />
|
|
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400" type="text/css">
|
|
<link rel="shortcut icon" href="/cgspace-notes/images/favicon.ico" />
|
|
<link rel="apple-touch-icon" href="/cgspace-notes/images/apple-touch-icon.png" />
|
|
|
|
</head>
|
|
<body>
|
|
<header class="global-header" style="background-image:url(../images/bg.jpg )">
|
|
<section class="header-text">
|
|
<h1><a href="/cgspace-notes/">CGSpace Notes</a></h1>
|
|
|
|
<div class="sns-links hidden-print">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<a href="/cgspace-notes/" class="btn-header btn-back hidden-xs">
|
|
<i class="fa fa-angle-left" aria-hidden="true"></i>
|
|
Home
|
|
</a>
|
|
|
|
|
|
</section>
|
|
</header>
|
|
<main class="container">
|
|
|
|
|
|
<article>
|
|
<header>
|
|
<h1 class="text-primary">April, 2016</h1>
|
|
<div class="post-meta clearfix">
|
|
<div class="post-date pull-left">
|
|
Posted on
|
|
<time datetime="2016-04-04T11:06:00+03:00">
|
|
Apr 4, 2016
|
|
</time>
|
|
</div>
|
|
<div class="pull-right">
|
|
|
|
<span class="post-tag small"><a href="/cgspace-notes//tags/notes">#notes</a></span>
|
|
|
|
</div>
|
|
</div>
|
|
</header>
|
|
<section>
|
|
|
|
|
|
<h2 id="2016-04-04:c88be15f5b2f07c85f7742556a955e47">2016-04-04</h2>
|
|
|
|
<ul>
|
|
<li>Looking at log file use on CGSpace and notice that we need to work on our cron setup a bit</li>
|
|
<li>We are backing up all logs in the log folder, including useless stuff like solr, cocoon, handle-plugin, etc</li>
|
|
<li>After running DSpace for over five years I’ve never needed to look in any other log file than dspace.log, leave alone one from last year!</li>
|
|
<li>This will save us a few gigs of backup space we’re paying for on S3</li>
|
|
<li>Also, I noticed the <code>checker</code> log has some errors we should pay attention to:</li>
|
|
</ul>
|
|
|
|
<pre><code>Run start time: 03/06/2016 04:00:22
|
|
Error retrieving bitstream ID 71274 from asset store.
|
|
java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files)
|
|
at java.io.FileInputStream.open(Native Method)
|
|
at java.io.FileInputStream.<init>(FileInputStream.java:146)
|
|
at edu.sdsc.grid.io.local.LocalFileInputStream.open(LocalFileInputStream.java:171)
|
|
at edu.sdsc.grid.io.GeneralFileInputStream.<init>(GeneralFileInputStream.java:145)
|
|
at edu.sdsc.grid.io.local.LocalFileInputStream.<init>(LocalFileInputStream.java:139)
|
|
at edu.sdsc.grid.io.FileFactory.newFileInputStream(FileFactory.java:630)
|
|
at org.dspace.storage.bitstore.BitstreamStorageManager.retrieve(BitstreamStorageManager.java:525)
|
|
at org.dspace.checker.BitstreamDAO.getBitstream(BitstreamDAO.java:60)
|
|
at org.dspace.checker.CheckerCommand.processBitstream(CheckerCommand.java:303)
|
|
at org.dspace.checker.CheckerCommand.checkBitstream(CheckerCommand.java:171)
|
|
at org.dspace.checker.CheckerCommand.process(CheckerCommand.java:120)
|
|
at org.dspace.app.checker.ChecksumChecker.main(ChecksumChecker.java:236)
|
|
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
|
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
|
|
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
|
at java.lang.reflect.Method.invoke(Method.java:606)
|
|
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:225)
|
|
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:77)
|
|
******************************************************
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>So this would be the <code>tomcat7</code> Unix user, who seems to have a default limit of 1024 files in its shell</li>
|
|
<li>For what it’s worth, we have been setting the actual Tomcat 7 process’ limit to 16384 for a few years (in <code>/etc/default/tomcat7</code>)</li>
|
|
<li>Looks like cron will read limits from <code>/etc/security/limits.*</code> so we can do something for the tomcat7 user there</li>
|
|
<li>Submit pull request for Tomcat 7 limits in Ansible dspace role (<a href="https://github.com/ilri/rmg-ansible-public/pull/30">#30</a>)</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-04-05:c88be15f5b2f07c85f7742556a955e47">2016-04-05</h2>
|
|
|
|
<ul>
|
|
<li>Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don’t need!</li>
|
|
</ul>
|
|
|
|
<pre><code># s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
|
|
# grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
|
|
# grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
|
|
# grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
|
|
# grep solr.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Also, adjust the cron jobs for backups so they only backup <code>dspace.log</code> and some stats files (.dat)</li>
|
|
<li>Try to do some metadata field migrations using the Atmire batch UI (<code>dc.Species</code> → <code>cg.species</code>) but it took several hours and even missed a few records</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-04-06:c88be15f5b2f07c85f7742556a955e47">2016-04-06</h2>
|
|
|
|
<ul>
|
|
<li>A better way to move metadata on this scale is via SQL, for example <code>dc.type.output</code> → <code>dc.type</code> (their IDs in the metadatafieldregistry are 66 and 109, respectively):</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
|
|
UPDATE 40852
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>After that an <code>index-discovery -bf</code> is required</li>
|
|
<li>Start working on metadata migrations, add 25 or so new metadata fields to CGSpace</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-04-07:c88be15f5b2f07c85f7742556a955e47">2016-04-07</h2>
|
|
|
|
<ul>
|
|
<li>Write shell script to do the migration of fields: <a href="https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b">https://gist.github.com/alanorth/72a70aca856d76f24c127a6e67b3342b</a></li>
|
|
<li>Testing with a few fields it seems to work well:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ./migrate-fields.sh
|
|
UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
|
|
UPDATE 40883
|
|
UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
|
|
UPDATE 21420
|
|
UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
|
|
UPDATE 51258
|
|
</code></pre>
|
|
|
|
<h2 id="2016-04-08:c88be15f5b2f07c85f7742556a955e47">2016-04-08</h2>
|
|
|
|
<ul>
|
|
<li>Discuss metadata renaming with Abenet, we decided it’s better to start with the center-specific subjects like ILRI, CIFOR, CCAFS, IWMI, and CPWF</li>
|
|
<li>I’ve e-mailed CCAFS and CPWF people to ask them how much time it will take for them to update their systems to cope with this change</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-04-10:c88be15f5b2f07c85f7742556a955e47">2016-04-10</h2>
|
|
|
|
<ul>
|
|
<li>Looking at the DOI issue <a href="https://www.yammer.com/dspacedevelopers/#/Threads/show?threadId=678507860">reported by Leroy from CIAT a few weeks ago</a></li>
|
|
<li>It seems the <code>dx.doi.org</code> URLs are much more proper in our repository!</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
|
|
count
|
|
-------
|
|
5638
|
|
(1 row)
|
|
|
|
dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://doi.org%';
|
|
count
|
|
-------
|
|
3
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I will manually edit the <code>dc.identifier.doi</code> in <a href="https://cgspace.cgiar.org/handle/10568/72509?show=full"><sup>10568</sup>⁄<sub>72509</sub></a> and tweet the link, then check back in a week to see if the donut gets updated</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-04-11:c88be15f5b2f07c85f7742556a955e47">2016-04-11</h2>
|
|
|
|
<ul>
|
|
<li>The donut is already updated and shows the correct number now</li>
|
|
<li>CCAFS people say it will only take them an hour to update their code for the metadata renames, so I proposed we’d do it tentatively on Monday the 18th.</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-04-12:c88be15f5b2f07c85f7742556a955e47">2016-04-12</h2>
|
|
|
|
<ul>
|
|
<li>Looking at quality of WLE data (<code>cg.subject.iwmi</code>) in SQL:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Listings and Reports is still not returning reliable data for <code>dc.type</code></li>
|
|
<li>I think we need to ask Atmire, as their documentation isn’t too clear on the format of the filter configs</li>
|
|
<li>Alternatively, I want to see if I move all the data from <code>dc.type.output</code> to <code>dc.type</code> and then re-index, if it behaves better</li>
|
|
<li>Looking at our <code>input-forms.xml</code> I see we have two sets of ILRI subjects, but one has a few extra subjects</li>
|
|
<li>Remove one set of ILRI subjects and remove duplicate <code>VALUE CHAINS</code> from existing list (<a href="https://github.com/ilri/DSpace/pull/216">#216</a>)</li>
|
|
<li>I decided to keep the set of subjects that had <code>FMD</code> and <code>RANGELANDS</code> added, as it appears to have been requested to have been added, and might be the newer list</li>
|
|
<li>I found 226 blank metadatavalues:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I think we should delete them and do a full re-index:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
|
|
DELETE 226
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>In other news, moving the <code>dc.type.output</code> to <code>dc.type</code> and re-indexing seems to have fixed the Listings and Reports issue from above</li>
|
|
<li>Unfortunately this isn’t a very good solution, because Listings and Reports config should allow us to filter on <code>dc.type.*</code> but the documentation isn’t very clear and I couldn’t reach Atmire today</li>
|
|
<li>We want to do the <code>dc.type.output</code> move on CGSpace anyways, but we should wait as it might affect other external people!</li>
|
|
</ul>
|
|
|
|
</section>
|
|
<footer>
|
|
|
|
<section class="author-info row">
|
|
<div class="author-avatar col-md-2">
|
|
|
|
</div>
|
|
<div class="author-meta col-md-6">
|
|
|
|
<h1 class="author-name text-primary">Alan Orth</h1>
|
|
|
|
|
|
</div>
|
|
|
|
</section>
|
|
<ul class="pager">
|
|
|
|
<li class="previous"><a href="/cgspace-notes/2016-03/"><span aria-hidden="true">←</span> Older</a></li>
|
|
|
|
|
|
<li class="next disabled"><a href="#">Newer <span aria-hidden="true">→</span></a></li>
|
|
|
|
</ul>
|
|
</footer>
|
|
</article>
|
|
|
|
</main>
|
|
<footer class="container global-footer">
|
|
<div class="copyright-note pull-left">
|
|
|
|
</div>
|
|
<div class="sns-links hidden-print">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
</footer>
|
|
|
|
<script src="/cgspace-notes/js/highlight.pack.js"></script>
|
|
<script>
|
|
hljs.initHighlightingOnLoad();
|
|
</script>
|
|
|
|
|
|
</body>
|
|
</html>
|
|
|