mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
323 lines
12 KiB
HTML
323 lines
12 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
|
||
<meta property="og:title" content="March, 2019" />
|
||
<meta property="og:description" content="2019-03-01
|
||
|
||
|
||
I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
|
||
I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
|
||
Looking at the other half of Udana’s WLE records from 2018-11
|
||
|
||
|
||
I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
|
||
I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
|
||
Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
|
||
68.15% <20> 9.45 instead of 68.15% ± 9.45
|
||
2003<EFBFBD>2013 instead of 2003–2013
|
||
|
||
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
|
||
" />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-03/" />
|
||
<meta property="article:published_time" content="2019-03-01T12:16:30+01:00"/>
|
||
<meta property="article:modified_time" content="2019-03-07T12:23:32+02:00"/>
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
<meta name="twitter:title" content="March, 2019"/>
|
||
<meta name="twitter:description" content="2019-03-01
|
||
|
||
|
||
I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
|
||
I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
|
||
Looking at the other half of Udana’s WLE records from 2018-11
|
||
|
||
|
||
I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
|
||
I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
|
||
Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
|
||
68.15% <20> 9.45 instead of 68.15% ± 9.45
|
||
2003<EFBFBD>2013 instead of 2003–2013
|
||
|
||
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
|
||
"/>
|
||
<meta name="generator" content="Hugo 0.54.0" />
|
||
|
||
|
||
|
||
<script type="application/ld+json">
|
||
{
|
||
"@context": "http://schema.org",
|
||
"@type": "BlogPosting",
|
||
"headline": "March, 2019",
|
||
"url": "https://alanorth.github.io/cgspace-notes/2019-03/",
|
||
"wordCount": "798",
|
||
"datePublished": "2019-03-01T12:16:30+01:00",
|
||
"dateModified": "2019-03-07T12:23:32+02:00",
|
||
"author": {
|
||
"@type": "Person",
|
||
"name": "Alan Orth"
|
||
},
|
||
"keywords": "Notes"
|
||
}
|
||
</script>
|
||
|
||
|
||
|
||
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2019-03/">
|
||
|
||
<title>March, 2019 | CGSpace Notes</title>
|
||
|
||
<!-- combined, minified CSS -->
|
||
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet" integrity="sha384-G5B34w7DFTumWTswxYzTX7NWfbvQEg1HbFFEg6ItN03uTAAoS2qkPS/fu3LhuuSA" crossorigin="anonymous">
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
</head>
|
||
|
||
<body>
|
||
|
||
|
||
<div class="blog-masthead">
|
||
<div class="container">
|
||
<nav class="nav blog-nav">
|
||
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
|
||
|
||
<header class="blog-header">
|
||
<div class="container">
|
||
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
<p class="lead blog-description">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
||
</div>
|
||
</header>
|
||
|
||
|
||
|
||
|
||
<div class="container">
|
||
<div class="row">
|
||
<div class="col-sm-8 blog-main">
|
||
|
||
|
||
|
||
|
||
<article class="blog-post">
|
||
<header>
|
||
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2019-03/">March, 2019</a></h2>
|
||
<p class="blog-post-meta"><time datetime="2019-03-01T12:16:30+01:00">Fri Mar 01, 2019</time> by Alan Orth in
|
||
|
||
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
||
|
||
</p>
|
||
</header>
|
||
<h2 id="2019-03-01">2019-03-01</h2>
|
||
|
||
<ul>
|
||
<li>I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
|
||
<li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…</li>
|
||
<li>Looking at the other half of Udana’s WLE records from 2018-11
|
||
|
||
<ul>
|
||
<li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li>
|
||
<li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li>
|
||
<li>Most worryingly, there are encoding errors in the abstracts for eleven items, for example:</li>
|
||
<li>68.15% <20> 9.45 instead of 68.15% ± 9.45</li>
|
||
<li>2003<EFBFBD>2013 instead of 2003–2013</li>
|
||
</ul></li>
|
||
<li>I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs</li>
|
||
</ul>
|
||
|
||
<h2 id="2019-03-03">2019-03-03</h2>
|
||
|
||
<ul>
|
||
<li>Trying to finally upload IITA’s 259 Feb 14 items to CGSpace so I exported them from DSpace Test:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ mkdir 2019-03-03-IITA-Feb14
|
||
$ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>As I was inspecting the archive I noticed that there were some problems with the bitsreams:
|
||
|
||
<ul>
|
||
<li>First, Sisay didn’t include the bitstream descriptions</li>
|
||
<li>Second, only five items had bitstreams and I remember in the discussion with IITA that there should have been nine!</li>
|
||
<li>I had to refer to the original CSV from January to find the file names, then download and add them to the export contents manually!</li>
|
||
</ul></li>
|
||
<li>After adding the missing bitstreams and descriptions manually I tested them again locally, then imported them to a temporary collection on CGSpace:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>DSpace’s export function doesn’t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something</li>
|
||
<li>After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the <code>dspace cleanup</code> script</li>
|
||
<li>Merge the IITA research theme changes from last month to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/413">#413</a>)
|
||
|
||
<ul>
|
||
<li>I will deploy to CGSpace soon and then think about how to batch tag all IITA’s existing items with this metadata</li>
|
||
</ul></li>
|
||
<li>Deploy Tomcat 7.0.93 on CGSpace (linode18) after having tested it on DSpace Test (linode19) for a week</li>
|
||
</ul>
|
||
|
||
<h2 id="2019-03-06">2019-03-06</h2>
|
||
|
||
<ul>
|
||
<li>Abenet was having problems with a CIP user account, I think that the user could not register</li>
|
||
<li>I suspect it’s related to the email issue that ICT hasn’t responded about since last week</li>
|
||
<li>As I thought, I still cannot send emails from CGSpace:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ dspace test-email
|
||
|
||
About to send test email:
|
||
- To: blah@stfu.com
|
||
- Subject: DSpace test email
|
||
- Server: smtp.office365.com
|
||
|
||
Error sending email:
|
||
- Error: javax.mail.AuthenticationFailedException
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I will send a follow-up to ICT to ask them to reset the password</li>
|
||
</ul>
|
||
|
||
<h2 id="2019-03-07">2019-03-07</h2>
|
||
|
||
<ul>
|
||
<li>ICT reset the email password and I confirmed that it is working now</li>
|
||
<li>Generate a controlled vocabulary of 1187 AGROVOC subjects from the top 1500 that I checked last month, dumping the terms themselves using <code>csvcut</code> and then applying XML controlled vocabulary format in vim and then checking with tidy for good measure:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ csvcut -c name 2019-02-22-subjects.csv > dspace/config/controlled-vocabularies/dc-contributor-author.xml
|
||
$ # apply formatting in XML file
|
||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I tested the AGROVOC controlled vocabulary locally and will deploy it on DSpace Test soon so people can see it</li>
|
||
<li>Atmire noticed my message about the “solr_update_time_stamp” error on the dspace-tech mailing list and created an issue on their tracker to discuss it with me
|
||
|
||
<ul>
|
||
<li>They say the error is harmless, but has nevertheless been fixed in their newer module versions</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<h2 id="2019-03-08">2019-03-08</h2>
|
||
|
||
<ul>
|
||
<li>There’s an issue with CGSpace right now where all items are giving a blank page in the XMLUI
|
||
|
||
<ul>
|
||
<li>Interestingly, if I check an item in the REST API it is also mostly blank: only the title and the ID!</li>
|
||
<li>I don’t see anything unusual in the Tomcat logs, though there are thousands of those <code>solr_update_time_stamp</code> errors:</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<pre><code># journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
|
||
1076
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I restarted Tomcat and it’s OK now…</li>
|
||
<li>Skype meeting with Peter and Abenet and Sisay
|
||
|
||
<ul>
|
||
<li>We want to try to crowd source the correction of invalid AGROVOC terms starting with the ~313 invalid ones from our top 1500</li>
|
||
<li>We will share a Google Docs spreadsheet with the partners and ask them to mark the deletions and corrections</li>
|
||
<li>Abenet and Alan to spend some time identifying correct DCTERMS fields to move to, with preference over CG Core 2.0 as we want to be globally compliant (use information from SEO crosswalks)</li>
|
||
</ul></li>
|
||
</ul>
|
||
|
||
<!-- vim: set sw=2 ts=2: -->
|
||
|
||
|
||
|
||
|
||
|
||
</article>
|
||
|
||
|
||
|
||
</div> <!-- /.blog-main -->
|
||
|
||
<aside class="col-sm-3 ml-auto blog-sidebar">
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Recent Posts</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
|
||
<li><a href="/cgspace-notes/2019-03/">March, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2019-02/">February, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2019-01/">January, 2019</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-12/">December, 2018</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2018-11/">November, 2018</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Links</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
||
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
||
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
</aside>
|
||
|
||
|
||
</div> <!-- /.row -->
|
||
</div> <!-- /.container -->
|
||
|
||
|
||
|
||
<footer class="blog-footer">
|
||
<p>
|
||
|
||
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
||
</p>
|
||
<p>
|
||
<a href="#">Back to top</a>
|
||
</p>
|
||
</footer>
|
||
|
||
|
||
</body>
|
||
|
||
</html>
|