mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-05 14:53:02 +01:00
370 lines
14 KiB
HTML
370 lines
14 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
|
|
<head>
|
|
|
|
|
|
|
|
|
|
<meta charset="utf-8">
|
|
<meta http-equiv="X-UA-Compatible" content="IE=edge">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
|
|
|
|
<meta name="description" content="">
|
|
<meta name="author" content="Alan Orth">
|
|
|
|
<!-- OpenGraph Metadata: http://ogp.me/ -->
|
|
<meta property="og:title" content="November, 2016">
|
|
<meta property="og:description" content="">
|
|
|
|
|
|
<meta property="og:type" content="article">
|
|
<meta property="article:published_time" content="2016-11-01T09:21:00+03:00">
|
|
<meta property="article:author" content="Alan Orth">
|
|
|
|
|
|
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-11/">
|
|
|
|
<!-- Metadata for Twitter: https://dev.twitter.com/cards/markup -->
|
|
|
|
<meta property="twitter:card" content="summary">
|
|
|
|
|
|
<meta property="twitter:title" content="November, 2016">
|
|
<meta property="twitter:description" content="">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<meta name="generator" content="Hugo 0.17" />
|
|
|
|
|
|
<base href="https://alanorth.github.io/cgspace-notes/">
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-11/">
|
|
|
|
<title>November, 2016 | CGSpace Notes</title>
|
|
|
|
<!-- combined, minified CSS -->
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet">
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
<link href="https://alanorth.github.io/cgspace-notes/index.xml" type="application/rss+xml" rel="alternate">
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
|
|
</div>
|
|
</header>
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2016-11/">November, 2016</a></h2>
|
|
<p class="blog-post-meta"><time datetime="2016-11-01T09:21:00+03:00">Tue Nov 01, 2016</time> by Alan Orth in
|
|
|
|
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
|
|
|
</p>
|
|
</header>
|
|
|
|
|
|
<h2 id="2016-11-01">2016-11-01</h2>
|
|
|
|
<ul>
|
|
<li>Add <code>dc.type</code> to the output options for Atmire’s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
|
|
</ul>
|
|
|
|
<p><img src="2016/11/listings-and-reports.png" alt="Listings and Reports with output type" /></p>
|
|
|
|
<h2 id="2016-11-02">2016-11-02</h2>
|
|
|
|
<ul>
|
|
<li>Migrate DSpace Test to DSpace 5.5 (<a href="https://gist.github.com/alanorth/61013895c6efe7095d7f81000953d1cf">notes</a>)</li>
|
|
<li>Run all updates on DSpace Test and reboot the server</li>
|
|
<li>Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! (<a href="https://github.com/ilri/DSpace/issues/63">#63</a>)</li>
|
|
<li>Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes</li>
|
|
<li>At the end it appeared to finish correctly but there were lots of errors right after it finished:</li>
|
|
</ul>
|
|
|
|
<pre><code>2016-11-02 15:09:48,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index
|
|
2016-11-02 15:09:48,584 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/3202 to Index
|
|
2016-11-02 15:09:48,589 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76455 to Index
|
|
2016-11-02 15:09:48,590 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/51693 to Index
|
|
2016-11-02 15:09:48,590 INFO org.dspace.discovery.IndexClient @ Done with indexing
|
|
2016-11-02 15:09:48,600 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76456 to Index
|
|
2016-11-02 15:09:48,613 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/55536 to Index
|
|
2016-11-02 15:09:48,616 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76457 to Index
|
|
2016-11-02 15:09:48,634 ERROR com.atmire.dspace.discovery.AtmireSolrService @
|
|
java.lang.NullPointerException
|
|
at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57)
|
|
at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824)
|
|
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821)
|
|
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898)
|
|
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
|
|
at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>DSpace is still up, and a few minutes later I see the default DSpace indexer is still running</li>
|
|
<li>Sure enough, looking back before the first one finished, I see output from both indexers interleaved in the log:</li>
|
|
</ul>
|
|
|
|
<pre><code>2016-11-02 15:09:28,545 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index
|
|
2016-11-02 15:09:28,633 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/60785 to Index
|
|
2016-11-02 15:09:28,678 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55695 of 55722): 43557
|
|
2016-11-02 15:09:28,688 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55703 of 55722): 34476
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I will raise a ticket with Atmire to ask them</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-11-06">2016-11-06</h2>
|
|
|
|
<ul>
|
|
<li>After re-deploying and re-indexing I didn’t see the same issue, and the indexing completed in 85 minutes, which is about how long it is supposed to take</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-11-07">2016-11-07</h2>
|
|
|
|
<ul>
|
|
<li>Horrible one liner to get Linode ID from certain Ansible host vars:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>I noticed some weird CRPs in the database, and they don’t show up in Discovery for some reason, perhaps the <code>:</code></li>
|
|
<li>I’ll export these and fix them in batch:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
|
|
COPY 22
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Test running the replacements:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Add <code>AMR</code> to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/288">#288</a>)</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-11-08">2016-11-08</h2>
|
|
|
|
<ul>
|
|
<li>Atmire’s Listings and Reports module seems to be broken on DSpace 5.5</li>
|
|
</ul>
|
|
|
|
<p><img src="2016/11/listings-and-reports-55.png" alt="Listings and Reports broken in DSpace 5.5" /></p>
|
|
|
|
<ul>
|
|
<li>I’ve filed a ticket with Atmire</li>
|
|
<li>Thinking about batch updates for ORCIDs and authors</li>
|
|
<li>Playing with <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> in Python to query Solr</li>
|
|
<li>All records in the authority core are either <code>authority_type:orcid</code> or <code>authority_type:person</code></li>
|
|
<li>There is a <code>deleted</code> field and all items seem to be <code>false</code>, but might be important sanity check to remember</li>
|
|
<li>The way to go is probably to have a CSV of author names and authority IDs, then to batch update them in PostgreSQL</li>
|
|
<li>Dump of the top ~200 authors in CGSpace:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv;
|
|
</code></pre>
|
|
|
|
<h2 id="2016-11-09">2016-11-09</h2>
|
|
|
|
<ul>
|
|
<li>CGSpace crashed so I quickly ran system updates, applied one or two of the waiting changes from the <code>5_x-prod</code> branch, and rebooted the server</li>
|
|
<li>The error was <code>Timeout waiting for idle object</code> but I haven’t looked into the Tomcat logs to see what happened</li>
|
|
<li>Also, I ran the corrections for CRPs from earlier this week</li>
|
|
</ul>
|
|
|
|
<h2 id="2016-11-10">2016-11-10</h2>
|
|
|
|
<ul>
|
|
<li>Helping Megan Zandstra and CIAT with some questions about the REST API</li>
|
|
<li>Playing with <code>find-by-metadata-field</code>, this works:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>But the results are deceiving because metadata fields can have text languages and your query must match exactly!</li>
|
|
</ul>
|
|
|
|
<pre><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
|
|
text_value | text_lang
|
|
------------+-----------
|
|
SEEDS |
|
|
SEEDS |
|
|
SEEDS | en_US
|
|
(3 rows)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>So basically, the text language here could be null, blank, or en_US</li>
|
|
<li>To query metadata with these properties, you can do:</li>
|
|
</ul>
|
|
|
|
<pre><code>$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
|
|
55
|
|
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
|
|
34
|
|
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>The results (55+34=89) don’t seem to match those from the database:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
|
|
count
|
|
-------
|
|
15
|
|
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
|
|
count
|
|
-------
|
|
4
|
|
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
|
|
count
|
|
-------
|
|
66
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>So, querying from the API I get 55 + 34 = 89 results, but the database actually only has 85…</li>
|
|
<li>And the <code>find-by-metadata-field</code> endpoint doesn’t seem to have a way to get all items with the field, or a wildcard value</li>
|
|
<li>I’ll ask a question on the dspace-tech mailing list</li>
|
|
<li>And speaking of <code>text_lang</code>, this is interesting:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
|
text_lang
|
|
-----------
|
|
|
|
ethnob
|
|
en
|
|
spa
|
|
EN
|
|
es
|
|
frn
|
|
en_
|
|
en_US
|
|
|
|
EN_US
|
|
eng
|
|
en_U
|
|
fr
|
|
(14 rows)
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Generate a list of all these so I can fix them in batch:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv;
|
|
COPY 14
|
|
</code></pre>
|
|
|
|
<ul>
|
|
<li>Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:</li>
|
|
</ul>
|
|
|
|
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
|
|
UPDATE 85
|
|
</code></pre>
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
|
|
<aside class="col-sm-3 offset-sm-1 blog-sidebar">
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="/cgspace-notes/2016-11/">November, 2016</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2016-10/">October, 2016</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2016-09/">September, 2016</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2016-08/">August, 2016</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2016-07/">July, 2016</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
<footer class="blog-footer">
|
|
<p>
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
</body>
|
|
|
|
</html>
|