1
0
mirror of https://github.com/alanorth/cgspace-notes.git synced 2025-01-12 06:53:21 +01:00

603 lines
41 KiB
HTML
Raw Normal View History

2023-07-04 08:03:36 +03:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="November, 2016" />
<meta property="og:description" content="2016-11-01
Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module (#286)
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-11/" />
<meta property="article:published_time" content="2016-11-01T09:21:00+03:00" />
<meta property="article:modified_time" content="2020-04-13T15:30:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="November, 2016"/>
<meta name="twitter:description" content="2016-11-01
Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module (#286)
"/>
2023-09-30 13:07:23 +03:00
<meta name="generator" content="Hugo 0.119.0">
2023-07-04 08:03:36 +03:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2016",
"url": "https://alanorth.github.io/cgspace-notes/2016-11/",
"wordCount": "2825",
"datePublished": "2016-11-01T09:21:00+03:00",
"dateModified": "2020-04-13T15:30:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-11/">
<title>November, 2016 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-11/">November, 2016</a></h2>
<p class="blog-post-meta">
<time datetime="2016-11-01T09:21:00+03:00">Tue Nov 01, 2016</time>
in
<span class="fas fa-tag" aria-hidden="true"></span>&nbsp;<a href="/tags/notes/" rel="tag">Notes</a>
</p>
</header>
<h2 id="2016-11-01">2016-11-01</h2>
<ul>
<li>Add <code>dc.type</code> to the output options for Atmire&rsquo;s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
</ul>
<p><img src="/cgspace-notes/2016/11/listings-and-reports.png" alt="Listings and Reports with output type"></p>
<h2 id="2016-11-02">2016-11-02</h2>
<ul>
<li>Migrate DSpace Test to DSpace 5.5 (<a href="https://gist.github.com/alanorth/61013895c6efe7095d7f81000953d1cf">notes</a>)</li>
<li>Run all updates on DSpace Test and reboot the server</li>
<li>Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! (<a href="https://github.com/ilri/DSpace/issues/63">#63</a>)</li>
<li>Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes</li>
<li>At the end it appeared to finish correctly but there were lots of errors right after it finished:</li>
</ul>
<pre tabindex="0"><code>2016-11-02 15:09:48,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index
2016-11-02 15:09:48,584 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/3202 to Index
2016-11-02 15:09:48,589 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76455 to Index
2016-11-02 15:09:48,590 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/51693 to Index
2016-11-02 15:09:48,590 INFO org.dspace.discovery.IndexClient @ Done with indexing
2016-11-02 15:09:48,600 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76456 to Index
2016-11-02 15:09:48,613 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/55536 to Index
2016-11-02 15:09:48,616 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76457 to Index
2016-11-02 15:09:48,634 ERROR com.atmire.dspace.discovery.AtmireSolrService @
java.lang.NullPointerException
at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57)
at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898)
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945)
</code></pre><ul>
<li>DSpace is still up, and a few minutes later I see the default DSpace indexer is still running</li>
<li>Sure enough, looking back before the first one finished, I see output from both indexers interleaved in the log:</li>
</ul>
<pre tabindex="0"><code>2016-11-02 15:09:28,545 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index
2016-11-02 15:09:28,633 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/60785 to Index
2016-11-02 15:09:28,678 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55695 of 55722): 43557
2016-11-02 15:09:28,688 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55703 of 55722): 34476
</code></pre><ul>
<li>I will raise a ticket with Atmire to ask them</li>
</ul>
<h2 id="2016-11-06">2016-11-06</h2>
<ul>
<li>After re-deploying and re-indexing I didn&rsquo;t see the same issue, and the indexing completed in 85 minutes, which is about how long it is supposed to take</li>
</ul>
<h2 id="2016-11-07">2016-11-07</h2>
<ul>
<li>Horrible one liner to get Linode ID from certain Ansible host vars:</li>
</ul>
<pre tabindex="0"><code>$ grep -A 3 contact_info * | grep -E &#34;(Orth|Sisay|Peter|Daniel|Tsega)&#34; | awk -F&#39;-&#39; &#39;{print $1}&#39; | grep linode | uniq | xargs grep linode_id
</code></pre><ul>
<li>I noticed some weird CRPs in the database, and they don&rsquo;t show up in Discovery for some reason, perhaps the <code>:</code></li>
<li>I&rsquo;ll export these and fix them in batch:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
COPY 22
</code></pre><ul>
<li>Test running the replacements:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Add <code>AMR</code> to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/288">#288</a>)</li>
</ul>
<h2 id="2016-11-08">2016-11-08</h2>
<ul>
<li>Atmire&rsquo;s Listings and Reports module seems to be broken on DSpace 5.5</li>
</ul>
<p><img src="/cgspace-notes/2016/11/listings-and-reports-55.png" alt="Listings and Reports broken in DSpace 5.5"></p>
<ul>
<li>I&rsquo;ve filed a ticket with Atmire</li>
<li>Thinking about batch updates for ORCIDs and authors</li>
<li>Playing with <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> in Python to query Solr</li>
<li>All records in the authority core are either <code>authority_type:orcid</code> or <code>authority_type:person</code></li>
<li>There is a <code>deleted</code> field and all items seem to be <code>false</code>, but might be important sanity check to remember</li>
<li>The way to go is probably to have a CSV of author names and authority IDs, then to batch update them in PostgreSQL</li>
<li>Dump of the top ~200 authors in CGSpace:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv;
</code></pre><h2 id="2016-11-09">2016-11-09</h2>
<ul>
<li>CGSpace crashed so I quickly ran system updates, applied one or two of the waiting changes from the <code>5_x-prod</code> branch, and rebooted the server</li>
<li>The error was <code>Timeout waiting for idle object</code> but I haven&rsquo;t looked into the Tomcat logs to see what happened</li>
<li>Also, I ran the corrections for CRPs from earlier this week</li>
</ul>
<h2 id="2016-11-10">2016-11-10</h2>
<ul>
<li>Helping Megan Zandstra and CIAT with some questions about the REST API</li>
<li>Playing with <code>find-by-metadata-field</code>, this works:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39;
</code></pre><ul>
<li>But the results are deceiving because metadata fields can have text languages and your query must match exactly!</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39;;
text_value | text_lang
------------+-----------
SEEDS |
SEEDS |
SEEDS | en_US
(3 rows)
</code></pre><ul>
<li>So basically, the text language here could be null, blank, or en_US</li>
<li>To query metadata with these properties, you can do:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39; | jq length
55
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;&#34;}&#39; | jq length
34
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;en_US&#34;}&#39; | jq length
</code></pre><ul>
<li>The results (55+34=89) don&rsquo;t seem to match those from the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang is null;
count
-------
15
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang=&#39;&#39;;
count
-------
4
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang=&#39;en_US&#39;;
count
-------
66
</code></pre><ul>
<li>So, querying from the API I get 55 + 34 = 89 results, but the database actually only has 85&hellip;</li>
<li>And the <code>find-by-metadata-field</code> endpoint doesn&rsquo;t seem to have a way to get all items with the field, or a wildcard value</li>
<li>I&rsquo;ll ask a question on the dspace-tech mailing list</li>
<li>And speaking of <code>text_lang</code>, this is interesting:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
text_lang
-----------
ethnob
en
spa
EN
es
frn
en_
en_US
EN_US
eng
en_U
fr
(14 rows)
</code></pre><ul>
<li>Generate a list of all these so I can maybe fix them in batch:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv;
COPY 14
</code></pre><ul>
<li>Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39;;
UPDATE 85
</code></pre><ul>
<li>The <code>fix-metadata.py</code> script I have is meant for specific metadata values, so if I want to update some <code>text_lang</code> values I should just do it directly in the database</li>
<li>For example, on a limited set:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;LIVESTOCK&#39; and text_lang=&#39;&#39;;
UPDATE 420
</code></pre><ul>
<li>And assuming I want to do it for all fields:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang=&#39;&#39;;
UPDATE 183726
</code></pre><ul>
<li>After that restarted Tomcat and PostgreSQL (because I&rsquo;m superstitious about caches) and now I see the following in REST API query:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39; | jq length
71
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;&#34;}&#39; | jq length
0
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;en_US&#34;}&#39; | jq length
</code></pre><ul>
<li>Not sure what&rsquo;s going on, but Discovery shows 83 values, and database shows 85, so I&rsquo;m going to reindex Discovery just in case</li>
</ul>
<h2 id="2016-11-14">2016-11-14</h2>
<ul>
<li>I applied Atmire&rsquo;s suggestions to fix Listings and Reports for DSpace 5.5 and now it works</li>
<li>There were some issues with the <code>dspace/modules/jspui/pom.xml</code>, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire&rsquo;s installation procedure must have changed</li>
<li>So there is apparently this Tomcat native way to limit web crawlers to one session: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Crawler Session Manager</a></li>
<li>After adding that to <code>server.xml</code> bots matching the pattern in the configuration will all use ONE session, just like normal users:</li>
</ul>
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Mon, 14 Nov 2016 19:47:29 GMT
Server: nginx
Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Robots-Tag: none
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Mon, 14 Nov 2016 19:47:35 GMT
Server: nginx
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
</code></pre><ul>
<li>The first one gets a session, and any after thatwithin 60 secondswill be internally mapped to the same session by Tomcat</li>
<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li>
</ul>
<h2 id="2016-11-15">2016-11-15</h2>
<ul>
<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li>
</ul>
<p><img src="/cgspace-notes/2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager">
<img src="/cgspace-notes/2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager"></p>
<ul>
<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li>
</ul>
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:54 GMT
Server: nginx
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Tue, 15 Nov 2016 08:49:59 GMT
Server: nginx
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
</code></pre><ul>
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
</ul>
<pre tabindex="0"><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt;
&lt;Valve className=&#34;org.apache.catalina.valves.CrawlerSessionManagerValve&#34;
crawlerUserAgents=&#34;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&#34; /&gt;
</code></pre><ul>
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E &#39;Mozilla/5\.0 \(compatible;.*\&#34;&#39; /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&#34; &#34;-&#34;
</code></pre><ul>
<li>Neat maven trick to exclude some modules from being built:</li>
</ul>
<pre tabindex="0"><code>$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
</code></pre><ul>
<li>We absolutely don&rsquo;t use those modules, so we shouldn&rsquo;t build them in the first place</li>
</ul>
<h2 id="2016-11-17">2016-11-17</h2>
<ul>
<li>Generate a list of journal titles for Peter and Abenet to look through so we can make a controlled vocabulary out of them:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc) to /tmp/journal-titles.csv with csv;
COPY 2515
</code></pre><ul>
<li>Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test</li>
<li>Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://ccafs.cgiar.org&#39;,&#39;https://ccafs.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://ccafs.cgiar.org%&#39;;
UPDATE 164
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://ccafs.cgiar.org&#39;,&#39;https://ccafs.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://ccafs.cgiar.org%&#39;;
UPDATE 7
</code></pre><ul>
<li>Had to run it twice to get all (not sure about &ldquo;global&rdquo; regex in PostgreSQL)</li>
<li>Run the updates on CGSpace as well</li>
<li>Run through some collections and manually regenerate some PDF thumbnails for items from before 2016 on DSpace Test to compare with CGSpace</li>
<li>I&rsquo;m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn&rsquo;t as good</li>
<li>The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p &#34;ImageMagick PDF Thumbnail&#34;
</code></pre><ul>
<li>In related news, I&rsquo;m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace&rsquo;s media filter has made thumbnails of THEM):</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where text_value like &#39;%.jpg.jpg&#39;;
</code></pre><ul>
<li>I&rsquo;m not sure if there&rsquo;s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore&hellip;</li>
</ul>
<h2 id="2016-11-18">2016-11-18</h2>
<ul>
<li>Enable Tomcat Crawler Session Manager on CGSpace</li>
</ul>
<h2 id="2016-11-21">2016-11-21</h2>
<ul>
<li>More work on Ansible playbooks for PostgreSQL 9.3→9.5 and Java 7→8 work</li>
<li>CGSpace virtual managers meeting</li>
<li>I need to look into making the item thumbnail clickable</li>
<li>Macaroni Bros said they tested the DSpace Test (DSpace 5.5) REST API for CCAFS and WLE sites and it works as expected</li>
</ul>
<h2 id="2016-11-23">2016-11-23</h2>
<ul>
<li>Upgrade Java from 7 to 8 on CGSpace</li>
<li>I had started planning the inplace PostgreSQL 9.3→9.5 upgrade but decided that I will have to <code>pg_dump</code> and <code>pg_restore</code> when I move to the new server soon anyways, so there&rsquo;s no need to upgrade the database right now</li>
<li>Chat with Carlos about CGCore and the CGSpace metadata registry</li>
<li>Dump CGSpace metadata field registry for Carlos: <a href="https://gist.github.com/alanorth/8cbd0bb2704d4bbec78025b4742f8e70">https://gist.github.com/alanorth/8cbd0bb2704d4bbec78025b4742f8e70</a></li>
<li>Send some feedback to Carlos on CG Core so they can better understand how DSpace/CGSpace uses metadata</li>
<li>Notes about PostgreSQL tuning from James: <a href="https://paste.fedoraproject.org/488776/14798952/">https://paste.fedoraproject.org/488776/14798952/</a></li>
<li>Play with Creative Commons stuff in DSpace submission step</li>
<li>It seems to work but it doesn&rsquo;t let you choose a version of CC (like 4.0), and we would need to customize the XMLUI item display so it doesn&rsquo;t display the gross CC badges</li>
</ul>
<h2 id="2016-11-24">2016-11-24</h2>
<ul>
<li>Bizuwork was testing DSpace Test on DSPace 5.5 and noticed that the Listings and Reports module seems to be case sensitive, whereas CGSpace&rsquo;s Listings and Reports isn&rsquo;t (ie, a search for &ldquo;orth, alan&rdquo; vs &ldquo;Orth, Alan&rdquo; returns the same results on CGSpace, but different on DSpace Test)</li>
<li>I have raised a ticket with Atmire</li>
<li>Looks like this issue is actually the new Listings and Reports module honoring the Solr search queries more correctly</li>
</ul>
<h2 id="2016-11-27">2016-11-27</h2>
<ul>
<li>Run system updates on DSpace Test and reboot the server</li>
<li>Deploy DSpace 5.5 on CGSpace:
<ul>
<li>maven package</li>
<li>stop tomcat</li>
<li>backup postgresql</li>
<li>run Atmire 5.5 schema deletions</li>
<li>delete the deployed spring folder</li>
<li>ant update</li>
<li>run system updates</li>
<li>reboot server</li>
</ul>
</li>
<li>Need to do updates for ansible infrastructure role defaults, and switch the GitHub branch to the new 5.5 one</li>
<li>Testing DSpace 5.5 on CGSpace, it seems CUA&rsquo;s export as XLS works for Usage statistics, but not Content statistics</li>
<li>I will raise a bug with Atmire</li>
</ul>
<h2 id="2016-11-28">2016-11-28</h2>
<ul>
<li>One user says they are still getting a blank page when he logs in (just CGSpace header, but no community list)</li>
<li>Looking at the Catlina logs I see there is some super long-running indexing process going on:</li>
</ul>
<pre tabindex="0"><code>INFO: FrameworkServlet &#39;oai&#39;: initialization completed in 2600 ms
[&gt; ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18
[&gt; ] 0% time remaining: 11 hour(s) 57 minute(s) 46 seconds. timestamp: 2016-11-28 03:00:19
[&gt; ] 0% time remaining: 23 hour(s) 4 minute(s) 28 seconds. timestamp: 2016-11-28 03:00:19
[&gt; ] 0% time remaining: 15 hour(s) 35 minute(s) 23 seconds. timestamp: 2016-11-28 03:00:19
[&gt; ] 0% time remaining: 14 hour(s) 5 minute(s) 56 seconds. timestamp: 2016-11-28 03:00:19
[&gt; ] 0% time remaining: 11 hour(s) 23 minute(s) 49 seconds. timestamp: 2016-11-28 03:00:19
[&gt; ] 0% time remaining: 11 hour(s) 21 minute(s) 57 seconds. timestamp: 2016-11-28 03:00:20
</code></pre><ul>
<li>It says OAI, and seems to start at 3:00 AM, but I only see the <code>filter-media</code> cron job set to start then</li>
<li>Double checking the <a href="https://wiki.lyrasis.org/display/DSDOC5x/Upgrading+DSpace">DSpace 5.x upgrade notes</a> for anything I missed, or troubleshooting tips</li>
<li>Running some manual processes just in case:</li>
</ul>
<pre tabindex="0"><code>$ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dcterms-types.xml
$ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dublin-core-types.xml
$ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/eperson-types.xml
$ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/workflow-types.xml
</code></pre><ul>
<li>Start working on paper for KM4Dev journal</li>
<li>Wow, Bram from Atmire pointed out this solution for using multiple handles with one DSpace instance: <a href="https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296">https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296</a></li>
<li>We might be able to migrate the <a href="http://library.cgiar.org/">CGIAR Library</a> now, as they had wanted to keep their handles</li>
</ul>
<h2 id="2016-11-29">2016-11-29</h2>
<ul>
<li>Sisay tried deleting and re-creating Goshu&rsquo;s account but he still can&rsquo;t see any communities on the homepage after he logs in</li>
<li>Around the time of his login I see this in the DSpace logs:</li>
</ul>
<pre tabindex="0"><code>2016-11-29 07:56:36,350 INFO org.dspace.authenticate.LDAPAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:no DN found for user g.cherinet@cgiar.org
2016-11-29 07:56:36,350 INFO org.dspace.authenticate.PasswordAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:authenticate:attempting password auth of user=g.cherinet@cgiar.org
2016-11-29 07:56:36,352 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:email=g.cherinet@cgiar.org, realm=null, result=2
2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Start processing item 10568/50391 id:51744
2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item stats
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item&#39;s bitstream stats
2016-11-29 07:56:36,608 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
2016-11-29 07:56:36,701 INFO org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ facets for scope, null: 23
2016-11-29 07:56:36,747 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: Error executing query
at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1618)
at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1600)
at org.dspace.discovery.SolrServiceImpl.search(SolrServiceImpl.java:1583)
at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.performSearch(SidebarFacetsTransformer.java:165)
at org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer.addOptions(SidebarFacetsTransformer.java:174)
at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
at sun.reflect.GeneratedMethodAccessor277.invoke(Unknown Source)
...
</code></pre><ul>
<li>At about the same time in the solr log I see a super long query:</li>
</ul>
<pre tabindex="0"><code>2016-11-29 07:56:36,734 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&amp;fl=dateIssued.year,handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=dateIssued.year:[*+TO+*]&amp;fq=read:(g0+OR+e574+OR+g0+OR+g3+OR+g9+OR+g10+OR+g14+OR+g16+OR+g18+OR+g20+OR+g23+OR+g24+OR+g2072+OR+g2074+OR+g28+OR+g2076+OR+g29+OR+g2078+OR+g2080+OR+g34+OR+g2082+OR+g2084+OR+g38+OR+g2086+OR+g2088+OR+g2091+OR+g43+OR+g2092+OR+g2093+OR+g2095+OR+g2097+OR+g50+OR+g2099+OR+g51+OR+g2103+OR+g62+OR+g65+OR+g2115+OR+g2117+OR+g2119+OR+g2121+OR+g2123+OR+g2125+OR+g77+OR+g78+OR+g79+OR+g2127+OR+g80+OR+g2129+OR+g2131+OR+g2133+OR+g2134+OR+g2135+OR+g2136+OR+g2137+OR+g2138+OR+g2139+OR+g2140+OR+g2141+OR+g2142+OR+g2148+OR+g2149+OR+g2150+OR+g2151+OR+g2152+OR+g2153+OR+g2154+OR+g2156+OR+g2165+OR+g2167+OR+g2171+OR+g2174+OR+g2175+OR+g129+OR+g2182+OR+g2186+OR+g2189+OR+g153+OR+g158+OR+g166+OR+g167+OR+g168+OR+g169+OR+g2225+OR+g179+OR+g2227+OR+g2229+OR+g183+OR+g2231+OR+g184+OR+g2233+OR+g186+OR+g2235+OR+g2237+OR+g191+OR+g192+OR+g193+OR+g202+OR+g203+OR+g204+OR+g205+OR+g207+OR+g208+OR+g218+OR+g219+OR+g222+OR+g223+OR+g230+OR+g231+OR+g238+OR+g241+OR+g244+OR+g254+OR+g255+OR+g262+OR+g265+OR+g268+OR+g269+OR+g273+OR+g276+OR+g277+OR+g279+OR+g282+OR+g2332+OR+g2335+OR+g2338+OR+g292+OR+g293+OR+g2341+OR+g296+OR+g2344+OR+g297+OR+g2347+OR+g301+OR+g2350+OR+g303+OR+g305+OR+g2356+OR+g310+OR+g311+OR+g2359+OR+g313+OR+g2362+OR+g2365+OR+g2368+OR+g321+OR+g2371+OR+g325+OR+g2374+OR+g328+OR+g2377+OR+g2380+OR+g333+OR+g2383+OR+g2386+OR+g2389+OR+g342+OR+g343+OR+g2392+OR+g345+OR+g2395+OR+g348+OR+g2398+OR+g2401+OR+g2404+OR+g2407+OR+g364+OR+g366+OR+g2425+OR+g2427+OR+g385+OR+g387+OR+g388+OR+g389+OR+g2442+OR+g395+OR+g2443+OR+g2444+OR+g401+OR+g403+OR+g405+OR+g408+OR+g2457+OR+g2458+OR+g411+OR+g2459+OR+g414+OR+g2463+OR+g417+OR+g2465+OR+g2467+OR+g421+OR+g2469+OR+g2471+OR+g424+OR+g2473+OR+g2475+OR+g2476+OR+g429+OR+g433+OR+g2481+OR+g2482+OR+g2483+OR+g443+OR+g444+OR+g445+OR+g446+OR+g448+OR+g453+OR+g455+OR+g456+OR+g457+OR+g458+OR+g459+OR+g461+OR+g462+OR+g463+OR+g464+OR+g465+OR+g467+OR+g468+OR+g469+OR+g474+OR+g476+OR+g477+OR+g480+OR+g483+OR+g484+OR+g493+OR+g496+OR+g497+OR+g498+OR+g500+OR+g502+OR+g504+OR+g505+OR+g2559+OR+g2560+OR+g513+OR+g2561+OR+g515+OR+g516+OR+g518+OR+g519+OR+g2567+OR+g520+OR+g521+OR+g522+OR+g2570+OR+g523+OR+g2571+OR+g524+OR+g525+OR+g2573+OR+g526+OR+g2574+OR+g527+OR+g528+OR+g2576+OR+g529+OR+g531+OR+g2579+OR+g533+OR+g534+OR+g2582+OR+g535+OR+g2584+OR+g538+OR+g2586+OR+g540+OR+g2588+OR+g541+OR+g543+OR+g544+OR+g545+OR+g546+OR+g548+OR+g2596+OR+g549+OR+g551+OR+g555+OR+g556+OR+g558+OR+g561+OR+g569+OR+g570+OR+g571+OR+g2619+OR+g572+OR+g2620+OR+g573+OR+g2621+OR+g2622+OR+g575+OR+g578+OR+g581+OR+g582+OR+g584+OR+g585+OR+g586+OR+g587+OR+g588+OR+g590+OR+g591+OR+g593+OR+g595+OR+g596+OR+g598+OR+g599+OR+g601+OR+g602+OR+g603+OR+g604+OR+g605+OR+g606+OR+g608+OR+g609+OR+g610+OR+g612+OR+g614+OR+g616+OR+g620+OR+g621+OR+g623+OR+g630+OR+g635+OR+g636+OR+g646+OR+g649+OR+g683+OR+g684+OR+g687+OR+g689+OR+g691+OR+g695+OR+g697+OR+g698+OR+g699+OR+g700+OR+g701+OR+g707+OR+g708+OR+g709+OR+g710+OR+g711+OR+g712+OR+g713+OR+g714+OR+g715+OR+g716+OR+g717+OR+g719+OR+g720+OR+g729+OR+g732+OR+g733+OR+g734+OR+g736+OR+g737+OR+g738+OR+g2786+OR+g752+OR+g754+OR+g2804+OR+g757+OR+g2805+OR+g2806+OR+g760+OR+g761+OR+g2810+OR+g2815+OR+g769+OR+g771+OR+g773+OR+g776+OR+g786+OR+g787+OR+g788+OR+g789+OR+g791+OR+g792+OR+g793+OR+g794+OR+g795+OR+g796+OR+g798+OR+g800+OR+g802+OR+g803+OR+g806+OR+g808+OR+g810+OR+g814+OR+g815+OR+g817+OR+g829+OR+g830+OR+g849+OR+g893+OR+g895+OR+g898+OR+g902+OR+g903+OR+g917+OR+g919+OR+g921+OR+g922+OR+g923+OR+g924+OR+g925+OR+g926+OR+g927+OR+g928+OR+g929+OR+g930+OR+g932+OR+g933+OR+g934+OR+g938+OR+g939+OR+g944+OR+g945+OR+g946+OR+g947+OR+g948+OR+g949+OR+g950+OR+g951+OR+g953+OR+g954+OR+g955+OR+g956+OR+g958+OR+g959+OR+g960+OR+g963+OR+g964+OR+g965+OR+g968+OR+g969+OR+g970+OR+g971+OR+g972+OR+g973+OR+g974+OR+g976+OR+g978+OR+g979+OR+g984+OR+g985+OR+g987+OR+g988+OR+g991+OR+g993+OR+g994+OR+g999+OR+g1000+OR+g1003
</code></pre><ul>
<li>Which, according to some old threads on DSpace Tech, means that the user has a lot of permissions (from groups or on the individual eperson) which increases the Solr query size / query URL</li>
<li>It might be fixed by increasing the Tomcat <code>maxHttpHeaderSize</code>, which is <a href="http://tomcat.apache.org/tomcat-7.0-doc/config/http.html">8192 (or 8KB) by default</a></li>
<li>I&rsquo;ve increased the <code>maxHttpHeaderSize</code> to 16384 on DSpace Test and the user said he is now able to see the communities on the homepage</li>
<li>I will make the changes on CGSpace soon</li>
<li>A few users are reporting having issues with their workflows, they get the following message: &ldquo;You are not allowed to perform this task&rdquo;</li>
<li>Might be the same as <a href="https://jira.duraspace.org/browse/DS-2920">DS-2920</a> on the bug tracker</li>
</ul>
<h2 id="2016-11-30">2016-11-30</h2>
<ul>
<li>The <code>maxHttpHeaderSize</code> fix worked on CGSpace (user is able to see the community list on the homepage)</li>
<li>The &ldquo;take task&rdquo; cache fix worked on DSpace Test but it&rsquo;s not an official patch, so I&rsquo;ll have to report the bug to DSpace people and try to get advice</li>
<li>More work on the KM4Dev Journal article</li>
</ul>
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2023-10-04 09:24:33 +03:00
<li><a href="/cgspace-notes/2023-10/">October, 2023</a></li>
2023-09-02 17:37:15 +03:00
<li><a href="/cgspace-notes/2023-09/">September, 2023</a></li>
2023-08-04 18:05:44 +03:00
<li><a href="/cgspace-notes/2023-08/">August, 2023</a></li>
2023-07-04 08:03:36 +03:00
<li><a href="/cgspace-notes/2023-07/">July, 2023</a></li>
<li><a href="/cgspace-notes/2023-06/">June, 2023</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>