Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module
Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -160,7 +160,7 @@ java.lang.NullPointerException
<ul>
<li>Horrible one liner to get Linode ID from certain Ansible host vars:</li>
</ul>
<pre tabindex="0"><code>$ grep -A 3 contact_info * | grep -E &quot;(Orth|Sisay|Peter|Daniel|Tsega)&quot; | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
<pre tabindex="0"><code>$ grep -A 3 contact_info * | grep -E &#34;(Orth|Sisay|Peter|Daniel|Tsega)&#34; | awk -F&#39;-&#39; &#39;{print $1}&#39; | grep linode | uniq | xargs grep linode_id
</code></pre><ul>
<li>I noticed some weird CRPs in the database, and they don&rsquo;t show up in Discovery for some reason, perhaps the <code>:</code></li>
<li>I&rsquo;ll export these and fix them in batch:</li>
@ -170,7 +170,7 @@ COPY 22
</code></pre><ul>
<li>Test running the replacements:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p &#39;fuuu&#39;
</code></pre><ul>
<li>Add <code>AMR</code> to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/288">#288</a>)</li>
</ul>
@ -200,11 +200,11 @@ COPY 22
<li>Helping Megan Zandstra and CIAT with some questions about the REST API</li>
<li>Playing with <code>find-by-metadata-field</code>, this works:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}'
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39;
</code></pre><ul>
<li>But the results are deceiving because metadata fields can have text languages and your query must match exactly!</li>
</ul>
<pre tabindex="0"><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
<pre tabindex="0"><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39;;
text_value | text_lang
------------+-----------
SEEDS |
@ -215,23 +215,23 @@ COPY 22
<li>So basically, the text language here could be null, blank, or en_US</li>
<li>To query metadata with these properties, you can do:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}' | jq length
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39; | jq length
55
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;&quot;}' | jq length
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;&#34;}&#39; | jq length
34
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;en_US&quot;}' | jq length
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;en_US&#34;}&#39; | jq length
</code></pre><ul>
<li>The results (55+34=89) don&rsquo;t seem to match those from the database:</li>
</ul>
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
<pre tabindex="0"><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang is null;
count
-------
15
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang=&#39;&#39;;
count
-------
4
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39; and text_lang=&#39;en_US&#39;;
count
-------
66
@ -267,27 +267,27 @@ COPY 14
</code></pre><ul>
<li>Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=&#39;en_US&#39; where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;SEEDS&#39;;
UPDATE 85
</code></pre><ul>
<li>The <code>fix-metadata.py</code> script I have is meant for specific metadata values, so if I want to update some <code>text_lang</code> values I should just do it directly in the database</li>
<li>For example, on a limited set:</li>
</ul>
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
<pre tabindex="0"><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value=&#39;LIVESTOCK&#39; and text_lang=&#39;&#39;;
UPDATE 420
</code></pre><ul>
<li>And assuming I want to do it for all fields:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang=&#39;&#39;;
UPDATE 183726
</code></pre><ul>
<li>After that restarted Tomcat and PostgreSQL (because I&rsquo;m superstitious about caches) and now I see the following in REST API query:</li>
</ul>
<pre tabindex="0"><code>$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;}' | jq length
<pre tabindex="0"><code>$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;}&#39; | jq length
71
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;&quot;}' | jq length
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;&#34;}&#39; | jq length
0
$ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -X POST &quot;http://localhost:8080/rest/items/find-by-metadata-field&quot; -d '{&quot;key&quot;: &quot;cg.subject.ilri&quot;,&quot;value&quot;: &quot;SEEDS&quot;, &quot;language&quot;:&quot;en_US&quot;}' | jq length
$ curl -s -H &#34;accept: application/json&#34; -H &#34;Content-Type: application/json&#34; -X POST &#34;http://localhost:8080/rest/items/find-by-metadata-field&#34; -d &#39;{&#34;key&#34;: &#34;cg.subject.ilri&#34;,&#34;value&#34;: &#34;SEEDS&#34;, &#34;language&#34;:&#34;en_US&#34;}&#39; | jq length
</code></pre><ul>
<li>Not sure what&rsquo;s going on, but Discovery shows 83 values, and database shows 85, so I&rsquo;m going to reindex Discovery just in case</li>
</ul>
@ -298,7 +298,7 @@ $ curl -s -H &quot;accept: application/json&quot; -H &quot;Content-Type: applica
<li>So there is apparently this Tomcat native way to limit web crawlers to one session: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Crawler Session Manager</a></li>
<li>After adding that to <code>server.xml</code> bots matching the pattern in the configuration will all use ONE session, just like normal users:</li>
</ul>
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -312,7 +312,7 @@ Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Robots-Tag: none
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -336,7 +336,7 @@ X-Cocoon-Version: 2.2.0
<ul>
<li>Seems the default regex doesn&rsquo;t catch Baidu, though:</li>
</ul>
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
<pre tabindex="0"><code>$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -349,7 +349,7 @@ Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
$ http --print h https://dspacetest.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#39;
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
@ -365,17 +365,17 @@ X-Cocoon-Version: 2.2.0
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
</ul>
<pre tabindex="0"><code>&lt;!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers --&gt;
&lt;Valve className=&quot;org.apache.catalina.valves.CrawlerSessionManagerValve&quot;
crawlerUserAgents=&quot;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&quot; /&gt;
&lt;Valve className=&#34;org.apache.catalina.valves.CrawlerSessionManagerValve&#34;
crawlerUserAgents=&#34;.*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*&#34; /&gt;
</code></pre><ul>
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
</ul>
<pre tabindex="0"><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\&quot;' /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&quot; &quot;-&quot;
<pre tabindex="0"><code>$ grep -o -E &#39;Mozilla/5\.0 \(compatible;.*\&#34;&#39; /var/log/nginx/access.log | sort | uniq
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)&#34; &#34;-&#34;
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)&#34; &#34;-&#34;
</code></pre><ul>
<li>Neat maven trick to exclude some modules from being built:</li>
</ul>
@ -393,9 +393,9 @@ COPY 2515
<li>Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test</li>
<li>Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://ccafs.cgiar.org&#39;,&#39;https://ccafs.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://ccafs.cgiar.org%&#39;;
UPDATE 164
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, &#39;http://ccafs.cgiar.org&#39;,&#39;https://ccafs.cgiar.org&#39;) where resource_type_id=2 and text_value like &#39;%http://ccafs.cgiar.org%&#39;;
UPDATE 7
</code></pre><ul>
<li>Had to run it twice to get all (not sure about &ldquo;global&rdquo; regex in PostgreSQL)</li>
@ -404,11 +404,11 @@ UPDATE 7
<li>I&rsquo;m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn&rsquo;t as good</li>
<li>The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p &quot;ImageMagick PDF Thumbnail&quot;
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p &#34;ImageMagick PDF Thumbnail&#34;
</code></pre><ul>
<li>In related news, I&rsquo;m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace&rsquo;s media filter has made thumbnails of THEM):</li>
</ul>
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where text_value like &#39;%.jpg.jpg&#39;;
</code></pre><ul>
<li>I&rsquo;m not sure if there&rsquo;s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore&hellip;</li>
</ul>
@ -464,7 +464,7 @@ UPDATE 7
<li>One user says they are still getting a blank page when he logs in (just CGSpace header, but no community list)</li>
<li>Looking at the Catlina logs I see there is some super long-running indexing process going on:</li>
</ul>
<pre tabindex="0"><code>INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
<pre tabindex="0"><code>INFO: FrameworkServlet &#39;oai&#39;: initialization completed in 2600 ms
[&gt; ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18
[&gt; ] 0% time remaining: 11 hour(s) 57 minute(s) 46 seconds. timestamp: 2016-11-28 03:00:19
[&gt; ] 0% time remaining: 23 hour(s) 4 minute(s) 28 seconds. timestamp: 2016-11-28 03:00:19
@ -497,7 +497,7 @@ $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacete
2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Start processing item 10568/50391 id:51744
2016-11-29 07:56:36,545 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item stats
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item's bitstream stats
2016-11-29 07:56:36,583 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Processing item&#39;s bitstream stats
2016-11-29 07:56:36,608 INFO com.atmire.utils.UpdateSolrStatsMetadata @ Solr metadata up-to-date
2016-11-29 07:56:36,701 INFO org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ facets for scope, null: 23
2016-11-29 07:56:36,747 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets