Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -30,7 +30,7 @@ I’ll update the DSpace role in our Ansible infrastructure playbooks and ru
Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -124,7 +124,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
<li>I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:</li>
</ul>
<pre tabindex="0"><code>02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5240)
@ -139,7 +139,7 @@ I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name &#39;conversionService&#39; defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39; of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property &#39;converters&#39; with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name &#39;com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2&#39;: Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations:
</code></pre><ul>
<li>Full log here: <a href="https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2">https://gist.github.com/alanorth/1e4ae567b853fea9d9dbf1a030ecd8c2</a></li>
<li>XMLUI fails to load, but the REST, SOLR, JSPUI, etc work</li>
@ -191,13 +191,13 @@ requests:
method: GET
url: https://dspacetest.cgiar.org/rest/test
validate:
raw: &quot;REST api is running.&quot;
raw: &#34;REST api is running.&#34;
login:
url: https://dspacetest.cgiar.org/rest/login
method: POST
data:
json: {&quot;email&quot;:&quot;test@dspace&quot;,&quot;password&quot;:&quot;thepass&quot;}
json: {&#34;email&#34;:&#34;test@dspace&#34;,&#34;password&#34;:&#34;thepass&#34;}
status:
url: https://dspacetest.cgiar.org/rest/status
@ -229,15 +229,15 @@ $ dspace community-filiator --set -p 10568/97114 -c 10568/3112
</code></pre><ul>
<li>Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:</li>
</ul>
<pre tabindex="0"><code>update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
<pre tabindex="0"><code>update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI Juornal&#39;;
UPDATE 1
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI journal&#39;;
UPDATE 23
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='YES';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;YES&#39;;
UPDATE 1
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value='NO';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;NO&#39;;
DELETE 17
update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI';
update metadatavalue set text_value=&#39;ISI Journal&#39; where resource_type_id=2 and metadata_field_id=226 and text_value=&#39;ISI&#39;;
UPDATE 15
</code></pre><ul>
<li>Start working on adding metadata for access and usage rights that we started earlier in 2018 (and also in 2017)</li>
@ -246,7 +246,7 @@ UPDATE 15
<li>Linode said that CGSpace (linode18) had a high CPU load earlier today</li>
<li>When I looked, I see it&rsquo;s the same Russian IP that I noticed last month:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;10/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;10/Sep/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
1459 157.55.39.202
1579 95.108.181.88
1615 157.55.39.147
@ -260,7 +260,7 @@ UPDATE 15
</code></pre><ul>
<li>And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):</li>
</ul>
<pre tabindex="0"><code># grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10
<pre tabindex="0"><code># grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51&#39; dspace.log.2018-09-10
14133
</code></pre><ul>
<li>The user agent is still the same:</li>
@ -270,7 +270,7 @@ UPDATE 15
<li>I added <code>.*crawl.*</code> to the Tomcat Session Crawler Manager Valve, so I&rsquo;m not sure why the bot is creating so many sessions&hellip;</li>
<li>I just tested that user agent on CGSpace and it <em>does not</em> create a new session:</li>
</ul>
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
<pre tabindex="0"><code>$ http --print Hh https://cgspace.cgiar.org &#39;User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)&#39;
GET / HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
@ -319,7 +319,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
<li>Linode says that CGSpace (linode18) has had high CPU for the past two hours</li>
<li>The top IP addresses today are:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &quot;13/Sep/2018&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E &#34;13/Sep/2018&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
32 46.229.161.131
38 104.198.9.108
39 66.249.64.91
@ -333,9 +333,9 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
</code></pre><ul>
<li>And the top two addresses seem to be re-using their Tomcat sessions properly:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92&#39; dspace.log.2018-09-13 | sort | uniq
7
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77&#39; dspace.log.2018-09-13 | sort | uniq
2
</code></pre><ul>
<li>So I&rsquo;m not sure what&rsquo;s going on</li>
@ -397,12 +397,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>There are some example queries on the <a href="https://wiki.lyrasis.org/display/DSPACE/Solr">DSpace Solr wiki</a></li>
<li>For example, this query returns 1655 rows for item <a href="https://cgspace.cgiar.org/handle/10568/10630">10568/10630</a>:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&#39;
</code></pre><ul>
<li>The id in the Solr query is the item&rsquo;s database id (get it from the REST API or something)</li>
<li>Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire&rsquo;s statlet shows, though the query logic here is confusing:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)&#39;
</code></pre><ul>
<li>According to the <a href="https://wiki.apache.org/solr/SolrQuerySyntax">SolrQuerySyntax</a> page on the Apache wiki, the <code>[* TO *]</code> syntax just selects a range (in this case all values for a field)</li>
<li>So it seems to be:
@ -413,15 +413,15 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
</li>
<li>What the shit, I think I&rsquo;m right: the simplified logic in <em>this</em> query returns the same 889:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=-(statistics_type:[*+TO+*]+-statistics_type:view)&#39;
</code></pre><ul>
<li>And if I simplify the <code>statistics_type</code> logic the same way, it still returns the same 889!</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=bundleName:ORIGINAL&amp;fq=statistics_type:view&#39;
</code></pre><ul>
<li>As for item views, I suppose that&rsquo;s just the same query, minus the <code>bundleName:ORIGINAL</code>:</li>
</ul>
<pre tabindex="0"><code>$ http 'http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view'
<pre tabindex="0"><code>$ http &#39;http://localhost:3000/solr/statistics/select?indent=on&amp;rows=0&amp;q=type:0+owningItem:11576&amp;fq=isBot:false&amp;fq=-bundleName:ORIGINAL&amp;fq=statistics_type:view&#39;
</code></pre><ul>
<li>That one returns 766, which is exactly 1655 minus 889&hellip;</li>
<li>Also, Solr&rsquo;s <code>fq</code> is similar to the regular <code>q</code> query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries</li>
@ -432,11 +432,11 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
<li>It uses the Python-based <a href="https://falcon.readthedocs.io">Falcon</a> web framework and talks to Solr directly using the <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> library (which seems to have issues in Python 3.7 currently)</li>
<li>After deploying on DSpace Test I can then get the stats for an item using its ID:</li>
</ul>
<pre tabindex="0"><code>$ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
<pre tabindex="0"><code>$ http -b &#39;https://dspacetest.cgiar.org/rest/statistics/item?id=110988&#39;
{
&quot;downloads&quot;: 2,
&quot;id&quot;: 110988,
&quot;views&quot;: 15
&#34;downloads&#34;: 2,
&#34;id&#34;: 110988,
&#34;views&#34;: 15
}
</code></pre><ul>
<li>The numbers are different than those that come from Atmire&rsquo;s statlets for some reason, but as I&rsquo;m querying Solr directly, I have no idea where their numbers come from!</li>
@ -533,7 +533,7 @@ sqlite&gt; INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
<pre tabindex="0"><code># python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
Type &#34;help&#34;, &#34;copyright&#34;, &#34;credits&#34; or &#34;license&#34; for more information.
&gt;&gt;&gt; import sqlite3
&gt;&gt;&gt; print(sqlite3.sqlite_version)
3.24.0
@ -606,7 +606,7 @@ Indexing item downloads (page 260 of 260)
<li>I will have to keep an eye on that over the next few weeks to see if things stay as they are</li>
<li>I did a batch replacement of the access rights with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script on DSpace Test:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.identifier.status -t correct -m 206
</code></pre><ul>
<li>This changes &ldquo;Open Access&rdquo; to &ldquo;Unrestricted Access&rdquo; and &ldquo;Limited Access&rdquo; to &ldquo;Restricted Access&rdquo;</li>
<li>After that I did a full Discovery reindex:</li>
@ -629,7 +629,7 @@ sys 2m18.485s
<li>Linode emailed to say that CGSpace&rsquo;s (linode19) CPU load was high for a few hours last night</li>
<li>Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;26/Sep/2018:(19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;26/Sep/2018:(19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
295 34.218.226.147
296 66.249.64.95
350 157.55.39.185
@ -645,9 +645,9 @@ sys 2m18.485s
<li><code>68.6.87.12</code> is on Cox Communications in the US (?)</li>
<li>These hosts are not using proper user agents and are not re-using their Tomcat sessions:</li>
</ul>
<pre tabindex="0"><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
<pre tabindex="0"><code>$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180&#39; dspace.log.2018-09-26 | sort | uniq
5423
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
$ grep -c -E &#39;session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12&#39; dspace.log.2018-09-26 | sort | uniq
758
</code></pre><ul>
<li>I will add their IPs to the list of bad bots in nginx so we can add a &ldquo;bot&rdquo; user agent to them and let Tomcat&rsquo;s Crawler Session Manager Valve handle them</li>
@ -659,8 +659,8 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26
<li>Peter sent me a list of 43 author names to fix, but it had some encoding errors like <code>Belalcázar, John</code> like usual (I will tell him to stop trying to export as UTF-8 because it never seems to work)</li>
<li>I did batch replaces for both on CGSpace with my <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a> script:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.contributor.affiliation -t correct -m 211
$ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3
</code></pre><ul>
<li>Afterwards I started a full Discovery re-index:</li>
</ul>
@ -675,18 +675,18 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
<li>Valerio keeps sending items on CGSpace that have weird or incorrect languages, authors, etc</li>
<li>I think I should just batch export and update all languages&hellip;</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
</code></pre><ul>
<li>Then I can simply delete the &ldquo;Other&rdquo; and &ldquo;other&rdquo; ones because that&rsquo;s not useful at all:</li>
</ul>
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;Other&#39;;
DELETE 6
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;other&#39;;
DELETE 79
</code></pre><ul>
<li>Looking through the list I see some weird language codes like <code>gh</code>, so I checked out those items:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
resource_id
-------------
94530
@ -699,12 +699,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
</code></pre><ul>
<li>Those items are from Ghana, so the submitter apparently thought <code>gh</code> was a language&hellip; I can safely delete them:</li>
</ul>
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
<pre tabindex="0"><code>dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
DELETE 2
</code></pre><ul>
<li>The next issue would be <code>jn</code>:</li>
</ul>
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
<pre tabindex="0"><code>dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jn&#39;;
resource_id
-------------
94001
@ -718,12 +718,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
<li>Those items are about Japan, so I will update them to be <code>ja</code></li>
<li>Other replacements:</li>
</ul>
<pre tabindex="0"><code>DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jp';
<pre tabindex="0"><code>DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;gh&#39;;
UPDATE metadatavalue SET text_value=&#39;fr&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;fn&#39;;
UPDATE metadatavalue SET text_value=&#39;hi&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;in&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;Ja&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jn&#39;;
UPDATE metadatavalue SET text_value=&#39;ja&#39; WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;language&#39; and qualifier = &#39;iso&#39;) AND text_value=&#39;jp&#39;;
</code></pre><ul>
<li>Then there are 12 items with <code>en|hi</code>, but they were all in one collection so I just exported it as a CSV and then re-imported the corrected metadata</li>
</ul>