mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2019-05-05
This commit is contained in:
content/posts
docs
2015-11
2015-12
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
2017-01
2017-02
2017-03
2017-04
2017-05
2017-06
2017-07
2017-08
2017-09
2017-10
2017-11
2017-12
2018-01
2018-02
2018-03
2018-04
2018-05
2018-06
2018-07
2018-08
2018-09
2018-10
2018-11
2018-12
2019-01
2019-02
2019-03
2019-04
2019-05
404.htmlcategories
cgiar-library-migration
index.htmlindex.xmlpage
posts
sitemap.xmltags
@ -11,15 +11,13 @@
|
||||
|
||||
I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
|
||||
|
||||
|
||||
$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
|
||||
|
||||
|
||||
|
||||
During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
|
||||
|
||||
|
||||
There is insufficient memory for the Java Runtime Environment to continue.
|
||||
|
||||
" />
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-07/" />
|
||||
@ -33,17 +31,15 @@ There is insufficient memory for the Java Runtime Environment to continue.
|
||||
|
||||
I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
|
||||
|
||||
|
||||
$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
|
||||
|
||||
|
||||
|
||||
During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
|
||||
|
||||
|
||||
There is insufficient memory for the Java Runtime Environment to continue.
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.55.3" />
|
||||
<meta name="generator" content="Hugo 0.55.5" />
|
||||
|
||||
|
||||
|
||||
@ -125,30 +121,25 @@ There is insufficient memory for the Java Runtime Environment to continue.
|
||||
<h2 id="2018-07-01">2018-07-01</h2>
|
||||
|
||||
<ul>
|
||||
<li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
|
||||
</ul>
|
||||
<li><p>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</p>
|
||||
|
||||
<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
|
||||
</ul>
|
||||
<li><p>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</p>
|
||||
|
||||
<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
</ul>
|
||||
|
||||
<ul>
|
||||
<li>As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:</li>
|
||||
</ul>
|
||||
<li><p>As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:</p>
|
||||
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
$ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:</li>
|
||||
</ul>
|
||||
<li><p>Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:</p>
|
||||
|
||||
<pre><code>$ sudo su - postgres
|
||||
$ psql dspace
|
||||
@ -163,10 +154,9 @@ dspace=# commit
|
||||
dspace=# \q
|
||||
$ exit
|
||||
$ dspace database migrate ignored
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>After that I started Tomcat 7 and DSpace seems to be working, now I need to tell our colleagues to try stuff and report issues they have</li>
|
||||
<li><p>After that I started Tomcat 7 and DSpace seems to be working, now I need to tell our colleagues to try stuff and report issues they have</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-02">2018-07-02</h2>
|
||||
@ -179,38 +169,34 @@ $ dspace database migrate ignored
|
||||
<h2 id="2018-07-03">2018-07-03</h2>
|
||||
|
||||
<ul>
|
||||
<li>Finally finish with the CIFOR Archive records (a total of 2448):
|
||||
<li><p>Finally finish with the CIFOR Archive records (a total of 2448):</p>
|
||||
|
||||
<ul>
|
||||
<li>I mapped the 50 items that were duplicates from elsewhere in CGSpace into <a href="https://cgspace.cgiar.org/handle/10568/16702">CIFOR Archive</a></li>
|
||||
<li>I did one last check of the remaining 2398 items and found eight who have a <code>cg.identifier.doi</code> that links to some URL other than a DOI so I moved those to <code>cg.identifier.url</code> and <code>cg.identifier.googleurl</code> as appropriate</li>
|
||||
<li>Also, thirteen items had a DOI in their citation, but did not have a <code>cg.identifier.doi</code> field, so I added those</li>
|
||||
<li>Then I imported those 2398 items in two batches (to deal with memory issues):</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
|
||||
<li><p>Then I imported those 2398 items in two batches (to deal with memory issues):</p>
|
||||
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
|
||||
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
</ul></li>
|
||||
|
||||
<ul>
|
||||
<li>I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:</li>
|
||||
</ul>
|
||||
<li><p>I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:</p>
|
||||
|
||||
<pre><code>dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
|
||||
count
|
||||
count
|
||||
-------
|
||||
785
|
||||
785
|
||||
dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
|
||||
count
|
||||
count
|
||||
-------
|
||||
4
|
||||
</code></pre>
|
||||
4
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>I think I should fix that as well as some other garbage values like “test” and “dspace.ilri.org” etc:</li>
|
||||
</ul>
|
||||
<li><p>I think I should fix that as well as some other garbage values like “test” and “dspace.ilri.org” etc:</p>
|
||||
|
||||
<pre><code>dspace=# begin;
|
||||
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
|
||||
@ -222,14 +208,12 @@ UPDATE 1
|
||||
dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=222 and metadata_value_id in (2299312, 10684, 10700, 996403);
|
||||
DELETE 4
|
||||
dspace=# commit;
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:</li>
|
||||
</ul>
|
||||
<li><p>Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:</p>
|
||||
|
||||
<pre><code>03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
|
||||
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
|
||||
java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
|
||||
at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
|
||||
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
|
||||
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5256)
|
||||
@ -245,10 +229,9 @@ dspace=# commit;
|
||||
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
|
||||
at java.lang.Thread.run(Thread.java:748)
|
||||
Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>Gotta check that out later…</li>
|
||||
<li><p>Gotta check that out later…</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-04">2018-07-04</h2>
|
||||
@ -274,92 +257,96 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
|
||||
<li>I was tempted to do the Linode instance upgrade on CGSpace (linode18), but after looking closely at the system backups I noticed that Solr isn’t being backed up to S3</li>
|
||||
<li>I apparently noticed this—and fixed it!—in <a href="/cgspace-notes/2016-07/">2016-07</a>, but it doesn’t look like the backup has been updated since then!</li>
|
||||
<li>It looks like I added Solr to the <code>backup_to_s3.sh</code> script, but that script is not even being used (<code>s3cmd</code> is run directly from root’s crontab)</li>
|
||||
<li>For now I have just initiated a manual S3 backup of the Solr data:</li>
|
||||
</ul>
|
||||
|
||||
<li><p>For now I have just initiated a manual S3 backup of the Solr data:</p>
|
||||
|
||||
<pre><code># s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>But I need to add this to cron!</li>
|
||||
<li>I wonder if I should convert some of the cron jobs to systemd services / timers…</li>
|
||||
<li>I sent a note to all our users on Yammer to ask them about possible maintenance on Sunday, July 14th</li>
|
||||
<li>Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (<a href="https://github.com/ilri/DSpace/issues/384">#384</a>)</li>
|
||||
<li>I regenerated the list of names for all our ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
|
||||
</ul>
|
||||
<li><p>But I need to add this to cron!</p></li>
|
||||
|
||||
<li><p>I wonder if I should convert some of the cron jobs to systemd services / timers…</p></li>
|
||||
|
||||
<li><p>I sent a note to all our users on Yammer to ask them about possible maintenance on Sunday, July 14th</p></li>
|
||||
|
||||
<li><p>Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (<a href="https://github.com/ilri/DSpace/issues/384">#384</a>)</p></li>
|
||||
|
||||
<li><p>I regenerated the list of names for all our ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</p>
|
||||
|
||||
<pre><code>$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
|
||||
$ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>But after comparing to the existing list of names I didn’t see much change, so I just ignored it</li>
|
||||
<li><p>But after comparing to the existing list of names I didn’t see much change, so I just ignored it</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-09">2018-07-09</h2>
|
||||
|
||||
<ul>
|
||||
<li>Uptime Robot said that CGSpace was down for two minutes early this morning but I don’t see anything in Tomcat logs or dmesg</li>
|
||||
<li>Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat’s <code>catalina.out</code>:</li>
|
||||
</ul>
|
||||
|
||||
<li><p>Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat’s <code>catalina.out</code>:</p>
|
||||
|
||||
<pre><code>Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>I’m not sure if it’s the same error, but I see this in DSpace’s <code>solr.log</code>:</li>
|
||||
</ul>
|
||||
<li><p>I’m not sure if it’s the same error, but I see this in DSpace’s <code>solr.log</code>:</p>
|
||||
|
||||
<pre><code>2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>I see a strange error around that time in <code>dspace.log.2018-07-08</code>:</li>
|
||||
</ul>
|
||||
<li><p>I see a strange error around that time in <code>dspace.log.2018-07-08</code>:</p>
|
||||
|
||||
<pre><code>2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
|
||||
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr/statistics
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>But not sure what caused that…</li>
|
||||
<li>I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT</li>
|
||||
<li>Looking in the nginx logs I see the top ten IP addresses active today:</li>
|
||||
</ul>
|
||||
<li><p>But not sure what caused that…</p></li>
|
||||
|
||||
<li><p>I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT</p></li>
|
||||
|
||||
<li><p>Looking in the nginx logs I see the top ten IP addresses active today:</p>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
1691 40.77.167.84
|
||||
1701 40.77.167.69
|
||||
1718 50.116.102.77
|
||||
1872 137.108.70.6
|
||||
2172 157.55.39.234
|
||||
2190 207.46.13.47
|
||||
2848 178.154.200.38
|
||||
4367 35.227.26.162
|
||||
4387 70.32.83.92
|
||||
4738 95.108.181.88
|
||||
</code></pre>
|
||||
1691 40.77.167.84
|
||||
1701 40.77.167.69
|
||||
1718 50.116.102.77
|
||||
1872 137.108.70.6
|
||||
2172 157.55.39.234
|
||||
2190 207.46.13.47
|
||||
2848 178.154.200.38
|
||||
4367 35.227.26.162
|
||||
4387 70.32.83.92
|
||||
4738 95.108.181.88
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>Of those, <em>all</em> except <code>70.32.83.92</code> and <code>50.116.102.77</code> are <em>NOT</em> re-using their Tomcat sessions, for example from the XMLUI logs:</li>
|
||||
</ul>
|
||||
<li><p>Of those, <em>all</em> except <code>70.32.83.92</code> and <code>50.116.102.77</code> are <em>NOT</em> re-using their Tomcat sessions, for example from the XMLUI logs:</p>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
|
||||
4435
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li><code>95.108.181.88</code> appears to be Yandex, so I dunno why it’s creating so many sessions, as its user agent should match Tomcat’s Crawler Session Manager Valve</li>
|
||||
<li><code>70.32.83.92</code> is on MediaTemple but I’m not sure who it is. They are mostly hitting REST so I guess that’s fine</li>
|
||||
<li><code>35.227.26.162</code> doesn’t declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx</li>
|
||||
<li><code>178.154.200.38</code> is Yandex again</li>
|
||||
<li><code>207.46.13.47</code> is Bing</li>
|
||||
<li><code>157.55.39.234</code> is Bing</li>
|
||||
<li><code>137.108.70.6</code> is our old friend CORE bot</li>
|
||||
<li><code>50.116.102.77</code> doesn’t declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that’s fine</li>
|
||||
<li><code>40.77.167.84</code> is Bing again</li>
|
||||
<li>Interestingly, the first time that I see <code>35.227.26.162</code> was on 2018-06-08</li>
|
||||
<li>I’ve added <code>35.227.26.162</code> to the bot tagging logic in the nginx vhost</li>
|
||||
<li><p><code>95.108.181.88</code> appears to be Yandex, so I dunno why it’s creating so many sessions, as its user agent should match Tomcat’s Crawler Session Manager Valve</p></li>
|
||||
|
||||
<li><p><code>70.32.83.92</code> is on MediaTemple but I’m not sure who it is. They are mostly hitting REST so I guess that’s fine</p></li>
|
||||
|
||||
<li><p><code>35.227.26.162</code> doesn’t declare a user agent and is on Google Cloud, so I should probably mark them as a bot in nginx</p></li>
|
||||
|
||||
<li><p><code>178.154.200.38</code> is Yandex again</p></li>
|
||||
|
||||
<li><p><code>207.46.13.47</code> is Bing</p></li>
|
||||
|
||||
<li><p><code>157.55.39.234</code> is Bing</p></li>
|
||||
|
||||
<li><p><code>137.108.70.6</code> is our old friend CORE bot</p></li>
|
||||
|
||||
<li><p><code>50.116.102.77</code> doesn’t declare a user agent and lives on HostGator, but mostly just hits the REST API so I guess that’s fine</p></li>
|
||||
|
||||
<li><p><code>40.77.167.84</code> is Bing again</p></li>
|
||||
|
||||
<li><p>Interestingly, the first time that I see <code>35.227.26.162</code> was on 2018-06-08</p></li>
|
||||
|
||||
<li><p>I’ve added <code>35.227.26.162</code> to the bot tagging logic in the nginx vhost</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-10">2018-07-10</h2>
|
||||
@ -372,32 +359,30 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
|
||||
<li>All were tested and merged to the <code>5_x-prod</code> branch and will be deployed on CGSpace this coming weekend when I do the Linode server upgrade</li>
|
||||
<li>I need to get them onto the 5.8 testing branch too, either via cherry-picking or by rebasing after we finish testing Atmire’s 5.8 pull request (<a href="https://github.com/ilri/DSpace/pull/378">#378</a>)</li>
|
||||
<li>Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC</li>
|
||||
<li>These are the top ten users in the last two hours:</li>
|
||||
</ul>
|
||||
|
||||
<li><p>These are the top ten users in the last two hours:</p>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Jul/2018:(11|12|13)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
81 193.95.22.113
|
||||
82 50.116.102.77
|
||||
112 40.77.167.90
|
||||
117 196.190.95.98
|
||||
120 178.154.200.38
|
||||
215 40.77.167.96
|
||||
243 41.204.190.40
|
||||
415 95.108.181.88
|
||||
695 35.227.26.162
|
||||
697 213.139.52.250
|
||||
</code></pre>
|
||||
81 193.95.22.113
|
||||
82 50.116.102.77
|
||||
112 40.77.167.90
|
||||
117 196.190.95.98
|
||||
120 178.154.200.38
|
||||
215 40.77.167.96
|
||||
243 41.204.190.40
|
||||
415 95.108.181.88
|
||||
695 35.227.26.162
|
||||
697 213.139.52.250
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>Looks like <code>213.139.52.250</code> is Moayad testing his new CGSpace vizualization thing:</li>
|
||||
</ul>
|
||||
<li><p>Looks like <code>213.139.52.250</code> is Moayad testing his new CGSpace vizualization thing:</p>
|
||||
|
||||
<pre><code>213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>He said there was a bug that caused his app to request a bunch of invalid URLs</li>
|
||||
<li>I’ll have to keep and eye on this and see how their platform evolves</li>
|
||||
<li><p>He said there was a bug that caused his app to request a bunch of invalid URLs</p></li>
|
||||
|
||||
<li><p>I’ll have to keep and eye on this and see how their platform evolves</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-11">2018-07-11</h2>
|
||||
@ -417,85 +402,83 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
|
||||
|
||||
<ul>
|
||||
<li>Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM</li>
|
||||
<li>Here are the top ten IPs from last night and this morning:</li>
|
||||
</ul>
|
||||
|
||||
<li><p>Here are the top ten IPs from last night and this morning:</p>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
48 66.249.64.91
|
||||
50 35.227.26.162
|
||||
57 157.55.39.234
|
||||
59 157.55.39.71
|
||||
62 147.99.27.190
|
||||
82 95.108.181.88
|
||||
92 40.77.167.90
|
||||
97 183.128.40.185
|
||||
97 240e:f0:44:fa53:745a:8afe:d221:1232
|
||||
3634 208.110.72.10
|
||||
48 66.249.64.91
|
||||
50 35.227.26.162
|
||||
57 157.55.39.234
|
||||
59 157.55.39.71
|
||||
62 147.99.27.190
|
||||
82 95.108.181.88
|
||||
92 40.77.167.90
|
||||
97 183.128.40.185
|
||||
97 240e:f0:44:fa53:745a:8afe:d221:1232
|
||||
3634 208.110.72.10
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "12/Jul/2018:00" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
25 216.244.66.198
|
||||
38 40.77.167.185
|
||||
46 66.249.64.93
|
||||
56 157.55.39.71
|
||||
60 35.227.26.162
|
||||
65 157.55.39.234
|
||||
83 95.108.181.88
|
||||
87 66.249.64.91
|
||||
96 40.77.167.90
|
||||
7075 208.110.72.10
|
||||
</code></pre>
|
||||
25 216.244.66.198
|
||||
38 40.77.167.185
|
||||
46 66.249.64.93
|
||||
56 157.55.39.71
|
||||
60 35.227.26.162
|
||||
65 157.55.39.234
|
||||
83 95.108.181.88
|
||||
87 66.249.64.91
|
||||
96 40.77.167.90
|
||||
7075 208.110.72.10
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>We have never seen <code>208.110.72.10</code> before… so that’s interesting!</li>
|
||||
<li>The user agent for these requests is: Pcore-HTTP/v0.44.0</li>
|
||||
<li>A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it</li>
|
||||
<li>This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:</li>
|
||||
</ul>
|
||||
<li><p>We have never seen <code>208.110.72.10</code> before… so that’s interesting!</p></li>
|
||||
|
||||
<li><p>The user agent for these requests is: Pcore-HTTP/v0.44.0</p></li>
|
||||
|
||||
<li><p>A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it</p></li>
|
||||
|
||||
<li><p>This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:</p>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
17098 208.110.72.10
|
||||
17098 208.110.72.10
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
|
||||
1161
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-12
|
||||
1885
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>I think the problem is that, despite the bot requesting <code>robots.txt</code>, it almost exlusively requests dynamic pages from <code>/discover</code>:</li>
|
||||
</ul>
|
||||
<li><p>I think the problem is that, despite the bot requesting <code>robots.txt</code>, it almost exlusively requests dynamic pages from <code>/discover</code>:</p>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
13364 GET /discover
|
||||
993 GET /search-filter
|
||||
804 GET /browse
|
||||
13364 GET /discover
|
||||
993 GET /search-filter
|
||||
804 GET /browse
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep robots
|
||||
208.110.72.10 - - [12/Jul/2018:00:22:28 +0000] "GET /robots.txt HTTP/1.1" 200 1301 "https://cgspace.cgiar.org/robots.txt" "Pcore-HTTP/v0.44.0"
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting</li>
|
||||
<li>I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case</li>
|
||||
<li>Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):</li>
|
||||
</ul>
|
||||
<li><p>So this bot is just like Baiduspider, and I need to add it to the nginx rate limiting</p></li>
|
||||
|
||||
<li><p>I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case</p></li>
|
||||
|
||||
<li><p>Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):</p>
|
||||
|
||||
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
|
||||
COPY 4518
|
||||
dspace=# \q
|
||||
$ csvcut -c 1 < /tmp/affiliations.csv > /tmp/affiliations-1.csv
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>We also need to discuss standardizing our countries and comparing our ORCID iDs</li>
|
||||
<li><p>We also need to discuss standardizing our countries and comparing our ORCID iDs</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-13">2018-07-13</h2>
|
||||
|
||||
<ul>
|
||||
<li>Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:</li>
|
||||
</ul>
|
||||
<li><p>Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:</p>
|
||||
|
||||
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
|
||||
COPY 4518
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-15">2018-07-15</h2>
|
||||
|
||||
@ -506,8 +489,8 @@ COPY 4518
|
||||
<li>Peter had asked a question about how mapped items are displayed in the Altmetric dashboard</li>
|
||||
<li>For example, <a href="10568/82810"><sup>10568</sup>⁄<sub>82810</sub></a> is mapped to four collections, but only shows up in one “department” in their dashboard</li>
|
||||
<li>Altmetric help said that <a href="https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/82810">according to OAI that item is only in one department</a></li>
|
||||
<li>I noticed that indeed there was only one collection listed, so I forced an OAI re-import on CGSpace:</li>
|
||||
</ul>
|
||||
|
||||
<li><p>I noticed that indeed there was only one collection listed, so I forced an OAI re-import on CGSpace:</p>
|
||||
|
||||
<pre><code>$ dspace oai import -c
|
||||
OAI 2.0 manager action started
|
||||
@ -522,38 +505,34 @@ Full import
|
||||
Total: 73925 items
|
||||
Purging cached OAI responses.
|
||||
OAI 2.0 manager action ended. It took 697 seconds.
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>Now I see four colletions in OAI for that item!</li>
|
||||
<li>I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change</li>
|
||||
<li>ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!</li>
|
||||
</ul>
|
||||
<li><p>Now I see four colletions in OAI for that item!</p></li>
|
||||
|
||||
<li><p>I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change</p></li>
|
||||
|
||||
<li><p>ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!</p>
|
||||
|
||||
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
|
||||
1020
|
||||
$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
|
||||
1158
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>I combined the two lists and regenerated the names for all our the ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</li>
|
||||
</ul>
|
||||
<li><p>I combined the two lists and regenerated the names for all our the ORCID iDs using my <a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b">resolve-orcids.py</a> script:</p>
|
||||
|
||||
<pre><code>$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-07-15-orcid-ids.txt
|
||||
$ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolved-orcids.txt -d
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via <code>% !sort</code> and then checked the formatting with tidy:</li>
|
||||
</ul>
|
||||
<li><p>Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via <code>% !sort</code> and then checked the formatting with tidy:</p>
|
||||
|
||||
<pre><code>$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>I will check with the CGSpace team to see if they want me to add these to CGSpace</li>
|
||||
<li>Help Udana from WLE understand some Altmetrics concepts</li>
|
||||
<li><p>I will check with the CGSpace team to see if they want me to add these to CGSpace</p></li>
|
||||
|
||||
<li><p>Help Udana from WLE understand some Altmetrics concepts</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-18">2018-07-18</h2>
|
||||
@ -565,20 +544,20 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
|
||||
<li>I suggested that we should have a wider meeting about this, and that I would post that on Yammer</li>
|
||||
<li>I was curious about how and when Altmetric harvests the OAI, so I looked in nginx’s OAI log</li>
|
||||
<li>For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts</li>
|
||||
<li>In there I see two bots making about 750 requests each, and this one is probably Altmetric:</li>
|
||||
</ul>
|
||||
|
||||
<li><p>In there I see two bots making about 750 requests each, and this one is probably Altmetric:</p>
|
||||
|
||||
<pre><code>178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1" 200 58653 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
|
||||
178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////200 HTTP/1.1" 200 67950 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
|
||||
...
|
||||
178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////73900 HTTP/1.1" 20 0 25049 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>So if they are getting 100 records per OAI request it would take them 739 requests</li>
|
||||
<li>I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve… does OAI use Tomcat sessions?</li>
|
||||
<li>Appears not:</li>
|
||||
</ul>
|
||||
<li><p>So if they are getting 100 records per OAI request it would take them 739 requests</p></li>
|
||||
|
||||
<li><p>I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve… does OAI use Tomcat sessions?</p></li>
|
||||
|
||||
<li><p>Appears not:</p>
|
||||
|
||||
<pre><code>$ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&resumptionToken=oai_dc////100'
|
||||
GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1
|
||||
@ -600,7 +579,8 @@ Vary: Accept-Encoding
|
||||
X-Content-Type-Options: nosniff
|
||||
X-Frame-Options: SAMEORIGIN
|
||||
X-XSS-Protection: 1; mode=block
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-19">2018-07-19</h2>
|
||||
|
||||
@ -620,44 +600,45 @@ X-XSS-Protection: 1; mode=block
|
||||
<ul>
|
||||
<li>I told the IWMI people that they can use <code>sort_by=3</code> in their OpenSearch query to sort the results by <code>dc.date.accessioned</code> instead of <code>dc.date.issued</code></li>
|
||||
<li>They say that it is a burden for them to capture the issue dates, so I cautioned them that this is in their own benefit for future posterity and that everyone else on CGSpace manages to capture the issue dates!</li>
|
||||
<li>For future reference, as I had previously noted in <a href="/cgspace-notes/2018-04/">2018-04</a>, sort options are configured in <code>dspace.cfg</code>, for example:</li>
|
||||
</ul>
|
||||
|
||||
<li><p>For future reference, as I had previously noted in <a href="/cgspace-notes/2018-04/">2018-04</a>, sort options are configured in <code>dspace.cfg</code>, for example:</p>
|
||||
|
||||
<pre><code>webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
|
||||
</code></pre>
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>Just because I was curious I made sure that these options are working as expected in DSpace 5.8 on DSpace Test (they are)</li>
|
||||
<li>I tested the Atmire Listings and Reports (L&R) module one last time on my local test environment with a new snapshot of CGSpace’s database and re-generated Discovery index and it worked fine</li>
|
||||
<li>I finally informed Atmire that we’re ready to proceed with deploying this to CGSpace and that they should advise whether we should wait about the SNAPSHOT versions in <code>pom.xml</code></li>
|
||||
<li>There is no word on the issue I reported with Tomcat 8.5.32 yet, though…</li>
|
||||
<li><p>Just because I was curious I made sure that these options are working as expected in DSpace 5.8 on DSpace Test (they are)</p></li>
|
||||
|
||||
<li><p>I tested the Atmire Listings and Reports (L&R) module one last time on my local test environment with a new snapshot of CGSpace’s database and re-generated Discovery index and it worked fine</p></li>
|
||||
|
||||
<li><p>I finally informed Atmire that we’re ready to proceed with deploying this to CGSpace and that they should advise whether we should wait about the SNAPSHOT versions in <code>pom.xml</code></p></li>
|
||||
|
||||
<li><p>There is no word on the issue I reported with Tomcat 8.5.32 yet, though…</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-23">2018-07-23</h2>
|
||||
|
||||
<ul>
|
||||
<li>Still discussing dates with IWMI</li>
|
||||
<li>I looked in the database to see the breakdown of date formats used in <code>dc.date.issued</code>, ie YYYY, YYYY-MM, or YYYY-MM-DD:</li>
|
||||
</ul>
|
||||
|
||||
<li><p>I looked in the database to see the breakdown of date formats used in <code>dc.date.issued</code>, ie YYYY, YYYY-MM, or YYYY-MM-DD:</p>
|
||||
|
||||
<pre><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
|
||||
count
|
||||
count
|
||||
-------
|
||||
53292
|
||||
53292
|
||||
(1 row)
|
||||
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}$';
|
||||
count
|
||||
count
|
||||
-------
|
||||
3818
|
||||
3818
|
||||
(1 row)
|
||||
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}-[0-9]{2}-[0-9]{2}$';
|
||||
count
|
||||
count
|
||||
-------
|
||||
17357
|
||||
</code></pre>
|
||||
17357
|
||||
</code></pre></li>
|
||||
|
||||
<ul>
|
||||
<li>So it looks like YYYY is the most numerious, followed by YYYY-MM-DD, then YYYY-MM</li>
|
||||
<li><p>So it looks like YYYY is the most numerious, followed by YYYY-MM-DD, then YYYY-MM</p></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-07-26">2018-07-26</h2>
|
||||
|
Reference in New Issue
Block a user