I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
Looking at the other half of Udana’s WLE records from 2018-11
I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
68.15% <20> 9.45 instead of 68.15% ± 9.45
2003<EFBFBD>2013 instead of 2003–2013
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
Looking at the other half of Udana’s WLE records from 2018-11
I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
68.15% <20> 9.45 instead of 68.15% ± 9.45
2003<EFBFBD>2013 instead of 2003–2013
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
<li>I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good</li>
<li>I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…</li>
<li>Looking at the other half of Udana’s WLE records from 2018-11
<ul>
<li>I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)</li>
<li>I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items</li>
<li>Most worryingly, there are encoding errors in the abstracts for eleven items, for example:</li>
<li>68.15% <20> 9.45 instead of 68.15% ± 9.45</li>
<li>2003<EFBFBD>2013 instead of 2003–2013</li>
</ul></li>
<li>I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs</li>
<li>As I was inspecting the archive I noticed that there were some problems with the bitsreams:
<ul>
<li>First, Sisay didn’t include the bitstream descriptions</li>
<li>Second, only five items had bitstreams and I remember in the discussion with IITA that there should have been nine!</li>
<li>I had to refer to the original CSV from January to find the file names, then download and add them to the export contents manually!</li>
</ul></li>
<li>After adding the missing bitstreams and descriptions manually I tested them again locally, then imported them to a temporary collection on CGSpace:</li>
<li>DSpace’s export function doesn’t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something</li>
<li>After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the <code>dspace cleanup</code> script</li>
<li>Merge the IITA research theme changes from last month to the <code>5_x-prod</code> branch (<ahref="https://github.com/ilri/DSpace/pull/413">#413</a>)
<ul>
<li>I will deploy to CGSpace soon and then think about how to batch tag all IITA’s existing items with this metadata</li>
<li>Generate a controlled vocabulary of 1187 AGROVOC subjects from the top 1500 that I checked last month, dumping the terms themselves using <code>csvcut</code> and then applying XML controlled vocabulary format in vim and then checking with tidy for good measure:</li>
</ul>
<pre><code>$ csvcut -c name 2019-02-22-subjects.csv > dspace/config/controlled-vocabularies/dc-contributor-author.xml
<li>Atmire noticed my message about the “solr_update_time_stamp” error on the dspace-tech mailing list and created an issue on their tracker to discuss it with me
<ul>
<li>They say the error is harmless, but has nevertheless been fixed in their newer module versions</li>
<li>There’s an issue with CGSpace right now where all items are giving a blank page in the XMLUI
<ul>
<li>Interestingly, if I check an item in the REST API it is also mostly blank: only the title and the ID!</li>
<li>I don’t see anything unusual in the Tomcat logs, though there are thousands of those <code>solr_update_time_stamp</code> errors:</li>
</ul></li>
</ul>
<pre><code># journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
1076
</code></pre>
<ul>
<li>I restarted Tomcat and it’s OK now…</li>
<li>Skype meeting with Peter and Abenet and Sisay
<ul>
<li>We want to try to crowd source the correction of invalid AGROVOC terms starting with the ~313 invalid ones from our top 1500</li>
<li>We will share a Google Docs spreadsheet with the partners and ask them to mark the deletions and corrections</li>
<li>Abenet and Alan to spend some time identifying correct DCTERMS fields to move to, with preference over CG Core 2.0 as we want to be globally compliant (use information from SEO crosswalks)</li>
<li>I need to follow up on the privacy page that Sisay worked on</li>
<li>We want to try to migrate the 600 <ahref="https://livestock.cgiar.org">Livestock CRP blog posts</a> to CGSpace, Peter will try to export the XML from WordPress so I can try to parse it with a script</li>
<li>I shared a post on Yammer informing our editors to try to AGROVOC controlled list</li>
<li>The SPDX legal committee had a meeting and discussed the addition of CC-BY-ND-3.0-IGO and other IGO licenses to their list, but it seems unlikely (<ahref="https://github.com/spdx/license-list-XML/issues/767#issuecomment-470709673">spdx/license-list-XML/issues/767</a>)</li>
<li>The FireOak report highlights the fact that several CGSpace collections have mixed-content errors due to the use of HTTP links in the Feedburner forms</li>
<li>I see 46 occurrences of these with this query:</li>
</ul>
<pre><code>dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
</code></pre>
<ul>
<li>I can replace these globally using the following SQL:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
UPDATE 43
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
UPDATE 44
</code></pre>
<ul>
<li>I ran the corrections on CGSpace and DSpace Test</li>
<li>Working on tagging IITA’s items with their new research theme (<code>cg.identifier.iitatheme</code>) based on their existing IITA subjects (see <ahref="/cgspace-notes/2018-02/">notes from 2019-02</a>)</li>
<li>I exported the entire IITA community from CGSpace and then used <code>csvcut</code> to extract only the needed fields:</li>
<li><p>After importing to OpenRefine I realized that tagging items based on their subjects is tricky because of the row/record mode of OpenRefine when you split the multi-value cells as well as the fact that some items might need to be tagged twice (thus needing a <code>||</code>)</p></li>
<li><p>I think it might actually be easier to filter by IITA subject, then by IITA theme (if needed), and then do transformations with some conditional values in GREL expressions like:</p></li>
</ul>
<pre><code>if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
</code></pre>
<ul>
<li>Then it’s more annoying because there are four IITA subject columns…</li>
<li>In total this would add research themes to 1,755 items</li>
<li>I want to double check one last time with Bosede that they would like to do this, because I also see that this will tag a few hundred items from the 1970s and 1980s</li>
<li>This is a bit ugly, but it works (using the <ahref="https://wiki.duraspace.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL helper function</a> to resolve ID to handle):</li>
</ul>
<pre><code>for id in $(psql -U postgres -d dspacetest -h localhost -c "SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'" | grep -oE '[0-9]{3,}'); do
<li>Then I couldn’t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:</li>
<li>And looking at those items more closely, only one of them has an <em>issue date</em> of after 2018-04, so I will only update that one (as the countrie’s name only changed in 2018-04)</li>
java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is closed.
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
at org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.prepareStatement(PoolingDataSource.java:313)
at org.dspace.storage.rdbms.DatabaseManager.queryTable(DatabaseManager.java:220)
at org.dspace.authorize.AuthorizeManager.getPolicies(AuthorizeManager.java:612)
at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:154)
at org.dspace.content.crosswalk.METSRightsCrosswalk.disseminateElement(METSRightsCrosswalk.java:300)
</code></pre>
<ul>
<li>Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently</li>
<li>I didn’t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today <em>might</em> be related?</li>
</ul>
<pre><code>SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
java.util.EmptyStackException
at java.util.Stack.peek(Stack.java:102)
at java.util.Stack.pop(Stack.java:84)
at org.apache.cocoon.callstack.CallStack.leave(CallStack.java:54)
at org.apache.cocoon.servletservice.CallStackHelper.leaveServlet(CallStackHelper.java:85)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:484)
at org.apache.cocoon.servletservice.ServletServiceContext$PathDispatcher.forward(ServletServiceContext.java:443)
at org.apache.cocoon.servletservice.spring.ServletFactoryBean$ServiceInterceptor.invoke(ServletFactoryBean.java:264)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
at com.sun.proxy.$Proxy90.service(Unknown Source)
at org.dspace.springmvc.CocoonView.render(CocoonView.java:113)
at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1180)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:950)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:778)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:624)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:731)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.rdf.negotiation.NegotiationFilter.doFilter(NegotiationFilter.java:59)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:78)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:494)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at org.apache.catalina.valves.CrawlerSessionManagerValve.invoke(CrawlerSessionManagerValve.java:234)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1137)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
</code></pre>
<ul>
<li>For now I will just restart Tomcat…</li>
<li>Last week Felix from Earlham said that they finished testing on DSpace Test so I made backups of some things there and re-deployed the system on Ubuntu 18.04
<ul>
<li>During re-deployment I hit a few issues with the <ahref="https://github.com/ilri/rmg-ansible-public">Ansible playbooks</a> and made some minor improvements</li>
<li>There seems to be an <ahref="https://bugs.launchpad.net/ubuntu/+source/nodejs/+bug/1794589">issue with nodejs’s dependencies now</a>, which causes npm to get uninstalled when installing the certbot dependencies (due to a conflict in libssl dependencies)</li>
<li>I re-worked the playbooks to use Node.js from the upstream official repository for now</li>