<li>Add new CRP subject “GRAIN LEGUMES AND DRYLAND CEREALS” to <code>input-forms.xml</code> (<ahref="https://github.com/ilri/DSpace/pull/358">#358</a>)</li>
<li>Merge the ORCID integration stuff in to <code>5_x-prod</code> for deployment on CGSpace soon (<ahref="https://github.com/ilri/DSpace/pull/359">#359</a>)</li>
<li>Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server</li>
<li>Run all system updates on DSpace Test and reboot server</li>
<li>I ran the <ahref="https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c">orcid-authority-to-item.py</a> script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata</li>
<li><p>Apply the proposed PostgreSQL indexes from DS-3636 (pull request <ahref="https://github.com/DSpace/DSpace/pull/1791/">#1791</a> on CGSpace (linode18)</p></li>
<li>Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata</li>
<li>This makes the CSV have tons of columns, for example <code>dc.title</code>, <code>dc.title[]</code>, <code>dc.title[en]</code>, <code>dc.title[eng]</code>, <code>dc.title[en_US]</code> and so on!</li>
<li>On second inspection it looks like <code>dc.description.provenance</code> fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…</li>
<li>If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
<li>In other news, I was playing with adding ORCID identifiers to a dump of CIAT’s community via CSV in OpenRefine</li>
<li>Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the <code>cg.creator.id</code> field</li>
<li>For example, a GREL expression in a custom text facet to get all items with <code>dc.contributor.author[en_US]</code> of a certain author with several name variations (this is how you use a logical OR in OpenRefine):</li>
<li>Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:</li>
</ul>
<pre><code>if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
</code></pre>
<ul>
<li>One thing that bothers me is that this won’t honor author order</li>
<li>It might be better to do batches of these in PostgreSQL with a script that takes the <code>place</code> column of an author into account when setting the <code>cg.creator.id</code></li>
<li>I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching <code>cg.creator.id</code> fieldsa: <ahref="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py </a></li>
<li>The CSV should have two columns: author name and ORCID identifier:</li>
</ul>
<pre><code>dc.contributor.author,cg.creator.id
"Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
"Orth, A.",Alan S. Orth: 0000-0002-1735-7458
</code></pre>
<ul>
<li>I didn’t integrate the ORCID API lookup for author names in this script for now because I was only interested in “tagging” old items for a few given authors</li>
<li>Give James Stapleton input on Sisay’s KRAs</li>
<li>Create a pull request to disable ORCID authority integration for <code>dc.contributor.author</code> in the submission forms and XMLUI display (<ahref="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
<li>Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn’t find them</li>
<li>I made a quick fix and it’s working now (<ahref="https://github.com/ilri/DSpace/pull/364">#364</a>)</li>
<li>I created a new Linode server for DSpace Test (linode6623840) so I could try the block storage stuff, but when I went to add a 300GB volume it said that block storage capacity was exceeded in that datacenter (Newark, NJ)</li>
<li>I deleted the Linode and created another one (linode6624164) in the Fremont, CA region</li>
<li>After that I deployed the Ubuntu 16.04 image and attached a 300GB block storage volume to the image</li>
<li>Magdalena wrote to ask why there was no Altmetric donut for an item on CGSpace, but there was one on the related CCAFS publication page</li>
<li>It looks the the CCAFS publications page fetches the donut using its DOI, whereas CGSpace queries via Handle</li>
<li>I will write to Altmetric support and ask them, as perhaps its part of a larger issue</li>
<li>The full error is here: <ahref="https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca">https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca</a></li>
<li>If I do a report for “Orth, Alan” with the same custom layout it works!</li>
<li>I submitted a ticket to Atmire: <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589</a></li>
<li>ICT made the DNS updates for dspacetest.cgiar.org late last night</li>
<li>I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164</li>
<li>Looking at the CRP subjects on CGSpace I see there is one blank one so I’ll just fix it:</li>
</ul>
<pre><code>dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
</code></pre>
<ul>
<li>Copy all CRP subjects to a CSV to do the mass updates:</li>
</ul>
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
COPY 21
</code></pre>
<ul>
<li>Once I prepare the new input forms (<ahref="https://github.com/ilri/DSpace/issues/362">#362</a>) I will need to do the batch corrections:</li>
<li>ICT responded that they “fixed” the CGSpace connectivity issue in Nairobi without telling me the problem</li>
<li>When I asked, Robert Okal said CGNET messed up when updating the DNS for cgspace.cgiar.org last week</li>
<li>I told him that my request last week was for dspacetest.cgiar.org, not cgspace.cgiar.org!</li>
<li>So they updated the wrong fucking DNS records</li>
<li>Magdalena from CCAFS wrote to ask about one record that has a bunch of metadata missing in her Listings and Reports export</li>
<li>It appears to be this one: <ahref="https://cgspace.cgiar.org/handle/10568/83473?show=full">https://cgspace.cgiar.org/handle/10568/83473?show=full</a></li>
<li>The title is “Untitled” and there is some metadata but indeed the citation is missing</li>
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
</code></pre>
<ul>
<li>I have no idea why it crashed</li>
<li>I ran all system updates and rebooted it</li>
<li>Abenet told me that one of Lance Robinson’s ORCID iDs on CGSpace is incorrect</li>
<li>I will remove it from the controlled vocabulary (<ahref="https://github.com/ilri/DSpace/pull/367">#367</a>) and update any items using the old one:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
UPDATE 1
</code></pre>
<ul>
<li>Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits</li>
<li>Merge the changes to CRP names to the <code>5_x-prod</code> branch and deploy on CGSpace (<ahref="https://github.com/ilri/DSpace/pull/363">#363</a>)</li>
<li>Run corrections for CRP names in the database:</li>
<pre><code>2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field "dc_contributor_author".
java.lang.IllegalArgumentException: No choices plugin was configured for field "dc_contributor_author".
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
<li>Looks like the indexing gets confused that there is still data in the <code>authority</code> column</li>
<li>Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!</li>
<li>Since we’ve migrated the ORCID identifiers associated with the authority data to the <code>cg.creator.id</code> field we can nullify the authorities remaining in the database:</li>
</ul>
<pre><codeclass="language-sql">dspace=# UPDATE metadatavalue SET authority=NULL WHERE resource_type_id=2 AND metadata_field_id=3 AND authority IS NOT NULL;
UPDATE 195463
</code></pre>
<ul>
<li>After this the indexing works as usual and item counts and facets are back to normal</li>
<li>Send Peter a list of all authors to correct:</li>
</ul>
<pre><codeclass="language-sql">dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element
= 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv header;
COPY 56156
</code></pre>
<ul>
<li>Afterwards we’ll want to do some batch tagging of ORCID identifiers to these names</li>
<li>I guess we need to give it more RAM because it now has CGSpace’s large Solr core</li>
<li>I will increase the memory from 3072m to 4096m</li>
<li>Update <ahref="https://github.com/ilri/rmg-ansible-public">Ansible playbooks</a> to use <ahref="https://jdbc.postgresql.org/">PostgreSQL JBDC driver</a> 42.2.2</li>
<li>Deploy the new JDBC driver on DSpace Test</li>
<li>I’m also curious to see how long the <code>dspace index-discovery -b</code> takes on DSpace Test where the DSpace installation directory is on one of Linode’s new block storage volumes</li>
<li>Update my Mirage 2 setup notes for Ubuntu 18.04: <ahref="https://gist.github.com/alanorth/9bfd29feb7d2e836a9d417633319b3f5">https://gist.github.com/alanorth/9bfd29feb7d2e836a9d417633319b3f5</a></li>
<li>But it’s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):</li>
</ul>
<pre><code>or(
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/))
)
</code></pre>
<ul>
<li>And here’s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it’s time to add delete support to my <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre><code>or(
isNotNull(value.match(/.*delete.*/i)),
isNotNull(value.match(/.*remove.*/i)),
isNotNull(value.match(/.*check.*/i))
)
</code></pre>
<ul>
<li><p>So I guess the routine is in OpenRefine is:</p>
<ul>
<li>Transform: trim leading/trailing whitespace</li>
<li>Atmire got back to me about the <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589">Listings and Reports issue</a> and said it’s caused by items that have missing <code>dc.identifier.citation</code> fields</li>
<li>The will send a fix</li>
</ul>
<h2id="2018-03-27">2018-03-27</h2>
<ul>
<li>Atmire got back with an updated quote about the DSpace 5.8 compatibility so I’ve forwarded it to Peter</li>
</ul>
<h2id="2018-03-28">2018-03-28</h2>
<ul>
<li>DSpace Test crashed due to heap space so I’ve increased it from 4096m to 5120m</li>
<li>Add ISI Journal (cg.isijournal) as an option in Atmire’s Listing and Reports layout (<ahref="https://github.com/ilri/DSpace/pull/370">#370</a>) for Abenet</li>
Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
Fixed 100 occurences of: ROOTS, TUBERS AND BANANAS
Fixed 31 occurences of: HUMIDTROPICS
Fixed 21 occurences of: MAIZE
Fixed 11 occurences of: POLICIES, INSTITUTIONS, AND MARKETS
Fixed 28 occurences of: GRAIN LEGUMES
Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
Fixed 5 occurences of: GENEBANKS
</code></pre>
<ul>
<li>That’s weird because we just updated them last week…</li>
<li>Create a pull request to enable searching by ORCID identifier (<code>cg.creator.id</code>) in Discovery and Listings and Reports (<ahref="https://github.com/ilri/DSpace/pull/371">#371</a>)</li>
<li>I will test it on DSpace Test first!</li>
<li>Fix one missing XMLUI string for “Access Status” (cg.identifier.status)</li>