<li>Follow up with Carol and Francesca from Bioversity as they were on holiday during the mid-to-late August
<ul>
<li>I told them to check the <ahref="https://dspacetest.cgiar.org/handle/10568/103999">temporary collection on DSpace Test</a> where I uploaded the 1,427 items so they can see how it will look</li>
<li>Also, I told them to advise me about the strange file extensions (.7z, .zip, .lck)</li>
<li>Also, I reminded Abenet to check the metadata, as the institutional authors at least will need some modification</li>
<li>It looks like these items were uploaded by Sisay on 2018-12-19 so we can use the <ahref="https://cgspace.cgiar.org/handle/10568/68616/discover?filtertype_1=dateAccessioned&filter_relational_operator_1=contains&filter_1=2018-12-19&submit_apply_filter=&query=">accession date as a filter</a> to narrow it down to 230 items (of which only 104 have PDFs, according to the Daniel1807.xls input input file)</li>
<li>Now I just checked a few manually and they are correct in the original input file, so something must have happened when Sisay was processing them for upload</li>
<li>I think we can skip the MODS crosswalk for now because it is only used in <ahref="https://wiki.lyrasis.org/display/DSDOC5x/DSpace+AIP+Format#DSpaceAIPFormat-MODSSchema">AIP exports that are meant for non-DSpace systems</a></li>
<li>We should probably do the QDC crosswalk as well as those in <code>xhtml-head-item.properties</code>…</li>
<li>Ouch, there is potentially a lot of work in the OAI metadata formats like DIM, METS, and QDC (see <code>dspace/config/crosswalks/oai/*.xsl</code>)</li>
<li>In general I think I should only modify the left side of the crosswalk mappings (ie, where metadata is coming from) so we maintain the same exact output for search engines, etc</li>
<li>Maria Garruccio asked me to add two new Bioversity ORCID identifiers to CGSpace so I created a <ahref="https://github.com/ilri/DSpace/pull/431">pull request</a></li>
<li>Marissa Van Epp asked me to add new CCAFS Phase II project tags to CGSpace so I created a <ahref="https://github.com/ilri/DSpace/pull/432">pull request</a>
<ul>
<li>I will wait until I hear from her to merge it because there is one tag that seems to be a duplicate because its name (PII-WA_agrosylvopast) is similar to one that already exists (PII-WA_AgroSylvopastoralSystems)</li>
<li>I have updated my <ahref="https://gist.github.com/alanorth/2db39e91f48d116e00a4edffd6ba6409">notes on the possible changes</a> and done more work on the XMLUI replacements</li>
<li>Deploy <ahref="https://jdbc.postgresql.org/">PostgreSQL JDBC driver</a> version 42.2.7 on DSpace Test and update the <ahref="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
<li>Update nginx TLS cipher suite to the latest <ahref="https://ssl-config.mozilla.org/#server=nginx&server-version=1.16.1&config=intermediate&openssl-version=1.0.2g">Mozilla intermediate recommendations for nginx 1.16.0 and openssl 1.0.2</a>
<pretabindex="0"><code>2019-09-15 15:32:18,137 WARN org.apache.cocoon.components.xslt.TraxErrorListener - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
<pretabindex="0"><code>2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.METSRightsCrosswalk", name="METSRIGHTS"
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.OREDisseminationCrosswalk", name="ore"
2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.DIMDisseminationCrosswalk", name="dim"
<li>There is only <ahref="https://github.com/pgjdbc/pgjdbc/issues/1567">one minor fix to a usecase we aren’t using</a> so I will deploy this on the servers the next time I do updates</li>
<li>Start looking at IITA’s latest round of batch updates that Sisay had <ahref="https://dspacetest.cgiar.org/handle/10568/105486">uploaded to DSpace Test</a> earlier this month
<li>For posterity, IITA’s original input file was 20196th.xls and Sisay uploaded it as “IITA_Sep_06” to DSpace Test</li>
<li>Sisay said he did ran the csv-metadata-quality script on the records, but I assume he didn’t run the unsafe fixes or AGROVOC checks because I still see unneccessary Unicode, excessive whitespace, one invalid ISBN, missing dates and a few invalid AGROVOC fields</li>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
<pretabindex="0"><code>dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
<li>Deploy a fresh snapshot of CGSpace’s PostgreSQL database on DSpace Test so we can get more accurate duplicate checking with the upcoming Bioversity and IITA migrations</li>
<li>I told them it’s probably best if we have Francesco produce a new export from Typo 3</li>
<li>But on second thought I think that I’ve already done so much work on this file as it is that I should fix what I can here and then do a new import to DSpace Test with the PDFs</li>
<li>Other corrections would be to replace “Inst.” and “Instit.” with “Institute” and remove those blank ISSNs from the citations</li>
<li>I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
<li>There are a <em>few dozen</em> that have completely fucked up names due to some encoding error</li>
<li>To make matters worse, when I tried to download them, some of the links in the “URL” column that Francesco included are wrong, so I had to go to the permalink and get a link that worked</li>
<li>I’m still waiting to hear what Carol and Francesca want to do with the <code>1195.pdf.LCK</code> file (for now I’ve removed it from the CSV, but for future reference it has the number 630 in its permalink)</li>
<li>I wrote two fairly long GREL expressions to clean up the institutional author names in the <code>dc.contributor.author</code> and <code>dc.identifier.citation</code> fields using OpenRefine
<li>I imported the 1,427 Bioversity records with bitstreams to a new collection called <ahref="https://dspacetest.cgiar.org/handle/10568/103688">2019-09-20 Bioversity Migration Test</a> on DSpace Test (after splitting them in two batches of about 700 each):</li>
<li>Re-upload the <ahref="https://dspacetest.cgiar.org/handle/10568/105116">IITA Sept 6 (20196th.xls) records to DSpace Test</a> after I did the re-sync yesterday
<ul>
<li>Then I looked at the records again and sent some feedback about three duplicates to Bosede</li>
<li>Also I noticed that many journal articles have the journal and page information in the citation, but are missing <code>dc.source</code> and <code>dc.format.extent</code> fields</li>
<li>fasttext is likely the best, but <ahref="https://github.com/facebookresearch/fastText/issues/909">prints a blank link to the console when loading a model</a></li>
<li>langid seems to be the best considering the above experiences</li>
<li>Bosede fixed a few of the things I mentioned in her Sept 6 batch records, but there were still issues
<ul>
<li>I sent her a bit more feedback because when I asked her to delete a duplicate, she deleted the <em>existing</em> item on DSpace Test rather than the new one in the new batch file!</li>
<li>I fixed two incorrect languages after analyzing it with my beta language detection in the csv-metadata-quality tool</li>
<li>Give more feedback to Bosede about the <ahref="https://dspacetest.cgiar.org/handle/10568/105116">IITA Sept 6 (20196th.xls) records on DSpace Test</a>
<ul>
<li>I told her to delete one item that appears to be a duplicate, or to fix its citation to be correct if she thinks it is not a duplicate</li>
<li>I deleted another item that I had previously identified as a duplicate that she had fixed by incorrectly deleting the original (ugh)</li>
<li>Get a list of institutions from CCAFS’s Clarisa API and try to parse it with <code>jq</code>, do some small cleanups and add a header in <code>sed</code>, and then pass it through <code>csvcut</code> to add line numbers:</li>
<li>Peter will respond to ICARDA’s request to deposit items in to CGSpace, with a caveat that we agree on some vocabulary standards for institutions, countries, regions, etc</li>
<li>We discussed using ISO 3166 for countries, though Peter doesn’t like the formal names like “Moldova, Republic of” and “Tanzania, United Republic of”
<li>The Debian <code>iso-codes</code> package has ISO 3166-1 with “common name”, “name”, and “official name” representations, for example:
<ul>
<li>common_name: Tanzania</li>
<li>name: Tanzania, United Republic of</li>
<li>official_name: United Republic of Tanzania</li>
<li>Peter said that a new server for DSpace Test is fine, so I can proceed with the normal process of getting approval from Michael Victor and ICT when I have time (recommend moving from $40 to $80/month Linode, with 16GB RAM)</li>
<li>I need to ask Atmire for a quote to upgrade CGSpace to DSpace 6 with all current modules so we can see how many more credits we need</li>