Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…
<li>Look at Bioversity's latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…</li>
<li>I tried to extract the filenames and construct a URL to download the PDFs with my <code>generate-thumbnails.py</code> script, but there seem to be several paths for PDFs so I can't guess it properly</li>
<li>I will have to wait for Francesco to respond about the PDFs, or perhaps proceed with a metadata-only upload so we can do other checks on DSpace Test</li>
<li>Daniel Haile-Michael asked about using a logical OR with the DSpace OpenSearch, but I looked in the DSpace manual and it does not seem to be possible</li>
<li>It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains</li>
<li>Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04's <ahref="https://ssl-config.mozilla.org/#server=nginx&server-version=1.16.0&config=intermediate&openssl-version=1.1.0g&hsts=false&ocsp=false">OpenSSL 1.1.0g with nginx 1.16.0</a></li>
<li>Run all system updates on AReS dev server (linode20) and reboot it</li>
<li>Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:</li>
<li>For example, <code>Wild_cherry_Prunus_avium_859.pdf</code> is here (with double underscore): <ahref="https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf">https://www.bioversityinternational.org/fileadmin/_migrated/uploads/tx_news/Wild_cherry__Prunus_avium__859.pdf</a></li>
<li>For some reason the host header in the proxy pass is not set so nginx on DSpace Test makes a request to the upstream nginx on an IP-based virtual host</li>
<li>The upstream nginx returns HTTP 444 because we configured it to not answer when a request does not send a valid hostname</li>
<li>Though I am really wondering why this happened now, because the configuration has been working for months…</li>
<li>Improve the output of the suspicious characters check in <ahref="https://github.com/alanorth/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.0</li>
<li>Looking at the 128 IITA records (20195TH.xls) that Sisay uploadd to DSpace Test last month: <ahref="https://dspacetest.cgiar.org/handle/10568/102361">IITA_July_29</a>
<ul>
<li>The records are pretty clean because Sisay ran them through the csv-metadata-quality tool</li>
<li>I fixed one incorrect country (MELBOURNE)</li>
<li>I normalized all DOIs to be <ahref="https://doi.org">https://doi.org</a> format</li>
<li>This item is using the wrong Google Books link: <ahref="https://dspacetest.cgiar.org/handle/10568/102593">https://dspacetest.cgiar.org/handle/10568/102593</a></li>
<li>The French abstract here has copy/paste errors: <ahref="https://dspacetest.cgiar.org/handle/10568/102491">https://dspacetest.cgiar.org/handle/10568/102491</a></li>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
<li><code>$ lein run ~/src/git/DSpace/2019-02-22-sponsorships.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
<li><ahref="https://dspacetest.cgiar.org/handle/10568/102513">https://dspacetest.cgiar.org/handle/10568/102513</a> is a duplicate of CIAT item: <ahref="https://cgspace.cgiar.org/handle/10568/44158">https://cgspace.cgiar.org/handle/10568/44158</a></li>
<li><ahref="https://dspacetest.cgiar.org/handle/10568/102512">https://dspacetest.cgiar.org/handle/10568/102512</a> is a duplicate of CIAT item: <ahref="https://cgspace.cgiar.org/handle/10568/43557">https://cgspace.cgiar.org/handle/10568/43557</a></li>
<li>Create and merge a pull request (<ahref="https://github.com/ilri/DSpace/pull/429">#429</a>) to add eleven new CCAFS Phase II Project Tags to CGSpace</li>
<li>Atmire responded to the <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=685">Solr cores issue</a> last week, but they could not reproduce the issue
<li>I told them not to continue, and that we would keep an eye on it and keep troubleshooting it (if neccessary) in the public eye on dspace-tech and Solr mailing lists</li>
<li>Testing an import of 1,429 Bioversity items (metadata only) on my local development machine and got an error with Java memory after about 1,000 items:</li>
<li>To make sure we didn't have memory issues I reduced Tomcat's JVM heap by 512m, increased the import processes's heap to 512m, and split the input file into two parts with about 700 each</li>
<li>Deploy latest <code>5_x-prod</code> branch on CGSpace (linode18), including the <ahref="https://github.com/ilri/DSpace/pull/429">new CCAFS project tags</a></li>
<li>Deploy Tomcat 7.0.96 and PostgreSQL JDBC 42.2.6 driver on CGSpace (linde18)</li>
<li>Unfortunately now the filename column has paths too, so I have to use a simple Python/Jython script in OpenRefine to get the basename of the files in the filename column:</li>
<li>Upload <ahref="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality repository to ILRI's GitHub organization</a></li>
<li>Fix a few invalid countries in IITA's <ahref="https://dspacetest.cgiar.org/handle/10568/102361">July 29</a> records (aka “20195TH.xls”)
<li>These were not caught by my csv-metadata-quality check script because of a logic error</li>
<li>Remove <code>dc.identified.uri</code> fields from test data, set <code>id</code> values to “-1”, add collection mappings according to <code>dc.type</code>, and Upload 126 IITA records to CGSpace</li>
<li>For example, when I'm working on the author corrections I want to do the basic checks on the corrected fields, but on the original fields so I would use <code>--exclude-fields dc.contributor.author</code> for example</li>
<li>Peter approved the related citation changes so I merged the <ahref="https://github.com/ilri/DSpace/pull/430">pull request on GitHub</a> and will deploy it to CGSpace this weekend</li>
<li>Add a safety feature to <code>fix-metadata-values.py</code> that skips correction values that contain the ‘|’ character</li>
<li>Help Francesco from Bioversity with the REST and OAI APIs on CGSpace
<ul>
<li>He is contracted by Bioversity to work on the migration from Typo3</li>
<li>I told him that the OAI interface only exposes Dublin Core fields in its default configuration and that he might want to use OAI to get the latest-changed items, then use REST API to get their metadata</li>
<li>Add a fix for missing space after commas to my <ahref="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> script and tag version 0.2.2</li>
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
<li>I notice that CG Core doesn't currently have a field for CGSpace's “alternative title” (<code>dc.title.alternative</code>), but DCTERMS has <code>dcterms.alternative</code> so I <ahref="https://github.com/AgriculturalSemantics/cg-core/issues/9">raised an issue about adding it</a></li>
<li>Need to assess the XSL changes to see if things like <code>not(@qualifier)]</code> still make sense after we move fields from DC to DCTERMS, as some fields will no longer have qualifiers</li>
<li>Do I need to edit the author links to remove <code>dc.contributor.author</code> in <code>0_CGIAR/xsl/aspect/artifactbrowser/item-list-alterations.xsl</code>?</li>
<li>Do I need to edit the author links to remove <code>dc.contributor.author</code> in <code>0_CGIAR/xsl/aspect/discovery/discovery-item-list-alterations.xsl</code>?</li>
<li>I told him it is because of the “content disposition” that causes DSpace to tell the browser to open or download the file based on its file size (currently around 8 megabytes)</li>
<li>Peter asked why <ahref="https://hdl.handle.net/10568/97825">an item on CGSpace</a> has no Altmetric donut on the item view, but has one in our explorer
<li>So this is the same issue we had before, where Altmetric <em>knows</em> this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn't show it because it seems to a secondary handle or something</li>
<li>After reading the code I see that XSLT is reading the community titles from the DIM representation (stored in the <code>$dim</code> variable) created from METS</li>
<li>I modified the patterns in my sed script so that those lines are not replaced and then the community list works again</li>
<li>This is actually not a problem at all because this metadata is only used in the HTML meta tags in XMLUI community lists and has nothing to do with item metadata</li>