<metaproperty="og:description"content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc."/>
<metaname="twitter:description"content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix: $ csvcut -c 'id,dc."/>
<li>I exported it, but a quick run through the <code>csv-metadata-quality</code> tool shows that there are some low-hanging fruits we can fix before I send him the data</li>
<li>I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script’s “unneccesary Unicode” fix:</li>
<li>Then I replace them in vim with <code>:% s/\%u00a0/ /g</code> because I can’t figure out the correct sed syntax to do it directly from the pipe above</li>
<li>I modified the script so it replaces the non-breaking spaces instead of removing them</li>
<li>Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):</li>
<li>Release <ahref="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.3.1">version 0.3.1 of the csv-metadata-quality script</a> with the non-breaking spaces change</li>
<li>I suggested that if they still want to do value addition of those records (like adding countries, regions, etc) that they could maybe do it after we migrate the records to CGSpace</li>
<li>Carol responded to tell me where to map the items with type Brochure, Journal Item, and Thesis, so I applied them to the <ahref="https://dspacetest.cgiar.org/handle/10568/103688">collection on DSpace Test</a></li>
<li>Gabriela from CIP asked me if it was possible to generate an RSS feed of items that have the CIP subject “POTATO AGRI-FOOD SYSTEMS”
<ul>
<li>I notice that there is a similar term “SWEETPOTATO AGRI-FOOD SYSTEMS” so I had to come up with a way to exclude that using the boolean “AND NOT” in the <ahref="https://cgspace.cgiar.org/open-search/discover?query=cipsubject:POTATO%20AGRI%E2%80%90FOOD%20SYSTEMS%20AND%20NOT%20cipsubject:SWEETPOTATO%20AGRI%E2%80%90FOOD%20SYSTEMS&scope=10568/51671&sort_by=3&order=DESC">OpenSearch query</a></li>
<li>Again, the <code>sort_by=3</code> parameter is the accession date, as configured in <code>dspace.cfg</code></li>
<li>Continue working on identifying duplicates in the Bioversity migration
<ul>
<li>I have been recording the originals and duplicates in a spreadsheet so I can map them later</li>
<li>For now I am just reconciling any incorrect or missing metadata in the original items on CGSpace, deleting the duplicate from DSpace Test, and mapping the original to the correct place on CGSpace</li>
<li>So far I have deleted thirty duplicates and mapped fourteen</li>
<li>I was preparing to check the affiliations on the Bioversity records when I noticed that the last list of top affiliations I generated has some anomalies
<li>I would still like to perhaps (re)move institutional authors from <code>dc.contributor.author</code> to <code>cg.contributor.affiliation</code>, but I will have to run that by Francesca, Carol, and Abenet</li>
<li>I could use a custom text facet like this in OpenRefine to find authors that likely match the “Last, F.” pattern: <code>isNotNull(value.match(/^.*, \p{Lu}\.?.*$/))</code></li>
<li>The <code>\p{Lu}</code> is a cool <ahref="https://www.regular-expressions.info/unicode.html">regex character class</a> to make sure this works for letters with accents</li>
<li>As cool as that is, it’s actually more effective to just search for authors that have “.” in them!</li>
<li>I’ve decided to add a <code>cg.contributor.affiliation</code> column to 1,025 items based on the logic above where the author name is not an actual person</li>
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
<li>I talked to Peter about the Bioversity items and he said that we should add the institutional authors back to <code>dc.contributor.author</code>, because I had moved them to <code>cg.contributor.affiliation</code>
<li>It’s really stupid, but for some reason the handles are included even though I specified the <code>-m</code> option, so after the export I removed the <code>dc.identifier.uri</code> metadata values from the items</li>
<li>I had forgotten (again) that the <code>dspace export</code> command doesn’t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…</li>
<li>After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them</li>
<li>Create a test user for Mohammad Salem to test depositing from MEL to DSpace Test, as the last one I had created in 2019-08 was cleared when we re-syncronized DSpace Test with CGSpace recently.</li>
<li>Move the CGSpace CG Core v2 notes from a <ahref="https://gist.github.com/alanorth/2db39e91f48d116e00a4edffd6ba6409">GitHub Gist</a> to a <ahref="/cgspace-notes/cgspace-cgcorev2-migration/">page</a> on this site for archive and searchability sake</li>
<li>I noticed that the page title is messed up on the item view, and after over an hour of troubleshooting it I couldn’t figure out why</li>
<li>It seems to be because the <code>dc.title</code>→<code>dcterms.title</code> modifications cause the title metadata to disappear from DRI’s <code><pageMeta></code> and therefore the title is not accessible to the XSL transformation</li>
<li>Also, I noticed a few places in the Java code where <code>dc.title</code> is hard coded so I think this might be one of the fields that we just assume DSpace relies on internally</li>
<li>I will revert all changes to <code>dc.title</code> and <code>dc.title.alternative</code></li>
<li>TODO: there are similar issues with the <code>citation_author</code> metadata element missing from DRI, so I might have to revert those changes too</li>
<li>After more digging in the source I found out why the <code>dcterms.title</code> and <code>dcterms.creator</code> fields are not present in the DRI <code>pageMeta</code>…
<ul>
<li>The <code>pageMeta</code> element is constructed in <code>dspace-xmlui/src/main/java/org/dspace/app/xmlui/wing/IncludePageMeta.java</code> and the code does not consider any other schemas besides DC</li>
<li>I moved title and creator back to the original DC fields and then everything was working as expected in the pageMeta, so I guess we cannot use these in DCTERMS either!</li>