<li>Spend some time looking at duplicate DOIs again…</li>
</ul>
<h2id="2024-04-13">2024-04-13</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again…</li>
</ul>
<h2id="2024-04-14">2024-04-14</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again…</li>
</ul>
<h2id="2024-04-15">2024-04-15</h2>
<ul>
<li>Spend some time looking at duplicate DOIs again…</li>
<li>Delete ~260 duplicate metadata values using the elaborate SQL and sort method I documented here: <ahref="https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418">https://github.com/DSpace/DSpace/issues/8253#issuecomment-1331756418</a></li>
<li>Tony noticed that the DSpace 7 REST API is very slow with the embeds so I profiled a bit:</li>
</ul>
<pretabindex="0"><code>$ time curl -s -o /dev/null 'https://cgspace.cgiar.org/server/api/discover/search/objects?query=cg.identifier.project%3AIFPRI*&scope=8f1e9650-fe87-4e6e-889a-1cacfb747408&page=0&size=100&embed=thumbnail,bundles/bitstreams&sort=dcterms.issued,desc'
curl -s -o /dev/null 0.01s user 0.01s system 0% cpu 47.515 total
<li>Assist Deborah with an advanced query on CGSpace for biodiversity and health:</li>
</ul>
<pretabindex="0"><code>dcterms.issued:[2010 TO 2024] AND dcterms.type:"Journal Article" AND (dc.title:"biodiversity" OR dcterms.subject:"biodiversity" OR dc.title:"health" OR dcterms.subject:"health")
</code></pre><ul>
<li>Remove CIMMYT URLs and citations from 277 journal articles on CGSpace since it is a bit tacky
<ul>
<li>I used this Jython expression in OpenRefine with <ahref="https://citation.crosscite.org/docs.html">Crossref’s content negotiation</a> to get citations for all DOIs:</li>
</span></span><spanstyle="display:flex;"><span>url <spanstyle="color:#f92672">=</span><spanstyle="color:#e6db74">"https://api.crossref.org/works/"</span><spanstyle="color:#f92672">+</span> doi <spanstyle="color:#f92672">+</span><spanstyle="color:#e6db74">"/transform/text/x-bibliography"</span>
<li>It took ten or so minutes for it to finish (and note this is Python 2 inside OpenRefine so I had to be careful with Unicode), but worked well!</li>
<li>Spend some time looking at duplicate DOIs again…</li>
<li>Refresh ORCID identifiers from ORCID API and update CGSpace metadata and controlled vocabulary</li>
</ul>
<h2id="2024-04-20">2024-04-20</h2>
<ul>
<li>I read an <ahref="https://github.com/greenelab/scihub/issues/9">interesting thread about DOI casing</a>
<ul>
<li>Apparently the DOI specification says ASCII characters in DOIs are case insensitive</li>
<li>Indeed, <ahref="https://www.crossref.org/documentation/member-setup/constructing-your-dois/">Crossref recommends lower case</a> for all DOIs</li>
<li>I was curious about the DOIs in our database so I checked before and after lower casing:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-before.txt;
</span></span><spanstyle="display:flex;"><span>localhost/dspace7= ☘ \COPY (SELECT DISTINCT(lower(text_value)) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=220 AND text_value IS NOT NULL AND text_value !='') TO /tmp/dois-sql-after.txt;
<li>I need to investigate options for lower casing these in the repository, for example in a curation task, and in all workflows around DSpace metadata…</li>
</ul>
<h2id="2024-04-23">2024-04-23</h2>
<ul>
<li>Spent some time writing a Java curation task to normalize DOIs in items when they enter the workflow edit step
<ul>
<li>The workflow curation tasks are not documented very well but I got a basic configuration working</li>
<li>I found a bug in DSpace curation tasks and discussed on Slack</li>
<li>I finalized the <code>NormalizeDOIs</code> curation task and released v7.6.1.1 of the <ahref="https://github.com/ilri/cgspace-java-helpers">cgspace-java-helpers</a> project</li>
</ul>
</li>
</ul>
<h2id="2024-04-24">2024-04-24</h2>
<ul>
<li>A bit more testing of the curation tasks
<ul>
<li>I tested a patch by Mark Wood</li>
</ul>
</li>
<li>I added support for normalizing DOIs to this same format to my <ahref="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> project</li>
</ul>
<h2id="2024-04-25">2024-04-25</h2>
<ul>
<li>I lowercased the remaining 3,900 DOIs on CGSpace that had uppercase ASCII characters</li>
<li>Start working on the IFPRI 2020–2021 batch migration
<ul>
<li>I modified my <code>check_duplicates.py</code> script to check for DOIs instead of titles, and use a similarity of 1.0 to make sure the match is exact</li>
</ul>
</li>
<li>I noticed something in the Tomcat log:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>tomcat9[690]: WARNING: The HTTP response header [Content-Disposition] with value [attachment; filename="Literature review on Women’s Empowerment and their Resilience2.pdf"] has been removed from the response because it is invalid
</span></span><spanstyle="display:flex;"><span>tomcat9[690]: java.lang.IllegalArgumentException: The Unicode character [’] at code point [8,217] cannot be encoded as it is outside the permitted range of 0 to 255
</span></span></code></pre></div><ul>
<li>I found the bitstream’s ID and then used the <code>ds6_bitstream2itemhandle</code><ahref="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6">SQL helper function</a> to find the item’s handle
<ul>
<li>Then I replaced the curly quote with a regular quote in all bistreams</li>