mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2023-05-13
This commit is contained in:
@ -18,7 +18,7 @@ Then I did some work to add missing abstracts (about 900!), volumes, issues, lic
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2024-05/" />
|
||||
<meta property="article:published_time" content="2024-05-01T10:39:00+03:00" />
|
||||
<meta property="article:modified_time" content="2024-05-01T17:10:05+03:00" />
|
||||
<meta property="article:modified_time" content="2024-05-05T21:43:52+03:00" />
|
||||
|
||||
|
||||
|
||||
@ -32,7 +32,7 @@ Then I did some work to add missing abstracts (about 900!), volumes, issues, lic
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.125.5">
|
||||
<meta name="generator" content="Hugo 0.125.7">
|
||||
|
||||
|
||||
|
||||
@ -42,9 +42,9 @@ Then I did some work to add missing abstracts (about 900!), volumes, issues, lic
|
||||
"@type": "BlogPosting",
|
||||
"headline": "May, 2024",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2024-05/",
|
||||
"wordCount": "41",
|
||||
"wordCount": "359",
|
||||
"datePublished": "2024-05-01T10:39:00+03:00",
|
||||
"dateModified": "2024-05-01T17:10:05+03:00",
|
||||
"dateModified": "2024-05-05T21:43:52+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -130,6 +130,60 @@ Then I did some work to add missing abstracts (about 900!), volumes, issues, lic
|
||||
<ul>
|
||||
<li>Spend some time looking at duplicate DOIs again…</li>
|
||||
</ul>
|
||||
<h2 id="2024-05-06">2024-05-06</h2>
|
||||
<ul>
|
||||
<li>Spend some time looking at duplicate DOIs again…</li>
|
||||
</ul>
|
||||
<h2 id="2024-05-07">2024-05-07</h2>
|
||||
<ul>
|
||||
<li>Discuss RSS feeds and OpenSearch with IWMI
|
||||
<ul>
|
||||
<li>It seems our OpenSearch feed settings are using the defaults, so I need to copy some of those over from our old DSpace 6 branch</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I saw a patch for an interesting issue on DSpace GitHub: <a href="https://github.com/DSpace/DSpace/issues/9544">Error submitting or deleting items - URI too long when user is in a large number of groups</a>
|
||||
<ul>
|
||||
<li>I hadn’t realized it, but we have lots of those errors:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ zstdgrep -a <span style="color:#e6db74">'URI Too Long'</span> log/dspace.log-2024-04-* | wc -l
|
||||
</span></span><span style="display:flex;"><span>1423
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Spend some time looking at duplicate DOIs again…</li>
|
||||
</ul>
|
||||
<h2 id="2024-05-08">2024-05-08</h2>
|
||||
<ul>
|
||||
<li>Spend some time looking at duplicate DOIs again…
|
||||
<ul>
|
||||
<li>I finally finished looking at the duplicate DOIs for journal articles</li>
|
||||
<li>I updated the list of handle redirects and there are 386 of them!</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2024-05-09">2024-05-09</h2>
|
||||
<ul>
|
||||
<li>Spend some time working on the IFPRI 2020–2021 batch
|
||||
<ul>
|
||||
<li>I started by checking for exact duplicates (1.0 similarity) using DOI, type, and issue date</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2024-05-12">2024-05-12</h2>
|
||||
<ul>
|
||||
<li>I couldn’t figure out how to do a complex join on withdrawn items along with their metadata, so I pull out a few like titles, handles, and provenance separately:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code class="language-psql" data-lang="psql">dspace=# \COPY (SELECT i.uuid, m.text_value AS uri FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=25) TO /tmp/withdrawn-handles.csv CSV HEADER;
|
||||
dspace=# \COPY (SELECT i.uuid, m.text_value AS title FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=64) TO /tmp/withdrawn-titles.csv CSV HEADER;
|
||||
dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN metadatavalue m ON i.uuid = m.dspace_object_id WHERE withdrawn AND m.metadata_field_id=28 AND m.text_value LIKE 'Submitted by%') TO /tmp/withdrawn-submitted-by.csv CSV HEADER;
|
||||
</code></pre><ul>
|
||||
<li>Then joined them:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvjoin -c uuid /tmp/withdrawn-title.csv /tmp/withdrawn-handles.csv /tmp/withdrawn-submitted-by.csv > /tmp/withdrawn.csv
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>This gives me an insight into who submitted at 334 of the duplicates over the past few years…</li>
|
||||
<li>I fixed a few hundred titles with leading/trailing whitespace, newlines, and ligatures like ff, fi, fl, ffi, and ffl</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user