mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2024-05-27
This commit is contained in:
@ -18,7 +18,7 @@ Then I did some work to add missing abstracts (about 900!), volumes, issues, lic
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2024-05/" />
|
||||
<meta property="article:published_time" content="2024-05-01T10:39:00+03:00" />
|
||||
<meta property="article:modified_time" content="2024-05-13T16:24:11+03:00" />
|
||||
<meta property="article:modified_time" content="2024-05-20T17:34:14+03:00" />
|
||||
|
||||
|
||||
|
||||
@ -42,9 +42,9 @@ Then I did some work to add missing abstracts (about 900!), volumes, issues, lic
|
||||
"@type": "BlogPosting",
|
||||
"headline": "May, 2024",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2024-05/",
|
||||
"wordCount": "652",
|
||||
"wordCount": "1023",
|
||||
"datePublished": "2024-05-01T10:39:00+03:00",
|
||||
"dateModified": "2024-05-13T16:24:11+03:00",
|
||||
"dateModified": "2024-05-20T17:34:14+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -224,6 +224,74 @@ dspace=# \COPY (SELECT i.uuid, m.text_value AS submitted_by FROM item i JOIN met
|
||||
<li>One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check</li>
|
||||
<li>Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)</li>
|
||||
</ul>
|
||||
<h2 id="2024-05-22">2024-05-22</h2>
|
||||
<ul>
|
||||
<li>Finalize and upload the IFPRI 2020–2021 batch set
|
||||
<ul>
|
||||
<li>I used a new technique to get missing licenses via Crossref (it’s Python 2 because of OpenRefine’s Jython):</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> urllib2
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span>doi <span style="color:#f92672">=</span> cells[<span style="color:#e6db74">'cg.identifier.doi[en_US]'</span>]<span style="color:#f92672">.</span>value
|
||||
</span></span><span style="display:flex;"><span>url <span style="color:#f92672">=</span> <span style="color:#e6db74">"https://api.crossref.org/works/"</span> <span style="color:#f92672">+</span> doi
|
||||
</span></span><span style="display:flex;"><span>useragent <span style="color:#f92672">=</span> <span style="color:#e6db74">"Python (mailto:a.o@cgiar.org)"</span>
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span>request <span style="color:#f92672">=</span> urllib2<span style="color:#f92672">.</span>Request(url<span style="color:#f92672">.</span>encode(<span style="color:#e6db74">"utf-8"</span>), headers<span style="color:#f92672">=</span>{<span style="color:#e6db74">"User-Agent"</span> : useragent})
|
||||
</span></span><span style="display:flex;"><span>get <span style="color:#f92672">=</span> urllib2<span style="color:#f92672">.</span>urlopen(request)
|
||||
</span></span><span style="display:flex;"><span>
|
||||
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">return</span> get<span style="color:#f92672">.</span>read()<span style="color:#f92672">.</span>decode(<span style="color:#e6db74">'utf-8'</span>)
|
||||
</span></span></code></pre></div><h2 id="2024-05-23">2024-05-23</h2>
|
||||
<ul>
|
||||
<li>Finalize last of the duplicates I found for the IFPRI 2020–2021 batch set (those that we missed initially due to mismatched types)</li>
|
||||
<li>Export a new list of IFPRI redirects from CONTENTdm:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">'dc.description.provenance[en_US]'</span> -r <span style="color:#e6db74">'Original URLs? from IFPRI CONTENTdm'</span> cgspace.csv <span style="color:#ae81ff">\
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvcut -c 'id,dc.description.provenance[en_US],dc.identifier.uri[en_US]' \
|
||||
</span></span><span style="display:flex;"><span> | tee /tmp/ifpri-redirects.csv \
|
||||
</span></span><span style="display:flex;"><span> | csvstat --count
|
||||
</span></span><span style="display:flex;"><span>4004
|
||||
</span></span></code></pre></div><p>I found a way to get abstracts from PLOS</p>
|
||||
<ul>
|
||||
<li>They offer an API that returns XML including the JATS-formatted abstracts</li>
|
||||
<li>I created a new column in OpenRefine by fetching specially crafted URLs based on the DOIs using this GREL:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>"https://journals.plos.org/plosone/article/file?id=" + cells['doi'].value + '&type=manuscript'
|
||||
</span></span></code></pre></div><p>Then used <code>value.parseXml()</code> on the resulting text to extract the abstract’s text:</p>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>value.parseXml().select("abstract")[0].xmlText()
|
||||
</span></span></code></pre></div><p>This doesn’t preserve <code><p></code> tags though…</p>
|
||||
<ul>
|
||||
<li>Oh, nice, this does!</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>forEach(value.parseHtml().select("abstract p"), i, i.htmlText()).join("\r\n\r\n")
|
||||
</span></span></code></pre></div><p>For each paragraph inside an abstract, get the inner text and join them as one string separated by two newlines…</p>
|
||||
<ul>
|
||||
<li>Ah, some articles have multiple abstracts, for example: <a href="https://journals.plos.org/plosone/article/file?id=https://doi.org/10.1371/journal.pntd.0001859&type=manuscript">https://journals.plos.org/plosone/article/file?id=https://doi.org/10.1371/journal.pntd.0001859&type=manuscript</a></li>
|
||||
<li>I need to select the abstract that does <strong>not</strong> have any attributes (using <a href="https://jsoup.org/apidocs/org/jsoup/select/Selector.html">Jsoup selector syntax</a>)</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>forEach(value.parseXml().select("abstract:not([*]) p"), i, i.xmlText()).join("\r\n\r\n")
|
||||
</span></span></code></pre></div><p>Testing <code>xsv</code> (Rust) versus <code>csvkit</code> (Python) to filter all items with DOIs from a DSpace dump with 118,000 items:</p>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ time xsv search -s doi <span style="color:#e6db74">'doi\.org'</span> /tmp/cgspace-minimal.csv | xsv <span style="color:#66d9ef">select</span> doi | xsv count
|
||||
</span></span><span style="display:flex;"><span>27339
|
||||
</span></span><span style="display:flex;"><span>xsv search -s doi 'doi\.org' /tmp/cgspace-minimal.csv 0.06s user 0.03s system 98% cpu 0.091 total
|
||||
</span></span><span style="display:flex;"><span>xsv select doi 0.02s user 0.02s system 40% cpu 0.091 total
|
||||
</span></span><span style="display:flex;"><span>xsv count 0.01s user 0.00s system 9% cpu 0.090 total
|
||||
</span></span><span style="display:flex;"><span>$ time csvgrep -c doi -m <span style="color:#e6db74">'doi.org'</span> /tmp/cgspace-minimal.csv | csvcut -c doi | csvstat --count
|
||||
</span></span><span style="display:flex;"><span>27339
|
||||
</span></span><span style="display:flex;"><span>csvgrep -c doi -m 'doi.org' /tmp/cgspace-minimal.csv 1.15s user 0.06s system 95% cpu 1.273 total
|
||||
</span></span><span style="display:flex;"><span>csvcut -c doi 0.42s user 0.05s system 36% cpu 1.283 total
|
||||
</span></span><span style="display:flex;"><span>csvstat --count 0.20s user 0.03s system 18% cpu 1.298 total
|
||||
</span></span></code></pre></div><h2 id="2024-05-27">2024-05-27</h2>
|
||||
<ul>
|
||||
<li>Working on IFPRI datasets batch migration
|
||||
<ul>
|
||||
<li>732 items total</li>
|
||||
<li>6 duplicates on CGSpace</li>
|
||||
<li>6 duplicates within set that need investigation</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user