mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-23 05:32:20 +01:00
500 lines
26 KiB
HTML
500 lines
26 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
|
|
<meta property="og:title" content="May, 2022" />
|
|
<meta property="og:description" content="2022-05-04
|
|
|
|
I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
|
|
|
|
18.207.136.176
|
|
185.189.36.248
|
|
50.118.223.78
|
|
52.70.76.123
|
|
3.236.10.11
|
|
|
|
|
|
Looking at the Solr statistics for 2022-04
|
|
|
|
52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
|
|
64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
|
|
185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
|
|
157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
|
|
52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
|
|
157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
|
|
207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
|
|
If I query Solr for time:2022-04* AND dns:*msnbot* AND dns:*.msn.com. I see a handful of IPs that made 41,000 requests
|
|
|
|
|
|
I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-05/" />
|
|
<meta property="article:published_time" content="2022-05-04T09:13:39+03:00" />
|
|
<meta property="article:modified_time" content="2022-05-30T16:00:02+03:00" />
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="May, 2022"/>
|
|
<meta name="twitter:description" content="2022-05-04
|
|
|
|
I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
|
|
|
|
18.207.136.176
|
|
185.189.36.248
|
|
50.118.223.78
|
|
52.70.76.123
|
|
3.236.10.11
|
|
|
|
|
|
Looking at the Solr statistics for 2022-04
|
|
|
|
52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
|
|
64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
|
|
185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
|
|
157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
|
|
52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
|
|
157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
|
|
207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
|
|
If I query Solr for time:2022-04* AND dns:*msnbot* AND dns:*.msn.com. I see a handful of IPs that made 41,000 requests
|
|
|
|
|
|
I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.102.3" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "May, 2022",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2022-05/",
|
|
"wordCount": "1673",
|
|
"datePublished": "2022-05-04T09:13:39+03:00",
|
|
"dateModified": "2022-05-30T16:00:02+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-05/">
|
|
|
|
<title>May, 2022 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-05/">May, 2022</a></h2>
|
|
<p class="blog-post-meta">
|
|
<time datetime="2022-05-04T09:13:39+03:00">Wed May 04, 2022</time>
|
|
in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2022-05-04">2022-05-04</h2>
|
|
<ul>
|
|
<li>I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
|
|
<ul>
|
|
<li>18.207.136.176</li>
|
|
<li>185.189.36.248</li>
|
|
<li>50.118.223.78</li>
|
|
<li>52.70.76.123</li>
|
|
<li>3.236.10.11</li>
|
|
</ul>
|
|
</li>
|
|
<li>Looking at the Solr statistics for 2022-04
|
|
<ul>
|
|
<li>52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests</li>
|
|
<li>64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc</li>
|
|
<li>185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt</li>
|
|
<li>157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr</li>
|
|
<li>52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests</li>
|
|
<li>157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests</li>
|
|
<li>207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr</li>
|
|
<li>If I query Solr for <code>time:2022-04* AND dns:*msnbot* AND dns:*.msn.com.</code> I see a handful of IPs that made 41,000 requests</li>
|
|
</ul>
|
|
</li>
|
|
<li>I purged 93,974 hits from these IPs using my <code>check-spider-ip-hits.sh</code> script</li>
|
|
</ul>
|
|
<ul>
|
|
<li>Now looking at the Solr statistics by user agent I see:
|
|
<ul>
|
|
<li><code>SomeRandomText</code></li>
|
|
<li><code>RestSharp/106.11.7.0</code></li>
|
|
<li><code>MetaInspector/5.7.0 (+https://github.com/jaimeiniesta/metainspector)</code></li>
|
|
<li><code>wp_is_mobile</code></li>
|
|
<li><code>Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"</code></li>
|
|
<li><code>insomnia/2022.2.1</code></li>
|
|
<li><code>ZoteroTranslationServer</code></li>
|
|
<li><code>omgili/0.5 +http://omgili.com</code></li>
|
|
<li><code>curb</code></li>
|
|
<li><code>Sprout Social (Link Attachment)</code></li>
|
|
</ul>
|
|
</li>
|
|
<li>I purged 2,900 hits from these user agents from Solr using my <code>check-spider-hits.sh</code> script</li>
|
|
<li>I made a <a href="https://github.com/atmire/COUNTER-Robots/pull/54">pull request to COUNTER-Robots</a> for some of these agents
|
|
<ul>
|
|
<li>In the mean time I will add them to our local overrides in DSpace</li>
|
|
</ul>
|
|
</li>
|
|
<li>Run all system updates on AReS server, update all Docker containers, and restart the server
|
|
<ul>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2022-05-05">2022-05-05</h2>
|
|
<ul>
|
|
<li>Update PostgreSQL JDBC driver to 42.3.5 in the Ansible infrastructure playbooks and deploy on DSpace Test</li>
|
|
<li>Peter asked me how many items we add to CGSpace every year
|
|
<ul>
|
|
<li>I wrote a SQL query to check the number of items grouped by their accession dates since 2009:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT EXTRACT(year from text_value::date) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
|
|
</span></span><span style="display:flex;"><span> yyyy │ count
|
|
</span></span><span style="display:flex;"><span>──────┼───────
|
|
</span></span><span style="display:flex;"><span> 2022 │ 2073
|
|
</span></span><span style="display:flex;"><span> 2021 │ 6471
|
|
</span></span><span style="display:flex;"><span> 2020 │ 4074
|
|
</span></span><span style="display:flex;"><span> 2019 │ 7330
|
|
</span></span><span style="display:flex;"><span> 2018 │ 8899
|
|
</span></span><span style="display:flex;"><span> 2017 │ 6860
|
|
</span></span><span style="display:flex;"><span> 2016 │ 8451
|
|
</span></span><span style="display:flex;"><span> 2015 │ 15692
|
|
</span></span><span style="display:flex;"><span> 2014 │ 16479
|
|
</span></span><span style="display:flex;"><span> 2013 │ 4388
|
|
</span></span><span style="display:flex;"><span> 2012 │ 6472
|
|
</span></span><span style="display:flex;"><span> 2011 │ 2694
|
|
</span></span><span style="display:flex;"><span> 2010 │ 2457
|
|
</span></span><span style="display:flex;"><span> 2009 │ 293
|
|
</span></span></code></pre></div><ul>
|
|
<li>Note that I had an issue with casting <code>text_value</code> to date because one item had an accession date of <code>2016</code> instead of <code>2016-09-29T20:14:47Z</code>
|
|
<ul>
|
|
<li>Once I fixed that PostgreSQL was able to <a href="https://www.postgresql.org/docs/12/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT">extract() the year</a></li>
|
|
<li>There were some other methods I tried that worked also, for example <code>TO_DATE()</code>:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ SELECT EXTRACT(year from TO_DATE(text_value, 'YYYY-MM-DD"T"HH24:MI:SS"Z"')) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
|
|
</span></span></code></pre></div><ul>
|
|
<li>But it seems PostgreSQL is smart enough to recognize date formatting in strings automatically when we cast so we don’t need to convert to date first</li>
|
|
<li>Another thing I noticed is that a few hundred items have accession dates from decades ago, perhaps this is due to importing items from the CGIAR Library?</li>
|
|
<li>I spent some time merging a few pull requests for DSpace 6.4 and porting one to <code>main</code> for DSpace 7.x</li>
|
|
<li>I also submitted a <a href="https://github.com/DSpace/DSpace/pull/8288">pull request to migrate Mirage 2’s build from bower and compass to yarn and node-sass</a></li>
|
|
</ul>
|
|
<h2 id="2022-05-07">2022-05-07</h2>
|
|
<ul>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2022-05-09">2022-05-09</h2>
|
|
<ul>
|
|
<li>Submit an issue to Atmire’s bug tracker inquiring about DSpace 6.4 support</li>
|
|
</ul>
|
|
<h2 id="2022-05-10">2022-05-10</h2>
|
|
<ul>
|
|
<li>Submit an updated <a href="https://github.com/DSpace/DSpace/pull/8292">pull request to migrate Mirage 2’s build from bower and compass to npm and node-sass</a>
|
|
<ul>
|
|
<li>This one is better than the previous one because it uses npm directly, which comes with the Node.js distribution, rather than requiring the user to install yarn</li>
|
|
<li>I also updated a bunch of grunt build deps</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2022-05-12">2022-05-12</h2>
|
|
<ul>
|
|
<li>CGSpace meeting with Abenet and Peter
|
|
<ul>
|
|
<li>We discussed the future of CGSpace and DSpace in general in the new One CGIAR</li>
|
|
<li>We discussed how to prepare for bringing in content from the Initiatives, whether we need new metadata fields to support people from IFPRI etc</li>
|
|
<li>We discussed the need for good quality Drupal and WordPress modules so sites can harvest content from the repository</li>
|
|
<li>Peter asked me to send him a list of investors/funders/donors so he can clean it up, but also to try to align it with ROR and evntually do something like we do with country codes, adding the ROR IDs and potentially showing the badge on item views</li>
|
|
<li>We also discussed removing some Mirage 2 themes for old programs and CRPs that don’t have custom branding, ie only Google Analytics</li>
|
|
</ul>
|
|
</li>
|
|
<li>Export a list of donors for Peter to clean up:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-05-12-donors.csv WITH CSV HEADER;
|
|
</span></span><span style="display:flex;"><span>COPY 1184
|
|
</span></span></code></pre></div><ul>
|
|
<li>Then I created a CSV from our <code>cg-creator-identifier.xml</code> controlled vocabulary and ran it against our database with <code>add-orcid-identifiers-csv.py</code> to see if any author names by chance matched that are missing ORCIDs in CGSpace</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-05-12-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> | tee /tmp/orcid.log
|
|
</span></span><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">"Adding ORCID"</span> /tmp/add-orcids.log
|
|
</span></span><span style="display:flex;"><span>85
|
|
</span></span></code></pre></div><ul>
|
|
<li>So it’s only eighty-five, but better than nothing…</li>
|
|
<li>I removed the custom Mirage 2 themes for some old projects:
|
|
<ul>
|
|
<li>AgriFood</li>
|
|
<li>AVCD</li>
|
|
<li>LIVES</li>
|
|
<li>FeedTheFuture</li>
|
|
<li>DrylandSystems</li>
|
|
<li>TechnicalConsortium</li>
|
|
<li>EADD</li>
|
|
</ul>
|
|
</li>
|
|
<li>That should knock off a few minutes of the maven build time!</li>
|
|
<li>I generated a report from the AReS nginx logs on linode18:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/access.log.* | grep <span style="color:#e6db74">'GET /explorer'</span> | goaccess --log-format<span style="color:#f92672">=</span>COMBINED - -o /tmp/ares_report.html
|
|
</span></span></code></pre></div><h2 id="2022-05-13">2022-05-13</h2>
|
|
<ul>
|
|
<li>Peter finalized the corrections on donors from yesterday so I extracted them into fix/delete CSVs and ran them on CGSpace:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i 2022-05-13-fix-CGSpace-Donors.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -f cg.contributor.donor -m <span style="color:#ae81ff">248</span> -t correct -d
|
|
</span></span><span style="display:flex;"><span>$ ./ilri/delete-metadata-values.py -i 2022-05-13-delete-CGSpace-Donors.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -f cg.contributor.donor -m <span style="color:#ae81ff">248</span> -d
|
|
</span></span></code></pre></div><ul>
|
|
<li>I cleaned up a few records manually (like some that had \r\n) then re-exported the donors and checked against the latest ROR dump:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2022-05-13-donors.csv -r v1.0-2022-03-17-ror-data.json -o /tmp/2022-05-13-ror.csv
|
|
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2022-05-13-ror.csv | wc -l
|
|
</span></span><span style="display:flex;"><span>230
|
|
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m false /tmp/2022-05-13-ror.csv | csvcut -c organization > /tmp/2022-05-13-ror-unmatched.csv
|
|
</span></span></code></pre></div><ul>
|
|
<li>Then I sent Peter a list so he can try to update some from ROR</li>
|
|
<li>I did some work to upgrade the Mirage 2 build dependencies in our <code>6_x-prod</code> branch
|
|
<ul>
|
|
<li>I switched to Node.js 14 also</li>
|
|
</ul>
|
|
</li>
|
|
<li>Meeting with Margarita and Manuel from ABC to discuss uploading ~6,000 automatically-generated CRP policy reports from MARLO to CGSpace
|
|
<ul>
|
|
<li>They will try to provide the records and PDFs by mid June because they are still finalizing the reports for 2021</li>
|
|
<li>MARLO will be going offline because it was for the CRPs</li>
|
|
<li>We reviewed the metadata they have and gave them some advice on the formatting</li>
|
|
<li>Once we upload the records I will need to provide them with a mapping of the MARLO URLs to Handle URLs so they can set up redirects</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2022-05-14">2022-05-14</h2>
|
|
<ul>
|
|
<li>Start a full Discovery index</li>
|
|
<li>Start an AReS harvest</li>
|
|
</ul>
|
|
<h2 id="2022-05-23">2022-05-23</h2>
|
|
<ul>
|
|
<li>Start an AReS harvest</li>
|
|
</ul>
|
|
<h2 id="2022-05-24">2022-05-24</h2>
|
|
<ul>
|
|
<li>Update CGSpace to latest <code>6_x-prod</code> branch, which removes a handful of Mirage 2 themes and migrates to Node.js 14 and some newer build deps</li>
|
|
<li>Run all system updates on CGSpace (linode18) and reboot it</li>
|
|
</ul>
|
|
<h2 id="2022-05-25">2022-05-25</h2>
|
|
<ul>
|
|
<li>Maria Garruccio sent me a handful of new ORCID identifiers for Alliance staff
|
|
<ul>
|
|
<li>We currently have 1349 unique identifiers and this adds about forty-five new ones (!):</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -oE <span style="color:#e6db74">'[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}'</span> ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | sort | uniq | wc -l
|
|
</span></span><span style="display:flex;"><span>1349
|
|
</span></span><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new-abc-orcids.txt | grep -oE <span style="color:#e6db74">'[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}'</span> | sort | uniq > /tmp/2022-05-25-combined-orcids.txt
|
|
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2022-05-25-combined-orcids.txt
|
|
</span></span><span style="display:flex;"><span>1395 /tmp/2022-05-25-combined-orcids.txt
|
|
</span></span></code></pre></div><ul>
|
|
<li>After combining and filtering them I resolved their names using my <code>resolve-orcids.py</code> script:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-orcids.py -i /tmp/2022-05-25-combined-orcids.txt -o /tmp/2022-05-25-combined-orcids-names.txt
|
|
</span></span></code></pre></div><ul>
|
|
<li>There are some names that changed, so I need to run them through the <code>fix-metadata-values.py</code> script:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat 2022-05-25-update-orcids.csv
|
|
</span></span><span style="display:flex;"><span>cg.creator.identifier,correct
|
|
</span></span><span style="display:flex;"><span>"Andrea Fongar: 0000-0003-2084-1571","ANDREA CECILIA SANCHEZ BOGADO: 0000-0003-4549-6970"
|
|
</span></span><span style="display:flex;"><span>"Bekele Shiferaw: 0000-0002-3645-320X","Bekele A. Shiferaw: 0000-0002-3645-320X"
|
|
</span></span><span style="display:flex;"><span>"Henry Kpaka: 0000-0002-7480-2933","Henry Musa Kpaka: 0000-0002-7480-2933"
|
|
</span></span><span style="display:flex;"><span>"Josephine Agogbua: 0000-0001-6317-1227","Josephine Udunma Agogbua: 0000-0001-6317-1227"
|
|
</span></span><span style="display:flex;"><span>"Martha Lilia Del Río Duque: 0000-0002-0879-0292","Martha Del Río: 0000-0002-0879-0292"
|
|
</span></span><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i 2022-05-25-update-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -f cg.creator.identifier -m <span style="color:#ae81ff">247</span> -t correct -d -n
|
|
</span></span><span style="display:flex;"><span>Connected to database.
|
|
</span></span><span style="display:flex;"><span>Would fix 4 occurences of: Andrea Fongar: 0000-0003-2084-1571
|
|
</span></span><span style="display:flex;"><span>Would fix 1 occurences of: Bekele Shiferaw: 0000-0002-3645-320X
|
|
</span></span><span style="display:flex;"><span>Would fix 2 occurences of: Josephine Agogbua: 0000-0001-6317-1227
|
|
</span></span><span style="display:flex;"><span>Would fix 34 occurences of: Martha Lilia Del Río Duque: 0000-0002-0879-0292
|
|
</span></span></code></pre></div><h2 id="2022-05-26">2022-05-26</h2>
|
|
<ul>
|
|
<li>I extracted the names and ORCID identifiers from Maria’s spreadsheet and produced several CSV files with different name formats:
|
|
<ul>
|
|
<li>First Last (GREL: <code>cells['First Name'].value + ' ' + cells['Surname'].value</code>)</li>
|
|
<li>Last, First (GREL: <code>cells['Surname'].value + ", " + cells['First Name'].value</code>)</li>
|
|
<li>Last, F. (GREL: <code>cells['Surname'].value + ", " + cells['First Name'].value.substring(0, 1) + "."</code>)</li>
|
|
</ul>
|
|
</li>
|
|
<li>Then I constructed a CSV for each of these variations to use with <code>add-orcid-identifiers-csv.py</code>
|
|
<ul>
|
|
<li>In total I matched a bunch of authors and added 872 new metadata fields!</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2022-05-27">2022-05-27</h2>
|
|
<ul>
|
|
<li>Send a follow up to Leroy from the Alliance to ask about the CIAT Library URLs
|
|
<ul>
|
|
<li>It seems that I forgot to attach the list of PDFs when I last communicated with him in 2022-03</li>
|
|
</ul>
|
|
</li>
|
|
<li>Meeting with Terry Bucknell from Overton.io</li>
|
|
</ul>
|
|
<h2 id="2022-05-28">2022-05-28</h2>
|
|
<ul>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2022-05-30">2022-05-30</h2>
|
|
<ul>
|
|
<li>Help IITA with some collection authorization issues on CGSpace</li>
|
|
<li>Finally looking into Peter’s Altmetric export from 2022-02
|
|
<ul>
|
|
<li>We want to try to compare some of the information about open access status with that in CGSpace</li>
|
|
<li>I created a new column for all items that have CGSpace handles using this GREL:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>"https://hdl.handle.net/" + value.match(/.*?(10568\/\d+).*?/)[0]
|
|
</span></span></code></pre></div><ul>
|
|
<li>With that I can do a join on the CGSpace metadata and perhaps clean up some items</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./bin/dspace metadata-export -f 2022-05-30-cgspace.csv
|
|
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'id,dc.identifier.uri[en_US],dcterms.accessRights[en_US],dcterms.license[en_US]'</span> 2022-05-30-cgspace.csv | sed <span style="color:#e6db74">'1 s/dc\.identifier\.uri\[en_US\]/dc.identifier.uri/'</span> > /tmp/cgspace.csv
|
|
</span></span><span style="display:flex;"><span>$ csvjoin -c <span style="color:#e6db74">'dc.identifier.uri'</span> ~/Downloads/2022-05-30-Altmetric-Research-Outputs-CGSpace.csv /tmp/cgspace.csv > /tmp/cgspace-altmetric.csv
|
|
</span></span></code></pre></div><ul>
|
|
<li>Examining the data in OpenRefine I spot checked a few records where Altmetric and CGSpace disagree and in most cases I found Altmetric to be wrong…</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|