cgspace-notes/docs/2022-12/index.html

460 lines
24 KiB
HTML

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="December, 2022" />
<meta property="og:description" content="2022-12-01
Fix some incorrect regions on CGSpace
I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpace (UN M.49 region)
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-12/" />
<meta property="article:published_time" content="2022-12-01T08:52:36+03:00" />
<meta property="article:modified_time" content="2022-12-15T16:41:04+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="December, 2022"/>
<meta name="twitter:description" content="2022-12-01
Fix some incorrect regions on CGSpace
I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpace (UN M.49 region)
"/>
<meta name="generator" content="Hugo 0.108.0">
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "December, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-12/",
"wordCount": "1588",
"datePublished": "2022-12-01T08:52:36+03:00",
"dateModified": "2022-12-15T16:41:04+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-12/">
<title>December, 2022 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-12/">December, 2022</a></h2>
<p class="blog-post-meta">
<time datetime="2022-12-01T08:52:36+03:00">Thu Dec 01, 2022</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2022-12-01">2022-12-01</h2>
<ul>
<li>Fix some incorrect regions on CGSpace
<ul>
<li>I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions</li>
</ul>
</li>
<li>Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!</li>
<li>Replace &ldquo;East Asia&rdquo; with &ldquo;Eastern Asia&rdquo; region on CGSpace (UN M.49 region)</li>
</ul>
<ul>
<li>CGSpace and PRMS information session with Enrico and a bunch of researchers</li>
<li>I noticed some minor issues with SPDX licenses and AGROVOC terms in items submitted by TIP so I sent a message to Daniel from Alliance</li>
<li>I startd a harvest on AReS since we&rsquo;ve updated so much metadata recently</li>
</ul>
<h2 id="2022-12-02">2022-12-02</h2>
<ul>
<li>File some issues related to metadata on the MEL issue tracker
<ul>
<li><a href="https://github.com/CodeObia/MEL/issues/11066">Only use &ldquo;Open Access&rdquo; or &ldquo;Limited Access&rdquo; access rights when publishing items on CGSpace</a></li>
<li><a href="https://github.com/CodeObia/MEL/issues/11067">Set the description when submitting bitstreams to CGSpace</a></li>
<li><a href="https://github.com/CodeObia/MEL/issues/11068">Some items have a Creative Commons license, but are Limited Access and bitstreams are locked</a></li>
</ul>
</li>
</ul>
<h2 id="2022-12-03">2022-12-03</h2>
<ul>
<li>I downloaded a fresh copy of CLARISA&rsquo;s institutions list as well as ROR&rsquo;s latest dump from 2022-12-01 to check how many are matching:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s https://api.clarisa.cgiar.org/api/institutions | json_pp &gt; ~/Downloads/2022-12-03-CLARISA-institutions.json
</span></span><span style="display:flex;"><span>$ jq -r <span style="color:#e6db74">&#39;.[] | .name&#39;</span> ~/Downloads/2022-12-03-CLARISA-institutions.json &gt; ~/Downloads/2022-12-03-CLARISA-institutions.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i ~/Downloads/2022-12-03-CLARISA-institutions.txt -o /tmp/clarisa-ror-matches.csv -r v1.15-2022-12-01-ror-data.json
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/clarisa-ror-matches.csv | wc -l
</span></span><span style="display:flex;"><span>1864
</span></span><span style="display:flex;"><span>$ wc -l ~/Downloads/2022-12-03-CLARISA-institutions.txt
</span></span><span style="display:flex;"><span>7060 /home/aorth/Downloads/2022-12-03-CLARISA-institutions.txt
</span></span></code></pre></div><ul>
<li>Out of the box they match 26.4%, but there are many institutions with multiple languages in the text value, as well as countries in parentheses so I think it could be higher</li>
<li>If I replace the slashes and remove the countries at the end there are slightly more matches, around 29%:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ sed -e <span style="color:#e6db74">&#39;s_ / _\n_&#39;</span> -e <span style="color:#e6db74">&#39;s_/_\n_&#39;</span> -e <span style="color:#e6db74">&#39;s/ \?(.*)$//&#39;</span> ~/Downloads/2022-12-03-CLARISA-institutions.txt &gt; ~/Downloads/2022-12-03-CLARISA-institutions-alan.txt
</span></span></code></pre></div><ul>
<li>I checked CGSpace&rsquo;s top 1,000 affiliations too, first exporting from PostgreSQL:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2022-11-22-affiliations.csv;
</span></span></code></pre></div><ul>
<li>Then cutting (tab is the default delimeter):</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cut -f <span style="color:#ae81ff">1</span> /tmp/2022-11-22-affiliations.csv &gt; 2022-11-22-affiliations.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i 2022-11-22-affiliations.txt -o /tmp/cgspace-matches.csv -r v1.15-2022-12-01-ror-data.json
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/cgspace-matches.csv | wc -l
</span></span><span style="display:flex;"><span>542
</span></span></code></pre></div><ul>
<li>So that&rsquo;s a 54% match for our top affiliations</li>
<li>I realized we should actually check affiliations and sponsors, since those are stored in separate fields
<ul>
<li>When I add those the matches go down a bit to 45%</li>
</ul>
</li>
<li>Oh man, I realized institutions like <code>Université d'Abomey Calavi</code> don&rsquo;t match in ROR because they are like this in the JSON:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>&#34;name&#34;: &#34;Universit\u00e9 d&#39;Abomey-Calavi&#34;
</span></span></code></pre></div><ul>
<li>So we likely match a bunch more than 50%&hellip;</li>
<li>I exported a list of affiliations and donors from CGSpace for Peter to look over and send corrections</li>
</ul>
<h2 id="2022-12-05">2022-12-05</h2>
<ul>
<li>First day of PRMS technical workshop in Rome</li>
<li>Last night I submitted a CSV import with changes to 1,500 Alliance items (adding regions) and it hadn&rsquo;t completed after twenty-four hours so I canceled it
<ul>
<li>Not sure if there is some rollback that will happen or what state the database will be in, so I will wait a few hours to see what happens before trying to modify those items again</li>
<li>I started it again a few hours later with a subset of the items and 4GB of RAM instead of 2</li>
<li>It completed successfully&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2022-12-07">2022-12-07</h2>
<ul>
<li>I found a bug in my csv-metadata-quality script regarding the regions
<ul>
<li>I was accidentally checking <code>cg.coverage.subregion</code> due to a sloppy regex</li>
<li>This means I&rsquo;ve added a few thousand UN M.49 regions to the <code>cg.coverage.subregion</code> field in the last few days</li>
<li>I had to extract them from CGSpace and delete them using <code>delete-metadata-values.py</code></li>
</ul>
</li>
<li>My <a href="https://github.com/DSpace/DSpace/pull/8550">DSpace 7.x pull request to tell ImageMagick about the PDF CropBox</a> was merged</li>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2022-12-08">2022-12-08</h2>
<ul>
<li>While on the plane I decided to fix some ORCID identifiers, as I had seen some poorly formatted ones
<ul>
<li>I couldn&rsquo;t remember the XPath syntax so this was kinda ghetto:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ xmllint --xpath <span style="color:#e6db74">&#39;//node/isComposedBy/node()&#39;</span> dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE <span style="color:#e6db74">&#39;label=&#34;.*&#34;&#39;</span> | sed -e <span style="color:#e6db74">&#39;s/label=&#34;//&#39;</span> -e <span style="color:#e6db74">&#39;s/&#34;$//&#39;</span> &gt; /tmp/orcid-names.txt
</span></span><span style="display:flex;"><span>$ ./ilri/update-orcids.py -i /tmp/orcid-names.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -m <span style="color:#ae81ff">247</span>
</span></span></code></pre></div><ul>
<li>After that there were still some poorly formatted ones that my script didn&rsquo;t fix, so perhaps these are new ones not in our list
<ul>
<li>I dumped them and combined with the existing ones to resolve later:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=247 AND text_value LIKE &#39;%http%&#39;) to /tmp/orcid-formatting.txt;
</span></span><span style="display:flex;"><span>COPY 36
</span></span></code></pre></div><ul>
<li>I think there are really just some new ones&hellip;</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/orcid-formatting.txt| grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u &gt; /tmp/2022-12-08-orcids.txt
</span></span><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort -u | wc -l
</span></span><span style="display:flex;"><span>1907
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2022-12-08-orcids.txt
</span></span><span style="display:flex;"><span>1939 /tmp/2022-12-08-orcids.txt
</span></span></code></pre></div><ul>
<li>Then I applied these updates on CGSpace</li>
<li>Maria mentioned that she was getting a lot more items in her daily subscription emails
<ul>
<li>I had a hunch it was related to me updating the <code>last_modified</code> timestamp after updating a bunch of countries, regions, etc in items</li>
<li>Then today I noticed this option in <code>dspace.cfg</code>: <code>eperson.subscription.onlynew</code></li>
<li>By default DSpace sends notifications for modified items too! I&rsquo;ve disabled it now&hellip;</li>
</ul>
</li>
<li>I applied 498 fixes and two deletions to affiliations sent by Peter</li>
<li>I applied 206 fixes and eighty-one deletions to donors sent by Peter</li>
<li>I tried to figure out how to authenticate to the DSpace 7 REST API
<ul>
<li>First <a href="https://github.com/DSpace/RestContract/blob/main/csrf-tokens.md">you need a CSRF token</a>, before you can even try to authenticate</li>
<li>Then you can authenticate, but I can&rsquo;t get it to work:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -v https://dspace7test.ilri.org/server/api
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>dspace-xsrf-token: 0b7861fb-9c8a-4eea-be70-b3be3bd0a0b4
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>$ curl -v -X POST --data <span style="color:#e6db74">&#34;user=aorth@omg.com&amp;password=myPassword&#34;</span> <span style="color:#e6db74">&#34;https://dspace7test.ilri.org/server/authn/login&#34;</span> -H <span style="color:#e6db74">&#34;X-XSRF-TOKEN: 0b7861fb-9c8a-4eea-be70-b3be3bd0a0b4&#34;</span>
</span></span></code></pre></div><ul>
<li>Start a harvest on AReS</li>
</ul>
<h2 id="2022-12-09">2022-12-09</h2>
<ul>
<li>I found a way to check the owner of a Handle prefix
<ul>
<li>You query the admin Handle for the prefix, ie: <a href="https://hdl.handle.net/0.na/10568">https://hdl.handle.net/0.na/10568</a></li>
</ul>
</li>
</ul>
<h2 id="2022-12-11">2022-12-11</h2>
<ul>
<li>I got LDAP authentication working on DSpace 7</li>
</ul>
<h2 id="2022-12-12">2022-12-12</h2>
<ul>
<li>Submit some issues to MEL GitHub:
<ul>
<li><a href="https://github.com/CodeObia/MEL/issues/11081">Links to https://mel.cgiar.org/dspace/limited for Limited Access items on CGSpace</a></li>
<li><a href="https://github.com/CodeObia/MEL/issues/11083">Items submitted to CGSpace without Initiative</a></li>
</ul>
</li>
<li>PRMS planning meeting before tomorrow&rsquo;s meeting with researchers and submitters</li>
</ul>
<h2 id="2022-12-13">2022-12-13</h2>
<ul>
<li>I made some minor changes to csv-metadata-quality
<ul>
<li>I switched to using the SPDX license data as a JSON directly from SPDX, instead of via the now-deprecated spdx-license-list package on pypi</li>
</ul>
</li>
<li>I exported the Initiatives collection to tag missing regions</li>
<li>I submitted an issue to MEL GitHub:
<ul>
<li><a href="https://github.com/CodeObia/MEL/issues/11084">Set the description of bitstreams in the THUMBNAIL bundle to &ldquo;IM Thumbnail&rdquo; when submitting to CGSpace</a></li>
</ul>
</li>
<li>Submit a pull request to <a href="https://github.com/citizenlab/test-lists/pull/1199">fix the Handle link in the Citizen Lab test URLs for Iran</a>
<ul>
<li>I had originally submitted this in 2018, but it seems someone updated the URL in 2020&hellip; hmmm</li>
</ul>
</li>
<li>I normalized the <code>text_lang</code> values on CGSpace again:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
</span></span><span style="display:flex;"><span> text_lang | count
</span></span><span style="display:flex;"><span>-----------+---------
</span></span><span style="display:flex;"><span> en_US | 3050302
</span></span><span style="display:flex;"><span> en | 618
</span></span><span style="display:flex;"><span> | 605
</span></span><span style="display:flex;"><span> fr | 2
</span></span><span style="display:flex;"><span> vi | 2
</span></span><span style="display:flex;"><span> es | 1
</span></span><span style="display:flex;"><span> | 0
</span></span><span style="display:flex;"><span>(7 rows)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>dspace=# BEGIN;
</span></span><span style="display:flex;"><span>BEGIN
</span></span><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en&#39;, &#39;&#39;, NULL);
</span></span><span style="display:flex;"><span>UPDATE 1223
</span></span><span style="display:flex;"><span>dspace=# COMMIT;
</span></span><span style="display:flex;"><span>COMMIT
</span></span></code></pre></div><ul>
<li>I wrote an initial version of a script to map CGSpace items to Initiative collections based on their <code>cg.contributor.initiative</code> metadata
<ul>
<li>I am still considering if I want to add a mode to <em>un-map</em> items that are mapped to collections, but do not have the corresponding metadata tag</li>
</ul>
</li>
</ul>
<h2 id="2022-12-14">2022-12-14</h2>
<ul>
<li>Lots of work on PRMS related metadata issues with CGSpace
<ul>
<li>We noticed that PRMS uses <code>cg.identifier.dataurl</code> for the FAIR score, but not <code>cg.identifier.url</code></li>
<li>We don&rsquo;t use these consistently for datasets in CGSpace so I decided to move them to the dataurl field, but we will also ask the PRMS team to consider the normal URL field, as there are commonly other external resources related to the knowledge product there</li>
</ul>
</li>
<li>I updated the <code>move-metadata-values.py</code> script to use the latest best practices from my other scripts and some of the helper functions from <code>util.py</code>
<ul>
<li>Then I exported a list of text values pointing to Dataverse instances from <code>cg.identifier.url</code>:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ \COPY (SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=219 AND (text_value LIKE &#39;%persistentId%&#39; OR text_value LIKE &#39;%20.500.11766.1/%&#39;)) to /tmp/data.txt;
</span></span><span style="display:flex;"><span>COPY 61
</span></span></code></pre></div><ul>
<li>Then I moved them to <code>cg.identifier.dataurl</code> on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/move-metadata-values.py -i /tmp/data.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;dom@in34sniper&#39;</span> -f cg.identifier.url -t cg.identifier.dataurl
</span></span></code></pre></div><ul>
<li>I still need to add a note to the CGSpace submission form to inform submitters about the correct field for dataset URLs</li>
<li>I finalized work on my new <code>fix-initiative-mappings.py</code> script
<ul>
<li>It has two modes:
<ol>
<li>Check item metadata to see which Initiatives are tagged and then map the item if it is not yet mapped to the corresponding Initiative collection</li>
<li>Check item collections to see which Initiatives are mapped and then unmap the item if the corresponding Initiative metadata is missing</li>
</ol>
</li>
<li>The second one is disabled by default until I can get more feedback from Abenet, Michael, and others</li>
</ul>
</li>
<li>After I applied a handful of collection mappings I started a harvest on AReS</li>
</ul>
<h2 id="2022-12-15">2022-12-15</h2>
<ul>
<li>I did some metadata quality checks on the Initiatives collection, adding some missing regions and removing a few duplicate ones</li>
</ul>
<h2 id="2022-12-18">2022-12-18</h2>
<ul>
<li>Load on the server is a bit high
<ul>
<li>Looking at the nginx logs I see someone from the University of Chicago (128.135.98.29) is using RStudio Desktop to query and scrape CGSpace</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code># grep -c &#39;RStudio Desktop&#39; /var/log/nginx/access.log
5570
</code></pre><ul>
<li>RStudio is already in the ILRI bot overrides for DSpace so it shouldn&rsquo;t be causing any extra hits, but I&rsquo;ll put an HTTP 403 in the nginx config to tell the user to use the REST API</li>
<li>Start a harvest on AReS</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-12/">December, 2022</a></li>
<li><a href="/cgspace-notes/2022-11/">November, 2022</a></li>
<li><a href="/cgspace-notes/2022-10/">October, 2022</a></li>
<li><a href="/cgspace-notes/2022-09/">September, 2022</a></li>
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>