mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 16:38:19 +01:00
697 lines
39 KiB
HTML
697 lines
39 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
|
|
<meta property="og:title" content="February, 2024" />
|
|
<meta property="og:description" content="2024-02-05
|
|
|
|
Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
|
|
Lower case all the AGROVOC subjects on CGSpace
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2024-01/" />
|
|
<meta property="article:published_time" content="2024-01-05T11:10:00+03:00" />
|
|
<meta property="article:modified_time" content="2024-02-29T09:41:44+03:00" />
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="February, 2024"/>
|
|
<meta name="twitter:description" content="2024-02-05
|
|
|
|
Delete duplicate metadata as described in my DSpace issue from last year: https://github.com/DSpace/DSpace/issues/8253
|
|
Lower case all the AGROVOC subjects on CGSpace
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.123.6">
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "February, 2024",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2024-01/",
|
|
"wordCount": "560",
|
|
"datePublished": "2024-01-05T11:10:00+03:00",
|
|
"dateModified": "2024-02-29T09:41:44+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2024-01/">
|
|
|
|
<title>February, 2024 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2024-01/">February, 2024</a></h2>
|
|
<p class="blog-post-meta">
|
|
<time datetime="2024-01-05T11:10:00+03:00">Fri Jan 05, 2024</time>
|
|
in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2024-02-05">2024-02-05</h2>
|
|
<ul>
|
|
<li>Delete duplicate metadata as described in my DSpace issue from last year: <a href="https://github.com/DSpace/DSpace/issues/8253">https://github.com/DSpace/DSpace/issues/8253</a></li>
|
|
<li>Lower case all the AGROVOC subjects on CGSpace</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span>dspace<span style="color:#f92672">=#</span> <span style="color:#66d9ef">BEGIN</span>;
|
|
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">BEGIN</span>
|
|
</span></span><span style="display:flex;"><span>dspace<span style="color:#f92672">=*#</span> <span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> text_value<span style="color:#f92672">=</span><span style="color:#66d9ef">LOWER</span>(text_value) <span style="color:#66d9ef">WHERE</span> dspace_object_id <span style="color:#66d9ef">IN</span> (<span style="color:#66d9ef">SELECT</span> uuid <span style="color:#66d9ef">FROM</span> item) <span style="color:#66d9ef">AND</span> metadata_field_id<span style="color:#f92672">=</span><span style="color:#ae81ff">187</span> <span style="color:#66d9ef">AND</span> text_value <span style="color:#f92672">~</span> <span style="color:#e6db74">'[[:upper:]]'</span>;
|
|
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">UPDATE</span> <span style="color:#ae81ff">180</span>
|
|
</span></span><span style="display:flex;"><span>dspace<span style="color:#f92672">=*#</span> <span style="color:#66d9ef">COMMIT</span>;
|
|
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">COMMIT</span>
|
|
</span></span></code></pre></div><h2 id="2024-02-06">2024-02-06</h2>
|
|
<ul>
|
|
<li>Discuss IWMI using the CGSpace REST API for their new website</li>
|
|
<li>Export the IWMI community to extract their ORCID identifiers:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-export -i 10568/16814 -f /tmp/iwmi.csv
|
|
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'cg.creator.identifier,cg.creator.identifier[en_US]'</span> ~/Downloads/2024-02-06-iwmi.csv <span style="color:#ae81ff">\
|
|
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' \
|
|
</span></span><span style="display:flex;"><span> | sort -u \
|
|
</span></span><span style="display:flex;"><span> | tee /tmp/iwmi-orcids.txt \
|
|
</span></span><span style="display:flex;"><span> | wc -l
|
|
</span></span><span style="display:flex;"><span>353
|
|
</span></span><span style="display:flex;"><span>$ ./ilri/resolve_orcids.py -i /tmp/iwmi-orcids.txt -o /tmp/iwmi-orcids-names.csv -d
|
|
</span></span></code></pre></div><ul>
|
|
<li>I noticed some similar looking names in our list so I clustered them in OpenRefine and manually checked a dozen or so to update our list</li>
|
|
</ul>
|
|
<h2 id="2024-02-07">2024-02-07</h2>
|
|
<ul>
|
|
<li>Maria asked me about the “missing” item from last week again
|
|
<ul>
|
|
<li>I can see it when I used the Admin search, but not in her workflow</li>
|
|
<li>It was submitted by TIP so I checked that user’s workspace and found it there</li>
|
|
<li>After depositing, it went into the workflow so Maria should be able to see it now</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-02-09">2024-02-09</h2>
|
|
<ul>
|
|
<li>Minor edits to CGSpace submission form</li>
|
|
<li>Upload 55 ISNAR book chapters to CGSpace from Peter</li>
|
|
</ul>
|
|
<h2 id="2024-02-19">2024-02-19</h2>
|
|
<ul>
|
|
<li>Looking into the collection mapping issue on CGSpace
|
|
<ul>
|
|
<li>It seems to be by design in DSpace 7: <a href="https://github.com/DSpace/dspace-angular/issues/1203">https://github.com/DSpace/dspace-angular/issues/1203</a></li>
|
|
<li>This is a massive setback for us…</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-02-20">2024-02-20</h2>
|
|
<ul>
|
|
<li>Minor work on OpenRXV to fix a bug in the ng-select drop downs</li>
|
|
<li>Minor work on the DSpace 7 nginx configuration to allow requesting robots.txt and sitemaps without hitting rate limits</li>
|
|
</ul>
|
|
<h2 id="2024-02-21">2024-02-21</h2>
|
|
<ul>
|
|
<li>Minor updates on OpenRXV, including one bug fix for missing mapped collections
|
|
<ul>
|
|
<li>Salem had to re-work the harvester for DSpace 7 since the mapped collections and parent collection list are separate!</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-02-22">2024-02-22</h2>
|
|
<ul>
|
|
<li>Discuss tagging of datasets and re-work the submission form to encourage use of DOI field for any item that has a DOI, and the normal URL field if not
|
|
<ul>
|
|
<li>The “cg.identifier.dataurl” field will be used for “related” datasets</li>
|
|
<li>I still have to check and move some metadata for existing datasets</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-02-23">2024-02-23</h2>
|
|
<ul>
|
|
<li>This morning Tomcat died due to an OOM kill from the kernel:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>kernel: Out of memory: Killed process 698 (java) total-vm:14151300kB, anon-rss:9665812kB, file-rss:320kB, shmem-rss:0kB, UID:997 pgtables:20436kB oom_score_adj:0
|
|
</span></span></code></pre></div><ul>
|
|
<li>I don’t see any abnormal pattern in my Grafana graphs, for JVM or system load… very weird</li>
|
|
<li>I updated the submission form on CGSpace to include the new changes to URLs for datasets
|
|
<ul>
|
|
<li>I also updated about 80 datasets to move the URLs to the correct field</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-02-25">2024-02-25</h2>
|
|
<ul>
|
|
<li>This morning Tomcat died while I was doing a CSV export, with an OOM kill from the kernel:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>kernel: Out of memory: Killed process 720768 (java) total-vm:14079976kB, anon-rss:9301684kB, file-rss:152kB, shmem-rss:0kB, UID:997 pgtables:19488kB oom_score_adj:0
|
|
</span></span></code></pre></div><ul>
|
|
<li>I don’t know why this is happening so often recently…</li>
|
|
</ul>
|
|
<h2 id="2024-02-27">2024-02-27</h2>
|
|
<ul>
|
|
<li>IFPRI sent me a list of authors to add to our list for now, until we can find a better way of doing it
|
|
<ul>
|
|
<li>I extracted the existing authors from our controlled vocabulary and combined them with IFPRI’s:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ xmllint --xpath <span style="color:#e6db74">'//node/isComposedBy/node()'</span> dspace/config/controlled-vocabularies/dc-contributor-author.xml <span style="color:#ae81ff">\
|
|
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | grep -oE 'label=".*"' \
|
|
</span></span><span style="display:flex;"><span> | sed -e 's/label="//' -e 's/"$//' > /tmp/authors
|
|
</span></span><span style="display:flex;"><span>$ cat /tmp/authors /tmp/ifpri-authors | sort -u > /tmp/new-authors
|
|
</span></span></code></pre></div><h2 id="2024-02-28">2024-02-28</h2>
|
|
<ul>
|
|
<li>I figured out a way to add a new Angular component to handle all our relation fields</li>
|
|
</ul>
|
|
<h2 id="2024-02-29">2024-02-29</h2>
|
|
<ul>
|
|
<li>Clean up a bunch of metadata on CGSpace</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2024-01/">February, 2024</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-12/">December, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-11/">November, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-10/">October, 2023</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|
|
tadata values:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ csvcut -c dcterms.publisher ~/Downloads/2024-01-09-publishers4.csv | sed -e 1d -e 's/"//g' > /tmp/top-publishers.txt
|
|
</code></pre><ul>
|
|
<li>Export a list of ORCID identifiers from PostgreSQL to look them up on ORCID and update our controlled vocabulary:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2024-01-09-orcid-identifiers.txt;
|
|
</span></span><span style="display:flex;"><span>localhost/dspace7= ☘ \q
|
|
</span></span><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2024-01-09-orcid-identifiers.txt | grep -oE <span style="color:#e6db74">'[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}'</span> | sort -u > /tmp/2024-01-09-orcids.txt
|
|
</span></span><span style="display:flex;"><span>$ ./ilri/resolve_orcids.py -i /tmp/2024-01-09-orcids.txt -o /tmp/2024-01-09-orcids-names.txt -d
|
|
</span></span></code></pre></div><ul>
|
|
<li>Then I updated existing ORCID identifiers in CGSpace:</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>$ ./ilri/update_orcids.py -i /tmp/2024-01-09-orcids-names.txt -db dspace -u dspace -p bahhhh
|
|
</code></pre><ul>
|
|
<li>Bizu seems to be having issues due to belonging to too many groups
|
|
<ul>
|
|
<li>I see some messages from Solr in the DSpace log:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre tabindex="0"><code>2024-01-09 06:23:35,893 ERROR unknown unknown org.dspace.authorize.AuthorizeServiceImpl @ Failed getting getting community/collection admin status for bahhhhh@cgiar.org The search error is: Error from server at http://localhost:8983/solr/search: org.apache.solr.search.SyntaxError: Cannot parse 'search.resourcetype:Community AND (admin:eef481147-daf3-4fd2-bb8d-e18af8131d8c OR admin:g80199ef9-bcd6-4961-9512-501dea076607 OR admin:g4ac29263-cf0c-48d0-8be7-7f09317d50ec OR admin:g0e594148-a0f6-4f00-970d-6b7812f89540 OR admin:g0265b87a-2183-4357-a971-7a5b0c7add3a OR admin:g371ae807-f014-4305-b4ec-f2a8f6f0dcfa OR admin:gdc5cb27c-4a5a-45c2-b656-a399fded70de OR admin:ge36d0ece-7a52-4925-afeb-6641d6a348cc OR admin:g15dc1173-7ddf-43cf-a89a-77a7f81c4cfc OR admin:gc3a599d3-c758-46cd-9855-c98f6ab58ae4 OR admin:g3d648c3e-58c3-4342-b500-07cba10ba52d OR admin:g82bf5168-65c1-4627-8eb4-724fa0ea51a7 OR admin:ge751e973-697d-419c-b59b-5a5644702874 OR admin:g44dd0a80-c1e6-4274-9be4-9f342d74928c OR admin:g4842f9c2-73ed-476a-a81a-7167d8aa7946 OR admin:g5f279b3f-c2ce-4c75-b151-1de52c1a540e OR admin:ga6df8adc-2e1d-40f2-8f1e-f77796d0eecd OR admin:gfdfc1621-382e-437a-8674-c9007627565c OR admin:g15cd114a-0b89-442b-a1b4-1febb6959571 OR admin:g12aede99-d018-4c00-b4d4-a732541d0017 OR admin:gc59529d7-002a-4216-b2e1-d909afd2d4a9 OR admin:gd0806714-bc13-460d-bedd-121bdd5436a4 OR admin:gce70739a-8820-4d56-b19c-f191855479e4 OR admin:g7d3409eb-81e3-4156-afb1-7f02de22065f OR admin:g54bc009e-2954-4dad-8c30-be6a09dc5093 OR admin:gc5e1d6b7-4603-40d7-852f-6654c159dec9 OR admin:g0046214d-c85b-4f12-a5e6-2f57a2c3abb0 OR admin:g4c7b4fd0-938f-40e9-ab3e-447c317296c1 OR admin:gcfae9b69-d8dd-4cf3-9a4e-d6e31ff68731 OR ... admin:g20f366c0-96c0-4416-ad0b-46884010925f)': too many boolean clauses The search resourceType filter was: search.resourcetype:Community
|
|
</code></pre><ul>
|
|
<li>There are 1,805 OR clauses in the full log!
|
|
<ul>
|
|
<li>We previous had this issue in 2020-01 and 2020-02 with DSpace 5 and DSpace 6</li>
|
|
<li>At the time the solution was to increase the <code>maxBooleanClauses</code> in Solr and to disable access rights awareness, but I don’t think we want to do the second one now</li>
|
|
<li>I saw many users of Solr in other applications increasing this to obscenely high numbers, so I think we should be OK to increase it from 1024 to 2048</li>
|
|
</ul>
|
|
</li>
|
|
<li>Re-visiting the DSpace user groomer to delete inactive users
|
|
<ul>
|
|
<li>In 2023-08 I noticed that this was now <a href="https://github.com/DSpace/DSpace/pull/2928">possible in DSpace 7</a></li>
|
|
<li>As a test I tried to delete all users who have been inactive since six years ago (Janury 9, 2018):</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace dsrun org.dspace.eperson.Groomer -a -b 01/09/2018 -d
|
|
</span></span></code></pre></div><ul>
|
|
<li>I tested it on DSpace 7 Test and it worked… I am debating running it on CGSpace…
|
|
<ul>
|
|
<li>I see we have almost 9,000 users:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace user -L > /tmp/users-before.txt
|
|
</span></span><span style="display:flex;"><span>$ wc -l /tmp/users-before.txt
|
|
</span></span><span style="display:flex;"><span>8943 /tmp/users-before.txt
|
|
</span></span></code></pre></div><ul>
|
|
<li>I decided to do the same on CGSpace and it worked without errors</li>
|
|
<li>I finished working on the controlled vocabulary for publishers</li>
|
|
</ul>
|
|
<h2 id="2024-01-10">2024-01-10</h2>
|
|
<ul>
|
|
<li>I spent some time deleting old groups on CGSpace</li>
|
|
<li>I looked into the use of the <code>cg.identifier.ciatproject</code> field and found there are only a handful of uses, with some even seeming to be a mistake:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace7= ☘ SELECT DISTINCT text_value AS "cg.identifier.ciatproject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata
|
|
</span></span><span style="display:flex;"><span>_field_id = 232 GROUP BY "cg.identifier.ciatproject" ORDER BY count DESC;
|
|
</span></span><span style="display:flex;"><span> cg.identifier.ciatproject │ count
|
|
</span></span><span style="display:flex;"><span>───────────────────────────┼───────
|
|
</span></span><span style="display:flex;"><span> D145 │ 4
|
|
</span></span><span style="display:flex;"><span> LAM_LivestockPlus │ 2
|
|
</span></span><span style="display:flex;"><span> A215 │ 1
|
|
</span></span><span style="display:flex;"><span> A217 │ 1
|
|
</span></span><span style="display:flex;"><span> A220 │ 1
|
|
</span></span><span style="display:flex;"><span> A223 │ 1
|
|
</span></span><span style="display:flex;"><span> A224 │ 1
|
|
</span></span><span style="display:flex;"><span> A227 │ 1
|
|
</span></span><span style="display:flex;"><span> A229 │ 1
|
|
</span></span><span style="display:flex;"><span> A230 │ 1
|
|
</span></span><span style="display:flex;"><span> CLIMATE CHANGE MITIGATION │ 1
|
|
</span></span><span style="display:flex;"><span> LIVESTOCK │ 1
|
|
</span></span><span style="display:flex;"><span>(12 rows)
|
|
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
|
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 240.041 ms
|
|
</span></span></code></pre></div><ul>
|
|
<li>I think we can move those to a new <code>cg.identifier.project</code> if we create one</li>
|
|
<li>The <code>cg.identifier.cpwfproject</code> field is similarly sparse, but the CCAFS ones are widely used</li>
|
|
</ul>
|
|
<h2 id="2024-01-12">2024-01-12</h2>
|
|
<ul>
|
|
<li>Export a list of affiliations to do some cleanup:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
|
|
</span></span><span style="display:flex;"><span>COPY 11719
|
|
</span></span></code></pre></div><ul>
|
|
<li>I first did some clustering and editing in OpenRefine, then I’ll import those back into CGSpace and then do another export</li>
|
|
<li>Troubleshooting the statistics pages that aren’t working on DSpace 7
|
|
<ul>
|
|
<li>On a hunch, I queried for for Solr statistics documents that <strong>did not have an <code>id</code> matching the 36-character UUID pattern</strong>:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl <span style="color:#e6db74">'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'</span>
|
|
</span></span><span style="display:flex;"><span>{
|
|
</span></span><span style="display:flex;"><span> "responseHeader":{
|
|
</span></span><span style="display:flex;"><span> "status":0,
|
|
</span></span><span style="display:flex;"><span> "QTime":0,
|
|
</span></span><span style="display:flex;"><span> "params":{
|
|
</span></span><span style="display:flex;"><span> "q":"-id:/.{36}/",
|
|
</span></span><span style="display:flex;"><span> "rows":"0"}},
|
|
</span></span><span style="display:flex;"><span> "response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
|
|
</span></span><span style="display:flex;"><span> }}
|
|
</span></span></code></pre></div><ul>
|
|
<li>They seem to come mostly from 2020, 2023, and 2024:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl <span style="color:#e6db74">'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'</span>
|
|
</span></span><span style="display:flex;"><span>{
|
|
</span></span><span style="display:flex;"><span> "responseHeader":{
|
|
</span></span><span style="display:flex;"><span> "status":0,
|
|
</span></span><span style="display:flex;"><span> "QTime":13,
|
|
</span></span><span style="display:flex;"><span> "params":{
|
|
</span></span><span style="display:flex;"><span> "facet.range":"time",
|
|
</span></span><span style="display:flex;"><span> "q":"-id:/.{36}/",
|
|
</span></span><span style="display:flex;"><span> "facet.range.gap":"+1YEAR",
|
|
</span></span><span style="display:flex;"><span> "rows":"0",
|
|
</span></span><span style="display:flex;"><span> "facet":"true",
|
|
</span></span><span style="display:flex;"><span> "facet.range.start":"2010-01-01T00:00:00Z",
|
|
</span></span><span style="display:flex;"><span> "facet.range.end":"NOW"}},
|
|
</span></span><span style="display:flex;"><span> "response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
|
|
</span></span><span style="display:flex;"><span> },
|
|
</span></span><span style="display:flex;"><span> "facet_counts":{
|
|
</span></span><span style="display:flex;"><span> "facet_queries":{},
|
|
</span></span><span style="display:flex;"><span> "facet_fields":{},
|
|
</span></span><span style="display:flex;"><span> "facet_ranges":{
|
|
</span></span><span style="display:flex;"><span> "time":{
|
|
</span></span><span style="display:flex;"><span> "counts":[
|
|
</span></span><span style="display:flex;"><span> "2010-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2011-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2012-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2013-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2014-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2015-01-01T00:00:00Z",89,
|
|
</span></span><span style="display:flex;"><span> "2016-01-01T00:00:00Z",11,
|
|
</span></span><span style="display:flex;"><span> "2017-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2018-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2019-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2020-01-01T00:00:00Z",1339,
|
|
</span></span><span style="display:flex;"><span> "2021-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2022-01-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2023-01-01T00:00:00Z",653736,
|
|
</span></span><span style="display:flex;"><span> "2024-01-01T00:00:00Z",144993],
|
|
</span></span><span style="display:flex;"><span> "gap":"+1YEAR",
|
|
</span></span><span style="display:flex;"><span> "start":"2010-01-01T00:00:00Z",
|
|
</span></span><span style="display:flex;"><span> "end":"2025-01-01T00:00:00Z"}},
|
|
</span></span><span style="display:flex;"><span> "facet_intervals":{},
|
|
</span></span><span style="display:flex;"><span> "facet_heatmaps":{}}}
|
|
</span></span></code></pre></div><ul>
|
|
<li>They seem to come from 2023-08 until now (so way before we migrated to DSpace 7):</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl <span style="color:#e6db74">'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'</span>
|
|
</span></span><span style="display:flex;"><span>{
|
|
</span></span><span style="display:flex;"><span> "responseHeader":{
|
|
</span></span><span style="display:flex;"><span> "status":0,
|
|
</span></span><span style="display:flex;"><span> "QTime":196,
|
|
</span></span><span style="display:flex;"><span> "params":{
|
|
</span></span><span style="display:flex;"><span> "facet.range":"time",
|
|
</span></span><span style="display:flex;"><span> "q":"-id:/.{36}/",
|
|
</span></span><span style="display:flex;"><span> "facet.range.gap":"+1MONTH",
|
|
</span></span><span style="display:flex;"><span> "rows":"0",
|
|
</span></span><span style="display:flex;"><span> "facet":"true",
|
|
</span></span><span style="display:flex;"><span> "facet.range.start":"2023-01-01T00:00:00Z",
|
|
</span></span><span style="display:flex;"><span> "facet.range.end":"NOW"}},
|
|
</span></span><span style="display:flex;"><span> "response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
|
|
</span></span><span style="display:flex;"><span> },
|
|
</span></span><span style="display:flex;"><span> "facet_counts":{
|
|
</span></span><span style="display:flex;"><span> "facet_queries":{},
|
|
</span></span><span style="display:flex;"><span> "facet_fields":{},
|
|
</span></span><span style="display:flex;"><span> "facet_ranges":{
|
|
</span></span><span style="display:flex;"><span> "time":{
|
|
</span></span><span style="display:flex;"><span> "counts":[
|
|
</span></span><span style="display:flex;"><span> "2023-01-01T00:00:00Z",1,
|
|
</span></span><span style="display:flex;"><span> "2023-02-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2023-03-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2023-04-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2023-05-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2023-06-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2023-07-01T00:00:00Z",0,
|
|
</span></span><span style="display:flex;"><span> "2023-08-01T00:00:00Z",27621,
|
|
</span></span><span style="display:flex;"><span> "2023-09-01T00:00:00Z",59165,
|
|
</span></span><span style="display:flex;"><span> "2023-10-01T00:00:00Z",115338,
|
|
</span></span><span style="display:flex;"><span> "2023-11-01T00:00:00Z",96147,
|
|
</span></span><span style="display:flex;"><span> "2023-12-01T00:00:00Z",355464,
|
|
</span></span><span style="display:flex;"><span> "2024-01-01T00:00:00Z",125429],
|
|
</span></span><span style="display:flex;"><span> "gap":"+1MONTH",
|
|
</span></span><span style="display:flex;"><span> "start":"2023-01-01T00:00:00Z",
|
|
</span></span><span style="display:flex;"><span> "end":"2024-02-01T00:00:00Z"}},
|
|
</span></span><span style="display:flex;"><span> "facet_intervals":{},
|
|
</span></span><span style="display:flex;"><span> "facet_heatmaps":{}}}
|
|
</span></span></code></pre></div><ul>
|
|
<li>I see that we had 31,744 statistic events yesterday, and 799 have no <code>id</code>!</li>
|
|
<li>I asked about this on Slack and will file an issue on GitHub if someone else also finds such records
|
|
<ul>
|
|
<li>Several people said they have them, so it’s a bug of some sort in DSpace, not our configuration</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-01-13">2024-01-13</h2>
|
|
<ul>
|
|
<li>Yesterday alone we had 37,000 unique IPs making requests to nginx
|
|
<ul>
|
|
<li>I looked up the ASNs and found 6,000 IPs from this network in Amazon Singapore: 47.128.0.0/14</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-01-15">2024-01-15</h2>
|
|
<ul>
|
|
<li>Investigating the CSS selector warning that I’ve seen in PM2 logs:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>0|dspace-ui | 1 rules skipped due to selector errors:
|
|
</span></span><span style="display:flex;"><span>0|dspace-ui | .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
|
|
</span></span></code></pre></div><ul>
|
|
<li>It seems to be a bug in Angular, as this selector comes from Bootstrap 4.6.x and is not invalid
|
|
<ul>
|
|
<li>But that led me to a more interesting issue with <code>inlineCritical</code> optimization for styles in Angular SSR that might be responsible for causing high load in the frontend</li>
|
|
<li>See: <a href="https://github.com/angular/angular/issues/42098">https://github.com/angular/angular/issues/42098</a></li>
|
|
<li>See: <a href="https://github.com/angular/universal/issues/2106">https://github.com/angular/universal/issues/2106</a></li>
|
|
<li>See: <a href="https://github.com/GoogleChromeLabs/critters/issues/78">https://github.com/GoogleChromeLabs/critters/issues/78</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>Since the production site was flapping a lot I decided to try disabling inlineCriticalCss</li>
|
|
<li>There have been on and off load issues with the Angular frontend today
|
|
<ul>
|
|
<li>I think I will just block all data center network blocks for now</li>
|
|
<li>In the last week I see almost 200,000 unique IPs:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat -f /var/log/nginx/*access.log /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk <span style="color:#e6db74">'{print $1}'</span> | sort -u |
|
|
</span></span><span style="display:flex;"><span>tee /tmp/ips.txt | wc -l
|
|
</span></span><span style="display:flex;"><span>196493
|
|
</span></span></code></pre></div><ul>
|
|
<li>Looking these IPs up I see there are 18,000 coming from Comcast, 10,000 from AT&T, 4110 from Charter, 3500 from Cox and dozens of other residential IPs
|
|
<ul>
|
|
<li>I highly doubt these are home users browsing CGSpace… seems super fishy</li>
|
|
<li>Also, over 1,000 IPs from SpaceX Starlink in the last week. RIGHT</li>
|
|
<li>I will temporarily add a few new datacenter ISP network blocks to our rate limit:
|
|
<ul>
|
|
<li>16509 Amazon-02</li>
|
|
<li>701 UUNET</li>
|
|
<li>8075 Microsoft</li>
|
|
<li>15169 Google</li>
|
|
<li>14618 Amazon-AES</li>
|
|
<li>396982 Google Cloud</li>
|
|
</ul>
|
|
</li>
|
|
<li>The load on the server <em>immediately</em> dropped</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-01-17">2024-01-17</h2>
|
|
<ul>
|
|
<li>It turns out AS701 (UUNET) is Verizon Business, which is used as an ISP for many staff at IFPRI
|
|
<ul>
|
|
<li>This was causing them to see HTTP 429 “too many requests” errors on CGSpace</li>
|
|
<li>I removed this ASN from the rate limiting</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-01-18">2024-01-18</h2>
|
|
<ul>
|
|
<li>Start looking at Solr stats again
|
|
<ul>
|
|
<li>I found one statistics record that has 22,000 of the same collection in <code>owningColl</code> and 22,000 of the same community in <code>owningComm</code></li>
|
|
<li>The record is from 2015 and think it would be easier to delete it than fix it:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl http://localhost:8983/solr/statistics/update -H <span style="color:#e6db74">"Content-type: text/xml"</span> --data-binary <span style="color:#e6db74">'<delete><query>uid:3b4eefba-a302-4172-a286-dcb25d70129e</query></delete>'</span>
|
|
</span></span></code></pre></div><ul>
|
|
<li>Looking again, there are at least 1,000 of these so I will need to come up with an actual solution to fix these</li>
|
|
<li>I’m noticing we have 1,800+ links to defunct resources on bioversityinternational.org in the <code>cg.link.permalink</code> field
|
|
<ul>
|
|
<li>I should ask Alliance if they have any plans to fix those, or upload them to CGSpace</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2024-01-22">2024-01-22</h2>
|
|
<ul>
|
|
<li>Meeting with IWMI about ORCID integration on CGSpace now that we’ve migrated to DSpace 7</li>
|
|
<li>File an issue for the inaccurate DSpace statistics: <a href="https://github.com/DSpace/DSpace/issues/9275">https://github.com/DSpace/DSpace/issues/9275</a></li>
|
|
</ul>
|
|
<h2 id="2024-01-23">2024-01-23</h2>
|
|
<ul>
|
|
<li>Meeting with IWMI about ORCID integration and the DSpace API for use with WordPress</li>
|
|
<li>IFPRI sent me an list of their author ORCIDs to add to our controlled vocabulary
|
|
<ul>
|
|
<li>I joined them with our current list and resolved their names on ORCID and updated them in our database:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/IFPRI<span style="color:#ae81ff">\ </span>ORCiD<span style="color:#ae81ff">\ </span>All.csv | grep -oE <span style="color:#e6db74">'[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}'</span> | sort -u > /tmp/2024-01-23-orcids.txt
|
|
</span></span><span style="display:flex;"><span>$ ./ilri/resolve_orcids.py -i /tmp/2024-01-23-orcids.txt -o /tmp/2024-01-23-orcids-names.txt -d
|
|
</span></span><span style="display:flex;"><span>$ ./ilri/update_orcids.py -i /tmp/2024-01-23-orcids-names.txt -db dspace -u dspace -p fuuu
|
|
</span></span></code></pre></div><ul>
|
|
<li>This adds about 400 new identifiers to the controlled vocabulary</li>
|
|
<li>I consolidated our various project identifier fields for closed programs into one <code>cg.identifer.project</code>:
|
|
<ul>
|
|
<li><code>cg.identifier.ccafsproject</code></li>
|
|
<li><code>cg.identifier.ccafsprojectpii</code></li>
|
|
<li><code>cg.identifier.ciatproject</code></li>
|
|
<li><code>cg.identifier.cpwfproject</code></li>
|
|
</ul>
|
|
</li>
|
|
<li>I prefixed the existing 2,644 metadata values with “CCAFS”, “CIAT”, or “CPWF” so we can figure out where they came from if need be, and deleted the old fields from the metadata registry</li>
|
|
</ul>
|
|
<h2 id="2024-01-26">2024-01-26</h2>
|
|
<ul>
|
|
<li>Minor work on dspace-angular to clean up component styles</li>
|
|
<li>Add <code>cg.identifier.publicationRank</code> to CGSpace metadata registry and submission form</li>
|
|
</ul>
|
|
<h2 id="2024-01-29">2024-01-29</h2>
|
|
<ul>
|
|
<li>Rework the nginx bot and network limits slightly to remove some old patterns/networks and remove Google
|
|
<ul>
|
|
<li>The Google Scholar team contacted me to ask why their requests were timing out (well…)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2024-01/">February, 2024</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2024-01/">January, 2024</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-12/">December, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-11/">November, 2023</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2023-10/">October, 2023</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|