846 lines
63 KiB
HTML
Raw Normal View History

2021-10-04 19:40:13 +03:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
2021-11-01 10:49:21 +02:00
<meta property="og:title" content="October, 2021" />
<meta property="og:description" content="2021-10-01
2021-10-04 19:40:13 +03:00
Export all affiliations on CGSpace and run them against the latest RoR data dump:
2021-11-09 06:29:52 +02:00
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
2021-10-04 19:40:13 +03:00
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
So we have 1879/7100 (26.46%) matching already
" />
<meta property="og:type" content="article" />
2021-11-01 10:49:21 +02:00
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-10/" />
<meta property="article:published_time" content="2021-10-01T11:14:07+03:00" />
<meta property="article:modified_time" content="2021-11-01T10:48:13+02:00" />
2021-10-04 19:40:13 +03:00
<meta name="twitter:card" content="summary"/>
2021-11-01 10:49:21 +02:00
<meta name="twitter:title" content="October, 2021"/>
<meta name="twitter:description" content="2021-10-01
2021-10-04 19:40:13 +03:00
Export all affiliations on CGSpace and run them against the latest RoR data dump:
2021-11-09 06:29:52 +02:00
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
2021-10-04 19:40:13 +03:00
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
So we have 1879/7100 (26.46%) matching already
"/>
2023-03-21 16:35:41 +03:00
<meta name="generator" content="Hugo 0.111.3">
2021-10-04 19:40:13 +03:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
2021-11-01 10:49:21 +02:00
"headline": "October, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-10/",
"wordCount": "4224",
"datePublished": "2021-10-01T11:14:07+03:00",
"dateModified": "2021-11-01T10:48:13+02:00",
2021-10-04 19:40:13 +03:00
"author": {
"@type": "Person",
"name": "Alan Orth"
2021-11-01 10:49:21 +02:00
},
"keywords": "Notes"
2021-10-04 19:40:13 +03:00
}
</script>
2021-11-01 10:49:21 +02:00
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2021-10/">
2021-10-04 19:40:13 +03:00
2021-11-01 10:49:21 +02:00
<title>October, 2021 | CGSpace Notes</title>
2021-10-04 19:40:13 +03:00
<!-- combined, minified CSS -->
2022-09-12 11:35:57 +03:00
<link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F&#43;GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous">
2021-10-04 19:40:13 +03:00
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
2021-11-01 10:49:21 +02:00
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2021-10/">October, 2021</a></h2>
2021-10-04 19:40:13 +03:00
<p class="blog-post-meta">
2021-11-01 10:49:21 +02:00
<time datetime="2021-10-01T11:14:07+03:00">Fri Oct 01, 2021</time>
in
2022-06-23 08:40:53 +03:00
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
2021-11-01 10:49:21 +02:00
2021-10-04 19:40:13 +03:00
</p>
</header>
2021-11-01 10:49:21 +02:00
<h2 id="2021-10-01">2021-10-01</h2>
2021-10-04 19:40:13 +03:00
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
</span></span><span style="display:flex;"><span>ations-matching.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
</span></span><span style="display:flex;"><span>1879
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt
</span></span></code></pre></div><ul>
2021-10-04 19:40:13 +03:00
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<h2 id="2021-10-03">2021-10-03</h2>
<ul>
<li>Dominique from IWMI asked me for information about how CGSpace partners are using CGSpace APIs to feed their websites</li>
<li>Start a fresh indexing on AReS</li>
<li>Udana sent me his file of 292 non-IWMI publications for the Virtual library on water management
<ul>
<li>He added licenses</li>
<li>I want to clean up the <code>dcterms.extent</code> field though because it has volume, issue, and pages there</li>
<li>I cloned the column several times and extracted values based on their positions, for example:
<ul>
<li>Volume: <code>value.partition(&quot;:&quot;)[0]</code></li>
<li>Issue: <code>value.partition(&quot;(&quot;)[2].partition(&quot;)&quot;)[0]</code></li>
<li>Page: <code>&quot;p. &quot; + value.replace(&quot;.&quot;, &quot;&quot;)</code></li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="2021-10-04">2021-10-04</h2>
<ul>
<li>Start looking at the last month of Solr statistics on CGSpace
<ul>
<li>I see a number of IPs with &ldquo;normal&rdquo; user agents who clearly behave like bots
<ul>
<li>198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)</li>
<li>93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)</li>
<li>193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)</li>
<li>3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)</li>
<li>34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code>, from ASN 14618 (AMAZON-AES, US)</li>
<li>18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code>, from ASN 14618 (AMAZON-AES, US)</li>
<li>3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code>, from ASN 14618 (AMAZON-AES, US)</li>
<li>3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code>, from ASN 14618 (AMAZON-AES, US)</li>
</ul>
</li>
<li>Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/*.log* | grep <span style="color:#e6db74">&#39;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/mozilla-4.0-ips.txt
</span></span><span style="display:flex;"><span># wc -l /tmp/mozilla-4.0-ips.txt
</span></span><span style="display:flex;"><span>543 /tmp/mozilla-4.0-ips.txt
</span></span></code></pre></div><ul>
2021-10-04 19:40:13 +03:00
<li>Then I resolved the IPs and extracted the ones belonging to Amazon:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k <span style="color:#e6db74">&#34;</span>$ABUSEIPDB_API_KEY<span style="color:#e6db74">&#34;</span> -o /tmp/mozilla-4.0-ips.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -m <span style="color:#ae81ff">14618</span> /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
</span></span></code></pre></div><ul>
2021-10-04 19:40:13 +03:00
<li>I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon</li>
<li>Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> 1592 GET /handle/10947/2526
</span></span><span style="display:flex;"><span> 1592 GET /handle/10947/2527
</span></span><span style="display:flex;"><span> 1592 GET /handle/10947/34
</span></span><span style="display:flex;"><span> 1593 GET /handle/10947/6
</span></span><span style="display:flex;"><span> 1594 GET /handle/10947/1
</span></span><span style="display:flex;"><span> 1598 GET /handle/10947/2515
</span></span><span style="display:flex;"><span> 1598 GET /handle/10947/2516
</span></span><span style="display:flex;"><span> 1599 GET /handle/10568/101335
</span></span><span style="display:flex;"><span> 1599 GET /handle/10568/91688
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2517
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2518
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2519
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2708
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2871
</span></span><span style="display:flex;"><span> 1600 GET /handle/10568/89342
</span></span><span style="display:flex;"><span> 1600 GET /handle/10947/4467
</span></span><span style="display:flex;"><span> 1607 GET /handle/10568/103816
</span></span><span style="display:flex;"><span> 290382 GET /handle/10568/83389
</span></span></code></pre></div><ul>
2021-10-05 18:54:39 +03:00
<li>Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight&hellip;</li>
2021-10-04 19:40:13 +03:00
<li>Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR</li>
<li>Meeting with Michelle from Altmetric about their new CSV upload system
<ul>
<li>I sent her some examples of Handles that have DOIs, but no linked score (yet) to see if an association will be created when she uploads them</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-csv" data-lang="csv">doi,handle
10.1016/j.agsy.2021.103263,10568/115288
10.3389/fgene.2021.723360,10568/115287
10.3389/fpls.2021.720670,10568/115285
</code></pre><ul>
<li>Extract the AGROVOC subjects from IWMI&rsquo;s 292 publications to validate them against AGROVOC:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;dcterms.subject[en_US]&#39;</span> ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e <span style="color:#e6db74">&#39;s/||/\n/g&#39;</span> -e <span style="color:#e6db74">&#39;s/&#34;//g&#39;</span> | sort -u &gt; /tmp/agrovoc.txt
</span></span><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <span style="color:#e6db74">&#39;0&#39;</span> /tmp/agrovoc-matches.csv | csvcut -c <span style="color:#ae81ff">1</span> &gt; /tmp/invalid-agrovoc.csv
</span></span></code></pre></div><h2 id="2021-10-05">2021-10-05</h2>
2021-10-05 18:54:39 +03:00
<ul>
<li>Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
<ul>
<li>I added <code>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</code> to the list of bad bots in nginx</li>
<li>I purged all the Amazon IPs using this user agent, as well as the few other IPs I identified yesterday</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 465119
</span></span></code></pre></div><h2 id="2021-10-06">2021-10-06</h2>
2021-10-07 08:27:39 +03:00
<ul>
<li>Thinking about how we could check for duplicates before importing
<ul>
<li>I found out that <a href="https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/">PostgreSQL has a built-in similarity function</a>:</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; CREATE EXTENSION pg_trgm;
</span></span><span style="display:flex;"><span>localhost/dspace63= &gt; SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,&#39;Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines&#39;) &gt; 0.5;
</span></span><span style="display:flex;"><span> metadata_value_id │ text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> 3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
</span></span><span style="display:flex;"><span> 3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
</span></span><span style="display:flex;"><span>(2 rows)
</span></span></code></pre></div><ul>
2021-10-07 08:27:39 +03:00
<li>I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)</li>
<li>I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
<ul>
<li>I think I will check for similar titles, and if I find them I will print out the handles for verification</li>
<li>I could also proceed to check other metadata like type because those shouldn&rsquo;t vary too much</li>
</ul>
</li>
<li>I ran my new <code>check-duplicates.py</code> script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
<ul>
<li>Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!</li>
<li>This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives</li>
<li>I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
<ul>
<li>0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total</li>
<li>0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total</li>
<li>0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total</li>
</ul>
</li>
</ul>
</li>
<li>Some minor updates to csv-metadata-quality
<ul>
<li>Fix two issues with regular expressions in the duplicate items and experimental language checks</li>
<li>Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field</li>
</ul>
</li>
<li>Then I ran this new version of csv-metadata-quality on an export of IWMI&rsquo;s community, minus some fields I don&rsquo;t want to check:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -C <span style="color:#e6db74">&#39;dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection&#39;</span> ~/Downloads/iwmi.csv &gt; /tmp/iwmi-to-check.csv
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
</span></span><span style="display:flex;"><span>$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi.csv
</span></span></code></pre></div><ul>
2021-10-07 08:27:39 +03:00
<li>I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs&hellip;
<ul>
<li>I cut a subset of the fields from the main CSV and tried again, but DSpace said &ldquo;no changes detected&rdquo;</li>
<li>The duplicates are definitely removed from the CSV, but DSpace doesn&rsquo;t detect them</li>
<li>I realized this is an issue I&rsquo;ve had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!</li>
<li>I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 (&ldquo;No changes were detected&rdquo; when importing metadata via XMLUI&quot;) where he says:</li>
</ul>
</li>
</ul>
<blockquote>
<p>It&rsquo;s very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.</p>
</blockquote>
<ul>
<li>Shit, so that&rsquo;s worth looking into&hellip;</li>
</ul>
2021-10-08 17:15:17 +03:00
<h2 id="2021-10-07">2021-10-07</h2>
<ul>
<li>I decided to upload the cleaned IWMI community by moving the cleaned metadata field from <code>dcterms.subject[en_US]</code> to <code>dcterms.subject[en_Fu]</code> temporarily, uploading them, then moving them back, and uploading again
<ul>
<li>I started by copying just a handful of fields from the iwmi.csv community export:</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]&#39;</span> ~/Downloads/iwmi.csv &gt; /tmp/iwmi-duplicate-metadata.csv
</span></span><span style="display:flex;"><span># Copy and blank columns in OpenRefine
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
</span></span><span style="display:flex;"><span>$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi-duplicates-cleaned.csv
</span></span></code></pre></div><ul>
2021-10-08 17:15:17 +03:00
<li>It takes a few hours per 2,000 items because DSpace processes them so slowly&hellip; sigh&hellip;</li>
</ul>
<h2 id="2021-10-08">2021-10-08</h2>
<ul>
<li>I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
</span></span><span style="display:flex;"><span> text_lang | count
</span></span><span style="display:flex;"><span>-----------+---------
</span></span><span style="display:flex;"><span> en_US | 2603711
</span></span><span style="display:flex;"><span> en_Fu | 115568
</span></span><span style="display:flex;"><span> en | 8818
</span></span><span style="display:flex;"><span> | 5286
</span></span><span style="display:flex;"><span> fr | 2
</span></span><span style="display:flex;"><span> vn | 2
</span></span><span style="display:flex;"><span> | 0
</span></span><span style="display:flex;"><span>(7 rows)
</span></span><span style="display:flex;"><span>cgspace=# BEGIN;
</span></span><span style="display:flex;"><span>cgspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en_Fu&#39;, &#39;en&#39;, &#39;&#39;);
</span></span><span style="display:flex;"><span>UPDATE 129673
</span></span><span style="display:flex;"><span>cgspace=# COMMIT;
</span></span></code></pre></div><ul>
2021-10-08 17:15:17 +03:00
<li>So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
</span></span><span style="display:flex;"><span>391
</span></span></code></pre></div><ul>
2021-10-08 17:15:17 +03:00
<li>I tried to export ILRI&rsquo;s community, but ran into the export bug (DS-4211)
<ul>
<li>After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>32070
</span></span><span style="display:flex;"><span>$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>19315
</span></span></code></pre></div><ul>
2021-10-08 17:15:17 +03:00
<li>It seems there are only about 200 duplicate values in this subset of fields in ILRI&rsquo;s community:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
</span></span><span style="display:flex;"><span>220
</span></span></code></pre></div><ul>
2021-10-08 17:15:17 +03:00
<li>I found a cool way to select only the items with corrections
<ul>
<li>First, extract a handful of fields from the CSV with csvcut</li>
<li>Second, clean the CSV with csv-metadata-quality</li>
<li>Third, rename the columns to something obvious in the cleaned CSV</li>
<li>Fourth, use csvjoin to merge the cleaned file with the original</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]&#39;</span> /tmp/ilri.csv | csvsort | uniq &gt; /tmp/ilri-deduplicated-items.csv
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
</span></span><span style="display:flex;"><span>$ sed -i -e <span style="color:#e6db74">&#39;1s/en_US/en_Fu/g&#39;</span> /tmp/ilri-deduplicated-items-cleaned.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv &gt; /tmp/ilri-deduplicated-items-cleaned-joined.csv
</span></span></code></pre></div><ul>
2021-10-08 17:15:17 +03:00
<li>Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:</li>
</ul>
2022-03-04 15:30:06 +03:00
<pre tabindex="0"><code>if(cells[&#39;dcterms.subject[en_US]&#39;].value == cells[&#39;dcterms.subject[en_Fu]&#39;].value,&#34;same&#34;,&#34;different&#34;)
2021-10-08 17:15:17 +03:00
</code></pre><ul>
<li>For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
<ul>
<li>After these are uploaded I will normalize the <code>text_lang</code> fields in PostgreSQL again</li>
</ul>
</li>
2021-10-09 22:00:59 +03:00
<li>I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
</span></span><span style="display:flex;"><span>7720
</span></span></code></pre></div><ul>
2021-10-09 22:00:59 +03:00
<li>I applied these to the CIAT community, so in total that&rsquo;s over 8,000 duplicate metadata values removed in a handful of fields&hellip;</li>
</ul>
<h2 id="2021-10-09">2021-10-09</h2>
<ul>
<li>I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there</li>
<li>Also of note, there are some other fixes too, for example in IITA&rsquo;s community:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c -E <span style="color:#e6db74">&#39;(Fixing|Removing) (duplicate|excessive|invalid)&#39;</span> /tmp/out.log
</span></span><span style="display:flex;"><span>249
</span></span></code></pre></div><ul>
2021-10-09 22:00:59 +03:00
<li>I ran a full Discovery re-indexing on CGSpace</li>
<li>Then I exported all of CGSpace and extracted the ISSNs and ISBNs:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]&#39;</span> /tmp/cgspace.csv &gt; /tmp/cgspace-issn-isbn.csv
</span></span></code></pre></div><ul>
2021-10-09 22:00:59 +03:00
<li>I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs</li>
2021-10-08 17:15:17 +03:00
</ul>
2021-10-10 16:01:27 +03:00
<h2 id="2021-10-10">2021-10-10</h2>
<ul>
<li>Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on <code>metadata-export</code> (DS-4211)</li>
<li>First create a new PostgreSQL 13 container:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5433:5432 -d postgres:13-alpine
</span></span><span style="display:flex;"><span>$ createuser -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres --pwprompt dspacetest
</span></span><span style="display:flex;"><span>$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#39;CREATE EXTENSION pgcrypto;&#39;</span>
</span></span></code></pre></div><ul>
2021-10-10 16:01:27 +03:00
<li>Then edit setting in <code>dspace/config/local.cfg</code> and build the backend server with Java 11:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ mvn package
</span></span><span style="display:flex;"><span>$ cd dspace/target/dspace-installer
</span></span><span style="display:flex;"><span>$ ant fresh_install
</span></span><span style="display:flex;"><span># fix database not being fully ready, causing Tomcat to fail to start the server application
</span></span><span style="display:flex;"><span>$ ~/dspace7/bin/dspace database migrate
</span></span></code></pre></div><ul>
2021-10-10 16:01:27 +03:00
<li>Copy Solr configs and start Solr:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
</span></span><span style="display:flex;"><span>$ ~/src/solr-8.8.2/bin/solr start
</span></span></code></pre></div><ul>
2021-10-10 16:01:27 +03:00
<li>Start my local Tomcat 9 instance:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ systemctl --user start tomcat9@dspace7
</span></span></code></pre></div><ul>
2021-10-10 16:01:27 +03:00
<li>This works, so now I will drop the default database and import a dump from CGSpace</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ systemctl --user stop tomcat9@dspace7
</span></span><span style="display:flex;"><span>$ dropdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7
</span></span><span style="display:flex;"><span>$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">&#39;alter user dspacetest superuser;&#39;</span>
</span></span><span style="display:flex;"><span>$ pg_restore -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -d dspace7 -O --role<span style="color:#f92672">=</span>dspacetest -h localhost dspace-2021-10-09.backup
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">&#39;alter user dspacetest nosuperuser;&#39;</span>
</span></span></code></pre></div><ul>
2021-10-10 16:01:27 +03:00
<li>Delete Atmire migrations and some others that were &ldquo;unresolved&rdquo;:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#34;DELETE FROM schema_version WHERE description LIKE &#39;%Atmire%&#39; OR description LIKE &#39;%CUA%&#39; OR description LIKE &#39;%cua%&#39;;&#34;</span>
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#34;DELETE FROM schema_version WHERE version IN (&#39;5.0.2017.09.25&#39;, &#39;6.0.2017.01.30&#39;, &#39;6.0.2017.09.25&#39;);&#34;</span>
</span></span></code></pre></div><ul>
2021-10-11 20:06:42 +03:00
<li>Now DSpace 7 starts with my CGSpace data&hellip; nice
<ul>
<li>The Discovery indexing still takes seven hours&hellip; fuck</li>
</ul>
</li>
2021-10-10 16:01:27 +03:00
<li>I tested the <code>metadata-export</code> on DSpace 7.1-SNAPSHOT and it still has the duplicate items issue introduced by DS-4211
<ul>
<li>I filed a GitHub issue and notified nwoodward: <a href="https://github.com/DSpace/DSpace/issues/7988">https://github.com/DSpace/DSpace/issues/7988</a></li>
</ul>
</li>
<li>Start a full reindex on AReS</li>
</ul>
2021-10-11 20:06:42 +03:00
<h2 id="2021-10-11">2021-10-11</h2>
<ul>
<li>Start a full Discovery reindex on my local DSpace 6.3 instance:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e chrt -b <span style="color:#ae81ff">0</span> ~/dspace63/bin/dspace index-discovery -b
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
</span></span><span style="display:flex;"><span>Changes have been processed
</span></span><span style="display:flex;"><span>836140:6543.6
</span></span></code></pre></div><ul>
2021-10-11 20:06:42 +03:00
<li>So that&rsquo;s 1.8 hours versus 7 on DSpace 7, with the same database!</li>
<li>Several users wrote to me that CGSpace was slow recently
<ul>
<li>Looking at the PostgreSQL database I see connections look normal, but locks for <code>dspaceWeb</code> are high:</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_stat_activity&#39;</span> | wc -l
</span></span><span style="display:flex;"><span>53
</span></span><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | wc -l
</span></span><span style="display:flex;"><span>1697
</span></span><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceWeb&#39;&#34;</span> | wc -l
</span></span><span style="display:flex;"><span>1681
</span></span></code></pre></div><ul>
2021-10-11 20:06:42 +03:00
<li>Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:</li>
</ul>
<p><img src="/cgspace-notes/2021/10/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
<ul>
<li>The only thing I did on 2021-10-07 was import a few thousand metadata corrections&hellip;</li>
<li>I restarted PostgreSQL (instead of restarting Tomcat), so let&rsquo;s see if that helps</li>
<li>I filed <a href="https://github.com/DSpace/DSpace/issues/7989">a bug for the DSpace 6/7 duplicate values metadata import issue</a></li>
<li>I tested the two patches for removing abandoned submissions from the workflow but unfortunately it seems that they are for the configurable aka XML workflow, and we are using the basic workflow</li>
2021-10-17 20:47:01 +03:00
<li>I discussed PostgreSQL issues with some people on the DSpace Slack
<ul>
<li>Looking at postgresqltuner.pl and <a href="https://pgtune.leopard.in.ua">https://pgtune.leopard.in.ua</a> I realized that there were some settings that I hadn&rsquo;t changed in a few years that I probably need to re-evaluate</li>
<li>For example, <code>random_page_cost</code> is recommended to be 1.1 in the PostgreSQL 10 docs (default is 4.0, but we use 1 since 2017 when it came up in Hacker News)</li>
<li>Also, <code>effective_io_concurrency</code> is recommended to be &ldquo;hundreds&rdquo; if you are using an SSD (default is 1)</li>
</ul>
</li>
<li>I also enabled the <code>pg_stat_statements</code> extension to try to understand what queries are being run the most often, and how long they take</li>
</ul>
<h2 id="2021-10-12">2021-10-12</h2>
<ul>
<li>I looked again at the duplicate items query I was doing with trigrams recently and found a few new things
<ul>
<li>Looking at the <code>EXPLAIN ANALYZE</code> plan for the query I noticed it wasn&rsquo;t using any indexes</li>
<li>I <a href="https://dba.stackexchange.com/questions/103821/best-index-for-similarity-function/103823">read on StackExchange</a> that, if we want to make use of indexes, we need to use the similarity operator (<code>%</code>), not the function <code>similarity()</code> because &ldquo;index support is bound to operators in Postgres, not to functions&rdquo;</li>
<li>A note about the query plan output is that we need to read it from the bottom up!</li>
<li>So with the similary operator we need to set the threshold like this now:</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.5;
</span></span></code></pre></div><ul>
2021-10-17 20:47:01 +03:00
<li>Next I experimented with using GIN or GiST indexes on <code>metadatavalue</code>, but they were slower than the existing DSpace indexes
<ul>
<li>I tested a few variations of the query I had been using and found it&rsquo;s <em>much</em> faster if I use the similarity operator and keep the condition that object IDs are in the item table&hellip;</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 739.948 ms
</span></span></code></pre></div><ul>
2021-10-17 20:47:01 +03:00
<li>Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing!</li>
<li>I still don&rsquo;t understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate</li>
<li>So to summarize, the best to the worst query, all returning the same result:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 683.165 ms
</span></span><span style="display:flex;"><span>Time: 635.364 ms
</span></span><span style="display:flex;"><span>Time: 674.666 ms
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 1584.765 ms (00:01.585)
</span></span><span style="display:flex;"><span>Time: 1665.594 ms (00:01.666)
</span></span><span style="display:flex;"><span>Time: 1623.726 ms (00:01.624)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,&#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;) &gt; 0.6;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 4028.939 ms (00:04.029)
</span></span><span style="display:flex;"><span>Time: 4022.239 ms (00:04.022)
</span></span><span style="display:flex;"><span>Time: 4061.820 ms (00:04.062)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
</span></span><span style="display:flex;"><span>localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,&#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;) &gt; 0.6;
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 4358.713 ms (00:04.359)
</span></span><span style="display:flex;"><span>Time: 4301.248 ms (00:04.301)
</span></span><span style="display:flex;"><span>Time: 4417.909 ms (00:04.418)
</span></span></code></pre></div><h2 id="2021-10-13">2021-10-13</h2>
2021-10-17 20:47:01 +03:00
<ul>
<li>I looked into the <a href="https://github.com/DSpace/DSpace/issues/7946">REST API issue where fields without qualifiers throw an HTTP 500</a>
<ul>
<li>The fix is to check if the qualifier is not null AND not empty in dspace-api</li>
<li>I submitted a fix: <a href="https://github.com/DSpace/DSpace/pull/7993">https://github.com/DSpace/DSpace/pull/7993</a></li>
</ul>
</li>
</ul>
<h2 id="2021-10-14">2021-10-14</h2>
<ul>
<li>Someone in the DSpace community already posted a fix for the DSpace 6/7 duplicate items export bug!
<ul>
<li>I tested it and it works so I left feedback: <a href="https://github.com/DSpace/DSpace/pull/7995">https://github.com/DSpace/DSpace/pull/7995</a></li>
</ul>
</li>
<li>Altmetric support got back to us about the missing DOIHandle link and said it was due to the TLS certificate chain on CGSpace
<ul>
<li>I checked and everything is actually working fine, so it could be their backend servers are old and don&rsquo;t support the new Let&rsquo;s Encrypt trust path</li>
<li>I asked them to put me in touch with their backend developers directly</li>
</ul>
</li>
</ul>
<h2 id="2021-10-17">2021-10-17</h2>
<ul>
<li>Revert the ssl-cert change on the Ansible infrastructure scripts so that nginx uses a manually generated &ldquo;snakeoil&rdquo; TLS certificate
<ul>
<li>The ssl-cert one is easier because it&rsquo;s automatic, but they include the hostname in the bogus cert so it&rsquo;s an unecessary leak of information</li>
</ul>
</li>
2021-10-20 22:21:55 +03:00
<li>I started doing some tests to upgrade Elasticsearch from 7.6.2 to 7.7, 7.8, 7.9, and eventually 7.10 on OpenRXV
<ul>
<li>I tested harvesting, reporting, filtering, and various admin actions with each version and they all worked fine, with no errors in any logs as far as I can see</li>
<li>This fixes bunches of issues, updates Java from 13 to 15, and the base image from CentOS 7 to 8, so it&rsquo;s a decent amount of technical debt!</li>
<li>I even tried Elasticsearch 7.13.2, which has Java 16, and it works fine&hellip;</li>
<li>I submitted a pull request: <a href="https://github.com/ilri/OpenRXV/pull/126">https://github.com/ilri/OpenRXV/pull/126</a></li>
</ul>
</li>
</ul>
<h2 id="2021-10-20">2021-10-20</h2>
<ul>
<li>Meeting with Big Data and CGIAR repository players about the feasibility of moving to a single repository
<ul>
<li>We discussed several options, for example moving all DSpaces to CGSpace along with their permanent identifiers</li>
<li>The issue would be for centers like IFPRI who don&rsquo;t use DSpace and have integrations with their website etc with their current repository</li>
</ul>
</li>
2021-10-11 20:06:42 +03:00
</ul>
2021-10-24 21:21:01 +03:00
<h2 id="2021-10-21">2021-10-21</h2>
<ul>
<li>Udana from IWMI contacted me to ask if I could do a one-off AReS harvest because they have some new items they need to report on</li>
</ul>
<h2 id="2021-10-22">2021-10-22</h2>
<ul>
<li>Abenet and others contacted me to say that the LDAP login was not working on CGSpace
<ul>
<li>I checked with <code>ldapsearch</code> and it is indeed not working:</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;booo&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=fuuu)&#34;</span>
</span></span><span style="display:flex;"><span>Enter LDAP Password:
</span></span><span style="display:flex;"><span>ldap_bind: Invalid credentials (49)
</span></span><span style="display:flex;"><span> additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
</span></span></code></pre></div><ul>
2021-10-24 21:21:01 +03:00
<li>I sent a message to ILRI ICT to ask them to check the account
<ul>
<li>They reset the password so I ran all system updates and rebooted the server since users weren&rsquo;t able to log in anyways</li>
</ul>
</li>
</ul>
<h2 id="2021-10-24">2021-10-24</h2>
<ul>
<li>CIP was asking about CGSpace stats again
<ul>
<li>The last time I helped them with this was in 2021-04, when I extracted stats for their community from the DSpace Statistics API</li>
</ul>
</li>
<li>In looking at the CIP stats request I got curious if there were any hits from all those Russian IPs before 2021-07 that I could purge
<ul>
<li>Sure enough there were a few hundred IPs belonging to those ASNs:</li>
</ul>
</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ http <span style="color:#e6db74">&#39;localhost:8081/solr/statistics/select?q=time%3A2021-04*&amp;fl=ip&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip&amp;facet.limit=200000&amp;facet.mincount=1&#39;</span> &gt; /tmp/2021-04-ips.json
</span></span><span style="display:flex;"><span># Ghetto way to extract the IPs using jq, but I can<span style="color:#960050;background-color:#1e0010">&#39;</span>t figure out how only print them and not the facet counts, so I just use sed
</span></span><span style="display:flex;"><span>$ jq <span style="color:#e6db74">&#39;.facet_counts.facet_fields.ip[]&#39;</span> /tmp/2021-04-ips.json | grep -E <span style="color:#e6db74">&#39;^&#34;&#39;</span> | sed -e <span style="color:#e6db74">&#39;s/&#34;//g&#39;</span> &gt; /tmp/ips.txt
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u &gt; /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>125 /tmp/networks-to-block.txt
</span></span><span style="display:flex;"><span>$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt &gt; /tmp/ips-to-purge.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/ips-to-purge.txt
</span></span><span style="display:flex;"><span>202
</span></span></code></pre></div><ul>
2021-10-24 21:21:01 +03:00
<li>Attempting to purge those only shows about 3,500 hits, but I will do it anyways
<ul>
<li>Adding 64.39.108.48 from Qualys I get a total of 22631 hits purged</li>
</ul>
</li>
<li>I also purged another 5306 hits after checking the IPv4 list from AbuseIPDB.com</li>
</ul>
2021-10-28 18:08:13 +03:00
<h2 id="2021-10-25">2021-10-25</h2>
<ul>
<li>Help CIP colleagues with view and download statistics for their community in 2020 and 2021</li>
</ul>
<h2 id="2021-10-27">2021-10-27</h2>
<ul>
<li>Help ICARDA colleagues with GLDC reports on AReS
<ul>
<li>There was an issue due to differences in CRP metadata between repositories</li>
</ul>
</li>
</ul>
2021-11-01 10:07:11 +02:00
<h2 id="2021-10-28">2021-10-28</h2>
<ul>
<li>Meeting with Medha and a bunch of others about the FAIRscribe tool they have been developing
<ul>
<li>Seems it is a submission tool like MEL</li>
</ul>
</li>
</ul>
<h2 id="2021-10-29">2021-10-29</h2>
<ul>
<li>Linode alerted me that CGSpace (linode18) has high outbound traffic for the last two hours
<ul>
<li>This has happened a few other times this week so I decided to go look at the Solr stats for today</li>
<li>I see 93.158.91.62 is making thousands of requests to Discover with a normal user agent:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36
</code></pre><ul>
<li>Even more annoying, they are not re-using their session ID:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=&#39;</span> | sort | uniq | wc -l
</span></span><span style="display:flex;"><span>4888
</span></span></code></pre></div><ul>
2021-11-01 10:07:11 +02:00
<li>This IP has made 36,000 requests to CGSpace&hellip;</li>
<li>The IP is owned by <a href="internetvikings.com">Internet Vikings</a> in Sweden</li>
<li>I purged their statistics and set up a temporary HTTP 403 telling them to use a real user agent</li>
<li>I see another one in Sweden a few days ago (192.36.109.131), also using the same exact user agent as above, but belonging to <a href="http://webb.resilans.se/">Resilans AB</a>
<ul>
<li>I purged another 74,619 hits from this bot</li>
</ul>
</li>
<li>I added these two IPs to the nginx IP bot identifier</li>
<li>Jesus I found a few Russian IPs attempting SQL injection and path traversal, ie:</li>
</ul>
2022-03-04 15:30:06 +03:00
<pre tabindex="0"><code>45.9.20.71 - - [20/Oct/2021:02:31:15 +0200] &#34;GET /bitstream/handle/10568/1820/Rhodesgrass.pdf?sequence=4&amp;OoxD=6591%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23 HTTP/1.1&#34; 200 143070 &#34;https://cgspace.cgiar.org:443/bitstream/handle/10568/1820/Rhodesgrass.pdf&#34; &#34;Mozilla/5.0 (X11; U; Linux i686; es-AR; rv:1.8.1.11) Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11&#34;
2021-11-01 10:07:11 +02:00
</code></pre><ul>
<li>I reported them to AbuseIPDB.com and purged their hits:</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
</span></span><span style="display:flex;"><span>Purging 6364 hits from 45.9.20.71 in statistics
</span></span><span style="display:flex;"><span>Purging 8039 hits from 45.146.166.157 in statistics
</span></span><span style="display:flex;"><span>Purging 3383 hits from 45.155.204.82 in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 17786
</span></span></code></pre></div><h2 id="2021-10-31">2021-10-31</h2>
2021-11-01 10:07:11 +02:00
<ul>
<li>Update Docker containers for AReS on linode20 and run a fresh harvest</li>
<li>Found some strange IP (94.71.3.44) making 51,000 requests today with the user agent &ldquo;Microsoft Internet Explorer&rdquo;
<ul>
<li>It is in Greece, and it seems to be requesting each item&rsquo;s XMLUI full metadata view, so I suspect it&rsquo;s Gardian actually</li>
<li>I found it making another 25,000 requests yesterday&hellip;</li>
<li>I purged them from Solr</li>
</ul>
</li>
<li>Found 20,000 hits from Qualys (according to AbuseIPDB.com) using normal user agents&hellip; ugh, must be some ILRI ICT scan</li>
<li>Found more request from a Swedish IP (93.158.90.34) using that weird Firefox user agent that I noticed a few weeks ago:</li>
</ul>
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0
</code></pre><ul>
<li>That&rsquo;s from ASN 12552 (IPO-EU, SE), which is operated by Internet Vikings, though AbuseIPDB.com says it&rsquo;s <a href="availo.se">Availo Networks AB</a></li>
<li>There&rsquo;s another IP (3.225.28.105) that made a few thousand requests to the REST API from Amazon, though it&rsquo;s using a normal user agent</li>
</ul>
2022-03-04 15:30:06 +03:00
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
</span></span><span style="display:flex;"><span>3991
</span></span><span style="display:flex;"><span>~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE &#39;GET /rest/(collections|handle|items)&#39; | sort | uniq -c
</span></span><span style="display:flex;"><span> 3154 GET /rest/collections
</span></span><span style="display:flex;"><span> 427 GET /rest/handle
</span></span><span style="display:flex;"><span> 410 GET /rest/items
</span></span></code></pre></div><ul>
2021-11-01 10:07:11 +02:00
<li>It requested the <a href="https://cgspace.cgiar.org/handle/10568/75560">CIAT Story Maps</a> collection over 3,000 times last month&hellip;
<ul>
<li>I will purge those hits</li>
</ul>
</li>
</ul>
2021-10-07 08:27:39 +03:00
<!-- raw HTML omitted -->
2021-10-04 19:40:13 +03:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2023-04-02 09:16:25 +03:00
<li><a href="/cgspace-notes/2023-04/">April, 2023</a></li>
2023-03-01 08:30:25 +03:00
<li><a href="/cgspace-notes/2023-03/">March, 2023</a></li>
2023-02-09 08:50:54 +03:00
<li><a href="/cgspace-notes/2023-02/">February, 2023</a></li>
2023-01-01 10:12:13 +02:00
<li><a href="/cgspace-notes/2023-01/">January, 2023</a></li>
2022-12-03 10:46:29 +03:00
<li><a href="/cgspace-notes/2022-12/">December, 2022</a></li>
2021-10-04 19:40:13 +03:00
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>