mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-03-04
This commit is contained in:
@ -46,7 +46,7 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
|
||||
So we have 1879/7100 (26.46%) matching already
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.92.2" />
|
||||
<meta name="generator" content="Hugo 0.93.1" />
|
||||
|
||||
|
||||
|
||||
@ -136,15 +136,15 @@ So we have 1879/7100 (26.46%) matching already
|
||||
<ul>
|
||||
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
|
||||
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
|
||||
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
|
||||
ations-matching.csv
|
||||
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
|
||||
1879
|
||||
$ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
7100 /tmp/2021-10-01-affiliations.txt
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
|
||||
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
|
||||
</span></span><span style="display:flex;"><span>ations-matching.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
|
||||
</span></span><span style="display:flex;"><span>1879
|
||||
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>So we have 1879/7100 (26.46%) matching already</li>
|
||||
</ul>
|
||||
<h2 id="2021-10-03">2021-10-03</h2>
|
||||
@ -185,37 +185,37 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/*.log* | grep <span style="color:#e6db74">'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort | uniq > /tmp/mozilla-4.0-ips.txt
|
||||
# wc -l /tmp/mozilla-4.0-ips.txt
|
||||
543 /tmp/mozilla-4.0-ips.txt
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat --force /var/log/nginx/*.log* | grep <span style="color:#e6db74">'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort | uniq > /tmp/mozilla-4.0-ips.txt
|
||||
</span></span><span style="display:flex;"><span># wc -l /tmp/mozilla-4.0-ips.txt
|
||||
</span></span><span style="display:flex;"><span>543 /tmp/mozilla-4.0-ips.txt
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Then I resolved the IPs and extracted the ones belonging to Amazon:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k <span style="color:#e6db74">"</span>$ABUSEIPDB_API_KEY<span style="color:#e6db74">"</span> -o /tmp/mozilla-4.0-ips.csv
|
||||
$ csvgrep -c asn -m <span style="color:#ae81ff">14618</span> /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k <span style="color:#e6db74">"</span>$ABUSEIPDB_API_KEY<span style="color:#e6db74">"</span> -o /tmp/mozilla-4.0-ips.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -m <span style="color:#ae81ff">14618</span> /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon</li>
|
||||
<li>Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"> 1592 GET /handle/10947/2526
|
||||
1592 GET /handle/10947/2527
|
||||
1592 GET /handle/10947/34
|
||||
1593 GET /handle/10947/6
|
||||
1594 GET /handle/10947/1
|
||||
1598 GET /handle/10947/2515
|
||||
1598 GET /handle/10947/2516
|
||||
1599 GET /handle/10568/101335
|
||||
1599 GET /handle/10568/91688
|
||||
1599 GET /handle/10947/2517
|
||||
1599 GET /handle/10947/2518
|
||||
1599 GET /handle/10947/2519
|
||||
1599 GET /handle/10947/2708
|
||||
1599 GET /handle/10947/2871
|
||||
1600 GET /handle/10568/89342
|
||||
1600 GET /handle/10947/4467
|
||||
1607 GET /handle/10568/103816
|
||||
290382 GET /handle/10568/83389
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span> 1592 GET /handle/10947/2526
|
||||
</span></span><span style="display:flex;"><span> 1592 GET /handle/10947/2527
|
||||
</span></span><span style="display:flex;"><span> 1592 GET /handle/10947/34
|
||||
</span></span><span style="display:flex;"><span> 1593 GET /handle/10947/6
|
||||
</span></span><span style="display:flex;"><span> 1594 GET /handle/10947/1
|
||||
</span></span><span style="display:flex;"><span> 1598 GET /handle/10947/2515
|
||||
</span></span><span style="display:flex;"><span> 1598 GET /handle/10947/2516
|
||||
</span></span><span style="display:flex;"><span> 1599 GET /handle/10568/101335
|
||||
</span></span><span style="display:flex;"><span> 1599 GET /handle/10568/91688
|
||||
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2517
|
||||
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2518
|
||||
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2519
|
||||
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2708
|
||||
</span></span><span style="display:flex;"><span> 1599 GET /handle/10947/2871
|
||||
</span></span><span style="display:flex;"><span> 1600 GET /handle/10568/89342
|
||||
</span></span><span style="display:flex;"><span> 1600 GET /handle/10947/4467
|
||||
</span></span><span style="display:flex;"><span> 1607 GET /handle/10568/103816
|
||||
</span></span><span style="display:flex;"><span> 290382 GET /handle/10568/83389
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight…</li>
|
||||
<li>Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR</li>
|
||||
<li>Meeting with Michelle from Altmetric about their new CSV upload system
|
||||
@ -231,10 +231,10 @@ $ csvgrep -c asn -m <span style="color:#ae81ff">14618</span> /tmp/mozilla-4.0-ip
|
||||
</code></pre><ul>
|
||||
<li>Extract the AGROVOC subjects from IWMI’s 292 publications to validate them against AGROVOC:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">'dcterms.subject[en_US]'</span> ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e <span style="color:#e6db74">'s/||/\n/g'</span> -e <span style="color:#e6db74">'s/"//g'</span> | sort -u > /tmp/agrovoc.txt
|
||||
$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
|
||||
$ csvgrep -c <span style="color:#e6db74">'number of matches'</span> -m <span style="color:#e6db74">'0'</span> /tmp/agrovoc-matches.csv | csvcut -c <span style="color:#ae81ff">1</span> > /tmp/invalid-agrovoc.csv
|
||||
</code></pre></div><h2 id="2021-10-05">2021-10-05</h2>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'dcterms.subject[en_US]'</span> ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e <span style="color:#e6db74">'s/||/\n/g'</span> -e <span style="color:#e6db74">'s/"//g'</span> | sort -u > /tmp/agrovoc.txt
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvgrep -c <span style="color:#e6db74">'number of matches'</span> -m <span style="color:#e6db74">'0'</span> /tmp/agrovoc-matches.csv | csvcut -c <span style="color:#ae81ff">1</span> > /tmp/invalid-agrovoc.csv
|
||||
</span></span></code></pre></div><h2 id="2021-10-05">2021-10-05</h2>
|
||||
<ul>
|
||||
<li>Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
|
||||
<ul>
|
||||
@ -243,11 +243,11 @@ $ csvgrep -c <span style="color:#e6db74">'number of matches'</span> -m <
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
|
||||
...
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 465119
|
||||
</code></pre></div><h2 id="2021-10-06">2021-10-06</h2>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
|
||||
</span></span><span style="display:flex;"><span>...
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 465119
|
||||
</span></span></code></pre></div><h2 id="2021-10-06">2021-10-06</h2>
|
||||
<ul>
|
||||
<li>Thinking about how we could check for duplicates before importing
|
||||
<ul>
|
||||
@ -255,14 +255,14 @@ $ csvgrep -c <span style="color:#e6db74">'number of matches'</span> -m <
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > CREATE EXTENSION pg_trgm;
|
||||
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
|
||||
metadata_value_id │ text_value │ dspace_object_id
|
||||
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
|
||||
3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
|
||||
(2 rows)
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= > CREATE EXTENSION pg_trgm;
|
||||
</span></span><span style="display:flex;"><span>localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
|
||||
</span></span><span style="display:flex;"><span> metadata_value_id │ text_value │ dspace_object_id
|
||||
</span></span><span style="display:flex;"><span>───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
</span></span><span style="display:flex;"><span> 3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
|
||||
</span></span><span style="display:flex;"><span> 3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
|
||||
</span></span><span style="display:flex;"><span>(2 rows)
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)</li>
|
||||
<li>I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
|
||||
<ul>
|
||||
@ -291,10 +291,10 @@ localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id
|
||||
</li>
|
||||
<li>Then I ran this new version of csv-metadata-quality on an export of IWMI’s community, minus some fields I don’t want to check:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -C <span style="color:#e6db74">'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection'</span> ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
|
||||
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
|
||||
$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi.csv
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -C <span style="color:#e6db74">'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection'</span> ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
|
||||
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
|
||||
</span></span><span style="display:flex;"><span>$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi.csv
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs…
|
||||
<ul>
|
||||
<li>I cut a subset of the fields from the main CSV and tried again, but DSpace said “no changes detected”</li>
|
||||
@ -319,54 +319,54 @@ Try doing it in two imports. In first import, remove all authors. In second impo
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]'</span> ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
|
||||
# Copy and blank columns in OpenRefine
|
||||
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
|
||||
$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi-duplicates-cleaned.csv
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]'</span> ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
|
||||
</span></span><span style="display:flex;"><span># Copy and blank columns in OpenRefine
|
||||
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
|
||||
</span></span><span style="display:flex;"><span>$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi-duplicates-cleaned.csv
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>It takes a few hours per 2,000 items because DSpace processes them so slowly… sigh…</li>
|
||||
</ul>
|
||||
<h2 id="2021-10-08">2021-10-08</h2>
|
||||
<ul>
|
||||
<li>I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
|
||||
text_lang | count
|
||||
-----------+---------
|
||||
en_US | 2603711
|
||||
en_Fu | 115568
|
||||
en | 8818
|
||||
| 5286
|
||||
fr | 2
|
||||
vn | 2
|
||||
| 0
|
||||
(7 rows)
|
||||
cgspace=# BEGIN;
|
||||
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
|
||||
UPDATE 129673
|
||||
cgspace=# COMMIT;
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
|
||||
</span></span><span style="display:flex;"><span> text_lang | count
|
||||
</span></span><span style="display:flex;"><span>-----------+---------
|
||||
</span></span><span style="display:flex;"><span> en_US | 2603711
|
||||
</span></span><span style="display:flex;"><span> en_Fu | 115568
|
||||
</span></span><span style="display:flex;"><span> en | 8818
|
||||
</span></span><span style="display:flex;"><span> | 5286
|
||||
</span></span><span style="display:flex;"><span> fr | 2
|
||||
</span></span><span style="display:flex;"><span> vn | 2
|
||||
</span></span><span style="display:flex;"><span> | 0
|
||||
</span></span><span style="display:flex;"><span>(7 rows)
|
||||
</span></span><span style="display:flex;"><span>cgspace=# BEGIN;
|
||||
</span></span><span style="display:flex;"><span>cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
|
||||
</span></span><span style="display:flex;"><span>UPDATE 129673
|
||||
</span></span><span style="display:flex;"><span>cgspace=# COMMIT;
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">'Removing duplicate value'</span> /tmp/out.log
|
||||
391
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">'Removing duplicate value'</span> /tmp/out.log
|
||||
</span></span><span style="display:flex;"><span>391
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I tried to export ILRI’s community, but ran into the export bug (DS-4211)
|
||||
<ul>
|
||||
<li>After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed <span style="color:#e6db74">'1d'</span> | wc -l
|
||||
32070
|
||||
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed <span style="color:#e6db74">'1d'</span> | wc -l
|
||||
19315
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed <span style="color:#e6db74">'1d'</span> | wc -l
|
||||
</span></span><span style="display:flex;"><span>32070
|
||||
</span></span><span style="display:flex;"><span>$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed <span style="color:#e6db74">'1d'</span> | wc -l
|
||||
</span></span><span style="display:flex;"><span>19315
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>It seems there are only about 200 duplicate values in this subset of fields in ILRI’s community:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">'Removing duplicate value'</span> /tmp/out.log
|
||||
220
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">'Removing duplicate value'</span> /tmp/out.log
|
||||
</span></span><span style="display:flex;"><span>220
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I found a cool way to select only the items with corrections
|
||||
<ul>
|
||||
<li>First, extract a handful of fields from the CSV with csvcut</li>
|
||||
@ -376,14 +376,14 @@ $ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed <span style="col
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]'</span> /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
|
||||
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
|
||||
$ sed -i -e <span style="color:#e6db74">'1s/en_US/en_Fu/g'</span> /tmp/ilri-deduplicated-items-cleaned.csv
|
||||
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]'</span> /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
|
||||
</span></span><span style="display:flex;"><span>$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
|
||||
</span></span><span style="display:flex;"><span>$ sed -i -e <span style="color:#e6db74">'1s/en_US/en_Fu/g'</span> /tmp/ilri-deduplicated-items-cleaned.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
|
||||
<pre tabindex="0"><code>if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
|
||||
</code></pre><ul>
|
||||
<li>For these rows I starred them and then blanked out the original field so DSpace would see it as a removal, and add the new column
|
||||
<ul>
|
||||
@ -392,9 +392,9 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
|
||||
</li>
|
||||
<li>I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">'Removing duplicate value'</span> /tmp/out.log
|
||||
7720
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c <span style="color:#e6db74">'Removing duplicate value'</span> /tmp/out.log
|
||||
</span></span><span style="display:flex;"><span>7720
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I applied these to the CIAT community, so in total that’s over 8,000 duplicate metadata values removed in a handful of fields…</li>
|
||||
</ul>
|
||||
<h2 id="2021-10-09">2021-10-09</h2>
|
||||
@ -402,14 +402,14 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
|
||||
<li>I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there</li>
|
||||
<li>Also of note, there are some other fixes too, for example in IITA’s community:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c -E <span style="color:#e6db74">'(Fixing|Removing) (duplicate|excessive|invalid)'</span> /tmp/out.log
|
||||
249
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep -c -E <span style="color:#e6db74">'(Fixing|Removing) (duplicate|excessive|invalid)'</span> /tmp/out.log
|
||||
</span></span><span style="display:flex;"><span>249
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I ran a full Discovery re-indexing on CGSpace</li>
|
||||
<li>Then I exported all of CGSpace and extracted the ISSNs and ISBNs:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]'</span> /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]'</span> /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs</li>
|
||||
</ul>
|
||||
<h2 id="2021-10-10">2021-10-10</h2>
|
||||
@ -417,42 +417,42 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
|
||||
<li>Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on <code>metadata-export</code> (DS-4211)</li>
|
||||
<li>First create a new PostgreSQL 13 container:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5433:5432 -d postgres:13-alpine
|
||||
$ createuser -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres --pwprompt dspacetest
|
||||
$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
|
||||
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">'CREATE EXTENSION pgcrypto;'</span>
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5433:5432 -d postgres:13-alpine
|
||||
</span></span><span style="display:flex;"><span>$ createuser -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres --pwprompt dspacetest
|
||||
</span></span><span style="display:flex;"><span>$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
|
||||
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">'CREATE EXTENSION pgcrypto;'</span>
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Then edit setting in <code>dspace/config/local.cfg</code> and build the backend server with Java 11:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mvn package
|
||||
$ cd dspace/target/dspace-installer
|
||||
$ ant fresh_install
|
||||
# fix database not being fully ready, causing Tomcat to fail to start the server application
|
||||
$ ~/dspace7/bin/dspace database migrate
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ mvn package
|
||||
</span></span><span style="display:flex;"><span>$ cd dspace/target/dspace-installer
|
||||
</span></span><span style="display:flex;"><span>$ ant fresh_install
|
||||
</span></span><span style="display:flex;"><span># fix database not being fully ready, causing Tomcat to fail to start the server application
|
||||
</span></span><span style="display:flex;"><span>$ ~/dspace7/bin/dspace database migrate
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Copy Solr configs and start Solr:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
|
||||
$ ~/src/solr-8.8.2/bin/solr start
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
|
||||
</span></span><span style="display:flex;"><span>$ ~/src/solr-8.8.2/bin/solr start
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Start my local Tomcat 9 instance:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ systemctl --user start tomcat9@dspace7
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ systemctl --user start tomcat9@dspace7
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>This works, so now I will drop the default database and import a dump from CGSpace</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ systemctl --user stop tomcat9@dspace7
|
||||
$ dropdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7
|
||||
$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
|
||||
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">'alter user dspacetest superuser;'</span>
|
||||
$ pg_restore -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -d dspace7 -O --role<span style="color:#f92672">=</span>dspacetest -h localhost dspace-2021-10-09.backup
|
||||
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">'alter user dspacetest nosuperuser;'</span>
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ systemctl --user stop tomcat9@dspace7
|
||||
</span></span><span style="display:flex;"><span>$ dropdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7
|
||||
</span></span><span style="display:flex;"><span>$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
|
||||
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">'alter user dspacetest superuser;'</span>
|
||||
</span></span><span style="display:flex;"><span>$ pg_restore -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -d dspace7 -O --role<span style="color:#f92672">=</span>dspacetest -h localhost dspace-2021-10-09.backup
|
||||
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">'alter user dspacetest nosuperuser;'</span>
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Delete Atmire migrations and some others that were “unresolved”:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">"DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"</span>
|
||||
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">"DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"</span>
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">"DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"</span>
|
||||
</span></span><span style="display:flex;"><span>$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">"DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"</span>
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Now DSpace 7 starts with my CGSpace data… nice
|
||||
<ul>
|
||||
<li>The Discovery indexing still takes seven hours… fuck</li>
|
||||
@ -469,11 +469,11 @@ $ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspac
|
||||
<ul>
|
||||
<li>Start a full Discovery reindex on my local DSpace 6.3 instance:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e chrt -b <span style="color:#ae81ff">0</span> ~/dspace63/bin/dspace index-discovery -b
|
||||
Loading @mire database changes for module MQM
|
||||
Changes have been processed
|
||||
836140:6543.6
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ /usr/bin/time -f %M:%e chrt -b <span style="color:#ae81ff">0</span> ~/dspace63/bin/dspace index-discovery -b
|
||||
</span></span><span style="display:flex;"><span>Loading @mire database changes for module MQM
|
||||
</span></span><span style="display:flex;"><span>Changes have been processed
|
||||
</span></span><span style="display:flex;"><span>836140:6543.6
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>So that’s 1.8 hours versus 7 on DSpace 7, with the same database!</li>
|
||||
<li>Several users wrote to me that CGSpace was slow recently
|
||||
<ul>
|
||||
@ -481,13 +481,13 @@ Changes have been processed
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">'SELECT * FROM pg_stat_activity'</span> | wc -l
|
||||
53
|
||||
$ psql -c <span style="color:#e6db74">"SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid"</span> | wc -l
|
||||
1697
|
||||
$ psql -c <span style="color:#e6db74">"SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'"</span> | wc -l
|
||||
1681
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">'SELECT * FROM pg_stat_activity'</span> | wc -l
|
||||
</span></span><span style="display:flex;"><span>53
|
||||
</span></span><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">"SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid"</span> | wc -l
|
||||
</span></span><span style="display:flex;"><span>1697
|
||||
</span></span><span style="display:flex;"><span>$ psql -c <span style="color:#e6db74">"SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'"</span> | wc -l
|
||||
</span></span><span style="display:flex;"><span>1681
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:</li>
|
||||
</ul>
|
||||
<p><img src="/cgspace-notes/2021/10/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
|
||||
@ -516,71 +516,71 @@ $ psql -c <span style="color:#e6db74">"SELECT * FROM pg_locks pl LEFT JOIN p
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Next I experimented with using GIN or GiST indexes on <code>metadatavalue</code>, but they were slower than the existing DSpace indexes
|
||||
<ul>
|
||||
<li>I tested a few variations of the query I had been using and found it’s <em>much</em> faster if I use the similarity operator and keep the condition that object IDs are in the item table…</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
|
||||
text_value │ dspace_object_id
|
||||
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
(1 row)
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 739.948 ms
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
|
||||
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
|
||||
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
</span></span><span style="display:flex;"><span>(1 row)
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 739.948 ms
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing!</li>
|
||||
<li>I still don’t understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate</li>
|
||||
<li>So to summarize, the best to the worst query, all returning the same result:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
|
||||
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
|
||||
text_value │ dspace_object_id
|
||||
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
(1 row)
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 683.165 ms
|
||||
Time: 635.364 ms
|
||||
Time: 674.666 ms
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= > DISCARD ALL;
|
||||
localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
|
||||
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
|
||||
text_value │ dspace_object_id
|
||||
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
(1 row)
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 1584.765 ms (00:01.585)
|
||||
Time: 1665.594 ms (00:01.666)
|
||||
Time: 1623.726 ms (00:01.624)
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= > DISCARD ALL;
|
||||
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
|
||||
text_value │ dspace_object_id
|
||||
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
(1 row)
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 4028.939 ms (00:04.029)
|
||||
Time: 4022.239 ms (00:04.022)
|
||||
Time: 4061.820 ms (00:04.062)
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= > DISCARD ALL;
|
||||
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
|
||||
text_value │ dspace_object_id
|
||||
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
(1 row)
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 4358.713 ms (00:04.359)
|
||||
Time: 4301.248 ms (00:04.301)
|
||||
Time: 4417.909 ms (00:04.418)
|
||||
</code></pre></div><h2 id="2021-10-13">2021-10-13</h2>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
|
||||
</span></span><span style="display:flex;"><span>localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
|
||||
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
|
||||
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
</span></span><span style="display:flex;"><span>(1 row)
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 683.165 ms
|
||||
</span></span><span style="display:flex;"><span>Time: 635.364 ms
|
||||
</span></span><span style="display:flex;"><span>Time: 674.666 ms
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= > DISCARD ALL;
|
||||
</span></span><span style="display:flex;"><span>localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
|
||||
</span></span><span style="display:flex;"><span>localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
|
||||
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
|
||||
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
</span></span><span style="display:flex;"><span>(1 row)
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 1584.765 ms (00:01.585)
|
||||
</span></span><span style="display:flex;"><span>Time: 1665.594 ms (00:01.666)
|
||||
</span></span><span style="display:flex;"><span>Time: 1623.726 ms (00:01.624)
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= > DISCARD ALL;
|
||||
</span></span><span style="display:flex;"><span>localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
|
||||
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
|
||||
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
</span></span><span style="display:flex;"><span>(1 row)
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 4028.939 ms (00:04.029)
|
||||
</span></span><span style="display:flex;"><span>Time: 4022.239 ms (00:04.022)
|
||||
</span></span><span style="display:flex;"><span>Time: 4061.820 ms (00:04.062)
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= > DISCARD ALL;
|
||||
</span></span><span style="display:flex;"><span>localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
|
||||
</span></span><span style="display:flex;"><span> text_value │ dspace_object_id
|
||||
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
|
||||
</span></span><span style="display:flex;"><span> Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
|
||||
</span></span><span style="display:flex;"><span>(1 row)
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 4358.713 ms (00:04.359)
|
||||
</span></span><span style="display:flex;"><span>Time: 4301.248 ms (00:04.301)
|
||||
</span></span><span style="display:flex;"><span>Time: 4417.909 ms (00:04.418)
|
||||
</span></span></code></pre></div><h2 id="2021-10-13">2021-10-13</h2>
|
||||
<ul>
|
||||
<li>I looked into the <a href="https://github.com/DSpace/DSpace/issues/7946">REST API issue where fields without qualifiers throw an HTTP 500</a>
|
||||
<ul>
|
||||
@ -640,11 +640,11 @@ Time: 4417.909 ms (00:04.418)
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b <span style="color:#e6db74">"dc=cgiarad,dc=org"</span> -D <span style="color:#e6db74">"booo"</span> -W <span style="color:#e6db74">"(sAMAccountName=fuuu)"</span>
|
||||
Enter LDAP Password:
|
||||
ldap_bind: Invalid credentials (49)
|
||||
additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b <span style="color:#e6db74">"dc=cgiarad,dc=org"</span> -D <span style="color:#e6db74">"booo"</span> -W <span style="color:#e6db74">"(sAMAccountName=fuuu)"</span>
|
||||
</span></span><span style="display:flex;"><span>Enter LDAP Password:
|
||||
</span></span><span style="display:flex;"><span>ldap_bind: Invalid credentials (49)
|
||||
</span></span><span style="display:flex;"><span> additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I sent a message to ILRI ICT to ask them to check the account
|
||||
<ul>
|
||||
<li>They reset the password so I ran all system updates and rebooted the server since users weren’t able to log in anyways</li>
|
||||
@ -664,17 +664,17 @@ ldap_bind: Invalid credentials (49)
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ http <span style="color:#e6db74">'localhost:8081/solr/statistics/select?q=time%3A2021-04*&fl=ip&wt=json&indent=true&facet=true&facet.field=ip&facet.limit=200000&facet.mincount=1'</span> > /tmp/2021-04-ips.json
|
||||
# Ghetto way to extract the IPs using jq, but I can<span style="color:#960050;background-color:#1e0010">'</span>t figure out how only print them and not the facet counts, so I just use sed
|
||||
$ jq <span style="color:#e6db74">'.facet_counts.facet_fields.ip[]'</span> /tmp/2021-04-ips.json | grep -E <span style="color:#e6db74">'^"'</span> | sed -e <span style="color:#e6db74">'s/"//g'</span> > /tmp/ips.txt
|
||||
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
|
||||
$ csvgrep -c asn -r <span style="color:#e6db74">'^(49453|46844|206485|62282|36352|35913|35624|8100)$'</span> /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u > /tmp/networks-to-block.txt
|
||||
$ wc -l /tmp/networks-to-block.txt
|
||||
125 /tmp/networks-to-block.txt
|
||||
$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt > /tmp/ips-to-purge.txt
|
||||
$ wc -l /tmp/ips-to-purge.txt
|
||||
202
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ http <span style="color:#e6db74">'localhost:8081/solr/statistics/select?q=time%3A2021-04*&fl=ip&wt=json&indent=true&facet=true&facet.field=ip&facet.limit=200000&facet.mincount=1'</span> > /tmp/2021-04-ips.json
|
||||
</span></span><span style="display:flex;"><span># Ghetto way to extract the IPs using jq, but I can<span style="color:#960050;background-color:#1e0010">'</span>t figure out how only print them and not the facet counts, so I just use sed
|
||||
</span></span><span style="display:flex;"><span>$ jq <span style="color:#e6db74">'.facet_counts.facet_fields.ip[]'</span> /tmp/2021-04-ips.json | grep -E <span style="color:#e6db74">'^"'</span> | sed -e <span style="color:#e6db74">'s/"//g'</span> > /tmp/ips.txt
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvgrep -c asn -r <span style="color:#e6db74">'^(49453|46844|206485|62282|36352|35913|35624|8100)$'</span> /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u > /tmp/networks-to-block.txt
|
||||
</span></span><span style="display:flex;"><span>$ wc -l /tmp/networks-to-block.txt
|
||||
</span></span><span style="display:flex;"><span>125 /tmp/networks-to-block.txt
|
||||
</span></span><span style="display:flex;"><span>$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt > /tmp/ips-to-purge.txt
|
||||
</span></span><span style="display:flex;"><span>$ wc -l /tmp/ips-to-purge.txt
|
||||
</span></span><span style="display:flex;"><span>202
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>Attempting to purge those only shows about 3,500 hits, but I will do it anyways
|
||||
<ul>
|
||||
<li>Adding 64.39.108.48 from Qualys I get a total of 22631 hits purged</li>
|
||||
@ -715,9 +715,9 @@ $ wc -l /tmp/ips-to-purge.txt
|
||||
</code></pre><ul>
|
||||
<li>Even more annoying, they are not re-using their session ID:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l
|
||||
4888
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l
|
||||
</span></span><span style="display:flex;"><span>4888
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>This IP has made 36,000 requests to CGSpace…</li>
|
||||
<li>The IP is owned by <a href="internetvikings.com">Internet Vikings</a> in Sweden</li>
|
||||
<li>I purged their statistics and set up a temporary HTTP 403 telling them to use a real user agent</li>
|
||||
@ -729,17 +729,17 @@ $ wc -l /tmp/ips-to-purge.txt
|
||||
<li>I added these two IPs to the nginx IP bot identifier</li>
|
||||
<li>Jesus I found a few Russian IPs attempting SQL injection and path traversal, ie:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>45.9.20.71 - - [20/Oct/2021:02:31:15 +0200] "GET /bitstream/handle/10568/1820/Rhodesgrass.pdf?sequence=4&OoxD=6591%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23 HTTP/1.1" 200 143070 "https://cgspace.cgiar.org:443/bitstream/handle/10568/1820/Rhodesgrass.pdf" "Mozilla/5.0 (X11; U; Linux i686; es-AR; rv:1.8.1.11) Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11"
|
||||
<pre tabindex="0"><code>45.9.20.71 - - [20/Oct/2021:02:31:15 +0200] "GET /bitstream/handle/10568/1820/Rhodesgrass.pdf?sequence=4&OoxD=6591%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23 HTTP/1.1" 200 143070 "https://cgspace.cgiar.org:443/bitstream/handle/10568/1820/Rhodesgrass.pdf" "Mozilla/5.0 (X11; U; Linux i686; es-AR; rv:1.8.1.11) Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/2.0.0.11"
|
||||
</code></pre><ul>
|
||||
<li>I reported them to AbuseIPDB.com and purged their hits:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
|
||||
Purging 6364 hits from 45.9.20.71 in statistics
|
||||
Purging 8039 hits from 45.146.166.157 in statistics
|
||||
Purging 3383 hits from 45.155.204.82 in statistics
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 17786
|
||||
</code></pre></div><h2 id="2021-10-31">2021-10-31</h2>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
|
||||
</span></span><span style="display:flex;"><span>Purging 6364 hits from 45.9.20.71 in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 8039 hits from 45.146.166.157 in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 3383 hits from 45.155.204.82 in statistics
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 17786
|
||||
</span></span></code></pre></div><h2 id="2021-10-31">2021-10-31</h2>
|
||||
<ul>
|
||||
<li>Update Docker containers for AReS on linode20 and run a fresh harvest</li>
|
||||
<li>Found some strange IP (94.71.3.44) making 51,000 requests today with the user agent “Microsoft Internet Explorer”
|
||||
@ -757,13 +757,13 @@ Purging 3383 hits from 45.155.204.82 in statistics
|
||||
<li>That’s from ASN 12552 (IPO-EU, SE), which is operated by Internet Vikings, though AbuseIPDB.com says it’s <a href="availo.se">Availo Networks AB</a></li>
|
||||
<li>There’s another IP (3.225.28.105) that made a few thousand requests to the REST API from Amazon, though it’s using a normal user agent</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
|
||||
3991
|
||||
~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE 'GET /rest/(collections|handle|items)' | sort | uniq -c
|
||||
3154 GET /rest/collections
|
||||
427 GET /rest/handle
|
||||
410 GET /rest/items
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
|
||||
</span></span><span style="display:flex;"><span>3991
|
||||
</span></span><span style="display:flex;"><span>~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE 'GET /rest/(collections|handle|items)' | sort | uniq -c
|
||||
</span></span><span style="display:flex;"><span> 3154 GET /rest/collections
|
||||
</span></span><span style="display:flex;"><span> 427 GET /rest/handle
|
||||
</span></span><span style="display:flex;"><span> 410 GET /rest/items
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>It requested the <a href="https://cgspace.cgiar.org/handle/10568/75560">CIAT Story Maps</a> collection over 3,000 times last month…
|
||||
<ul>
|
||||
<li>I will purge those hits</li>
|
||||
|
Reference in New Issue
Block a user