mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-03-04
This commit is contained in:
@ -10,14 +10,14 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
|
||||
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
|
||||
|
||||
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
|
||||
<meta name="generator" content="Hugo 0.92.2" />
|
||||
<meta name="generator" content="Hugo 0.93.1" />
|
||||
|
||||
|
||||
|
||||
@ -94,11 +94,11 @@
|
||||
<ul>
|
||||
<li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
|
||||
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
|
||||
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
|
||||
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
|
||||
</code></pre></div>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
|
||||
</span></span></code></pre></div>
|
||||
<a href='https://alanorth.github.io/cgspace-notes/2022-03/'>Read more →</a>
|
||||
</article>
|
||||
|
||||
@ -170,13 +170,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
|
||||
<li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li>
|
||||
<li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
|
||||
Purging 1989 hits from The Knowledge AI in statistics
|
||||
Purging 1235 hits from MaCoCu in statistics
|
||||
Purging 455 hits from WhatsApp in statistics
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
|
||||
</code></pre></div>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
|
||||
</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics
|
||||
</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics
|
||||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
|
||||
</span></span></code></pre></div>
|
||||
<a href='https://alanorth.github.io/cgspace-notes/2021-12/'>Read more →</a>
|
||||
</article>
|
||||
|
||||
@ -199,9 +199,9 @@ Purging 455 hits from WhatsApp in statistics
|
||||
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
|
||||
<li>First I exported all the 2019 stats from CGSpace:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">'time:2019-*'</span> -a export -o statistics-2019.json -k uid
|
||||
$ zstd statistics-2019.json
|
||||
</code></pre></div>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">'time:2019-*'</span> -a export -o statistics-2019.json -k uid
|
||||
</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json
|
||||
</span></span></code></pre></div>
|
||||
<a href='https://alanorth.github.io/cgspace-notes/2021-11/'>Read more →</a>
|
||||
</article>
|
||||
|
||||
@ -223,15 +223,15 @@ $ zstd statistics-2019.json
|
||||
<ul>
|
||||
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
|
||||
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
|
||||
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
|
||||
ations-matching.csv
|
||||
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
|
||||
1879
|
||||
$ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
7100 /tmp/2021-10-01-affiliations.txt
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
|
||||
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
|
||||
</span></span><span style="display:flex;"><span>ations-matching.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
|
||||
</span></span><span style="display:flex;"><span>1879
|
||||
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>So we have 1879/7100 (26.46%) matching already</li>
|
||||
</ul>
|
||||
<a href='https://alanorth.github.io/cgspace-notes/2021-10/'>Read more →</a>
|
||||
@ -288,8 +288,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
<ul>
|
||||
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">'s/ \+/:/g'</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
|
||||
</code></pre></div><ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">'s/ \+/:/g'</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
|
||||
</ul>
|
||||
<a href='https://alanorth.github.io/cgspace-notes/2021-08/'>Read more →</a>
|
||||
@ -313,9 +313,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
|
||||
<ul>
|
||||
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
|
||||
COPY 20994
|
||||
</code></pre></div>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
|
||||
</span></span><span style="display:flex;"><span>COPY 20994
|
||||
</span></span></code></pre></div>
|
||||
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
|
||||
</article>
|
||||
|
||||
|
Reference in New Issue
Block a user